SED: An Algorithm for Automatic Identification of Section and Subsection Headings in Text Documents

Main Authors: Muhammad Bello Aliyu, Rahat Iqbal, Anne James, Dianabasi Nkantah
Format: Article Journal
Terbitan: , 2020
Subjects:
Online Access: https://zenodo.org/record/4431057
Daftar Isi:
  • The word processing applications, such as the Microsoft Word Office, have advanced features like the automatic table of contents (ToC) feature. The ToC is a representation of the headings of both sections and subsections that are within the document. Currently, there is no computational procedure to transverse the document and identify section and subsections to extract this information needed for ToC and other text analytics purposes. All the applications rely on the users to identify and highlights the texts (headings and subheadings) within the document that are to appear in the ToC. Text documents are organised into sections and subsections each with a named heading and subheading. This paper presents a novel algorithm for identifying the headings and subheadings within text documents. The automatic identification of the headings and subheadings (of all the sections) in the document. By leveraging this novel algorithm, the generation of the table of contents can be fully automated such that users do not have to identify/select the headings and subheadings manually. The algorithm is simple, rule-based and unsupervised. This improves the process and saves a great deal of time as there is no training involved. The algorithm has been tested on several documents (papers) and achieved an accuracy of over 82%. The algorithm also improves the computational capabilities of the current natural language processing approaches. It is also useful for automating some tasks in systematic literature reviews and would speed up the analysis and evaluation of the natural language resources and text analytics in general.