Table of contents

When having a lage HTML text document, scrolling is neede to jump to a certain section. If the document is huge, a epub book can be created from it. Websites sometimes have a table of content somehow created automatically when the page content is parsed and displayed.

If neither of the options work for you but you still want a table of contents in your document, this script can be used.

The script takes a html file as an argument and places the table of contents inside the document. The original input document is not modified, you may specify in which file the output is written to or the resulting HTML is send to standard out.

A table of contents is a navigation element. Therefore, by default the table of content is embeded into a html block like this:

<nav role="doc-toc" aria-labelledby="some_id">
  <div id="some_id">
     [ ... the list with the headlines as items ... ]  
  </div>
</nav>

This markup can be recognized by screenreaders to distinguish the table of contents from other content on that page.

There is a chance to replace the HTML snippet with something else, by using the argument --nav.

The list itself is an unordered list with subitems depending on the headline level. You may also have an ordered list. The headlines are numbered then. However, the headlines in the text remain unchanged.

The parsing process

The script searched for any healine as identified by h1 to h5. The order of appearance in the text must be preserved so that the headlines in the list appear in the same order as they appear in the text.

The navigation is done by having a link on the list item, and the target being the existing headline in the text. Therefore, the headline elements need to be extended by an idattribute. If a headline already contains that attribute, it is used by the script for the list item in the table of contents. Otherwise a randpom string is created, in order to avoid clashes with id attributes of the same value.

Because the html needs to be modified and thus may be broken (although that should not happen), the resulting HTML does not overwrite the original source file.

The toc.py script

#!/usr/bin/env python3
# -*- coding: UTF-8 -*-

"""
This script creates a table of content from the headlines that are contained
in the html document.
The html document/fragment is parsed and all h1 to h5 tags are searched. The
header tags get an id attribute set, if it is not already contained. The TOC
is build as an (un)ordered list in html containing anchor elements that link
to the headlines.
The resulting table of contents is embeded into a <nav> element that also
contains a headline like "Content". The whole structure as well as the head-
line can be customized. Furthermore the table of contents can be embedded
into the existing HTML by using a placeholder, which itself can be customized.
The default placeholder for embedding the table of contents is %%__TOC__%%.

General usage is:
toc.py --in <html_file> [ --out result.html ] [ --toc-only ]
       [ --header <html> ] [ --nav <html> ] [ --list-type ul|ol ]
       [ --placeholder <string> ]

Parameters are:
--help            Print this help information.
--in <file>       The file with the html snippet that is parsed.

Optional parameters:
--header <s>      Label string of the headline above the TOC HTML. This may
                  also contain HTML itself and is 
--list-type <t>   List type element that is used for the entries of the TOC.
                  By default this is "ul".
--nav <s>         The string for the table of contents where the list with
                  the headlines is embedded to. When this option is used
                  the string must contain a placeholder where the list is
                  inserted.
--out <file>      The file where the resulting html is written to. If not
                  submitted, the resulting html is written to stdout.
--placeholder <s> Placeholder string where the TOC is placed inside the HTML
                  document. Default is "%%__TOC__%%". When the HTML contains
                  no placeholder then the TOC is placed before the HTML.
                  This placeholder is also used in the --nav string, when
                  used.
--toc-only        Print the TOC only. In this case there is no automatic
                  linking to the headlines in the HTML content.


Example calls:
python3 toc.py --in foo.html --header "foo baar" --nav '<div class="foo">Inhalt<div>%%__TOX__%%</div></div>' --placeholder %%__TOX__%%

"""
import sys, random, string
from html.parser import HTMLParser

class tocParser(HTMLParser):
    
    def __init__(self):
        super().__init__()
        # Where the parsed result is stored (the changed input html)
        self._html = ''
        # The toc html.
        self.toc = None
        # An array with a dictionary element for each headline found.
        self._headers = []
        # Current headline level that is being added to self._headers.
        self._level = 0
        # Current id attribute value that is being added to self._headers.
        self._id = ''
        # Current headline label that is bein addedto self._headers.
        self._headline = ''
        # Embedd the toc into this stucture, if not set self._navhtml is used.
        # The toc_structure content must contain the placeholder (default %%__TOC__%%) where the toc items are placed.
        self.toc_structure = None
        # The label that is supposed to appear on top of the toc html, when not set self_navlabel is used.
        self.toc_header = None
        # The list type that is used for the toc items.
        self.list_type = 'ul'
        # Flag whether to produce the TOC only or change the full document.
        self.toc_only = False
        # The placeholder in the html where the toc is replaced with.
        self.placeholder = '%%__TOC__%%'
        # The html fragment for the <nav> element to place the toc inside.
        self._navhtml = '<nav role="doc-toc" aria-labelledby="{0}"><div id="{0}">{1}</div>{2}</nav>'
        # The default label for the content headline
        self._navlabel = '<div class="toc-label>Content</div>'

    def handle_starttag(self, tag, attrs):
        """ Handle all opening tags. When a headline tag is found we need to add an id attribute
        or remember it, so that the toc enty can be linked to that headline. """

        inHeader = False
        if tag in ['h1', 'h2', 'h3', 'h4', 'h5']:
            inHeader = True
            self._id = ''
            # Check for an already set id attribute.
            for attr in attrs:
                if attr[0] == 'id':
                    self._id = attr[1]
                    break
            # When we have no id attribute, create a random value and add it to the headline opening tag.
            if self._id == '':
                self._id = self.get_random_id()
                if self.toc_only == False:
                    attrs.append(['id', self._id])
            self._level = int(tag[1])

        # The opening element needs to be kept in the result as it is.
        attr_str = ''.join(
            f' {name}="{value}"' if value is not None else f" {name}"
            for name, value in attrs
        )
        # Write the opening tag as it is to the result document.
        self._html += f"<{tag}{attr_str}>"

        # We have an opening tag inside a headline (e.g. a <b> or similar).
        if inHeader == False and self._level > 0:
            self._headline += f"<{tag}{attr_str}>"

    def handle_endtag(self, tag):
        """ Handle the closing tags, when a headline is closed, the information of that headline
        is stored in the self._headers array. """

        if tag in ['h1', 'h2', 'h3', 'h4', 'h5']:
            self._headers.append({"id": self._id, "level": self._level, "text": self._headline})
            self._level = 0
            self._headline = ''
            self._id = ''
        self._html += f"</{tag}>"

    def handle_data(self, data):
        """ Handle everything outside a html element. """
        if self._level > 0:
            self._headline += data
        self._html += data

    def get_toc(self):
        """ Build the TOC from the information stored in self._headers. This is an array containing
        the headlines as they appear in the html document. """

        if self.toc is not None:
            return self.toc
        self.toc = ''
        if len(self._headers) == 0:
            return self.toc
        # "headline" of the toc.
        
        # Find the headline with the lowest level, i.e. the highest in the hierarchy.
        startlevel = 5
        for line in self._headers:
            if startlevel > line['level']:
                startlevel = line['level']
        # Start one level up so that at least one list block is written.
        level = startlevel - 1
        for line in self._headers:
            item = line['text'] if self.toc_only else f'<a href="#{line["id"]}">{line["text"]}</a>'
            # If the current headline level is lower (e.g. h3 < h2) then add a sub list.
            if level < line['level']:
                while level < line['level']:
                    self.toc += f'<{self.list_type}><li>'
                    level += 1
                self.toc += item
            # If the current headline level is higher (e.g. h2 > h3) close the sub list before adding this item.
            elif level > line['level']:
                while level > line['level']:
                    self.toc += f'</li></{self.list_type}>'
                    level -= 1
                self.toc += f'</li><li>{item}'
            # Same level, just add another li with the current headline.
            else:
                if self.toc[-5:] != '</li>':
                    self.toc += '</li>'
                self.toc += f'<li>{item}'
        # We are done with all headline, but to have clean html we need to close the list elements
        # that were opened above.
        while level > startlevel - 1:
            self.toc += f'</li></{self.list_type}>'
            level -= 1
            
        # Return the toc html.
        return self.toc
        
    def get_random_id(self, len=6):
        """ Return a random string to be used as an id. """
        return ''.join(random.choices(string.ascii_uppercase + string.ascii_lowercase + string.digits, k=len))

    def get_html(self):
        """ Return the parsed html and the toc, embedd the toc within the nav template
        and combine both to a single string. """

        label = self.toc_header if self.toc_header is not None else self._navlabel
        id = self.get_random_id()
        toc = self._navhtml.format(id, label, self.get_toc()) if self.toc_structure is None else self.toc_structure.replace(self.placeholder, self.get_toc())

        if self.toc_only:
            return toc
        if self._html.find(self.placeholder) > -1:
            return self._html.replace(self.placeholder, toc)
        return toc + self._html


def main():
    # available options that can be changed via the command line
    options = ['in', 'out', 'help', 'header', 'nav', 'list-type', 'toc-only', 'placeholder']

    toc = tocParser()
    outFile = None

    for i in range(len(sys.argv)):
        if i == 0:
            continue
        arg = sys.argv[i]
        # We have a command identified by -- remember it in currentCmd
        # in case this command needs an argument, or just set the
        # appropriate parameter without argument.
        if arg[0:2] == '--':
            currentCmd = arg[2:]
            if not(currentCmd in options):
                print("Invalid argument %s" % currentCmd)
                sys.exit(1)
            if currentCmd == 'help':
                print(__doc__)
                sys.exit(0)
            if currentCmd == 'toc-only':
                toc.toc_only = True
                currentCmd = ''
        # We have an argument, what was the previous command, do this
        # action?
        elif len(currentCmd) > 0:
            if currentCmd == 'in':
                with open(arg, 'r') as fp:
                   toc.feed(fp.read())
            elif currentCmd == 'out':
                outFile = arg
            elif currentCmd == 'header':
                toc.toc_header = arg
            elif currentCmd == 'nav':
                toc.toc_structure = arg
            elif currentCmd == 'list-type':
                if arg not in ['ul', 'ol']:
                    print("Invalid list-type %s" % arg)
                    sys.exit(1)
                toc.list_type = arg
            elif currentCmd == 'placeholder':
                toc.placeholder = arg
            currentCmd = ''
        else:
            print("Invalid or missing command for argument %s" % arg)
            sys.exit(1)
    
    if outFile is None:
        print(toc.get_html())
    else:
        with open(outFile, 'w') as fp:
            fp.write(toc.get_html())


if __name__ == '__main__':
    main()

This is the complete script. Run the command python3 toc.py --help to get all possible arguments and see a short documentation of the script itself.