Tuesday, February 10, 2015

Recursively dumping the structure of an HTML5 document

By Vasudev Ram





A while ago I had written this post,

The html5lib Python library (and Animatron :-)

which shows basic usage of a Python library called html5lib, that lets you parse HTML5 documents and then walk through their structure.

That post uses this HTML5 document as input for the program shown in it:


Yesteday I modified the program (test_html5lib.py) shown in that earlier post, to make it recursive, thereby simplifying it. Here is the code for the resulting program, html5_dump.py.
# Demo program to show how to dump the structure of 
# an HTML5 document to text, using html5lib.
# Author: Vasudev Ram.
# Copyright 2015 Vasudev Ram - http://www.dancingbison.com

import html5lib

# Define a function to dump HTML5 element info recursively, 
# given a top-level element.
def print_element(elem, indent, level):
    for sub_elem in elem:
        print "{}{}".format(indent * level, sub_elem)
        # Recursive call to print_element().
        print_element(sub_elem, indent, level + 1)

f = open("html5doc.html")
# Parse the HTML document.
tree = html5lib.parse(f)
indent = '----'
level = 0
print_element(tree, indent, level)
I ran the program with:
$ py html5_dump.py

where the py in the command refers to py, the Python Launcher for Windows

Here is the program output, which you can see is basically the same as the previous version, but, done using recursion.
<Element u'{http://www.w3.org/1999/xhtml}head' at 0x02978938>
<Element u'{http://www.w3.org/1999/xhtml}body' at 0x02978968>
----<Element u'{http://www.w3.org/1999/xhtml}header' at 0x02978980>
--------<Element u'{http://www.w3.org/1999/xhtml}h1' at 0x02978920>
--------<Element u'{http://www.w3.org/1999/xhtml}h2' at 0x02978B00>
--------<Element u'{http://www.w3.org/1999/xhtml}h3' at 0x02978AB8>
----<Element u'{http://www.w3.org/1999/xhtml}p' at 0x02978AE8>
----<Element u'{http://www.w3.org/2000/svg}svg' at 0x02978788>
--------<Element u'{http://www.w3.org/2000/svg}defs' at 0x02A12050>
--------<Element u'{http://www.w3.org/2000/svg}rect' at 0x02A12020>
--------<Element u'{http://www.w3.org/2000/svg}text' at 0x02A12068>
----<Element u'{http://www.w3.org/1999/xhtml}footer' at 0x02A12080>

The recursion helps in two ways: 1) recursively printing sub-elements, and 2) not having to keep track of the indentation level needed - the Python interpreter's handling of nested calls and backing out of them, takes care of that for us. See the line:
print_element(sub_elem, indent, level + 1)
However, if using deep recursion, we have to remember about python recursion depth issues.

Enjoy.

- Vasudev Ram - online Python trainer and freelance programmer

Seeking alpha ...

Signup to hear about my new software products.

Contact Page

Sub-feeds for my posts about Python and posts about xtopdf.

No comments: