jugad2 - Vasudev Ram on software innovation: html5lib

Tuesday, February 10, 2015

Recursively dumping the structure of an HTML5 document

A while ago I had written this post,

The html5lib Python library (and Animatron :-)

which shows basic usage of a Python library called html5lib, that lets you parse HTML5 documents and then walk through their structure.

That post uses this HTML5 document as input for the program shown in it:

Yesteday I modified the program (test_html5lib.py) shown in that earlier post, to make it recursive, thereby simplifying it. Here is the code for the resulting program, html5_dump.py.

# Demo program to show how to dump the structure of 
# an HTML5 document to text, using html5lib.
# Author: Vasudev Ram.
# Copyright 2015 Vasudev Ram - http://www.dancingbison.com

import html5lib

# Define a function to dump HTML5 element info recursively, 
# given a top-level element.
def print_element(elem, indent, level):
    for sub_elem in elem:
        print "{}{}".format(indent * level, sub_elem)
        # Recursive call to print_element().
        print_element(sub_elem, indent, level + 1)

f = open("html5doc.html")
# Parse the HTML document.
tree = html5lib.parse(f)
indent = '----'
level = 0
print_element(tree, indent, level)

I ran the program with:
$ py html5_dump.py

where the py in the command refers to py, the Python Launcher for Windows

Here is the program output, which you can see is basically the same as the previous version, but, done using recursion.

<Element u'{http://www.w3.org/1999/xhtml}head' at 0x02978938>
<Element u'{http://www.w3.org/1999/xhtml}body' at 0x02978968>
----<Element u'{http://www.w3.org/1999/xhtml}header' at 0x02978980>
--------<Element u'{http://www.w3.org/1999/xhtml}h1' at 0x02978920>
--------<Element u'{http://www.w3.org/1999/xhtml}h2' at 0x02978B00>
--------<Element u'{http://www.w3.org/1999/xhtml}h3' at 0x02978AB8>
----<Element u'{http://www.w3.org/1999/xhtml}p' at 0x02978AE8>
----<Element u'{http://www.w3.org/2000/svg}svg' at 0x02978788>
--------<Element u'{http://www.w3.org/2000/svg}defs' at 0x02A12050>
--------<Element u'{http://www.w3.org/2000/svg}rect' at 0x02A12020>
--------<Element u'{http://www.w3.org/2000/svg}text' at 0x02A12068>
----<Element u'{http://www.w3.org/1999/xhtml}footer' at 0x02A12080>

The recursion helps in two ways: 1) recursively printing sub-elements, and 2) not having to keep track of the indentation level needed - the Python interpreter's handling of nested calls and backing out of them, takes care of that for us. See the line:

print_element(sub_elem, indent, level + 1)

However, if using deep recursion, we have to remember about python recursion depth issues.

Enjoy.

- Vasudev Ram - online Python trainer and freelance programmer

Seeking alpha ...

Signup to hear about my new software products.

Contact Page

Sub-feeds for my posts about Python and posts about xtopdf.

Share |

Saturday, March 8, 2014

The html5lib Python library (and Animatron :-)

By Vasudev Ram

I came across the html5lib Python library recently. The site describes it thusly:

"html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers."

So it doesn't say explicitly that it is for parsing HTML5, though the library name includes "5" in its name. But I tried it out on a simple HTML5 document and it seems to be able to parse HTML5 - at least the few HTML5 elements I tried it on.

Here's the code I used to try out html5lib:

# test_html5lib.py
# A program to try out the html5lib Python library.
# Author: Vasudev Ram - www.dancingbison.com
import html5lib

f = open("html5doc.html")
tree = html5lib.parse(f)
print "tree:"
print repr(tree)
print
print "items in tree:"

for item in tree:
    print item
    for item2 in item:
        print "-" * 4, item2
        for item3 in item2:
            print "-" * 8, item3

And here is the output of running python test_html5lib.py:

<Element u'{http://www.w3.org/1999/xhtml}head' at 0x02B663C8>
<Element u'{http://www.w3.org/1999/xhtml}body' at 0x02B66488>
---- <Element u'{http://www.w3.org/1999/xhtml}header' at 0x02B664B8>
-------- <Element u'{http://www.w3.org/1999/xhtml}h1' at 0x02B66530>
-------- <Element u'{http://www.w3.org/1999/xhtml}h2' at 0x02B664E8>
-------- <Element u'{http://www.w3.org/1999/xhtml}h3' at 0x02B665F0>
---- <Element u'{http://www.w3.org/1999/xhtml}p' at 0x02B66650>
---- <Element u'{http://www.w3.org/2000/svg}svg' at 0x02B66BC0>
-------- <Element u'{http://www.w3.org/2000/svg}defs' at 0x02B66B60>
-------- <Element u'{http://www.w3.org/2000/svg}rect' at 0x02B66B30>
-------- <Element u'{http://www.w3.org/2000/svg}text' at 0x02B66BD8>
---- <Element u'{http://www.w3.org/1999/xhtml}footer' at 0x02B66BF0>

Here is the documentation for html5lib.

And speaking of HTML5, coincidentally, I came across Animatron via Hacker News, today:

Animatron is "a simple and powerful online tool that allows you to create stunning HTML5 animations and interactive content." Animatron is not really related to html5lib, except for the fact that both of them are about HTML5, but it looks cool. Check it out.

Hacker News thread about Animatron.

Enjoy.

- Vasudev Ram - Dancing Bison Enterprises

Contact Page

Share |

jugad2 - Vasudev Ram on software innovation

Pages

Tuesday, February 10, 2015

Recursively dumping the structure of an HTML5 document

Saturday, March 8, 2014

The html5lib Python library (and Animatron :-)

Blog Archive

Labels