HTML Parse tree using Python 2.7

Theodros Zelleke

This answer comes a bit late, but still I'd like to share it:

I used networkx and lxml (which I found to allow much more elegant traversal of the DOM-tree). However, the tree-layout depends on graphviz and pygraphviz installed. networkx itself would just distribute the nodes somehow on the canvas. The code actually is longer than required cause I draw the labels myself to have them boxed (networkx provides for drawing the labels but it doesn't pass on the bbox keyword to matplotlib).

import networkx as nx
from lxml import html
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout

raw = "...your raw html"

def traverse(parent, graph, labels):
    labels[parent] = parent.tag
    for node in parent.getchildren():
        graph.add_edge(parent, node)
        traverse(node, graph, labels)

G = nx.DiGraph()
labels = {}     # needed to map from node to tag
html_tag = html.document_fromstring(raw)
traverse(html_tag, G, labels)

pos = graphviz_layout(G, prog='dot')

label_props = {'size': 16,
               'color': 'black',
               'weight': 'bold',
               'horizontalalignment': 'center',
               'verticalalignment': 'center',
               'clip_on': True}
bbox_props = {'boxstyle': "round, pad=0.2",
              'fc': "grey",
              'ec': "b",
              'lw': 1.5}

nx.draw_networkx_edges(G, pos, arrows=True)
ax = plt.gca()

for node, label in labels.items():
        x, y = pos[node]
        ax.text(x, y, label,


Changes to the code if you prefer (or have) to use BeautifulSoup:

I'm no expert... just looked at BS4 for the first time,... but it works:

#from lxml import html
from bs4 import BeautifulSoup
from bs4.element import NavigableString


def traverse(parent, graph, labels):
    labels[hash(parent)] =
    for node in parent.children:
        if isinstance(node, NavigableString):
        graph.add_edge(hash(parent), hash(node))
        traverse(node, graph, labels)


#html_tag = html.document_fromstring(raw)
soup = BeautifulSoup(raw)
html_tag = next(soup.children)


Python modules:
1. ETE, but it requires Newick format data.
2. GraphViz + pydot. See this SO answer.

The amazing d3 TreeLayout which uses JSON format.

If you're using ETE then you'll need to convert html to newick format. Here's a small example I made:

from lxml import html
from urllib import urlopen

def getStringFromNode(node):
    # Customize this according to
    # your requirements.
    node_string = node.tag
    if node.get('id'):
        node_string += '-' + node.get('id')
    if node.get('class'):
        node_string += '-' + node.get('class')
    return node_string

def xmlToNewick(node):
    node_string = getStringFromNode(node)
    nwk_children = []
    for child in node.iterchildren():
    if nwk_children:
        return "(%s)%s" % (','.join(nwk_children), node_string)
        return node_string

def main():
    html_page = html.fromstring(urlopen('').read())
    newick_page = xmlToNewick(html_page)
    return newick_page


Output ( in newick format):

'((meta,title,script,style,style,script)head,(script,textarea-csi,(((b-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,(u)a-gb1)nobr)div-gbar,((span-gbn-gbi,span-gbf-gbf,span-gbe,a-gb4,a-gb4,a-gb_70-gb4)nobr)div-guser,div-gbh,div-gbh)div-mngb,(br-lgpd,(((div)div-hplogo)div,br)div-lga,(((td,(input,input,input,(input-lst)div-ds,br,((input-lsb)span-lsbb)span-ds,((input-lsb)span-lsbb)span-ds)td,(a,a)td-fl sblc)tr)table,input-gbv)form,div-gac_scont,(br,((a,a,a,a,a,a,a,a,a)font-addlang,br,br)div-als)div,(((a,a,a,a,a-fehl)div-fll)div,(a)p)span-footer)center,div-xjsd,(script)div-xjsi,script)body)html'

After that you can use ETE as showen in there examples.

Hope that helps.
