If I have a nested html (unordered) list that looks like this:
You can take a recursive approach:
from pprint import pprint
from bs4 import BeautifulSoup
text = """your html goes here"""
def find_li(element):
return [{li.a['href']: find_li(li)}
for ul in element('ul', recursive=False)
for li in ul('li', recursive=False)]
soup = BeautifulSoup(text, 'html.parser')
data = find_li(soup)
pprint(data)
Prints:
[{u'Page1_Level1.html': [{u'Page1_Level2.html': [{u'Page1_Level3.html': []},
{u'Page2_Level3.html': []},
{u'Page3_Level3.html': []}]}]},
{u'Page2_Level1.html': [{u'Page2_Level2.html': []}]}]
FYI, here is why I had to use html.parser
here: