Using BeautifulSoup to extract specific dl and dd list elements

陌路散爱 提交于 2019-12-01 21:52:55

问题


My first time posting. I am using BeautifulSoup 4 and python 2.7 (pycharm). I have a webpage containing elements and I need to extract specific elements where the tags are either 'Salary:' or 'Date:', the page contains multiple lists .

The problem: I cannot seem to identify and extract specific text. I have searched this site and tried without success.

Example html:

<dl><dt>Date:</dt><dd>13 September 2015</dd><dt>Salary:</dt><dd>Starting at £40,130 per annum.</dd></dl><dl><dt>Date:</dt><dd>15 December 2015</dd><dt>Salary:</dt><dd>Starting at £22,460 per annum.</dd></dl><dl><dt>Date:</dt><dd>10 January 2014</dd><dt>Salary:</dt><dd>Starting at £18,160 per annum.</dd></dl>

Code which I have tried without success:

r = requests.get("http://www.mywebsite.com/test.html")
soup = BeautifulSoup(r.content, "html.parser")
dl_data = soup.find_all("dl")
for dlitem in dl_data: 
    print dlitem.find("dt",text="Date:").parent.findNext("dd").contents[0]
    print dlitem.find("dt",text="Salary:").parent.findNext("dd").contents[0]

Expected Result:

13 September 2015
15 December 2015
10 January 2014
Starting at £40,130 per annum.
Starting at £22,460 per annum.
Starting at £18,160 per annum.

Actual Result:

print dlitem.find("dt",text="Date:").parent.findNext("dd").contents[0]
AttributeError: 'NoneType' object has no attribute 'parent'

I have tried numerous variations of this code and gone round in circles, I figured out how to print out all dd elements to screen, just not specific dd elements!

Thanks


回答1:


If order is not important just make some changes:

...
dl_data = soup.find_all("dd")
for dlitem in dl_data:
    print dlitem.string

Result:

13 September 2015
Starting at £40,130 per annum.
15 December 2015
Starting at £22,460 per annum.
10 January 2014
Starting at £18,160 per annum.

For your latest request:

for item in list(zip(soup.find_all("dd")[0::3],soup.find_all("dd")[2::3])):
    date, salary = item
    print ', '.join([date.string, salary.string])

Output:

13 September 2015, 100
14 September 2015, 200


来源:https://stackoverflow.com/questions/32475700/using-beautifulsoup-to-extract-specific-dl-and-dd-list-elements

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!