Web crawler to extract in between the list

谁说我不能喝 提交于 2019-12-13 20:01:07

问题


I am writing a web-crawler in python. I wish to get all the content in between <li> </li> tags .For example:

<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>

So here I want to :

a.)extract the date- and convert it into dd/mm/yyyy format

b.)the number before people.

soup = BeautifulSoup(page1)
h2 =soup.find_all("li")
count = 0
while count < len(h2):
    print (str(h2[count].get_text().encode('ascii', 'ignore')))
    count += 1

I can only extract the text right now.


回答1:


Get the text with .text, split the string by the first occurence of :, convert the date string to datetime using strptime() specifying existing %B %d, %Y format, then format it to string using strftime() specifying the desired %d/%m/%Y format and extract the number using At least (\d+) regular expression where (\d+) is a capturing group that would match one or more digits:

from datetime import datetime
import re

from bs4 import BeautifulSoup


data = '<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>'
soup = BeautifulSoup(data)

date_string, rest = soup.li.text.split(':', 1)

print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
print re.match(r'At least (\d+)', rest.strip()).group(1)

Prints:

13/01/1991
40


来源:https://stackoverflow.com/questions/27822862/web-crawler-to-extract-in-between-the-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!