python scraping date from html page (June 10, 2017)

问题

How can I extract date "June 03,2017" from html page having below table data. The date will change as per the order number. I am not sure if i am using it correctly. please advise.

<tr>
   <td style="font:bold 24px Arial;">Order #12345</td>
    <td style="font:13px Arial;"><strong>Order Date:</strong> June 03, 2017</td>
</tr>

Below is the sample code which i have written

import requests
from bs4 import BeautifulSoup

#'url' is the actual link of html page
data = requests.get('url').content
soup = BeautifulSoup(data, "html.parser")

on = soup.find_all(text=re.compile("Order #"))
print (on)

od = soup.find_all(text=re.compile("Order Date")).next_element()
print (od)

I am getting below error after executing above code.

Error :
['Order #12345']
Traceback (most recent call last):
  File "test.py", line 24, in <module>
    od = soup.find_all(text=re.compile("Order Date")).next_element()
AttributeError: 'ResultSet' object has no attribute 'next_element'

回答1:

This may not be ideal if there are other tables on the page you are trying to parse. If there is only one table, this should work.

EDIT: added example of how to parse the actual date from the string

In[19]: from datetime import datetime
   ...: 
   ...: from bs4 import BeautifulSoup
   ...: 
   ...: html = '''\
   ...: <tr>
   ...:    <td style="font:bold 24px Arial;">Order #12345</td>
   ...:     <td style="font:13px Arial;"><strong>Order Date:</strong> June 03, 2017</td>
   ...: </tr>
   ...: '''
   ...: soup = BeautifulSoup(html, 'lxml')
   ...: 
   ...: for row in soup.find_all('tr'):
   ...:     order_number, order_date = row.find_all('td')
   ...:     print(order_number.text)
   ...:     print(order_date.text)
   ...:     d = datetime.strptime(order_date.text, 'Order Date: %B %d, %Y')
   ...:     print(d.year, d.month, d.day)
   ...: 
Order #12345
Order Date: June 03, 2017
2017 6 3

回答2:

Alternatively,

>>> import requests
>>> import bs4
>>> soup = bs4.BeautifulSoup('''\
... <tr>
...     <td style="font:bold 24px Arial;">Order #12345</td>
...     <td style="font:13px Arial;"><strong>Order Date:</strong> June 03, 2017</td>
... </tr>''', 'lxml')
>>> soup.find_all(text=bs4.re.compile("Order #"))[0][7:]
'12345'
>>> soup.find_all(text=bs4.re.compile("Order Date:"))[0].parent.next.next.strip()
'June 03, 2017'

No need to import re separately as it is include in bs4. I followed what you did; that is, I looked for the text then navigated from there.

来源：https://stackoverflow.com/questions/44530033/python-scraping-date-from-html-page-june-10-2017

标签

python

beautifulsoup

screen-scraping