Extract date from multiple webpages with Python

﹥>﹥吖頭↗ 提交于 2021-01-28 08:35:08

问题


I want to extract date when news article was published on websites. For some websites I have exact html element where date/time is (div, p, time) but on some websites I do not have:

These are the links for some websites (german websites):

(3 Nov 2020) http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226

(Dec. 1, 2020) http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0&sq=&kategorie_id=&date_from=&date_to=

(10/22/2020) http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905

I have tried 3 different solutions with Python libs such as requests, htmldate and date_guesser but I'm always getting None, or in case of htmldate lib, I always get same date (2020.1.1)

from bs4 import BeautifulSoup
import requests
from htmldate import find_date
from date_guesser import guess_date, Accuracy

# Lib find_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
my_date = find_date(response.content, extensive_search=True)
print(my_date, '\n')


# Lib guess_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
my_date = guess_date(url=url, html=requests.get(url).text)
print(my_date.date, '\n')


# Lib Requests # I DO NOT GET last modified TAG
my_date = requests.head('http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226')
print(my_date.headers, '\n')

Am I doing something wrong?

Can you please tell me is there a way to extract date of publication from websites like this (where I do not have specific divs, p, and datetime elements).

IMPORTANT! I want to make universal date extraction, so that I can put these links in for loop and run the same function to them.


回答1:


I have never had much success with some of the date parsing libraries, so I usually go another route. I believe that the best method to extract the date strings from these sites in your question is with regular expressions.

website: linden.ch

import requests
import re as regex
from bs4 import BeautifulSoup
from datetime import datetime

url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_body = soup.find('body')
find_date = regex.search(r'(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})', str(page_body))
reformatted_timestamp = datetime.strptime(find_date.groups()[1], '%d. %b. %Y').strftime('%d-%m-%Y')
print(reformatted_timestamp)
# print output 
03-11-2020

website: buchholterberg.ch

import requests
import re as regex
from bs4 import BeautifulSoup
from datetime import datetime

url = "http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_body = soup.find('body')
find_date = regex.search(r'(Veröffentlicht)\s\w+:\s(\d{1,2}:\d{1,2}:\d{1,2})\s(\d{1,2}.\d{1,2}.\d{4})', str(page_body))
reformatted_timestamp = datetime.strptime(find_date.groups()[2], '%d.%m.%Y').strftime('%d-%m-%Y')
print(reformatted_timestamp)
# print output
22-10-2020

Update 12-04-2020

I looked at the source code for the two Python libraries: htmldate and date_guesser that you mentioned. Neither of these libraries can currently extract the date from the 3 sources that you listed in your question. The primary reason for this lack of extraction is linked to the date formats and language (german) of these target sites.

I had some free time so I put this together for you. The answer below can easily be modified to extract from any website and can be refined as needed based on the format of your target sources. It currently extract from all the links contained in URLs.


all urls

import requests
import re as regex
from bs4 import BeautifulSoup

def extract_date(can_of_soup):
   page_body = can_of_soup.find('body')
   clean_body = ''.join(str(page_body).replace('\n', ''))
   if 'Datum der Neuigkeit' in clean_body or 'Veröffentlicht' in clean_body:
     date_formats = '(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})|(Veröffentlicht am: \d{2}:\d{2}:\d{2} )(\d{1,2}.\d{1,2}.\d{4})'
     find_date = regex.search(date_formats, clean_body, regex.IGNORECASE)
     if find_date:
        clean_tuples = [i for i in list(find_date.groups()) if i]
        return ''.join(clean_tuples[1])
   else:
       tags = ['extra', 'elementStandard elementText', 'icms-block icms-information-date icms-text-gemeinde-color']
       for tag in tags:
          date_tag = page_body.find('div', {'class': f'{tag}'})
          if date_tag is not None:
            children = date_tag.findChildren()
            if children:
                find_date = regex.search(r'(\d{1,2}.\d{1,2}.\d{4})', str(children))
                return ''.join(find_date.groups())
            else:
                return ''.join(date_tag.contents)


def get_soup(target_url):
   response = requests.get(target_url)
   soup = BeautifulSoup(response.content, 'html.parser')
   return soup


urls = {'http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226',
    'http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0'
    '&sq=&kategorie_id=&date_from=&date_to=',
    'http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905',
    'https://www.steffisburg.ch/de/aktuelles/meldungen/Hochwasserschutz-und-Laengsvernetzung-Zulg.php',
    'https://www.wallisellen.ch/aktuellesinformationen/924227',
    'http://www.winkel.ch/de/aktuellesre/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id'
    '=1093910&ls=0&sq=&kategorie_id=&date_from=&date_to=',
    'https://www.aeschi.ch/de/aktuelles/mitteilungen/artikel/?tx_news_pi1%5Bnews%5D=87&tx_news_pi1%5Bcontroller%5D=News&tx_news_pi1%5Baction%5D=detail&cHash=ab4d329e2f1529d6e3343094b416baed'}


for url in urls:
   html = get_soup(url)
   article_date = extract_date(html)
   print(article_date)


来源:https://stackoverflow.com/questions/65095206/extract-date-from-multiple-webpages-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!