问题
I need to extract tweets embedded in text articles. The problem with the pages I'm testing is that they load tweets in ~5 out of 10 runs. So I need to use Selenium to wait for the page to load but I cannot make it work. I followed steps from their official website:
url = 'https://www.bbc.co.uk/news/world-us-canada-44648563'
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path='/Users/ME/Downloads/chromedriver', chrome_options=options)
driver.implicitly_wait(15)
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
tweets_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
tweets = '\n'.join(tweets_soup)
print(tweets)
I cannot use the option to wait for a certain element to appear because I'm scanning different pages and not all of them have embedded tweets. So to check if Selenium actually works or not I run the above script together with the script which doesn't use Selenium and compare their results:
url = 'https://www.bbc.co.uk/news/world-us-canada-44648563'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
tweets_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
tweets = '\n'.join(tweets_soup)
print(tweets)
I will really appreciate the help of this wonderful community!
来源:https://stackoverflow.com/questions/51214853/scrap-embedded-tweets-from-a-webpage-with-selenium-and-beautifulsoup