beautifulsoup

Trying to collect data from local files using BeautifulSoup

旧巷老猫 提交于 2019-12-24 11:58:41
问题 I want to run a python script to parse html files and collect a list of all the links with a target="_blank" attribute. I've tried the following but it's not getting anything from bs4. SoupStrainer says in the docs it'll take args in the same way as findAll etc, should this work? Am I missing some stupid error? import os import sys from bs4 import BeautifulSoup, SoupStrainer from unipath import Path def main(): ROOT = Path(os.path.realpath(__file__)).ancestor(3) src = ROOT.child("src")

failure in scraping the flight data table from airport website

你离开我真会死。 提交于 2019-12-24 11:55:11
问题 I have been trying to scrape arrival and departure data of domestic flights from the website of New Delhi International Airport. I have tried almost everything but I cannot extract the data. When I run the code, it returns nothing.I tried similar code on another airport website but it worked. Here is the code I wrote. res = requests.get("https://m.newdelhiairport.in/live-flight- information-all.aspx?FLMode=A&FLType=D") soup = BeautifulSoup(res.content,'html5lib') table = soup.find_all('tbody'

Beautifulsoup findall get stuck without processing

我与影子孤独终老i 提交于 2019-12-24 11:43:15
问题 I'm trying to understand BeautifulSoup and tried want to find all the links within facebook.com and iterate each and every link within it... Here is my code...it works fine but once it finds Linkedin.com and iterates over it, it get stuck at a point after this URL - http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fbusiness%2Elinkedin%2Ecom%2Ftalent-solutions%3Fsrc%3Dli-footer&urlhash=f9Nj When I run Linkedin.com separately, I don't have any problem... Could this be a limitation within

Why is BeautifulSoup related to 'Task exception was never retrieved'?

爷,独闯天下 提交于 2019-12-24 11:30:01
问题 I want to use the coroutine to crawl and parse webpages. I write a sample and test. The program could run well in python 3.5 in ubuntu 16.04 and it will quit when all the works have been done. The source code is below. import aiohttp import asyncio from bs4 import BeautifulSoup async def coro(): coro_loop = asyncio.get_event_loop() url = u'https://www.python.org/' for _ in range(4): async with aiohttp.ClientSession(loop=coro_loop) as coro_session: with aiohttp.Timeout(30, loop=coro_session

Python web scraping on large html webpages

被刻印的时光 ゝ 提交于 2019-12-24 11:24:31
问题 I am trying to get all the historical information of a particular stock from yahoo finance. I am new to python and web-scraping. I want to download all the historical data into a CSV file. The problem is that the code downloads only the first 100 entries of any stock on the website. When any stock is viewed on the browser, we have to scroll to the bottom of the page for more table entries to load. I think the same thing is happening when I download using the library. Some kind of optimization

Beautiful Soup returns empty list

非 Y 不嫁゛ 提交于 2019-12-24 11:02:41
问题 I am new to webscrapping. So I have been given a task to extract data from : Here I am choosing dataset of "comments". Below is my code for scrapping. import requests from bs4 import BeautifulSoup url = 'https://www.kaggle.com/hacker-news/hacker-news' headers = {'User-Agent' : 'Mozilla/5.0'} response = requests.get(url, headers = headers) response.status_code response.content soup = BeautifulSoup(response.content, 'html.parser') soup.find_all('tbody', class_ = 'TableBody-kSbjpE jGqIxa') When

Beautifulsoup functionality not working properly in specific scenario

喜夏-厌秋 提交于 2019-12-24 10:59:27
问题 I am trying to read in the following url using urllib2: http://frcwest.com/ and then search the data for the meta redirect. It reads the following data in: <!--?xml version="1.0" encoding="UTF-8"?--><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta content="0;url= Home.html" http-equiv="refresh"/></head><body></body></html> Reading it into Beautifulsoup

BeautifulSoup on multiple .html files

偶尔善良 提交于 2019-12-24 10:53:31
问题 I'm trying to extract information between fixed tags with BeautifulSoup by using the model suggested here enter link description here I have a lot of .html files in my folder and I want to save results obtained with a BeautifulSoup script into another folder in the form of individual .txt files. These .txt files should have the same name as original files but would contain only extracted content. The script I wrote (see below) processes files successfully but does not write extracted bits out

How can I grab the entire body text from a web page using BeautifulSoup?

折月煮酒 提交于 2019-12-24 10:49:30
问题 I would like to grab some text from a webpage of a medical document for a Natural Language Processing project and am having issues extracting the necessary information using BeautifulSoup. The website I am viewing can be found at the address: https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=2332-Abdominal%20Abscess%20I&D What I would like to do is grab the entire text body from this page and doing so with my cursor and simply applying a copy/paste would give me

beautifulsoup 4 + python: string returns 'None'

…衆ロ難τιáo~ 提交于 2019-12-24 10:47:17
问题 I'm trying to parse some html with BeautifulSoup4 and Python 2.7.6, but the string is returning "None". The HTML i'm trying to parse is: <div class="booker-booking"> 2 rooms · USD 0 <!-- Commission: USD --> </div> The snippet from python I have is: data = soup.find('div', class_='booker-booking').string I've also tried the following two: data = soup.find('div', class_='booker-booking').text data = soup.find('div', class_='booker-booking').contents[0] Which both return: u'\n\t\t2\xa0rooms \n\t