beautifulsoup

Parse BeautifulSoup element into Selenium

随声附和 提交于 2019-12-30 10:57:45
问题 I want to get the source code of a website using selenium; find a particular element using BeautifulSoup; and then parse it back into selenium as a selenium.webdriver.remote.webelement object. Like so: driver.get("www.google.com") soup = BeautifulSoup(driver.source) element = soup.find(title="Search") element = Selenium.webelement(element) element.click() How can I achieve this? 回答1: A general solution that worked for me is to compute the xpath of the bs4 element, then use that to find the

Scrape Yahoo Finance Income Statement with Python

巧了我就是萌 提交于 2019-12-30 10:05:05
问题 I'm trying to scrape data from income statements on Yahoo Finance using Python. Specifically, let's say I want the most recent figure of Net Income of Apple. The data is structured in a bunch of nested HTML-tables. I am using the requests module to access it and retrieve the HTML. I am using BeautifulSoup 4 to sift through the HTML-structure, but I can't figure out how to get the figure. Here is a screenshot of the analysis with Firefox. My code so far: from bs4 import BeautifulSoup import

Scrape Yahoo Finance Income Statement with Python

房东的猫 提交于 2019-12-30 10:04:49
问题 I'm trying to scrape data from income statements on Yahoo Finance using Python. Specifically, let's say I want the most recent figure of Net Income of Apple. The data is structured in a bunch of nested HTML-tables. I am using the requests module to access it and retrieve the HTML. I am using BeautifulSoup 4 to sift through the HTML-structure, but I can't figure out how to get the figure. Here is a screenshot of the analysis with Firefox. My code so far: from bs4 import BeautifulSoup import

Python Beautifulsoup Find_all except

风流意气都作罢 提交于 2019-12-30 09:30:18
问题 I'm struggling to find a simple to solve this problem and hope you might be able to help. I've been using Beautifulsoup's find all and trying some regex to find all the items except the 'emptyLine' line in the html below: <div class="product_item0 ">...</div> <div class="product_item1 ">...</div> <div class="product_item2 ">...</div> <div class="product_item0 ">...</div> <div class="product_item1 ">...</div> <div class="product_item2 ">...</div> <div class="product_item0 ">...</div> <div

Python HTML parsing with beautiful soup and filtering stop words

僤鯓⒐⒋嵵緔 提交于 2019-12-30 07:23:53
问题 I am parsing out specific information from a website into a file. Right now the program I have looks at a webpage, and find the right HTML tag and parses out the right contents. Now I want to further filter these "results". For example, on the site : http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx I am parsing out the ingredients which are located in < div class="ingredients"...> tag. This parser does the job nicely but I want to further process these results. When I run

beautifulsoup “list object has no attribute” error

人走茶凉 提交于 2019-12-30 06:56:18
问题 I'm trying to scrape temperatures from a weather site using the following: import urllib2 from BeautifulSoup import BeautifulSoup f = open('airport_temp.tsv', 'w') f.write("Location" + "\t" + "High Temp (F)" + "\t" + "Low Temp (F)" + "\t" + "Mean Humidity" + "\n" ) eventually parse from http://www.wunderground.com/history/airport/\w{4}/2012/\d{2}/1/DailyHistory.html for x in range(10): locationstamp = "Location " + str(x) print "Getting data for " + locationstamp url = 'http://www

Extracting href with Beautiful Soup

只谈情不闲聊 提交于 2019-12-30 06:39:15
问题 I use this code to get acces to my link : links = soup.find("span", { "class" : "hsmall" }) links.findNextSiblings('a') for link in links: print link['href'] print link.string Link have no ID or class or whatever, it's just a classic link with a href attribute. The response of my script is : print link['href'] TypeError: string indices must be integers Can you help me to get href value ? Thx ! 回答1: Links is still referring to your soup.find. So you could do something like: links = soup.find(

get text after specific tag with beautiful soup

三世轮回 提交于 2019-12-30 03:22:08
问题 I have a text like page.content = <body><b>Title:</b> Test title</body> I can get the Title tag with soup = BeautifulSoup(page.content) record_el = soup('body')[0] b_el = record_el.find('b',text='Title:') but how can I get the text after the b tag? I would like to get the text after the element containing "Title:" by referring to that element, and not the body element. 回答1: Referring to the docs you might want to use the next_sibling of your b_el : b_el.next_sibling # contains " Test title"

Find a specific tag with BeautifulSoup

Deadly 提交于 2019-12-30 02:38:07
问题 I can traverse generic tags easily with BS, but I don't know how to find specific tags. For example, how can I find all occurances of <div style="width=300px;"> ? Is this possible with BS? 回答1: The following should work soup = BeautifulSoup(htmlstring) soup.findAll('div', style="width=300px;") There are couple of ways to search for tags. http://www.crummy.com/software/BeautifulSoup/documentation.html For more text to understand and use it http://lxml.de/elementsoup.html 回答2: with bs4 things

Multithreading in Python/BeautifulSoup scraping doesn't speed up at all

若如初见. 提交于 2019-12-29 18:56:51
问题 I have a csv file ("SomeSiteValidURLs.csv") which listed all the links I need to scrape. The code is working and will go through the urls in the csv, scrape the information and record/save in another csv file ("Output.csv"). However, since I am planning to do it for a large portion of the site (for >10,000,000 pages), speed is important. For each link, it takes about 1s to crawl and save the info into the csv, which is too slow for the magnitude of the project. So I have incorporated the