screen-scraping

How to select some urls with BeautifulSoup?

扶醉桌前 提交于 2020-01-01 18:48:25
问题 I want to scrape the following information except the last row and "class="Region" row: ... <td>7</td> <td bgcolor="" align="left" style=" width:496px"><a class="xnternal" href="http://www.whitecase.com">White and Case</a></td> <td bgcolor="" align="left">New York</td> <td bgcolor="" align="left" class="Region">N/A</td> <td bgcolor="" align="left">1,863</td> <td bgcolor="" align="left">565</td> <td bgcolor="" align="left">1,133</td> <td bgcolor="" align="left">$160,000</td> <td bgcolor=""

Scrape using multiple POST data from the same URL

若如初见. 提交于 2020-01-01 14:34:12
问题 I have already created one spider that collects a list of company names with matching phone numbers. This is then saved to a CSV file. I am then wanting to scrape data from another site using the phones numbers in the CSV file as POST data. I am wanting it to loop through the same start URL but just scraping the data that each phone number produces until there are no more numbers left in the CSV file. This is what I have got so far: from scrapy.spider import BaseSpider from scrapy.http import

What is the best way to programmatically log into a web site in order to screen scrape? (Preferably in Python)

梦想与她 提交于 2020-01-01 07:10:14
问题 I want to be able to log into a website programmatically and periodically obtain some information from the site. What is the best tool(s) that would make this as simple as possible? I'd prefer a Python library of some type because I want to become more proficient in Python, but I'm open to any suggestions. 回答1: You can try Mechanize (http://wwwsearch.sourceforge.net/mechanize/) for programmatic web-browsing, and definitely use Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) for

What is the best way to programmatically log into a web site in order to screen scrape? (Preferably in Python)

痞子三分冷 提交于 2020-01-01 07:09:08
问题 I want to be able to log into a website programmatically and periodically obtain some information from the site. What is the best tool(s) that would make this as simple as possible? I'd prefer a Python library of some type because I want to become more proficient in Python, but I'm open to any suggestions. 回答1: You can try Mechanize (http://wwwsearch.sourceforge.net/mechanize/) for programmatic web-browsing, and definitely use Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) for

Memory leak in Node.js scraper

∥☆過路亽.° 提交于 2019-12-31 13:52:13
问题 This is a simple scraper written in JavaScript with Node.js, for scraping Wikipedia for periodic table element data. The dependencies are jsdom for DOM manipulation and chain-gang for queuing. It works fine, most of the time (it doesn't handle errors gracefully), and the code isn't too bad, dare I say for a for attempt, but there is a serious fault with it - it leaks memory horribly, anywhere from 0.3% to 0.6% of the computer's memory for each element, such that by the time it gets to lead it

Memory leak in Node.js scraper

懵懂的女人 提交于 2019-12-31 13:52:10
问题 This is a simple scraper written in JavaScript with Node.js, for scraping Wikipedia for periodic table element data. The dependencies are jsdom for DOM manipulation and chain-gang for queuing. It works fine, most of the time (it doesn't handle errors gracefully), and the code isn't too bad, dare I say for a for attempt, but there is a serious fault with it - it leaks memory horribly, anywhere from 0.3% to 0.6% of the computer's memory for each element, such that by the time it gets to lead it

PhantomJS download using a javascript link

大憨熊 提交于 2019-12-31 13:13:42
问题 I am attempting to scrape the below website: http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2011&month=0&season1=2011&ind=0&team=0&rost=0&players=0 If you click the small button at the top-right of the table titled "export data", a javascript script runs and my browser downloads the file in .csv form. I'd like to be able to write a PhantomJS script that can do this automatically. Any ideas? The above button is coded into HTML as such: <a id="LB_cmdCSV"

Scrape website with XML HTTP request with Excel VBA: wait for the page to fully load

亡梦爱人 提交于 2019-12-31 04:51:27
问题 I'm trying to scrape a product price from a webpage using Excel VBA. The following code is working when using VBA Internet Explorer navigate request. However I would like to use an XML HTTP request instead to speed up the scraping process. In the IE request code I tell the application to wait for 3 seconds to have the page fully load and be able to scrape the product price. If this line is not included it won't find the price. I tried to change this with an XML HTTP request (see the second

OpenUri causing 401 Unauthorized error with HTTPS URL

99封情书 提交于 2019-12-31 03:00:26
问题 I am adding functionality that scrapes an XML page from a source that requires the use of an HTTPS connection with authentication. I am trying to use Ryan Bates' Railscast #190 solution but I'm running into a 401 Authentication error. Here is my test Ruby script: require 'rubygems' require 'nokogiri' require 'open-uri' url = "https://biblesearch.americanbible.org/passages.xml?q[]=john+3:1-5&version=KJV" doc = Nokogiri::XML(open(url, :http_basic_authentication => ['username' ,'password']))

OpenUri causing 401 Unauthorized error with HTTPS URL

佐手、 提交于 2019-12-31 03:00:03
问题 I am adding functionality that scrapes an XML page from a source that requires the use of an HTTPS connection with authentication. I am trying to use Ryan Bates' Railscast #190 solution but I'm running into a 401 Authentication error. Here is my test Ruby script: require 'rubygems' require 'nokogiri' require 'open-uri' url = "https://biblesearch.americanbible.org/passages.xml?q[]=john+3:1-5&version=KJV" doc = Nokogiri::XML(open(url, :http_basic_authentication => ['username' ,'password']))