screen-scraping | 易学教程

How to select some urls with BeautifulSoup?

阅读更多关于 How to select some urls with BeautifulSoup?

问题 I want to scrape the following information except the last row and "class="Region" row: ... <td>7</td> <td bgcolor="" align="left" style=" width:496px"><a class="xnternal" href="http://www.whitecase.com">White and Case</a></td> <td bgcolor="" align="left">New York</td> <td bgcolor="" align="left" class="Region">N/A</td> <td bgcolor="" align="left">1,863</td> <td bgcolor="" align="left">565</td> <td bgcolor="" align="left">1,133</td> <td bgcolor="" align="left">$160,000</td> <td bgcolor=""

Scrape using multiple POST data from the same URL

阅读更多关于 Scrape using multiple POST data from the same URL

问题 I have already created one spider that collects a list of company names with matching phone numbers. This is then saved to a CSV file. I am then wanting to scrape data from another site using the phones numbers in the CSV file as POST data. I am wanting it to loop through the same start URL but just scraping the data that each phone number produces until there are no more numbers left in the CSV file. This is what I have got so far: from scrapy.spider import BaseSpider from scrapy.http import

What is the best way to programmatically log into a web site in order to screen scrape? (Preferably in Python)

阅读更多关于 What is the best way to programmatically log into a web site in order to screen scrape? (Preferably in Python)

问题 I want to be able to log into a website programmatically and periodically obtain some information from the site. What is the best tool(s) that would make this as simple as possible? I'd prefer a Python library of some type because I want to become more proficient in Python, but I'm open to any suggestions. 回答1: You can try Mechanize (http://wwwsearch.sourceforge.net/mechanize/) for programmatic web-browsing, and definitely use Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) for

What is the best way to programmatically log into a web site in order to screen scrape? (Preferably in Python)

阅读更多关于 What is the best way to programmatically log into a web site in order to screen scrape? (Preferably in Python)

Memory leak in Node.js scraper

阅读更多关于 Memory leak in Node.js scraper

问题 This is a simple scraper written in JavaScript with Node.js, for scraping Wikipedia for periodic table element data. The dependencies are jsdom for DOM manipulation and chain-gang for queuing. It works fine, most of the time (it doesn't handle errors gracefully), and the code isn't too bad, dare I say for a for attempt, but there is a serious fault with it - it leaks memory horribly, anywhere from 0.3% to 0.6% of the computer's memory for each element, such that by the time it gets to lead it

Memory leak in Node.js scraper

阅读更多关于 Memory leak in Node.js scraper

PhantomJS download using a javascript link

阅读更多关于 PhantomJS download using a javascript link

问题 I am attempting to scrape the below website: http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2011&month=0&season1=2011&ind=0&team=0&rost=0&players=0 If you click the small button at the top-right of the table titled "export data", a javascript script runs and my browser downloads the file in .csv form. I'd like to be able to write a PhantomJS script that can do this automatically. Any ideas? The above button is coded into HTML as such: <a id="LB_cmdCSV"

Scrape website with XML HTTP request with Excel VBA: wait for the page to fully load

阅读更多关于 Scrape website with XML HTTP request with Excel VBA: wait for the page to fully load

问题 I'm trying to scrape a product price from a webpage using Excel VBA. The following code is working when using VBA Internet Explorer navigate request. However I would like to use an XML HTTP request instead to speed up the scraping process. In the IE request code I tell the application to wait for 3 seconds to have the page fully load and be able to scrape the product price. If this line is not included it won't find the price. I tried to change this with an XML HTTP request (see the second

OpenUri causing 401 Unauthorized error with HTTPS URL

阅读更多关于 OpenUri causing 401 Unauthorized error with HTTPS URL

问题 I am adding functionality that scrapes an XML page from a source that requires the use of an HTTPS connection with authentication. I am trying to use Ryan Bates' Railscast #190 solution but I'm running into a 401 Authentication error. Here is my test Ruby script: require 'rubygems' require 'nokogiri' require 'open-uri' url = "https://biblesearch.americanbible.org/passages.xml?q[]=john+3:1-5&version=KJV" doc = Nokogiri::XML(open(url, :http_basic_authentication => ['username' ,'password']))

OpenUri causing 401 Unauthorized error with HTTPS URL

阅读更多关于 OpenUri causing 401 Unauthorized error with HTTPS URL