scrape | 易学教程

How can I input data into a webpage to scrape the resulting output using Python?

阅读更多关于 How can I input data into a webpage to scrape the resulting output using Python?

I am familiar with BeautifulSoup and urllib2 to scrape data from a webpage. However, what if a parameter needs to be entered into the page before the result that I want to scrape is returned? I'm trying to obtain the geographic distance between two addresses using this website: http://www.freemaptools.com/how-far-is-it-between.htm I want to be able to go to the page, enter two addresses, click "Show", and then extract the "Distance as the Crow Flies" and "Distance by Land Transport" values and save them to a dictionary. Is there any way to input data into a webpage using Python? Take a look at

scrapy xpath selector repeats data

阅读更多关于 scrapy xpath selector repeats data

I am trying to extract the business name and address from each listing and export it to a -csv, but I am having problems with the output csv. I think bizs = hxs.select("//div[@class='listing_content']") may be causing the problems. yp_spider.py from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from yp.items import Biz class MySpider(BaseSpider): name = "ypages" allowed_domains = ["yellowpages.com"] start_urls = ["http://www.yellowpages.com/sanfrancisco/restaraunts"] def parse(self, response): hxs = HtmlXPathSelector(response) bizs = hxs.select("//div[@class=

Html-Agility-Pack not loading the page with full content?

阅读更多关于 Html-Agility-Pack not loading the page with full content?

i am using Html Agility Pack to fetch data from website(scrapping) My problem is the website from i am fetching the data is load some of the content after few seconds of page load. SO whenever i am trying to read the particular data from particular Div its giving me null. but in var page i just not getting the division reviewBox ..becuase its not loaded yet. public void FetchAllLinks(String Url) { Url = "http://www.tripadvisor.com/"; HtmlDocument page = new HtmlWeb().Load(Url); var link_list= page.DocumentNode.SelectNodes("//div[@class='reviewBox']"); foreach (var link in link_list) { htmlpage

How to properly use mechanize to scrape AJAX sites

阅读更多关于 How to properly use mechanize to scrape AJAX sites

问题 So I am fairly new to web scraping. There is this site that has a table on it, the values of the table are controlled by Javascript. The values will determine the address of future values that my browser is told to request from the Javascript. These new pages have JSON responses that the script updates the table with in my browser. So I wanted to build a class with a mechanize method that takes in an url and spits out the body response, the first time a HTML, afterwards, the body response

Extract / Identify Tables from PDF python [closed]

阅读更多关于 Extract / Identify Tables from PDF python [closed]

Are there any open source libraries that support table identification & extraction? By this I mean: Identify a table structure exists Classify the table from its contents Extract data from the table in a useful output format e.g. JSON / CSV etc. I have looked through similar questions on this topic and found the following: PDFMiner which addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong) pdf-table-extract which attempts to address problem 1 but according to the To-Do list, cannot currently

How to scrape dynamic webpages by Python

阅读更多关于 How to scrape dynamic webpages by Python

[What I'm trying to do] Scrape the webpage below for used car data. http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1 [Issue] To scrape the entire pages. In the url above, only first 30 items are shown. Those could be scraped by the code below which I wrote. Links to other pages are displayed like 1 2 3... but the link addresses seems to be in Javascript. I googled for useful information but couldn't find any. from bs4 import BeautifulSoup import urllib.request html = urllib.request.urlopen("http://www.goo-net.com/php/search/summary.php

Scrape web site generated by Javascript

阅读更多关于 Scrape web site generated by Javascript

I think this is a real challenging one! I write a website for my local football league, www.rdyfl.co.uk , and include javascript code snippets from the F.A's Full-Time system where we generate our fixtures, linking in tables fixtures recent results etc. For another feature I want to add to the site I need to scrape the 'Upcoming Fixtures' for each agegroup and division but when I examine the source I have two problems. The fixtures content is generated by javascript and therefore I need to see the generated source and not just the source. When I view the generated source using Firefox the team

How can I input data into a webpage to scrape the resulting output using Python?

阅读更多关于 How can I input data into a webpage to scrape the resulting output using Python?

问题 I am familiar with BeautifulSoup and urllib2 to scrape data from a webpage. However, what if a parameter needs to be entered into the page before the result that I want to scrape is returned? I'm trying to obtain the geographic distance between two addresses using this website: http://www.freemaptools.com/how-far-is-it-between.htm I want to be able to go to the page, enter two addresses, click "Show", and then extract the "Distance as the Crow Flies" and "Distance by Land Transport" values

How to scrape tables inside a comment tag in html with R?

阅读更多关于 How to scrape tables inside a comment tag in html with R?

I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!-- What is the best way to get the tables from inside the comment tags? Thanks! Edit: I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none Ok..got it. library(stringi) library(knitr) library(rvest) any_version_html <- function(x){ XML:

scrapy xpath selector repeats data

阅读更多关于 scrapy xpath selector repeats data

问题 I am trying to extract the business name and address from each listing and export it to a -csv, but I am having problems with the output csv. I think bizs = hxs.select("//div[@class='listing_content']") may be causing the problems. yp_spider.py from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from yp.items import Biz class MySpider(BaseSpider): name = "ypages" allowed_domains = ["yellowpages.com"] start_urls = ["http://www.yellowpages.com/sanfrancisco