scrape

How can I input data into a webpage to scrape the resulting output using Python?

北城以北 提交于 2019-11-29 00:21:21
I am familiar with BeautifulSoup and urllib2 to scrape data from a webpage. However, what if a parameter needs to be entered into the page before the result that I want to scrape is returned? I'm trying to obtain the geographic distance between two addresses using this website: http://www.freemaptools.com/how-far-is-it-between.htm I want to be able to go to the page, enter two addresses, click "Show", and then extract the "Distance as the Crow Flies" and "Distance by Land Transport" values and save them to a dictionary. Is there any way to input data into a webpage using Python? Take a look at

scrapy xpath selector repeats data

杀马特。学长 韩版系。学妹 提交于 2019-11-28 14:17:36
I am trying to extract the business name and address from each listing and export it to a -csv, but I am having problems with the output csv. I think bizs = hxs.select("//div[@class='listing_content']") may be causing the problems. yp_spider.py from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from yp.items import Biz class MySpider(BaseSpider): name = "ypages" allowed_domains = ["yellowpages.com"] start_urls = ["http://www.yellowpages.com/sanfrancisco/restaraunts"] def parse(self, response): hxs = HtmlXPathSelector(response) bizs = hxs.select("//div[@class=

Html-Agility-Pack not loading the page with full content?

亡梦爱人 提交于 2019-11-28 13:58:41
i am using Html Agility Pack to fetch data from website(scrapping) My problem is the website from i am fetching the data is load some of the content after few seconds of page load. SO whenever i am trying to read the particular data from particular Div its giving me null. but in var page i just not getting the division reviewBox ..becuase its not loaded yet. public void FetchAllLinks(String Url) { Url = "http://www.tripadvisor.com/"; HtmlDocument page = new HtmlWeb().Load(Url); var link_list= page.DocumentNode.SelectNodes("//div[@class='reviewBox']"); foreach (var link in link_list) { htmlpage

How to properly use mechanize to scrape AJAX sites

蓝咒 提交于 2019-11-28 09:08:59
问题 So I am fairly new to web scraping. There is this site that has a table on it, the values of the table are controlled by Javascript. The values will determine the address of future values that my browser is told to request from the Javascript. These new pages have JSON responses that the script updates the table with in my browser. So I wanted to build a class with a mechanize method that takes in an url and spits out the body response, the first time a HTML, afterwards, the body response

Extract / Identify Tables from PDF python [closed]

混江龙づ霸主 提交于 2019-11-28 02:55:43
Are there any open source libraries that support table identification & extraction? By this I mean: Identify a table structure exists Classify the table from its contents Extract data from the table in a useful output format e.g. JSON / CSV etc. I have looked through similar questions on this topic and found the following: PDFMiner which addresses problem 3, but it seems the user is required to specify to PDFMiner where a table structure exists for each table (correct me if I'm wrong) pdf-table-extract which attempts to address problem 1 but according to the To-Do list, cannot currently

How to scrape dynamic webpages by Python

女生的网名这么多〃 提交于 2019-11-28 00:15:14
[What I'm trying to do] Scrape the webpage below for used car data. http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1 [Issue] To scrape the entire pages. In the url above, only first 30 items are shown. Those could be scraped by the code below which I wrote. Links to other pages are displayed like 1 2 3... but the link addresses seems to be in Javascript. I googled for useful information but couldn't find any. from bs4 import BeautifulSoup import urllib.request html = urllib.request.urlopen("http://www.goo-net.com/php/search/summary.php

Scrape web site generated by Javascript

放肆的年华 提交于 2019-11-27 16:26:42
I think this is a real challenging one! I write a website for my local football league, www.rdyfl.co.uk , and include javascript code snippets from the F.A's Full-Time system where we generate our fixtures, linking in tables fixtures recent results etc. For another feature I want to add to the site I need to scrape the 'Upcoming Fixtures' for each agegroup and division but when I examine the source I have two problems. The fixtures content is generated by javascript and therefore I need to see the generated source and not just the source. When I view the generated source using Firefox the team

How can I input data into a webpage to scrape the resulting output using Python?

混江龙づ霸主 提交于 2019-11-27 15:14:31
问题 I am familiar with BeautifulSoup and urllib2 to scrape data from a webpage. However, what if a parameter needs to be entered into the page before the result that I want to scrape is returned? I'm trying to obtain the geographic distance between two addresses using this website: http://www.freemaptools.com/how-far-is-it-between.htm I want to be able to go to the page, enter two addresses, click "Show", and then extract the "Distance as the Crow Flies" and "Distance by Land Transport" values

How to scrape tables inside a comment tag in html with R?

半城伤御伤魂 提交于 2019-11-27 14:55:16
I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!-- What is the best way to get the tables from inside the comment tags? Thanks! Edit: I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none Ok..got it. library(stringi) library(knitr) library(rvest) any_version_html <- function(x){ XML:

scrapy xpath selector repeats data

徘徊边缘 提交于 2019-11-27 08:17:50
问题 I am trying to extract the business name and address from each listing and export it to a -csv, but I am having problems with the output csv. I think bizs = hxs.select("//div[@class='listing_content']") may be causing the problems. yp_spider.py from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from yp.items import Biz class MySpider(BaseSpider): name = "ypages" allowed_domains = ["yellowpages.com"] start_urls = ["http://www.yellowpages.com/sanfrancisco