web-scraping | 易学教程

Use Pandas to Get Multiple Tables From Webpage

阅读更多关于 Use Pandas to Get Multiple Tables From Webpage

问题 I am using Pandas to parse the data from the following page: http://kenpom.com/index.php?y=2014 To get the data, I am writing: dfs = pd.read_html(url) The data looks great and is perfectly parsed, except it only takes data from the 40 first rows. It seems to be a problem with the separation of the tables, that makes it so that pandas does no get all the information. How do you get pandas to get all the data from all the tables on that webpage? 回答1: The HTML of page you have posted have

Scraping a website with python 3 that requires login

阅读更多关于 Scraping a website with python 3 that requires login

问题 Just a question regarding some scraping authentication. Using BeautifulSoup : #importing the requests lib import requests from bs4 import BeautifulSoup #specifying the page page = requests.get("http://localhost:8080/login?from=%2F") #parsing through the api soup = BeautifulSoup(page.content, 'html.parser') print(soup.prettify()) From here the output, I think would be important: <table> <tr> <td> User: </td> <td> <input autocapitalize="off" autocorrect="off" id="j_username" name="j_username"

Scrape website data, insert into an Excel cell, then move on to next

阅读更多关于 Scrape website data, insert into an Excel cell, then move on to next

问题 My project is to insert a car reg into the tax and mot website click the buttons, load the page and then take the dates. An issue I had it is to extract data within a strong li element which is the date/ dates for the tax and mot of which I need in two cells. Sub searchbot() 'dimension (declare or set aside memory for) our variables Dim objIE As InternetExplorer 'special object variable representing the IE browser Dim liEle As HTMLLinkElement 'special object variable for an <li> (link)

Scrape website data, insert into an Excel cell, then move on to next

阅读更多关于 Scrape website data, insert into an Excel cell, then move on to next

Can't find a way to scrape the resultant table after search using Selenium through Python

阅读更多关于 Can't find a way to scrape the resultant table after search using Selenium through Python

问题 I've been doing webscrape with BeautifulSoup, Selenium and Scrapy for a few months, mainly for research purposes. After up and downs I always managed to achieve my web-scraping objectives (a lot of them thanks to this site) until I face this site 'https://euclid.eba.europa.eu/register/cir/search' the page uses javascript and needs to be rendered in order to get the results. With selenium, I managed to click on Continue, Select EEA-Brach type and click on search but after getting the page

Retrieve Amazon Reviews for a particular product

阅读更多关于 Retrieve Amazon Reviews for a particular product

问题 I'm currently working on a research project which need to analyze reviews of a particular product and get an overall idea about the product. I heard that amazon is a good place to get product reviews/comments. Is there any way to retrieve those user reviews/comments from Amazon via an API?? I tried several python codes but it doesn't work.. Do i need to write a spider if there is no API to retrieve data? Are there any approaches/places to retrieve user reviews for a given product? 回答1: www

R download.file with “wget”-method and specifying extra wget options

阅读更多关于 R download.file with “wget”-method and specifying extra wget options

问题 I have a probably rather basic question to using the download.file function in R using the wget option and employing some of the wget extra options, but I just cannot get it to work. What I want to do: download a local copy of a webpage (actually several webpages, but for now the challenge is to get it to work even with 1). Challenge: I need the local copy to look exactly like the online version, which also means to include links/ icons, etc.. I found wget to be a good tool for this and I

Scrapy FormRequest login not working

阅读更多关于 Scrapy FormRequest login not working

问题 I'm trying to log in with Scrapy but am receiving lots of "Redirecting (302)" messages. This happens when I use my real login and also with fake login info. I also tried it with another site and still no luck. import scrapy from scrapy.http import FormRequest, Request class LoginSpider(scrapy.Spider): name = 'SOlogin' allowed_domains = ['stackoverflow.com'] login_url = 'https://stackoverflow.com/users/login?ssrc=head&returnurl=http%3a%2f%2fstackoverflow.com%2f' test_url = 'http:/

NameError: global name 'NAME' is not defined

阅读更多关于 NameError: global name 'NAME' is not defined

问题 I have been having an interesting time building a little web scraper and I think I am doing something wrong with my variable or function scope. Whenever I try to pull out some of the functionality into separate functions it gives me the NameError: global name 'NAME' is not defined. I see that a lot of people are having a similar problem but there seems to be a lot of variation with the same error and I can't figure it out. import urllib2, sys, urlparse, httplib, imageInfo from BeautifulSoup

Scrapy simulate XHR request - returning 400

阅读更多关于 Scrapy simulate XHR request - returning 400

问题 I'm trying to get data from a site using Ajax. The page loads and then Javascript requests the content. See this page for details: https://www.tele2.no/mobiltelefon.aspx The problem is that when i try to simulate this process by calling this url: https://www.tele2.no/Services/Webshop/FilterService.svc/ApplyPhoneFilters I get a 400 response telling me that the request is not allowed. This is my code: # -*- coding: utf-8 -*- import scrapy import json class Tele2Spider(scrapy.Spider): name =