scrape | 易学教程

scraping a non RSS page to generate a feed

阅读更多关于 scraping a non RSS page to generate a feed

问题 I want to scrape a page that regularly updates (adding new articles with exactly the same structure as previous ones) in order to generate an RSS feed. I can write the code to analyse the page easily, but how do I emulate a ping i.e. when the page updates how can my php script know? Does it have to be a cron job? (Probably a duplicate question I know, but searched for a direct answer with no luck. Closest I got was Scrape and generate RSS feed, which has a scraping script but no info on how

unable to get results from jsoup while giving post request

阅读更多关于 unable to get results from jsoup while giving post request

问题 This is the code snippet , it always returns error page try { String url = "http://kepler.sos.ca.gov/"; Connection.Response response = Jsoup.connect(url) .method(Connection.Method.GET) .execute(); Document responseDocument = response.parse(); Element eventValidation = responseDocument.select("input[name=__EVENTVALIDATION]").first(); Element viewState = responseDocument.select("input[name=__VIEWSTATE]").first(); response = Jsoup.connect(url) .data("__VIEWSTATE", viewState.attr("value")) .data(

Scrape a website (javascript website) using php

阅读更多关于 Scrape a website (javascript website) using php

问题 I am trying to scrape a website (believe it is in JavaScript) using a simple PHP script. I am a beginner so any help would be greatly appreciated. The URL of the webpage is: http://www.indiainfoline.com/Markets/Company/Fundamentals/Balance-Sheet/Yes-Bank-Ltd/532648 So here for example I would like to pass the name of company (Yes-Bank-Ltd) and code (532648) in get_file_contents. Not sure on how to do it so can somebody please help. Thanks, Nidhi 回答1: Why aren't you just not append the string

rvest - scrape 2 classes in 1 tag

阅读更多关于 rvest - scrape 2 classes in 1 tag

问题 I am new to rvest. How do I extract those elements with 2 class names or only 1 class name in tag? This is my code and issue: doc <- paste("<html>", "<body>", "<span class='a1 b1'> text1 </span>", "<span class='b1'> text2 </span>", "</body>", "</html>" ) library(rvest) read_html(doc) %>% html_nodes(".b1") %>% html_text() #output: text1, text2 #what i want: text2 #I also want to extract only elements with 2 class names read_html(doc) %>% html_nodes(".a1 .b1") %>% html_text() # Output that i

How to scrap data in an authenticated session within a dynamic page?

阅读更多关于 How to scrap data in an authenticated session within a dynamic page?

问题 I have coded a Scrapy spider using the loginform library (http://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically/) and taking this post as reference for dynamic webpages. This is the code: class MySpider(CrawlSpider): login_user = 'myusername' login_pass = 'mypassword' name = "tv" allowed_domains = [] start_urls = ["https://twitter.com/Acrocephalus/followers"] rules = ( Rule(SgmlLinkExtractor(allow=('https://twitter\.com/.*')), callback='parse_items', follow=True), ) def

Foreign Keys on Scrapy

阅读更多关于 Foreign Keys on Scrapy

问题 im doing an scrap with scrapy and my model on django is: class Creative(models.Model): name = models.CharField(max_length=200) picture = models.CharField(max_length=200, null = True) class Project(models.Model): title = models.CharField(max_length=200) description = models.CharField(max_length=500, null = True) creative = models.ForeignKey(Creative) class Image(models.Model): url = models.CharField(max_length=500) project = models.ForeignKey(Project) And my scrapy model: from scrapy.contrib

How do I scrape an https page? [duplicate]

阅读更多关于 How do I scrape an https page? [duplicate]

问题 This question already has answers here : Python Requests throwing SSLError (22 answers) Closed 5 years ago . I'm using a python script with 'lxml' and 'requests' to scrape a web page. My goal is to grab an element from a page and download it, but the content is on an HTTPS page and I'm getting an error when trying to access the stuff in the page. I'm sure there is some kind of certificate or authentication I have to include, but I'm struggling to find the right resources. I'm using: page =

Importing URLs for JSOUP to Scrape via Spreadsheet

阅读更多关于 Importing URLs for JSOUP to Scrape via Spreadsheet

问题 I finally got IntelliJ to work. I'm using the code below. It works perfect. I need it to loop over and over and pull links from a spreadsheet to find the price over and over again on different items. I have a spreadsheet with a few sample URLs located in column C starting at row 2. How can I have JSOUP use the URLs in this spreadsheet then output to column D? public class Scraper { public static void main(String[] args) throws Exception { final Document document = Jsoup.connect("examplesite

How many results does Google allow a request to scrape?

阅读更多关于 How many results does Google allow a request to scrape?

问题 The following PHP code works fine, but when it is used to scrape 1000 Google results for a specified keyword, it only returns 100 results. Does Google have a limit on results returned, or is there a different problem? <?php require_once ("header.php"); $data2 = getContent("http://www.google.de/search?q=auch&hl=de&num=100&gl=de&ix=nh&sourceid=chrome&ie=UTF-8"); $dom = new DOMDocument(); @$dom->loadHtml($data2); $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("//div[@id='ires']//li/h3/a/

Importing/scraping an website into excel

阅读更多关于 Importing/scraping an website into excel

问题 I am trying to scrape some data from a database, and I have it pretty much set. I look in IE for a tab that has me logged in into the database, and paste the query link there through vba. But how do I extract the data that it returns from the IE tab and put that into an excel cell or array. This is the code I have for opening my query: Sub import() Dim row As Integer Dim strTargetFile As String Dim wb As Workbook Dim test As String Dim ie As Object Call Fill_Array_Cultivar For row = 3 To 4