scrape

scraping a non RSS page to generate a feed

 ̄綄美尐妖づ 提交于 2019-12-11 03:26:25
问题 I want to scrape a page that regularly updates (adding new articles with exactly the same structure as previous ones) in order to generate an RSS feed. I can write the code to analyse the page easily, but how do I emulate a ping i.e. when the page updates how can my php script know? Does it have to be a cron job? (Probably a duplicate question I know, but searched for a direct answer with no luck. Closest I got was Scrape and generate RSS feed, which has a scraping script but no info on how

unable to get results from jsoup while giving post request

半城伤御伤魂 提交于 2019-12-11 03:13:51
问题 This is the code snippet , it always returns error page try { String url = "http://kepler.sos.ca.gov/"; Connection.Response response = Jsoup.connect(url) .method(Connection.Method.GET) .execute(); Document responseDocument = response.parse(); Element eventValidation = responseDocument.select("input[name=__EVENTVALIDATION]").first(); Element viewState = responseDocument.select("input[name=__VIEWSTATE]").first(); response = Jsoup.connect(url) .data("__VIEWSTATE", viewState.attr("value")) .data(

Scrape a website (javascript website) using php

三世轮回 提交于 2019-12-10 11:18:40
问题 I am trying to scrape a website (believe it is in JavaScript) using a simple PHP script. I am a beginner so any help would be greatly appreciated. The URL of the webpage is: http://www.indiainfoline.com/Markets/Company/Fundamentals/Balance-Sheet/Yes-Bank-Ltd/532648 So here for example I would like to pass the name of company (Yes-Bank-Ltd) and code (532648) in get_file_contents. Not sure on how to do it so can somebody please help. Thanks, Nidhi 回答1: Why aren't you just not append the string

rvest - scrape 2 classes in 1 tag

非 Y 不嫁゛ 提交于 2019-12-10 06:47:17
问题 I am new to rvest. How do I extract those elements with 2 class names or only 1 class name in tag? This is my code and issue: doc <- paste("<html>", "<body>", "<span class='a1 b1'> text1 </span>", "<span class='b1'> text2 </span>", "</body>", "</html>" ) library(rvest) read_html(doc) %>% html_nodes(".b1") %>% html_text() #output: text1, text2 #what i want: text2 #I also want to extract only elements with 2 class names read_html(doc) %>% html_nodes(".a1 .b1") %>% html_text() # Output that i

How to scrap data in an authenticated session within a dynamic page?

我怕爱的太早我们不能终老 提交于 2019-12-08 08:31:05
问题 I have coded a Scrapy spider using the loginform library (http://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically/) and taking this post as reference for dynamic webpages. This is the code: class MySpider(CrawlSpider): login_user = 'myusername' login_pass = 'mypassword' name = "tv" allowed_domains = [] start_urls = ["https://twitter.com/Acrocephalus/followers"] rules = ( Rule(SgmlLinkExtractor(allow=('https://twitter\.com/.*')), callback='parse_items', follow=True), ) def

Foreign Keys on Scrapy

懵懂的女人 提交于 2019-12-08 04:17:56
问题 im doing an scrap with scrapy and my model on django is: class Creative(models.Model): name = models.CharField(max_length=200) picture = models.CharField(max_length=200, null = True) class Project(models.Model): title = models.CharField(max_length=200) description = models.CharField(max_length=500, null = True) creative = models.ForeignKey(Creative) class Image(models.Model): url = models.CharField(max_length=500) project = models.ForeignKey(Project) And my scrapy model: from scrapy.contrib

How do I scrape an https page? [duplicate]

送分小仙女□ 提交于 2019-12-07 18:48:59
问题 This question already has answers here : Python Requests throwing SSLError (22 answers) Closed 5 years ago . I'm using a python script with 'lxml' and 'requests' to scrape a web page. My goal is to grab an element from a page and download it, but the content is on an HTTPS page and I'm getting an error when trying to access the stuff in the page. I'm sure there is some kind of certificate or authentication I have to include, but I'm struggling to find the right resources. I'm using: page =

Importing URLs for JSOUP to Scrape via Spreadsheet

早过忘川 提交于 2019-12-06 16:17:48
问题 I finally got IntelliJ to work. I'm using the code below. It works perfect. I need it to loop over and over and pull links from a spreadsheet to find the price over and over again on different items. I have a spreadsheet with a few sample URLs located in column C starting at row 2. How can I have JSOUP use the URLs in this spreadsheet then output to column D? public class Scraper { public static void main(String[] args) throws Exception { final Document document = Jsoup.connect("examplesite

How many results does Google allow a request to scrape?

我与影子孤独终老i 提交于 2019-12-06 11:54:26
问题 The following PHP code works fine, but when it is used to scrape 1000 Google results for a specified keyword, it only returns 100 results. Does Google have a limit on results returned, or is there a different problem? <?php require_once ("header.php"); $data2 = getContent("http://www.google.de/search?q=auch&hl=de&num=100&gl=de&ix=nh&sourceid=chrome&ie=UTF-8"); $dom = new DOMDocument(); @$dom->loadHtml($data2); $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("//div[@id='ires']//li/h3/a/

Importing/scraping an website into excel

人盡茶涼 提交于 2019-12-06 11:10:46
问题 I am trying to scrape some data from a database, and I have it pretty much set. I look in IE for a tab that has me logged in into the database, and paste the query link there through vba. But how do I extract the data that it returns from the IE tab and put that into an excel cell or array. This is the code I have for opening my query: Sub import() Dim row As Integer Dim strTargetFile As String Dim wb As Workbook Dim test As String Dim ie As Object Call Fill_Array_Cultivar For row = 3 To 4