web-crawler

Wildcards in robots.txt

Deadly 提交于 2019-12-17 17:13:14
问题 If in WordPress website I have categories in this order: -Parent --Child ---Subchild I have permalinks set to: %category%/%postname% Let use an example. I create post with post name "Sport game". It's tag is sport-game. It's full url is: domain.com/parent/child/subchild/sport-game Why I use this kind of permalinks is exactly to block some content easier in robots.txt. And now this is the part I have question for. In robots.txt: User-agent: Googlebot Disallow: /parent/* Disallow: /parent/*/*

HtmlAgilityPack HtmlWeb.Load returning empty Document

笑着哭i 提交于 2019-12-17 16:55:14
问题 I have been using HtmlAgilityPack for the last 2 months in a Web Crawler Application with no issues loading a webpage. Now when I try to load a this particular webpage, the document OuterHtml is empty, so this test fails var url = "http://www.prettygreen.com/"; var htmlWeb = new HtmlWeb(); var htmlDoc = htmlWeb.Load(url); var outerHtml = htmlDoc.DocumentNode.OuterHtml; Assert.AreNotEqual("", pageHtml); I can load another page from the site with no problems, such as setting url = "http://www

Scrapy Python Set up User Agent

被刻印的时光 ゝ 提交于 2019-12-17 15:53:42
问题 I tried to override the user-agent of my crawlspider by adding an extra line to the project configuration file. Here is the code: [settings] default = myproject.settings USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36" [deploy] #url = http://localhost:6800/ project = myproject But when I run the crawler against my own web, I notice the spider did not pick up my customized user agent but the default one "Scrapy/0.18.2 (

Submit form with no submit button in rvest

此生再无相见时 提交于 2019-12-17 14:01:30
问题 I'm trying write a crawler to download some information, similar to this Stack Overflow post. The answer is useful for creating the filled-in form, but I'm struggling to find a way to submit the form when a submit button is not part of the form. Here is an example: session <- html_session("www.chase.com") form <- html_form(session)[[3]] filledform <- set_values(form, `user_name` = user_name, `usr_password` = usr_password) session <- submit_form(session, filledform) At this point, I receive

How to print html source to console with phantomjs

依然范特西╮ 提交于 2019-12-17 10:54:47
问题 I just downloaed and installed phantomjs on my machine. I copy and pasted the following script into a file called hello.js: var page = require('webpage').create(); var url = 'https://www.google.com' page.onLoadStarted = function () { console.log('Start loading...'); }; page.onLoadFinished = function (status) { console.log('Loading finished.'); phantom.exit(); }; page.open(url); I'd like to print the complete html source (in this case from the google page) to a file or to the console. How do I

HtmlUnit Only Displays Host HTML Page for GWT App

心不动则不痛 提交于 2019-12-17 09:52:24
问题 I am using HtmlUnit API to add crawler support to my GWT app as follows: PrintWriter out = null; try { resp.setCharacterEncoding(CHAR_ENCODING); resp.setContentType("text/html"); url = buildUrl(req); out = resp.getWriter(); WebClient webClient = webClientProvider.get(); // set options WebClientOptions options = webClient.getOptions(); options.setCssEnabled(false); options.setThrowExceptionOnScriptError(false); options.setThrowExceptionOnFailingStatusCode(false); options.setRedirectEnabled

Web crawler that can interpret JavaScript [closed]

↘锁芯ラ 提交于 2019-12-17 08:32:09
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . I want to write a web crawler that can interpret JavaScript. Basically its a program in Java or PHP that takes a URL as input and outputs the DOM tree which is similar to the output in Firebug HTML window. The best example is Kayak.com where you can not see the resulting DOM

Click a Button in Scrapy

非 Y 不嫁゛ 提交于 2019-12-17 06:34:49
问题 I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking). I found out that Scrapy can handle forms (like logins) as shown here. But the problem is that there is no form to fill out, so it's not exactly what I need. How can I simply click a button, which then shows the information I need? Do I have to use an external library like mechanize or lxml? 回答1: Scrapy cannot interpret

Loop through links and download PDF's

假装没事ソ 提交于 2019-12-14 04:07:19
问题 I have a code that has been here for a while with different types of questions. This is getting closer to it's final version. However now I have a problem that there is mistake in the code and part of it is not functioning correct. The idea is to go through the links and grab PDF files. Links are getting stored in sLinks , see line with comment "Check that links are stored in sLinks". Code goes forward and files are getting stored in C:\temp\ , but then after 12 PDF's are in folder I am

Scraping links with Scrapy

好久不见. 提交于 2019-12-14 03:04:23
问题 I am trying to scrape a swedish real estate website www.booli.se . However, i can't figure out how to follow links for each house and extract for example price, rooms, age etc. I only know how to scrape one page and i can't seem to wrap my head around this. I am looking to do something like: for link in website: follow link attribute1 = item.css('cssobject::text').extract()[1] attribute2 = item.ss('cssobject::text').extract()[2] yield{'Attribute 1': attribute1, 'Attribute 2': attribute2} So