web-crawler | 易学教程

Running multiple spiders using scrapyd

阅读更多关于 Running multiple spiders using scrapyd

I had multiple spiders in my project so decided to run them by uploading to scrapyd server. I had uploaded my project succesfully and i can see all the spiders when i run the command curl http://localhost:6800/listspiders.json?project=myproject when i run the following command curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2 Only one spider runs because of only one spider given, but i want to run run multiple spiders here so the following command is right for running multiple spiders in scrapyd ? curl http://localhost:6800/schedule.json -d project=myproject -d

Is there a way to download partial part of a webpage, rather than the whole HTML body, programmatically?

阅读更多关于 Is there a way to download partial part of a webpage, rather than the whole HTML body, programmatically?

We only want a particular element from the HTML document at nytimes.com/technology. This page contains many articles, but we only want the article's title, which is in a . If we use wget, cURL, or any other tools or some package like requests in Python , whole HTML document is returned. Can we limite the returned data to specific element, such as the 's? The HTTP protocol knows nothing about HTML or DOM. Using HTTP you can fetch partial documents from supporting web servers using the Content-Range header, but you'll need to know the byte offsets of the data you want. The short answer is that

How to get hover data(ajax) by any crawler php

阅读更多关于 How to get hover data(ajax) by any crawler php

I am crawling one website's data. I am able to whole content on a page. But some data on page comes after hover on some icons and shown as tooltips. So I require that data also. Is it possible with any crawler. I am using PHP and simplehtmldom for parsing/ crawling page. Hover data can't be obtained by any crawlers. Crawlers crawl the web page and gets whole data ( HTML page source ). It's view which we can view as soon as we hit URL. Hover need mouse moving action over HTML attribute on page i.e manual action. And currently no crawlers do actions for hovering and getting that data as per my

HTML Snapshot for crawler - Understanding how it works

阅读更多关于 HTML Snapshot for crawler - Understanding how it works

i'm reading this article today. To be honest, im really interessed to "2. Much of your content is created by a server-side technology such as PHP or ASP.NET" point. I want understand if i have understood :) I create that php script (gethtmlsnapshot.php) where i include the server-side ajax page (getdata.php) and i escape (for security) the parameters. Then i add it at the end of the html static page (index-movies.html). Right? Now... 1 - Where i put that gethtmlsnapshot.php? In other words, i need to call (or better, the crawler need) that page. But if i don't have link on the main page, the

Small preview when sharing link on Social media Ruby On Rails

阅读更多关于 Small preview when sharing link on Social media Ruby On Rails

I'm working on a site whose front end is in angularjs and backend in ROR , Same ROR API is used in an android app also . Now I have a situation here . I need to share my Web's post on the Social media like facebook , twitter and google plus . And along with the link to the single post there should be a small preview also (a preview of the post which crawls before posting e.g in facebook) .I did it using angular Plugins . But when it comes to Android side , what they share and what displays on Facebook is the Link only . Then i did some R&D and i came to know that it must be done on server side

connection refused error when running Nutch 2

阅读更多关于 connection refused error when running Nutch 2

问题 I am trying to run Nutch 2 crawler on my system but I get the following error: Exception in thread "main" org.apache.gora.util.GoraException: java.io.IOException: java.sql.SQLTransientConnectionException: java.net.ConnectException: Connection refused at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:69

prevent NodeJS program from exiting

阅读更多关于 prevent NodeJS program from exiting

问题 I am creating NodeJS based crawler, which is working with node-cron package and I need to prevent entry script from exiting since application should run forever as cron and will execute crawlers at certain periods with logs. In the web application, server will listen and will prevent from terminating, but in serverless apps, it will exit the program after all code is executed and won't wait for crons. Should I write while(true) loop for that? What is best practices in node for this purpose?

Avoid bad requests due to relative urls

阅读更多关于 Avoid bad requests due to relative urls

I am trying to crawl a website using Scrapy, and the urls of every page I want to scrap are all written using a relative path of this kind:  <a href="../../en/item-to-scrap.html">Link</a> Now, in my browser, these links work, and you get to urls like https://www.domain-name.com/en/item-to-scrap.html (despite the relative path going back up twice in hierarchy instead of once) But my CrawlSpider does not manage to translate these urls into a "correct" one, and all I get is errors of that kind: 2013-10-13 09:30

Strange exceptions on production website from HTTP_USER_AGENT Java/1.6.0_17

阅读更多关于 Strange exceptions on production website from HTTP_USER_AGENT Java/1.6.0_17

Today we have received some strange exceptions on our production website. They all have the following HTTP_USER_AGENT string: Java/1.6.0_17 . I looked it up over at UserAgentString.com but the info is quite useless. Here's one of the exceptions we're getting (they are all more or less the same): System.NotSupportedException: The given path's format is not supported. The path that is being queried: /klacht/Scripts/,data:c,complete:function(a,b,c){c=a.responseText,a.isResolved()&&(a.done(function(a){c=a}),i.html(g I have a feeling there is a problem with this bot or whatever is being used to

htmlunit : An invalid or illegal selector was specified

阅读更多关于 htmlunit : An invalid or illegal selector was specified

I am trying to simulate the login with htmlunit. Although I wrote my code according to the examples, I have encountered a boring problem. Below are some message I have picked up from the console. runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: *:x).] sourceName=[http://user.mofangge.com/Scripts/inc/jquery-1.10.2.js] line=[1640] lineSource=[null] lineOffset=[0] WARNING: Obsolete content type encountered: 'application/x-javascript'. CSS error: 'http://user.mofangge.com/Content/Css/Style1/Main.css' [1:1] Error in style sheet. (Invalid