web-crawler

Crawler4j vs. Jsoup for the pages crawling and parsing in Java

旧时模样 提交于 2019-12-09 09:35:45
问题 I want to get the content of a page and extract the specific parts of it. As far as I know, there are at least two solutions for such task: Crawler4j and Jsoup. Both of them are capable retrieving the content of a page and extract sub-parts of it. The only thing I'm not sure about, what is the difference between them? There is a similar question, which is marked as answered: Crawler4j is a crawler, Jsoup is a parser. But I just checked, Jsoup is also capable crawling a page in addition to a

Apache Nutch 2.1 different batch id (null)

烂漫一生 提交于 2019-12-09 06:09:34
问题 I crawl few sites with Apache Nutch 2.1. While crawling I see the following message on lot of pages: ex. Skipping http://www.domainname.com/news/subcategory/111111/index.html; different batch id (null). What causes this error ? How can I resolve this problem, because the pages with different batch id (null) are not stored in database. The site that I crawled is based on drupal, but i have tried on many others non drupal sites. 回答1: I think, the message is not problem. batch_id not assigned to

PyPi download counts seem unrealistic

家住魔仙堡 提交于 2019-12-09 04:13:34
问题 I put a package on PyPi for the first time ~2 months ago, and have made some version updates since then. I noticed this week the download count recording, and was surprised to see it had been downloaded hundreds of times. Over the next few days, I was more surprised to see the download count increasing by sometimes hundreds per day , even though this is a niche statistical test toolbox. In particular, older versions of package are continuing to be downloaded, sometimes at higher rates than

is it possible to write web crawler in javascript?

人盡茶涼 提交于 2019-12-09 04:08:37
问题 I want to crawl the page and check for the hyperlinks in that respective page and also follow those hyperlinks and capture data from the page 回答1: Generally, browser JavaScript can only crawl within the domain of its origin, because fetching pages would be done via Ajax, which is restricted by the Same-Origin Policy. If the page running the crawler script is on www.example.com , then that script can crawl all the pages on www.example.com, but not the pages of any other origin (unless some

Scrape google's all search results based on certain criteria?

倖福魔咒の 提交于 2019-12-09 03:26:34
问题 I am working on my mapper and I need to get the full map of newegg.com I could try to scrap NE directly (which kind of violates NE's policies), but they have many products that are not available via direct NE search, but only via google.com search; and I need those links too. Here is the search string that returns 16mil of results: https://www.google.com/search?as_q=&as_epq=.com%2FProduct%2FProduct.aspx%3FItem%3D&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=newegg.com&as

Extracting Site data through Web Crawler outputs error due to mis-match of Array Index

两盒软妹~` 提交于 2019-12-09 02:49:30
问题 I been trying to extract site table text along with its link from the given table to (which is in site1.com) to my php page using a web crawler. But unfortunately, due to incorrect input of Array index in the php code, it came error as output. site1.com <table border="0" cellpadding="0" cellspacing="0" width="100%" class="Table2"> <tbody><tr> <td width="1%" valign="top" class="Title2"> </td> <td width="65%" valign="top" class="Title2">Subject</td> <td width="1%" valign="top" class="Title2"> <

Does html5mode(true) affect google search crawlers

拥有回忆 提交于 2019-12-08 23:58:42
问题 I'm reading this specification which is an agreement between web servers and search engine crawlers that allows for dynamically created content to be visible to crawlers. It's stated there that in order for a crawler to index html5 application one must implement routing using #! in URLs. In angular html5mode(true) we get rid of this hashed part of the URL. I'm wondering whether this is going to prevent crawlers from indexing my website. 回答1: Short answer - No, html5mode will not mess up your

How to send crawler data to PHP via command line?

半城伤御伤魂 提交于 2019-12-08 19:02:27
Can I send the results rather than stored in the JSON file, send it to PHP? I have this two files settings.json { "outputFile" : "C:\\wamp\\www\\drestip\\admin\\crawls\\mimshoes.json", "logFile" : "C:\\wamp\\www\\drestip\\admin\\crawls\\mimshoes.tsv", "pause" : 1, "local" : false, "connections" : 3, "cookiesEnabled" : false, "robotsDisabled" : false, "advancedMode" : true, "crawlTemplate" : [ "www.mimshoes.com/" ], "startUrls" : [ PAGES ], "maxDepth" : 10, "dataTemplate" : [ "www.mimshoes.com/{alpha}-{alpha}_{alpha}-{alpha}$" ], "destination" : "JSON", "connectorGuid" :

Google crawling with cookies

烈酒焚心 提交于 2019-12-08 18:04:24
The content of my site depends of cookies in the request, and when Google crawler bot visits my site it deoesn't index much content, because it does't have the specific cookies in each of its requests. Is it possible to setup some rule that when the crawler bot is crawling my site it uses the specific cookies? Googlebot does not honor cookies on purpose -- it has to "see" what anybody else will see on your website, the "smallest common denominator" if you will; otherwise search results would be meaningless to an unknown amount of searchers. Please google for "Googlebot cookies" to get pointed

Testing a website using C# [closed]

▼魔方 西西 提交于 2019-12-08 16:58:30
Closed . This question needs to be more focused . It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post . Closed 4 months ago . Folks, I need to accomplish some sophisticated web crawling. The goal in simple words: Login to a page, enter some values in some text fields, click Submit, then extract some values from the retrieved page. What is the best approach? Some Unit testing 3rd party lib? Manual crawling in C#? Maybe there is a ready lib for that specifically? Any other approach? This needs to be done