web-crawler

Why do I have different document counts in status and index?

那年仲夏 提交于 2019-12-02 08:50:51
So i'm following the Storm-Crawler-ElasticSearch tutorial and playing around with it. When Kibana is used to search I've noticed that number of hits for index name 'status' is far greater than 'index'. Example: On the top left, you can see that there's 846 hits for 'status' index I assume that means it has crawled through 846 pages. Now with 'index' index , it is shown that there are only 31 hits . I understand that functionallyn index and status are different as status is just responsible for the link meta data. The problem is that it seem that StormCrawler is parsing through many pages and

How to block a bot that is excessively visiting my site?

心已入冬 提交于 2019-12-02 08:06:02
问题 This bot doesn't respect nofollow noindex in robots.txt. I have this in robots.txt: User-agent: Msnbot Disallow: / User-Agent: Msnbot/2.0b Disallow: / Till now it was pretty slow, but now, it is a monster that won't leave my site at all. Crawls all WordPress and MyBB 24/7. To block IP ranges or what can I do to stop all of this content stealers? 回答1: Based on Block by useragent or empty referer you could something like this in your .htaccess Options +FollowSymlinks RewriteEngine On

how to avoid duplicate download urls in my python spider program?

﹥>﹥吖頭↗ 提交于 2019-12-02 07:37:10
问题 I wrote a spider program with python. It can recursively crawl web pages. I want to avoid download the same pages, so I store the urls in a list as follow: urls = [] def download(mainPage): # mainPage is a link global urls links = getHrefLinks(mainPage) for l in links: if l not in urls: urls.append(l) downPage(l) But there is a problem that when the links are too much, the urls will be very large, and the efficiency of the code if l not in urls is low. How to solve the problem? What is the

Does Facebook crawler currently interpret javascript before parsing the DOM?

爱⌒轻易说出口 提交于 2019-12-02 07:19:47
Following link seems to tell that it can't: How does Facebook Sharer select Images and other metadata when sharing my URL? But I wanted to know if it is still the case at current date... (The documentation on facebook dev site doesn't give any precision about this point) In the tests I've run I've never seen it interpret the JS, but that might be contextual / domain-specific (who knows). To test your specific case, use the Facebook linter: https://developers.facebook.com/tools/debug (log into FB first) That's the only way to be sure 100% sure how FB will parse your page (what properties it

How to find URLs in HTML using Java

萝らか妹 提交于 2019-12-02 07:15:13
I have the following... I wouldn't say problem, but situation. I have some HTML with tags and everything. I want to search the HTML for every URL. I'm doing it now by checking where it says 'h' then 't' then 't' then 'p', but I don't think is a great solution Any good ideas? Added: I'm looking for some kind of pseudocode but, just in case, I'm using Java for this project in particular Try using a HTML parsing library then search for <a> tags in the HTML document. Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements links = doc.select("a[href]"); // a with href not all

Scrapy view returns a blank page

╄→гoц情女王★ 提交于 2019-12-02 05:27:55
I'm new at Scrapy and I was just trying to scrape http://www.diseasesdatabase.com/ When I type scrapy view http://www.diseasesdatabase.com/ , it displays a blank page but if I download the page and do it on the local file, it displays as usual. Why is this happening? Pretend being a real browser providing a User-Agent header: scrapy view http://www.diseasesdatabase.com/ -s USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36" Worked for me. Note that -s option here helps to override the built-in USER_AGENT setting

Is it possible crawl ASP.NET pages?

荒凉一梦 提交于 2019-12-02 05:06:18
问题 Is there a way to crawl some ASP.NET pages that uses doPostBack as events calling? Example: Page1.aspx: Contains 1 LinkButton that redirects to Page2.aspx Code-behind for LinkButton Click event: Response.Redirect("Page2.aspx") In client side this code is generated on click event: doPostBack(... Is it possible crawl pages using only HttpWebRequest? I know that use Response.Redirect is not a good idea in this case, but I don't have choice. 回答1: Yes, it's possible if the code follows a well

I can't get the whole source code of an HTML page

一世执手 提交于 2019-12-02 04:31:52
问题 Using Python, I want to crawl data on a web page whose source if quite big (it is a Facebook page of some user). Say the URL is the URL I am trying to crawl. I run the following code: import urllib2 usock = urllib2.urlopen(url) data = usock.read() usock.close() Data is supposed to contain the source of the page I am crawling, but for some reason, it doesn't contain all the characters that are available when I compare directly with the source of the page. I don't know what I am doing wrong. I

Web-scraping across multipages without even knowing the last page number

北战南征 提交于 2019-12-02 04:03:08
Running my code for a site to crawl the titles of different tutorials spreading across several pages, I found it working flawless. I tried to write some code not depending on the last page number the url has but on the status code until it shows http.status<>200. The code I'm pasting below is working impeccably in this case. However, Trouble comes up when I try to use another url to see whether it breaks automatically but found that the code did fetch all the results but did not break. What is the workaround in this case so that the code will break when it is done and stop the macro? Here is

JSOUP - How to crawl a “login required” page using JSOUP

拜拜、爱过 提交于 2019-12-02 03:44:20
问题 I'm having trouble at crawling a determined website I wish to crawl. The problem is: after successfully logging in to that website I can't access a link which requires a valid login. For example: public Document executeLogin(String user, String password) { try { Connection.Response loginForm = Jsoup.connect(url) .method(Connection.Method.GET) .execute(); Document mainPage = Jsoup.connect(login-validation-url) .data("user", user) .data("senha", password) .cookies(loginForm.cookies()) .post();