web-crawler | 易学教程

Why do I have different document counts in status and index?

阅读更多关于 Why do I have different document counts in status and index?

So i'm following the Storm-Crawler-ElasticSearch tutorial and playing around with it. When Kibana is used to search I've noticed that number of hits for index name 'status' is far greater than 'index'. Example: On the top left, you can see that there's 846 hits for 'status' index I assume that means it has crawled through 846 pages. Now with 'index' index , it is shown that there are only 31 hits . I understand that functionallyn index and status are different as status is just responsible for the link meta data. The problem is that it seem that StormCrawler is parsing through many pages and

How to block a bot that is excessively visiting my site?

阅读更多关于 How to block a bot that is excessively visiting my site?

问题 This bot doesn't respect nofollow noindex in robots.txt. I have this in robots.txt: User-agent: Msnbot Disallow: / User-Agent: Msnbot/2.0b Disallow: / Till now it was pretty slow, but now, it is a monster that won't leave my site at all. Crawls all WordPress and MyBB 24/7. To block IP ranges or what can I do to stop all of this content stealers? 回答1: Based on Block by useragent or empty referer you could something like this in your .htaccess Options +FollowSymlinks RewriteEngine On

how to avoid duplicate download urls in my python spider program?

阅读更多关于 how to avoid duplicate download urls in my python spider program?

问题 I wrote a spider program with python. It can recursively crawl web pages. I want to avoid download the same pages, so I store the urls in a list as follow: urls = [] def download(mainPage): # mainPage is a link global urls links = getHrefLinks(mainPage) for l in links: if l not in urls: urls.append(l) downPage(l) But there is a problem that when the links are too much, the urls will be very large, and the efficiency of the code if l not in urls is low. How to solve the problem? What is the

Does Facebook crawler currently interpret javascript before parsing the DOM?

阅读更多关于 Does Facebook crawler currently interpret javascript before parsing the DOM?

Following link seems to tell that it can't: How does Facebook Sharer select Images and other metadata when sharing my URL? But I wanted to know if it is still the case at current date... (The documentation on facebook dev site doesn't give any precision about this point) In the tests I've run I've never seen it interpret the JS, but that might be contextual / domain-specific (who knows). To test your specific case, use the Facebook linter: https://developers.facebook.com/tools/debug (log into FB first) That's the only way to be sure 100% sure how FB will parse your page (what properties it

How to find URLs in HTML using Java

阅读更多关于 How to find URLs in HTML using Java

I have the following... I wouldn't say problem, but situation. I have some HTML with tags and everything. I want to search the HTML for every URL. I'm doing it now by checking where it says 'h' then 't' then 't' then 'p', but I don't think is a great solution Any good ideas? Added: I'm looking for some kind of pseudocode but, just in case, I'm using Java for this project in particular Try using a HTML parsing library then search for <a> tags in the HTML document. Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements links = doc.select("a[href]"); // a with href not all

Scrapy view returns a blank page

阅读更多关于 Scrapy view returns a blank page

I'm new at Scrapy and I was just trying to scrape http://www.diseasesdatabase.com/ When I type scrapy view http://www.diseasesdatabase.com/ , it displays a blank page but if I download the page and do it on the local file, it displays as usual. Why is this happening? Pretend being a real browser providing a User-Agent header: scrapy view http://www.diseasesdatabase.com/ -s USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36" Worked for me. Note that -s option here helps to override the built-in USER_AGENT setting

Is it possible crawl ASP.NET pages?

阅读更多关于 Is it possible crawl ASP.NET pages?

问题 Is there a way to crawl some ASP.NET pages that uses doPostBack as events calling? Example: Page1.aspx: Contains 1 LinkButton that redirects to Page2.aspx Code-behind for LinkButton Click event: Response.Redirect("Page2.aspx") In client side this code is generated on click event: doPostBack(... Is it possible crawl pages using only HttpWebRequest? I know that use Response.Redirect is not a good idea in this case, but I don't have choice. 回答1: Yes, it's possible if the code follows a well

I can't get the whole source code of an HTML page

阅读更多关于 I can't get the whole source code of an HTML page

问题 Using Python, I want to crawl data on a web page whose source if quite big (it is a Facebook page of some user). Say the URL is the URL I am trying to crawl. I run the following code: import urllib2 usock = urllib2.urlopen(url) data = usock.read() usock.close() Data is supposed to contain the source of the page I am crawling, but for some reason, it doesn't contain all the characters that are available when I compare directly with the source of the page. I don't know what I am doing wrong. I

Web-scraping across multipages without even knowing the last page number

阅读更多关于 Web-scraping across multipages without even knowing the last page number

Running my code for a site to crawl the titles of different tutorials spreading across several pages, I found it working flawless. I tried to write some code not depending on the last page number the url has but on the status code until it shows http.status<>200. The code I'm pasting below is working impeccably in this case. However, Trouble comes up when I try to use another url to see whether it breaks automatically but found that the code did fetch all the results but did not break. What is the workaround in this case so that the code will break when it is done and stop the macro? Here is

JSOUP - How to crawl a “login required” page using JSOUP

阅读更多关于 JSOUP - How to crawl a “login required” page using JSOUP

问题 I'm having trouble at crawling a determined website I wish to crawl. The problem is: after successfully logging in to that website I can't access a link which requires a valid login. For example: public Document executeLogin(String user, String password) { try { Connection.Response loginForm = Jsoup.connect(url) .method(Connection.Method.GET) .execute(); Document mainPage = Jsoup.connect(login-validation-url) .data("user", user) .data("senha", password) .cookies(loginForm.cookies()) .post();