web-crawler | 易学教程

Ruby on Rails, How to determine if a request was made by a robot or search engine spider?

阅读更多关于 Ruby on Rails, How to determine if a request was made by a robot or search engine spider?

I've Rails apps, that record an IP-address from every request to specific URL, but in my IP database i've found facebook blok IP like 66.220.15.* and Google IP (i suggest it come from bot). Is there any formula to determine an IP from request was made by a robot or search engine spider ? Thanks Robots are required (by common sense / courtesy more than any kind of law) to send along a User-Agent with their request. You can check for this using request.env["HTTP_USER_AGENT"] and filter as you please. Since the well behaved bots at least typically include a reference URI in the UA string they

.NET Does NOT Have Reliable Asynchronouos Socket Communication?

阅读更多关于 .NET Does NOT Have Reliable Asynchronouos Socket Communication?

I once wrote a Crawler in .NET. In order to improve its scalability, I tried to take advantage of asynchronous API of .NET. The System.Net.HttpWebRequest has asynchronous API BeginGetResponse/EndGetResponse. However, this pair of API is just to get a HTTP response headers and a Stream instance from which we can extract HTTP response content. So, my strategy is to use BeginGetResponse/EndGetResponse to asynchronously get the response Stream, then use BeginRead/EndRead to asynchronously get bytes from the response Stream instance. Everything seems perfect until the Crawler goes to stress test.

scrapy- how to stop Redirect (302)

阅读更多关于 scrapy- how to stop Redirect (302)

I'm trying to crawl a url using Scrapy. But it redirects me to page that doesn't exist. Redirecting (302) to <GET http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197> from <GET http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx> The problem is http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx exists, but http://www.shop.inonit.in/mobile/Products

Should I create pipeline to save files with scrapy?

阅读更多关于 Should I create pipeline to save files with scrapy?

I need to save a file (.pdf) but I'm unsure how to do it. I need to save .pdfs and store them in such a way that they are organized in a directories much like they are stored on the site I'm scraping them off. From what I can gather I need to make a pipeline, but from what I understand pipelines save "Items" and "items" are just basic data like strings/numbers. Is saving files a proper use of pipelines, or should I save file in spider instead? Yes and no[1]. If you fetch a pdf it will be stored in memory, but if the pdfs are not big enough to fill up your available memory so it is ok. You

Submit data via web form and extract the results

阅读更多关于 Submit data via web form and extract the results

My python level is Novice. I have never written a web scraper or crawler. I have written a python code to connect to an api and extract the data that I want. But for some the extracted data I want to get the gender of the author. I found this web site http://bookblog.net/gender/genie.php but downside is there isn't an api available. I was wondering how to write a python to submit data to the form in the page and extract the return data. It would be a great help if I could get some guidance on this. This is the form dom: <form action="analysis.php" method="POST"> <textarea cols="75" rows="13"

Is there CURRENTLY anyway to fetch Instagram user media without authentication?

阅读更多关于 Is there CURRENTLY anyway to fetch Instagram user media without authentication?

问题 Until recently there were several ways to retrieve Instagram user media without the need for API authentication. But apparently, the website stopped all of them. Some of the old methods: https://api.instagram.com/v1/users/user-id/media/recent/ https://www.instagram.com/user-id/media https://www.instagram.com/user-id/?__a=1 And some old related questions are: How can I get a user's media from Instagram without authenticating as a user? Is there still a way to fetch instagram feed without using

How do I use the Python Scrapy module to list all the URLs from my website?

阅读更多关于 How do I use the Python Scrapy module to list all the URLs from my website?

问题 I want to use the Python Scrapy module to scrape all the URLs from my website and write the list to a file. I looked in the examples but didn't see any simple example to do this. 回答1: Here's the python program that worked for me: from scrapy.selector import HtmlXPathSelector from scrapy.spider import BaseSpider from scrapy.http import Request DOMAIN = 'example.com' URL = 'http://%s' % DOMAIN class MySpider(BaseSpider): name = DOMAIN allowed_domains = [DOMAIN] start_urls = [ URL ] def parse

Detect Search Crawlers via JavaScript

阅读更多关于 Detect Search Crawlers via JavaScript

I am wondering how would I go abouts in detecting search crawlers? The reason I ask is because I want to suppress certain JavaScript calls if the user agent is a bot. I have found an example of how to to detect a certain browser, but am unable to find examples of how to detect a search crawler: /MSIE (\d+\.\d+);/.test(navigator.userAgent); //test for MSIE x.x Example of search crawlers I want to block: Google Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Googlebot/2.1 (+http://www.googlebot.com/bot.html) Googlebot/2.1 (+http://www.google.com/bot.html) Baidu

Following links, Scrapy web crawler framework

阅读更多关于 Following links, Scrapy web crawler framework

After several readings to Scrapy docs I'm still not catching the diferrence between using CrawlSpider rules and implementing my own link extraction mechanism on the callback method. I'm about to write a new web crawler using the latter approach, but just becuase I had a bad experience in a past project using rules. I'd really like to know exactly what I'm doing and why. Anyone familiar with this tool? Thanks for your help! CrawlSpider inherits BaseSpider. It just added rules to extract and follow links. If these rules are not enough flexible for you - use BaseSpider: class USpider(BaseSpider):

What's a good Web Crawler tool [closed]

阅读更多关于 What's a good Web Crawler tool [closed]

I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper. What I really need is something that I can give a site url to & it will follow every link and store the content for indexing. HTTrack -- http://www.httrack.com/ -- is a very good Website copier. Works pretty good. Have been using it for a long time. Nutch is a web crawler(crawler is the type of program you're looking for) -- http://lucene.apache.org/nutch/ -- which uses a top notch search utility lucene. Crawler4j is an open