web-crawler

Websites that are particularly challenging to crawl and scrape? [closed]

萝らか妹 提交于 2019-12-03 12:35:16
I'm interested in public facing sites (nothing behind a login / authentication) that have things like: High use of internal 301 and 302 redirects Anti-scraping measures (but not banning crawlers via robots.txt) Non-semantic, or invalid mark-up Content loaded via AJAX in the form of onclicks or infinite scrolling Lots of parameters used in urls Canonical problems Convoluted internal link structure and anything else that generally makes crawling a website a headache! I have built a crawler / spider that performs a range of analysis on a website, and I'm on the lookout for sites that will make it

Is there a hashing algorithm that is tolerant of minor differences?

旧巷老猫 提交于 2019-12-03 12:17:51
I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something like md5 can be foiled by simply putting the current date and time on the page. Are there any hashing algorithms that work for something like this? A common way to do document similarity is shingling , which is somewhat more involved than hashing. Also look into content defined chunking for a way to split up the document. I read a paper a few years

How do I remove a query from a url?

[亡魂溺海] 提交于 2019-12-03 11:08:40
I am using scrapy to crawl a site which seems to be appending random values to the query string at the end of each URL. This is turning the crawl into a sort of an infinite loop. How do i make scrapy to neglect the query string part of the URL's? See urllib.urlparse Example code: from urlparse import urlparse o = urlparse('http://url.something.com/bla.html?querystring=stuff') url_without_query_string = o.scheme + "://" + o.netloc + o.path Example output: Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits" or

How to prevent getting blacklisted while scraping Amazon [closed]

大城市里の小女人 提交于 2019-12-03 10:17:01
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 months ago . I try to scrape Amazon by Scrapy. but i have this error DEBUG: Retrying <GET http://www.amazon.fr/Amuses-bouche-Peuvent-b%C3%A9n%C3%A9ficier-dAmazon-Premium-Epicerie/s?ie=UTF8&page=1&rh=n%3A6356734031%2Cp_76%3A437878031> (failed 1 times): 503 Service Unavailable I think that it's because = Amazon is very good

Scrapy Crawling Speed is Slow (60 pages / min)

非 Y 不嫁゛ 提交于 2019-12-03 09:47:22
问题 I am experiencing slow crawl speeds with scrapy (around 1 page / sec). I'm crawling a major website from aws servers so I don't think its a network issue. Cpu utilization is nowhere near 100 and if I start multiple scrapy processes crawl speed is much faster. Scrapy seems to crawl a bunch of pages, then hangs for several seconds, and then repeats. I've tried playing with: CONCURRENT_REQUESTS = CONCURRENT_REQUESTS_PER_DOMAIN = 500 but this doesn't really seem to move the needle past about 20.

Wikipedia text download

大城市里の小女人 提交于 2019-12-03 09:44:24
I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online? To just give you some overview of my project, I want to find out the interesting words of few articles I am interested in. But to find these interesting words, I am planning to apply tf/idf to calculate term frequency for each word and pick the ones with high frequency. But to calculate the tf, I need to know the total occurrences in whole of Wikipedia. How can this be done? from wikipedia: http://en.wikipedia.org

Creating a bot/crawler

╄→尐↘猪︶ㄣ 提交于 2019-12-03 09:04:38
I would like to make a small bot in order to automatically and periodontally surf on a few partner website. This would save several hours to a lot of employees here. The bot must be able to : connect to this website, on some of them log itself as a user, access and parse a particular information on the website. The bot must be integrated to our website and change it's settings (used user…) with data of our website. Eventually it must sum up the parse information. Preferably this operation must be done from the client side, not on the server. I tried dart last month and loved it… I would like

How to set up a robot.txt which only allows the default page of a site

家住魔仙堡 提交于 2019-12-03 08:35:24
问题 Say I have a site on http://example.com. I would really like allowing bots to see the home page, but any other page need to blocked as it is pointless to spider. In other words http://example.com & http://example.com/ should be allowed, but http://example.com/anything and http://example.com/someendpoint.aspx should be blocked. Further it would be great if I can allow certain query strings to passthrough to the home page: http://example.com?okparam=true but not http://example.com

How to understand this raw HTML of Yahoo! Finance when retrieving data using Python?

倾然丶 夕夏残阳落幕 提交于 2019-12-03 08:18:39
I've been trying to retrieve stock price from Yahoo! Finance, like for Apple Inc. . My code is like this:(using Python 2) import requests from bs4 import BeautifulSoup as bs html='http://finance.yahoo.com/quote/AAPL/profile?p=AAPL' r = requests.get(html) soup = bs(r.text) The problem is when I see raw HTML behind this webpage, the class is dynamic, see figure below. This makes it hard for BeautifulSoup to get tags. How to understand the class and how to get data? HTML of Yahoo! Finance page PS: 1) I know pandas_datareader.data, but that's for historical data. I want the real-time stock data; 2

how to allow known web crawlers and block spammers and harmful robots from scanning asp.net website

北战南征 提交于 2019-12-03 08:06:31
How can I configure my site to allow crawling from well known robots like google, bing, yahoo, alexa etc. and stop other harmful spammers, robots should i block particular IP? please discuss any pros, cons Anything to be done in web.config or IIS? Can I do it server wide If i have vps with root access? Thanks. Kiril I'd recommend that you take a look the answer I posted to a similar question: How to identify web-crawler? Robots.txt The robots.txt is useful for polite bots, but spammers are generally not polite so they tend to ignore the robots.txt; it's great if you have robots.txt since it