web-crawler

Make Scrapy follow links and collect data

你说的曾经没有我的故事 提交于 2019-12-05 16:17:23
问题 I am trying to write program in Scrapy to open links and collect data from this tag: <p class="attrgroup"></p> . I've managed to make Scrapy collect all the links from given URL but not to follow them. Any help is very appreciated. 回答1: You need to yield Request instances for the links to follow, assign a callback and extract the text of the desired p element in the callback: # -*- coding: utf-8 -*- import scrapy # item class included here class DmozItem(scrapy.Item): # define the fields for

is Scrapy single-threaded or multi-threaded?

若如初见. 提交于 2019-12-05 14:51:57
问题 There are few concurrency settings in Scrapy, like CONCURRENT_REQUESTS. Does it mean, that Scrapy crawler is multi-threaded? So if I run scrapy crawl my_crawler it will literally fire multiple simultaneous requests in parallel? Im asking because, I've read that Scrapy is single-threaded. 回答1: Scrapy is single-threaded, except the interactive shell and some tests, see source. It's built on top of Twisted, which is single-threaded too, and makes use of it's own asynchronous concurrency

crawling website that needs authentication

限于喜欢 提交于 2019-12-05 13:48:10
How would I write a simple script (in cURL/python/ruby/bash/perl/java) that logs in to okcupid and tallies how many messages I've received each day? The output will be something like: 1/21/2011 1 messages 1/22/2011 0 messages 1/23/2011 2 messages 1/24/2011 1 messages The main issue is that I have never written a web crawler before. I have no idea how to programmatically log in to a site like okcupid. How do you make the authentication persist while loading different pages? etc.. Once I get access to the raw HTML, I'll be okay via regex and maps etc. Here's a solution using cURL that downloads

How to set Robots.txt or Apache to allow crawlers only at certain hours?

独自空忆成欢 提交于 2019-12-05 11:34:50
As traffic is distributed unevenly over 24 hours, I would like to disallow crawlers during peek hours and allow them at non-busy hours. Is there a method to achieve this? edit: thanks for all the good advice. This is another solution we found. 2bits.com has an article on setting IPTables firewall to limit the number of connections from certain IP addresses. the article the setting of IPTables: Using connlimit In newer Linux kernels, there is a connlimit module for iptables. It can be used like this: iptables -I INPUT -p tcp -m connlimit --connlimit-above 5 -j REJECT This limits connections

Python multithreading crawler

守給你的承諾、 提交于 2019-12-05 11:24:04
Hello! I am trying to write web crawler with python. I wanted to use python multithreading. Even after reading earlier suggested papers and tutorials, I still have problem. My code is here (whole source code is here ): class Crawler(threading.Thread): global g_URLsDict varLock = threading.Lock() count = 0 def __init__(self, queue): threading.Thread.__init__(self) self.queue = queue self.url = self.queue.get() def run(self): while 1: print self.getName()+" started" self.page = getPage(self.url) self.parsedPage = getParsedPage(self.page, fix=True) self.urls = getLinksFromParsedPage(self

php crawl - javascript enabled

偶尔善良 提交于 2019-12-05 10:46:29
问题 Bonjour, does anyone know of a way of creating a spider that acts as if it has javascript enabled? PHP Code: file_get_contents("http://www.google.co.uk/search?hl=en&q=".$keyword."&start=".($x*10)."&sa=N") it would retrieve the output of that page. If you used, PHP Code: file_get_contents("http://www.facebook.com/something/something.something.php") (im not sure i just know face book is a good example) it would return trhe output, which im guessing would include something along the lines of

What does it mean to say a web crawler is I/O bound and not CPU bound?

强颜欢笑 提交于 2019-12-05 10:17:20
I've seen this in some answers on S/O where the point is made that the programming language doesn't matter as much for a crawler and so C++ is overkill vs say Python. Can someone please explain this in layman's terms so that there's no ambiguity about what is implied? Clarification of the underlying assumption here is also appreciated. Thanks It means that I/O is the bottleneck here. The act of going out to the net to retrieve a page (I/O) is slower than analysing the page (CPU). So, making the CPU bit ten times faster will have little effect on the overall time taken. On the other hand,

Make a web crawler/spider

ぐ巨炮叔叔 提交于 2019-12-05 10:13:59
I'm looking into making a web crawler/spider but I need someone to point me in the right direction to get started. Basically, my spider is going to search for audio files and index them. I'm just wondering if anyone has any ideas for how I should do it. I've heard having it done in PHP would be extremely slow. I know vb.net so could that come in handy? I was thinking about using Googles filetype search to get links to crawl. Would that be ok? Chris Diver In VB.NET you will need to get the HTML first, so use the WebClient class or HttpWebRequest and HttpWebResponse classes. There is plenty of

Can i execute scrapy(python) crawl outside the project dir?

放肆的年华 提交于 2019-12-05 09:18:55
问题 The docs says i could only execute the crawl command inside the project dir : scrapy crawl tutor -o items.json -t json but i really need to execute it in my python code (the python file is not inside current project dir) Is there any approach fit my requirement ? My project tree: . ├── etao │ ├── etao │ │ ├── __init__.py │ │ ├── items.py │ │ ├── pipelines.py │ │ ├── settings.py │ │ └── spiders │ │ ├── __init__.py │ │ ├── etao_spider.py │ ├── items.json │ ├── scrapy.cfg │ └── start.py └──

Data scraping with scrapy [closed]

橙三吉。 提交于 2019-12-05 08:19:34
问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 6 years ago . i want to make a new betting tool, but i need a database of odds and results and can't find anything in the web. I found this site that has great archive: OddsPortal All i want to do is scrape the results and the