web-crawler

is Scrapy single-threaded or multi-threaded?

邮差的信 提交于 2019-12-03 23:22:17
There are few concurrency settings in Scrapy, like CONCURRENT_REQUESTS . Does it mean, that Scrapy crawler is multi-threaded? So if I run scrapy crawl my_crawler it will literally fire multiple simultaneous requests in parallel? Im asking because, I've read that Scrapy is single-threaded. Scrapy is single-threaded, except the interactive shell and some tests, see source . It's built on top of Twisted , which is single-threaded too, and makes use of it's own asynchronous concurrency capabilities, such as twisted.internet.interfaces.IReactorThreads.callFromThread , see source . Scrapy does most

Scrapy crawl all sitemap links

馋奶兔 提交于 2019-12-03 23:12:12
问题 I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider . So far i've extracted all the urls in the sitemap. Now i want to crawl through each link of the sitemap. Any help would be highly useful. The code so far is: class MySpider(SitemapSpider): name = "xyz" allowed_domains = ["xyz.nl"] sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] def parse(self, response): print response.url 回答1: You need to add sitemap_rules to process the data

Crawler4j with authentication

孤者浪人 提交于 2019-12-03 22:48:37
I'm trying to execute the crawler4j in a personal redmine for testing purposes. I want to authenticate and crawl several leves of depth in the application. I follow this tutorial from the FAQ of crawler4j. And create the next snippet: import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.ics.crawler4j.url.WebURL; public class CustomWebCrawler extends WebCrawler{ @Override public void visit(final Page pPage) { if (pPage.getParseData() instanceof HtmlParseData) { System.out.println("URL: " +

Mechanze form submission causes 'Assertion Error' in response when .read() is attempted

故事扮演 提交于 2019-12-03 22:42:37
I am writing a web-crawl program with python and am unable to login using mechanize. The form on the site looks like: <form method="post" action="PATLogon"> <h2 align="center"><img src="/myaladin/images/aladin_logo_rd.gif"></h2> <!-- ALADIN Request parameters --> <input type=hidden name=req value="db"> <input type=hidden name=key value="PROXYAUTH"> <input type=hidden name=url value="http://eebo.chadwyck.com/search"> <input type=hidden name=lib value="8"> <table> <tr><td><b>Last Name:</b></td> <td><input name=LN size=20 maxlength=26></td> <tr><td><b>University ID or Library Barcode:</b></td>

Data scraping with scrapy [closed]

你说的曾经没有我的故事 提交于 2019-12-03 21:59:47
i want to make a new betting tool, but i need a database of odds and results and can't find anything in the web. I found this site that has great archive: OddsPortal All i want to do is scrape the results and the odds from page like the one above. I found that a tool called Scrapy can do it, is it true? Can someone help me with some hints? I don't know about Scrapy, but JSoup should help you get you started. http://jsoup.org/ Download the .jar file. Right click your project folder > Properties > Java build path > libraries > add external jars > find the jar and click it. It's a nice little

What are the best prebuilt libraries for doing Web Crawling in Python [duplicate]

﹥>﹥吖頭↗ 提交于 2019-12-03 21:54:35
I need to crawl and store locally for future analysis the contents of a finite list of websites. I basically want to slurp in all pages and follow all internal links to get the entire publicly available site. Are there existing free libraries to get me there? I've seen Chilkat, but it's for pay. I'm just looking for baseline functionality here. Thoughts? Suggestions? Exact Duplicate: Anyone know of a good python based web crawler that I could use? Use Scrapy . It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies: Built-in support for

Can i execute scrapy(python) crawl outside the project dir?

南笙酒味 提交于 2019-12-03 21:47:37
The docs says i could only execute the crawl command inside the project dir : scrapy crawl tutor -o items.json -t json but i really need to execute it in my python code (the python file is not inside current project dir) Is there any approach fit my requirement ? My project tree: . ├── etao │ ├── etao │ │ ├── __init__.py │ │ ├── items.py │ │ ├── pipelines.py │ │ ├── settings.py │ │ └── spiders │ │ ├── __init__.py │ │ ├── etao_spider.py │ ├── items.json │ ├── scrapy.cfg │ └── start.py └── start.py <-------------- I want to execute the script here. Any here's my code followed this link but it

How to limit scrapy request objects?

与世无争的帅哥 提交于 2019-12-03 21:43:19
So I have a spider that I thought was leaking memory, turns out it is just grabbing too many links from link rich pages (sometimes it puts upwards of 100,000) when I check the telnet console >>> prefs() Now I have been over the docs and google again and again and I can't find a way to limit the requests that the spider takes in. What I want is to be able to tell it to hold back on taking requests once a certain amount goes into the scheduler. I have tried setting a DEPTH_LIMIT but that only lets it grab a large amount and then run the callback on the ones that it has grabbed. It seems like a

How to increase number of documents fetched by Apache Nutch crawler

被刻印的时光 ゝ 提交于 2019-12-03 21:26:05
I am using Apache Nutch 2.3 for crawling. There were about 200 urls in seed at start. Now as the time elasped, number of documents crawler are going to decrease or atmost same as at start. How I can configure Nutch so that my documents crawled should be increased? Is there any parameter that can be used to control number of documents? Second, how I can count number of documents crawled per day by nutch? m5khan One crawl cycle consists of four steps: Generate, Fetch, Parse and Update DB. for detailed information, read my answer here . Whats causing limited URL fetch can be caused by the

How to crawl a website after login in it with username and password

爱⌒轻易说出口 提交于 2019-12-03 21:23:04
I have written a webcrawler that crawls a website with keyward but i want to login to my specified website and filter information by keyword.How to achive that. i posting my code so far i have done . public class DB { public Connection conn = null; public DB() { try { Class.forName("com.mysql.jdbc.Driver"); String url = "jdbc:mysql://localhost:3306/test"; conn = DriverManager.getConnection(url, "root","root"); System.out.println("conn built"); } catch (SQLException e) { e.printStackTrace(); } catch (ClassNotFoundException e) { e.printStackTrace(); } } public ResultSet runSql(String sql) throws