web-crawler | 易学教程

Find all the web pages in a domain and its subdomains

阅读更多关于 Find all the web pages in a domain and its subdomains

I am looking for a way to find all the web pages and sub domains in a domain. For example, in the uoregon.edu domain, I would like to find all the web pages in this domain and in all the sub domains (e.g., cs.uoregon.edu). I have been looking at nutch, and I think it can do the job. But, it seems that nutch downloads entire web pages and indexes them for later search. But, I want a crawler that only scans a web page for URLs that belong to the same domain. Furthermore, it seems that nutch saves the linkdb in a serialized format. How can I read it? I tried solr, and it can read nutch's

scrapy not printing out stacktrace on exception

阅读更多关于 scrapy not printing out stacktrace on exception

Is there a special mechanism to force scrapy to print out all python exception/stacktrace. I made a simple mistake of getting a list attribute wrong resulting in AttributeError which did not show up in full in the logs What showed up was : 2015-11-15 22:13:50 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 264, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 40342, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 11, 15, 22, 13,

Any way to tell selenium don't execute js at some point?

阅读更多关于 Any way to tell selenium don't execute js at some point?

I want to crawl a site which have some generated content by js. That site run a js update content every 5 second (request a new encripted js file, can't parse). my code: from selenium import webdriver driver = webdriver.PhantomJS() driver.set_window_size(1120, 550) driver.get(url) trs = driver.find_elements_by_css_selector('.table tbody tr') print len(trs) for tr in trs: try: items.append(tr.text) except: # because the js update content, so this tr is missing pass print len(items) len(items) would not match len(trs) . How to tell selenium stop executing js or stop working after I run trs =

Scrapy SgmlLinkExtractor is ignoring allowed links

阅读更多关于 Scrapy SgmlLinkExtractor is ignoring allowed links

问题 Please take a look at this spider example in Scrapy documentation. The explanation is: This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and a Item will be filled with it. I copied the same spider exactly, and replaced "example.com" with another initial url. from scrapy.contrib.spiders import CrawlSpider, Rule from

Crawl website using wget and limit total number of crawled links

阅读更多关于 Crawl website using wget and limit total number of crawled links

I want to learn more about crawlers by playing around with the wget tool. I'm interested in crawling my department's website, and finding the first 100 links on that site. So far, the command below is what I have. How do I limit the crawler to stop after 100 links? wget -r -o output.txt -l 0 -t 1 --spider -w 5 -A html -e robots=on "http://www.example.com" You can't. wget doesn't support this so if you want something like this, you would have to write a tool yourself. You could fetch the main file, parse the links manually, and fetch them one by one with a limit of 100 items. But it's not

How to crawl links on all pages of a web site with Scrapy

阅读更多关于 How to crawl links on all pages of a web site with Scrapy

I'm learning about scrapy and I'm trying to extract all links that contains: " http://lattes.cnpq.br/andasequenceofnumbers " , example: http://lattes.cnpq.br/0281123427918302 But I don't know what is the page on the web site that contains these information. For example this web site: http://www.ppgcc.ufv.br/ The links that I want are on this page: http://www.ppgcc.ufv.br/?page_id=697 What could I do? I'm trying to use rules but I don't know how to use regular expressions correctly. Thank you 1 EDIT---- I need search on all pages of the main (ppgcc.ufv.br) site the kind of links (http://lattes

Is it safe to use the same CookieContainer across multiple HttpWebRequests?

阅读更多关于 Is it safe to use the same CookieContainer across multiple HttpWebRequests?

I am doing a kind of WebCrawler and I need to persist the Cookies state between requests. I download all pages async creating new HttpWebRequest instances, but setting the same CookieContainer. The pages can write and read cookies. Can I do it safely? There is any alternative that isn´t subclass the CookieContainer and put locks at all method? The MSDN says that this class isn´t thread safe, but in practice, can I do it? According to the documentation : Any public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe. So

How much is the difference between html parsing and web crawling in python [closed]

阅读更多关于 How much is the difference between html parsing and web crawling in python [closed]

Closed . This question needs to be more focused . It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post . Closed last year . I need to grab some data from websites in my django website. Now i am confused whether i should use python parsing libraries or web crawling libraries. Does search engine libraries also fall in same category I want to know how much is the difference between the two and if i want to use those functions inside my website which should i use If you can get away with background web

Scrapy process.crawl() to export data to json

阅读更多关于 Scrapy process.crawl() to export data to json

This might be a subquestion of Passing arguments to process.crawl in Scrapy python but the author marked the answer (that doesn't answer the subquestion i'm asking myself) as a satisfying one. Here's my problem : I cannot use scrapy crawl mySpider -a start_urls(myUrl) -o myData.json Instead i want/need to use crawlerProcess.crawl(spider) I have already figured out several way to pass the arguments (and anyway it is answered in the question I linked) but i can't grasp how i am supposed to tell it to dump the data into myData.json... the -o myData.json part Anyone got a suggestion ? Or am I just

Scrap multiple accounts aka multiple logins

阅读更多关于 Scrap multiple accounts aka multiple logins

I scrap successfully data for a single account. I want to scrap multiple accounts on a single website, multiple accounts needs multiple logins, I want a way how to manage login/logout ? dangra you can scrape multiples accounts in parallel using multiple cookiejars per account session, see "cookiejar" request meta key at http://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=cookiejar#std:reqmeta-cookiejar To clarify: suppose we have an array of accounts in settings.py : MY_ACCOUNTS = [ {'login': 'my_login_1', 'pwd': 'my_pwd_1'}, {'login': 'my_login_2', 'pwd': 'my_pwd_2'},