web-crawler

How to crawl billions of pages? [closed]

非 Y 不嫁゛ 提交于 2019-12-02 14:19:36
Is it possible to crawl billions of pages on a single server? Not if you want the data to be up to date. Even a small player in the search game would number the pages crawled in the multiple billions. " In 2006, Google has indexed over 25 billion web pages,[32] 400 million queries per day,[32] 1.3 billion images, and over one billion Usenet messages. " - Wikipedia And remember the quote is mentioning numbers from 2006. This is ancient history. State of the art is well more than that. Freshness of content: New content is constantly added at a very large rate (reality) Existing pages often

Scrap articles form wsj by requests, CURL and BeautifulSoup

孤街醉人 提交于 2019-12-02 14:13:29
问题 I'm a paid member of wsj and I tried to scrap articles to do my NLP project. I thought I kept the session. rs = requests.session() login_url="https://sso.accounts.dowjones.com/login?client=5hssEAdMy0mJTICnJNvC9TXEw3Va7jfO&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin&scope=openid%20idp_id&response_type=code&nonce=18091b1f-2c73-4a93-ab10-77b0d4d4f9d3&connection=DJldap&ui_locales=en-us-x-wsj-3&mg=prod%2Faccounts-wsj&state=NfljSw-Gz-TnT_I6kLjnTa2yxy8akTui#!

Read article content using goose retrieving nothing

夙愿已清 提交于 2019-12-02 14:11:23
I am trying to goose to read from .html files(specified url here for sake convenience in examples) [1] . But at times it's doesn't show any text. Please help me out here with the issue. Goose version used: https://github.com/agolo/python-goose/ Present version gives some errors. from goose import Goose from requests import get response = get('http://www.highbeam.com/doc/1P3-979471971.html') extractor = Goose() article = extractor.extract(raw_html=response.content) text = article.cleaned_text print text Goose indeed uses several predefined elements which are likely a good starting point for

How do you archive an entire website for offline viewing?

房东的猫 提交于 2019-12-02 14:00:56
We actually have burned static/archived copies of our asp.net websites for customers many times. We have used WebZip until now but we have had endless problems with crashes, downloaded pages not being re-linked correctly, etc. We basically need an application that crawls and downloads static copies of everything on our asp.net website (pages, images, documents, css, etc) and then processes the downloaded pages so that they can be browsed locally without an internet connection (get rid of absolute urls in links, etc). The more idiot proof the better. This seems like a pretty common and

Change IP address dynamically?

随声附和 提交于 2019-12-02 13:56:52
Consider the case, I want to crawl websites frequently, but my IP address got blocked after some day/limit. So, how can change my IP address dynamically or any other ideas? aberna An approach using Scrapy will make use of two components, RandomProxy and RotateUserAgentMiddleware . Modify DOWNLOADER_MIDDLEWARES as follows. You will have to insert the new components in the settings.py : DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 90, 'tutorial.randomproxy.RandomProxy': 100, 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,

Scrapy Tutorial Example

房东的猫 提交于 2019-12-02 10:46:30
问题 Looking to see if someone can point me in the right direction in regards to using Scrapy in python. I've been trying to follow the example for several days and still can't get the output expected. Used the Scrapy tutorial, http://doc.scrapy.org/en/latest/intro/tutorial.html#defining-our-item, and even download an exact project from the github repo but the output I get is not of that described in the tutorial. from scrapy.spiders import Spider from scrapy.selector import Selector from dirbot

Why can't I fetch www.google.com with Perl's LWP::Simple?

a 夏天 提交于 2019-12-02 10:22:29
I cant seem to get this peice of code to work: $self->{_current_page} = $href; my $response = $ua->get($href); my $responseCode = $response->code; if( $responseCode ne "404" ) { my $content = LWP::Simple->get($href); die "get failed: " . $href if (!defined $content); } Will return error: get failed: http://www.google.com The full code is as follows: #!/usr/bin/perl use strict; use URI; use URI::http; use File::Basename; use DBI; use LWP::Simple; require LWP::UserAgent; my $ua = LWP::UserAgent->new; $ua->timeout(10); $ua->env_proxy; $ua->max_redirect(0); package Crawler; sub new { my $class =

Authorization issue with cron crawler inserting data into Google spreadsheet using Google API in Ruby

我是研究僧i 提交于 2019-12-02 10:03:11
My project is to crawl the certain web data and put them into my Google spreadsheet every morning 9:00. And it has to get the authorization to read & write something. That's why the code below is located at the top. # Google API CLIENT_ID = blah blah CLIENT_SECRET = blah blah OAUTH_SCOPE = blah blah REDIRECT_URI = blah blah # Authorization_code def get_authorization_code client = Google::APIClient.new client.authorization.client_id = CLIENT_ID client.authorization.client_secret = CLIENT_SECRET client.authorization.scope = OAUTH_SCOPE client.authorization.redirect_uri = REDIRECT_URI uri =

How to find URLs in HTML using Java

≯℡__Kan透↙ 提交于 2019-12-02 09:52:40
问题 I have the following... I wouldn't say problem, but situation. I have some HTML with tags and everything. I want to search the HTML for every URL. I'm doing it now by checking where it says 'h' then 't' then 't' then 'p', but I don't think is a great solution Any good ideas? Added: I'm looking for some kind of pseudocode but, just in case, I'm using Java for this project in particular 回答1: Try using a HTML parsing library then search for <a> tags in the HTML document. Document doc = Jsoup

How to 'Grab' content from another website

…衆ロ難τιáo~ 提交于 2019-12-02 09:50:13
A friend has asked me this, and I couldn't answer. He asked: I am making this site where you can archive your site... It works like this, you enter your site like, something.com and then our site grabs the content on that website like images, and all that and uploads it to our site. Then people can view an exact copy of the site at oursite.com/something.com even if the server that is holding up something.com is down. How could he do this? (php?) and what would be some requirements? It sounds like you need to create a webcrawler. Web crawlers can be written in any language, although I would