web-crawler

How to crawl billions of pages? [closed]

北城以北 提交于 2019-12-03 00:56:00
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . Is it possible to crawl billions of pages on a single server? 回答1: Not if you want the data to be up to date. Even a small player in the search game would number the pages crawled in the multiple billions. " In 2006, Google has indexed over 25 billion web pages,[32] 400 million

Is Erlang the right choice for a webcrawler?

﹥>﹥吖頭↗ 提交于 2019-12-03 00:53:20
I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts is extracted. The result is then stored in a database. The language and plattform used for the crawler have to match the following criteria: easily scalable on multiple cores and cpus suited for high I/O loads fast regular expression matching easily to maintain/few operational overhead After some research I think Erlang might be a fitting candidate

How to prevent getting blacklisted while scraping Amazon [closed]

ε祈祈猫儿з 提交于 2019-12-03 00:46:06
I try to scrape Amazon by Scrapy. but i have this error DEBUG: Retrying <GET http://www.amazon.fr/Amuses-bouche-Peuvent-b%C3%A9n%C3%A9ficier-dAmazon-Premium-Epicerie/s?ie=UTF8&page=1&rh=n%3A6356734031%2Cp_76%3A437878031> (failed 1 times): 503 Service Unavailable I think that it's because = Amazon is very good at detecting bots. How can i prevent this? i used time.sleep(6) before every request. I don't want to use their API. I tried I use tor and polipo You have to be very careful with Amazon and follow the Amazon Terms of Use and policies related to web-scraping. Amazon is quite good at

How do you archive an entire website for offline viewing?

随声附和 提交于 2019-12-03 00:20:54
问题 We actually have burned static/archived copies of our asp.net websites for customers many times. We have used WebZip until now but we have had endless problems with crashes, downloaded pages not being re-linked correctly, etc. We basically need an application that crawls and downloads static copies of everything on our asp.net website (pages, images, documents, css, etc) and then processes the downloaded pages so that they can be browsed locally without an internet connection (get rid of

Difference between Crawling and getiting links with Html Agility pack,

ぃ、小莉子 提交于 2019-12-02 21:33:26
问题 i am getting links of a website using Html Agility pack with console application c#, by giving the divs that i want and get the links from those divs, my question is the thing i am doing is crawling or parsing, if not then what is crawling 来源: https://stackoverflow.com/questions/36324098/difference-between-crawling-and-getiting-links-with-html-agility-pack

Crawling Google Search with PHP

99封情书 提交于 2019-12-02 20:59:45
I trying to get my head around how to fetch Google search results with PHP or JavaScript. I know it has been possible before but now I can't find a way. I am trying to duplicate (somewhat) the functionality of http://www.getupdated.se/sokmotoroptimering/seo-verktyg/kolla-ranking/ But really the core issue I want to solve is just to get the search result via PHP or JavaScript,the rest i can figure out. Fetching the results using file_get_contents() or cURL doesn't seem to work. Example: $ch = curl_init(); $timeout = 5; curl_setopt($ch, CURLOPT_URL, 'http://www.google.se/#hl=sv&q=dogs'); curl

PyPi download counts seem unrealistic

左心房为你撑大大i 提交于 2019-12-02 19:54:28
I put a package on PyPi for the first time ~2 months ago, and have made some version updates since then. I noticed this week the download count recording, and was surprised to see it had been downloaded hundreds of times. Over the next few days, I was more surprised to see the download count increasing by sometimes hundreds per day , even though this is a niche statistical test toolbox. In particular, older versions of package are continuing to be downloaded, sometimes at higher rates than the newest version. What is going on here? Is there a bug in PyPi's downloaded counting, or is there an

Is it possible to develop a powerful web search engine using Erlang, Mnesia & Yaws?

99封情书 提交于 2019-12-02 19:47:19
I am thinking of developing a web search engine using Erlang, Mnesia & Yaws. Is it possible to make a powerful and the fastest web search engine using these software? What will it need to accomplish this and how what do I start with? Erlang can make the most powerful web crawler today. Let me take you through my simple crawler. Step 1. I create a simple parallelism module, which i call mapreduce -module(mapreduce). -export([compute/2]). %%===================================================================== %% usage example %% Module = string %% Function = tokens %% List_of_arg_lists = [["file

Authorization issue with cron crawler inserting data into Google spreadsheet using Google API in Ruby

守給你的承諾、 提交于 2019-12-02 19:16:34
问题 My project is to crawl the certain web data and put them into my Google spreadsheet every morning 9:00. And it has to get the authorization to read & write something. That's why the code below is located at the top. # Google API CLIENT_ID = blah blah CLIENT_SECRET = blah blah OAUTH_SCOPE = blah blah REDIRECT_URI = blah blah # Authorization_code def get_authorization_code client = Google::APIClient.new client.authorization.client_id = CLIENT_ID client.authorization.client_secret = CLIENT

unknown command: crawl error

旧城冷巷雨未停 提交于 2019-12-02 19:03:44
I am a newbie to python. I am running python 2.7.3 version 32 bit on 64 bit OS. (I tried 64 bit but it didn't workout). I followed the tutorial and installed scrapy on my machine. I have created one project, demoz. But when I enter scrapy crawl demoz it shows an error. I came across this thing when i hit scrapy command under (C:\python27\scripts) it shows: C:\Python27\Scripts>scrapy Scrapy 0.14.2 - no active project Usage: scrapy <command> [options] [args] Available commands: fetch Fetch a URL using the Scrapy downloader runspider Run a self-contained spider (without creating a project)