web-crawler

How should I scrape these images without errors?

假装没事ソ 提交于 2019-12-12 03:30:00
问题 I'm trying to scrape the images (or the images link) of this forum (http://www.xossip.com/showthread.php?t=1384077) . I've tried beautiful soup 4 and here is the code I tried: import requests from bs4 import BeautifulSoup def spider(max_pages): page = 1 while page <= max_pages: url = 'http://www.xossip.com/showthread.php?t=1384077&page=' + str(page) sourcecode= requests.get(url) plaintext = sourcecode.text soup = BeautifulSoup(plaintext) for link in soup.findAll('a',{'class': 'alt1'}): src =

Generate only unfetched urls instead of scored Nutch 2.3

半世苍凉 提交于 2019-12-12 03:27:28
问题 Is there any way to generate only the un-fetched urls instead of based on score in Nutch 2.x? 回答1: Well, for Nutch 1.x you could use the jexl support that is shipped since Nutch 1.12 (I think): $ bin/nutch generate -expr "status == db_unfetched" with this command you're ensuring that only the URLs with a db_unfetched status are considered for generating the segments that you want to crawl. This feature is still not available on 2.x branch, but writing a custom GeneratorJob could do the trick.

Scrapy Crawl all websites in start_url even if redirect

南笙酒味 提交于 2019-12-12 03:08:15
问题 I am trying to crawl a long list of websites. Some of the websites in the start_url list redirect (301). I want scrapy to crawl the redirected websites from start_url list as if they were also on the allowed_domain list (which they are not). For example, example.com was on my start_url list and allowed domain list and example.com redirects to foo.com. I want to crawl foo.com. DEBUG: Redirecting (301) to <GET http://www.foo.com/> from <GET http://www.example.com> I tried dynamically adding

How to get a content of a page that updates too often using webbrowser in C#

痞子三分冷 提交于 2019-12-12 03:04:05
问题 I would like to get the latest content of a page that updates too often, for example, https://www.oanda.com/currency/live-exchange-rates/ (prices update every five second on weekdays) I am using the following code: var webBrowser = new WebBrowser(); webBrowser.ScriptErrorsSuppressed = true; webBrowser.AllowNavigation = true; webBrowser.Navigate("https://www.oanda.com/currency/live-exchange-rates/"); while (webBrowser.ReadyState != WebBrowserReadyState.Complete) { Application.DoEvents(); }

Query a search widget using jsoup

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-12 02:53:48
问题 I want to query the below site and get all the result in to a csv file: http://services2.hdb.gov.sg/webapp/BB33RTIS/BB33SSearchWidget I already have a program for this(which was written by the previous programmer and I am trying to understand the code as I am a beginner in jsoup and web crawling) , but now the site is updated and the query no longer works. I think I need to update the URL. Below is the url string I am currently using: private final static String URL = "http://services2.hdb

how to use scrapy to crawl all items in a website

风流意气都作罢 提交于 2019-12-12 02:34:09
问题 i want to use recursion to crawl all the links in a website. and parse all the link pages, to extract all the detail links in the link pages. if the page link confroms to a rule, the page link is a item i want to parse detail. i use the code below: class DmovieSpider(BaseSpider): name = "dmovie" allowed_domains = ["movie.douban.com"] start_urls = ['http://movie.douban.com/'] def parse(self, response): item = DmovieItem() hxl = HtmlXPathSelector(response) urls = hxl.select("//a/@href").extract

Give more memory to my jar file

自闭症网瘾萝莉.ら 提交于 2019-12-12 02:26:24
问题 I have a multithreaded crawler. In this program, if I load a lot of seeds, I get an error. I saw the java.lang.OutOfMemoryError and thought maybe the memory is not enough. I tried running the crawler.jar file with these arguments: java -Xms512m -Xmx3G -jar crawler.jar but so far, no luck. This is the StackTrace of the program: Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect

What does this error mean: ValueError: unknown POST form encoding type ' ' (and how to solve it?)

给你一囗甜甜゛ 提交于 2019-12-12 02:16:16
问题 I'm trying to crawl a website (http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam) using mechanize but I am getting an error I cannot understand (and therefore cannot solve). That's probably due to my poor knowledge of web development. Here's what I'm trying to do: import mechanize # this is the website I want to crawl LINK = "http://www.dataescolabrasil.inep.gov.br/dataEscolaBrasil/home.seam" br = mechanize.Browser() br.open(LINK) request = mechanize.Request(LINK) response =

How does a web crawler work?

人走茶凉 提交于 2019-12-12 02:15:47
问题 Using some basic website scraping, I am trying to prepare a database for price comparison which will ease users' search experiences. Now, I have several questions: Should I use file_get_contents() or curl to get the contents of the required web page? $link = "http://xyz.com"; $res55 = curl_init($link); curl_setopt ($res55, CURLOPT_RETURNTRANSFER, 1); curl_setopt($res55, CURLOPT_FOLLOWLOCATION, true); $result = curl_exec($res55); Further, every time I crawl a web page, I fetch a lot of links

Can I allow indexing (by search engines) of restricted content without making it public?

允我心安 提交于 2019-12-12 01:58:14
问题 I have a site with some restricted content. I want my site to appear in search results, but I do not want it to get public. Is there a way by which I can allow crawlers to crawl through my site but prevent them from making it public? The closest solution I have found is Google First Click Free but even it requires me to show the content for the first time. 回答1: Why do you want to allow people to search for a page that they can't access if they click the link? Its technically possible to make