web-crawler | 易学教程

how to fix HTTP error fetching URL. Status=500 in java while crawling?

阅读更多关于 how to fix HTTP error fetching URL. Status=500 in java while crawling?

I am trying to crawl the user's ratings of cinema movies of imdb from the review page: (number of movies in my database is about 600,000). I used jsoup to parse pages as below: (sorry, I didn't write the whole code here since it is too long) try { //connecting to mysql db ResultSet res = st .executeQuery("SELECT id, title, production_year " + "FROM title " + "WHERE kind_id =1 " + "LIMIT 0 , 100000"); while (res.next()){ ....... ....... String baseUrl = "http://www.imdb.com/search/title?release_date=" + ""+year+","+year+"&title="+movieName+"" + "&title_type=feature,short,documentary,unknown";

Scrapy randomly crashing with celery in django

阅读更多关于 Scrapy randomly crashing with celery in django

问题 I am running my Scrapy project within Django on a Ubuntu Server. The problem is, Scrapy randomly crash even if Its only one spider running. Below is a snippet of the TraceBack. As a none expert, I have googled _SIGCHLDWaker Scrappy but couldn't comprehend the solutions found for the snippet of below: --- <exception caught here> --- File "/home/b2b/virtualenvs/venv/local/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 602, in _doReadOrWrite why = selectable.doWrite()

Creating loop to parse table data in scrapy/python

阅读更多关于 Creating loop to parse table data in scrapy/python

问题 Have python script using scrapy , which scrapes the data from a website, allocates it to 3 fields and then generates a .csv. Works ok but with one major problem. All fields contain all of the data, rather than it being separated out for each table row. I'm sure this is due to my loop not working and when it finds the xpath it just grabs all the data for every row before moving on to get data for the other 2 fields, instead of creating seperate rows def parse(self, response): hxs =

Ban robots from website [closed]

阅读更多关于 Ban robots from website [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . my website is often down because a spider is accessying to many resources. This is what the hosting told me. They told me to ban these IP address: 46.229.164.98 46.229.164.100 46.229.164.101 But I've no idea about how to do this. I've googled a bit and I've now added these lines to .htaccess in the root: # allow

Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

阅读更多关于 Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. The Crawl-delay values are for illustration purposes and will be different in a real robots.txt file. I have searched all over the web for proper answers but could not find one. There are too many mixed suggestions and I do not know which is the correct / proper method. Questions: (1) Can each user agent have it's own crawl-delay? (I assume yes) (2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line? (3) Does there have to be a blank

Crawler in Groovy (JSoup VS Crawler4j)

阅读更多关于 Crawler in Groovy (JSoup VS Crawler4j)

问题 I wish to develop a web crawler in Groovy(using Grails framework and MongoDB database) that has the ability to crawl a website, creating a list of site URLs and their resource types, their content, the response times and number of redirects involved. I am debating over JSoup vs Crawler4j. I have read about what they basically do but I cannot understand clearly the difference between the two. Can anyone suggest which would be a better one for the above functionality? Or is it totally incorrect

Importing URLs for JSOUP to Scrape via Spreadsheet

阅读更多关于 Importing URLs for JSOUP to Scrape via Spreadsheet

I finally got IntelliJ to work. I'm using the code below. It works perfect. I need it to loop over and over and pull links from a spreadsheet to find the price over and over again on different items. I have a spreadsheet with a few sample URLs located in column C starting at row 2. How can I have JSOUP use the URLs in this spreadsheet then output to column D? public class Scraper { public static void main(String[] args) throws Exception { final Document document = Jsoup.connect("examplesite.com").get(); for (Element row : document.select("#price")) { final String price = row.select("#price")

crawl links of sitemap.xml through wget command

阅读更多关于 crawl links of sitemap.xml through wget command

问题 I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget does not work, I only get as respond: Remote file exists but does not contain any link -- not retrieving. But for sure the sitemap.xml is full of "http://..." links. I tried almost every option of wget but nothing worked for me: wget -r --mirror http://mysite.com/sitemap.xml Does anyone knows how to open all links inside of a website sitemap.xml? Thanks, Dominic 回答1: It seems that wget can't

Download images from google image search (python)

阅读更多关于 Download images from google image search (python)

I am web scraping beginner. I am firstly refer to https://www.youtube.com/watch?v=ZAUNEEtzsrg to download image with the specific tag(e.g. cat ), and it works! But I encountered new problem which only can download about 100 images, and this problem seems like "ajax" which only load the first page html and not load all. Therefore, it seem like we must simulate scroll down to download next 100 images or more. My code: https://drive.google.com/file/d/0Bwjk-LKe_AohNk9CNXVQbGRxMHc/edit?usp=sharing To sum up,the problems are following: how to download all images in google image search by source code

How to select data from specific tags in nutch

阅读更多关于 How to select data from specific tags in nutch

I am a newbie in Apache Nutch and I would like to know whether it's possible to crawl selected area of a web page. For instance, select a particular div and crawl contents in that div only. Any help would be appreciated. Thanks! You will have to write a plugin that will extend HtmlParseFilter to achieve your goal. I reckon you will be doing some of the stuff yourself like parsing the html's specific section, extracting the URLs that you want and add them as outlinks. HtmlParseFilter implementation: (Code below gives the general idea) ParseResult filter(Content content, ParseResult parseResult,