web-crawler

How can I handle Javascript in a Perl web crawler?

风流意气都作罢 提交于 2019-11-26 19:01:25
I would like to crawl a website, the problem is, that its full of JavaScript things, such as buttons and such that when they are pressed, they do not change the URL, but the data on the page is changed. Usually I use LWP / Mechanize etc to crawl sites, but neither support JavaScript. any idea? Another option might be Selenium with WWW::Selenium module The WWW::Scripter module has a JavaScript plugin that may be useful. Can't say I've used it myself, however. WWW::Mechanize::Firefox might be of use. that way you can have Firefox handle the complex JavaScript issues and then extract the

Do Google's crawlers interpret Javascript? What if I load a page through AJAX? [closed]

不羁岁月 提交于 2019-11-26 18:56:47
When a user enters my page, I have to make another AJAX call...to load data inside a div. That's just how my application works. The problem is...when I view the source of this code, it does not contain the source of that AJAX. Of course, when I do wget URL ...it also does not show the AJAX HTML. Makes sense. But what about Google? Will Google be able to crawl the content, as if it's a browser? How do I allow Google to crawl my page just like a user would see it? jldupont Updated: From the answer to this question about "Ajax generated content, crawling and black listing" I found this document

How to get a web page's source code from Java [duplicate]

旧时模样 提交于 2019-11-26 18:14:27
问题 This question already has answers here : How do you Programmatically Download a Webpage in Java (10 answers) Closed 4 years ago . I just want to retrieve any web page's source code from Java. I found lots of solutions so far, but I couldn't find any code that works for all the links below: http://www.cumhuriyet.com.tr?hn=298710 http://www.fotomac.com.tr/Yazarlar/Olcay%20%C3%87ak%C4%B1r/2011/11/23/hesap-makinesi http://www.sabah.com.tr/Gundem/2011/12/23/basbakan-konferansta-konusuyor# The main

python: [Errno 10054] An existing connection was forcibly closed by the remote host

浪子不回头ぞ 提交于 2019-11-26 18:07:24
问题 I am writing python to crawl Twitter space using Twitter-py. I have set the crawler to sleep for a while (2 seconds) between each request to api.twitter.com. However, after some times of running (around 1), when the Twitter's rate limit not exceeded yet, I got this error. [Errno 10054] An existing connection was forcibly closed by the remote host. What are possible causes of this problem and how to solve this? I have searched through and found that the Twitter server itself may force to close

Writing items to a MySQL database in Scrapy

你。 提交于 2019-11-26 17:58:28
问题 I am new to Scrapy, I had the spider code class Example_spider(BaseSpider): name = "example" allowed_domains = ["www.example.com"] def start_requests(self): yield self.make_requests_from_url("http://www.example.com/bookstore/new") def parse(self, response): hxs = HtmlXPathSelector(response) urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract() for i in urls: yield Request(urljoin("http://www.example.com/", i[1:]), callback=self.parse_url) def parse_url(self, response):

How can I use different pipelines for different spiders in a single Scrapy project

孤人 提交于 2019-11-26 17:55:30
问题 I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider. Thanks 回答1: Building on the solution from Pablo Hoffman, you can use the following decorator on the process_item method of a Pipeline object so that it checks the pipeline attribute of your spider for whether or not it should be executed. For example: def check_spider_pipeline(process_item_method):

Parse HTML content in VBA

只愿长相守 提交于 2019-11-26 17:48:26
I have a question relating to HTML parsing. I have a website with some products and I would like to catch text within page into my current spreadsheet. This spreadsheet is quite big but contains ItemNbr in 3rd column, I expect the text in the 14th column and one row corresponds to one product (item). My idea is to fetch the 'Material' on the webpage which is inside the Innertext after tag. The id number changes from one page to page (sometimes ). Here is the structure of the website: <div style="position:relative;"> <div></div> <table id="list-table" width="100%" tabindex="1" cellspacing="0"

How can I scrape pages with dynamic content using node.js?

﹥>﹥吖頭↗ 提交于 2019-11-26 17:32:24
I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created. I use the cheerio in node.js and My code is below. var request = require('request'); var cheerio = require('cheerio'); var url = "http://www.bdtong.co.kr/index.php?c_category=C02"; request(url, function (err, res, html) { var $ = cheerio.load(html); $('.listMain > li').each(function () { console.log($(this).find('a').attr('href')); }); }); This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain"> is empty. The content has not

Scrapy Linkextractor duplicating(?)

自闭症网瘾萝莉.ら 提交于 2019-11-26 17:21:53
问题 I have the crawler implemented as below. It is working and it would go through sites regulated under the link extractor . Basically what I am trying to do is to extract information from different places in the page: - href and text() under the class 'news' ( if exists) - image url under the class 'think block' ( if exists) I have three problems for my scrapy: 1) duplicating linkextractor It seems that it will duplicate processed page. ( I check against the export file and found that the same

Pulling data from a webpage, parsing it for specific pieces, and displaying it

我只是一个虾纸丫 提交于 2019-11-26 17:21:14
I've been using this site for a long time to find answers to my questions, but I wasn't able to find the answer on this one. I am working with a small group on a class project. We're to build a small "game trading" website that allows people to register, put in a game they have they want to trade, and accept trades from others or request a trade. We have the site functioning long ahead of schedule so we're trying to add more to the site. One thing I want to do myself is to link the games that are put in to Metacritic. Here's what I need to do. I need to (using asp and c# in visual studio 2012)