web-crawler | 易学教程

JSP Page Crawler that extracts all input parameters

阅读更多关于 JSP Page Crawler that extracts all input parameters

Do you happen to know of an opensource Java component that provides the facility to scan a set of dynamic pages (JSP) and then extract all the input parameters from there. Of course, a crawler would be able to crawl static code and not dynamic code, but my idea here is to extend it to crawl a webserver including all the server-side code. Naturally, I am assuming that the tool will have full access to the crawled webserver and not by using any hacks. The idea is to build a static analyzer that has the capacity to detect all parameters (request.getParameter() and such) fields from all dynamic

How to get HTML element coordinates using C#?

阅读更多关于 How to get HTML element coordinates using C#?

问题 I am planning to develop web crawler, which would extract coordinates of html elements from web pages. I have found out that it is possible to get html element coordinates by using "mshtml" assembly. Right now I would like to know if it is possible and how to get only necessary information (html,css) from web page, and then by using appropriate mshtml classes get correct coordinates of all html elements? Thank you! 回答1: I use these c# functions to determine element positions. You need to pass

How to Parse Java-script contains[dynamic] on web-page[html] using Python?

阅读更多关于 How to Parse Java-script contains[dynamic] on web-page[html] using Python?

问题 I am building a spider and I am using Beautiful soup to parse the contain of particular URL. Now, some sites are using Java-script to show dynamic contain which will be shown to user once some action [clicking or time] happens. Beautiful soup just parse the static contain which is before the java-script tag has run. I want the contain after java-script run. Is there any way to do this? I can think of one way: Grab the url, open the browser and run this URL and java-script tags as well. And

Data crawler or something else

阅读更多关于 Data crawler or something else

问题 I'm looking for something that I don't know exactly how it can be done. I don't have deep knowledge into crawling, scrapping and etc, but I believe the kind of technology I'm looking for are these. I've a list of around 100 websites that I'd like to monitor constantly. At least once every 3 or 4 days. In these website's I'd look for some logical matches, like: Text contains 'ABC' AND doesn't contain 'BCZ" OR text contains 'XYZ' AND doesn't contain 'ATM' and so on so forth The tool would have

How can I get a web page into a string using JavaScript?

阅读更多关于 How can I get a web page into a string using JavaScript?

问题 I need to get the html content of a page using JavaScript, the page could be also on another domain, kind of what does wget but in JavaScript. I want to use it for a kind of web-crawler. Using JavaScript, how can I get content of a page, provided I have an URL, and get it into a string? 回答1: Try this: function cbfunc(html) { alert(html.results[0]); } $.getScript('http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22' + encodeURIComponent(url) + '%22&format

Scrapy Deploy Doesn't Match Debug Result

阅读更多关于 Scrapy Deploy Doesn't Match Debug Result

问题 I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic: Go to the homepage, and there are some categorylist that to be used to build the second wave of links. For the second round of links, they are usually the first page from each category. Also, for different pages inside that category, they follow the same regular expression pattern wholesale/something/something/request or wholesale/pagenumber . And I want to follow those patterns to keep crawling and

Scrapy :: Issues with JSON export

阅读更多关于 Scrapy :: Issues with JSON export

问题 So, I have spent quite a bit of time going through the Scrapy documentation and tutorials, and I have since been plugging away at a very basic crawler. However, I am not able to get the output into a JSON file. I feel like I am missing something obvious, but I haven't been able to turn anything up after looking at a number of other examples, and trying several different things out. To be thorough, I will include all of the relevant code. What I am trying to get here is some specific items and

Scrapy CLOSESPIDER_PAGECOUNT setting don't work as should

阅读更多关于 Scrapy CLOSESPIDER_PAGECOUNT setting don't work as should

问题 I use scrapy 1.0.3 and can't discover how works CLOSESPIDER extesnion. For command: scrapy crawl domain_links --set=CLOSESPIDER_PAGECOUNT=1 is correctly one requst, but for two pages count: scrapy crawl domain_links --set CLOSESPIDER_PAGECOUNT=2 is infinity of requests. So please explain me how it works in simple example. This is my spider code: class DomainLinksSpider(CrawlSpider): name = "domain_links" #allowed_domains = ["www.example.org"] start_urls = [ "www.example.org/",] rules = ( #

Facing issue in elasticsearch mapping of nutch crawled document

阅读更多关于 Facing issue in elasticsearch mapping of nutch crawled document

问题 Facing some serious issues while using nutch and elasticsearch for crawling purpose. We have two data storage engines in our App. MySql Elasticsearch Lets say I have 10 urls stored in urls table of mysql db. Now I want to fetch these urls from table in run time and write these into seed,txt for crawling. I have written all these urls into need,txt at one go. Now my crawling starts and then I index these docs inside elasticsearch in an index(lets say url index). But I want to maintain a

Servlet-Filter is not honoured for welcome file

阅读更多关于 Servlet-Filter is not honoured for welcome file

问题 I am using a Filter do generate dynamicly content to be visible for webcrawlers (https://developers.google.com/webmasters/ajax-crawling/docs/specification). This filter is working fine if the incoming url contains a path (http://www.unclestock.com/app.jsp#!s=GOOG). If the incoming url contains just my domain (and a fragment), say http://www.unclestock.com#!s=GOOG, the welcome file (app.jsp) is returned, but the filter is not honnoured. My web.xml contains the following filter map: <filter