web-crawler

Find text inside javascript tag using PHP Simple HTML DOM Parser

China☆狼群 提交于 2019-11-29 13:09:30
I'm trying to find a text change regularly inside javascript tag : <script type="text/javascript"> jwplayer("mediaplayer").setup({ flashplayer: "player.swf", file:"filename", provider: "rtmp", streamer:"rtmp://192.168.1.1/file?wmsAuthSign=RANDOM-114-Character==", height:500, width:500, }); </script> How to get RANDOM-114-Character (or full value of 'streamer' flashvars) using PHP Simple HTML DOM Parser, I just have no idea to do this. You can do it with regular expression: preg_match ($pattern, $java_script, $matches); Pattern depends, if the variable 'wmsAuthSign' is unique. For example:

Scrapy import module items error

余生长醉 提交于 2019-11-29 12:38:04
My project structure: kmss/ ├── kmss │ ├── __init__.py │ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ ├── __init__.py │ └── first.py ├── README.rst ├── scrapy.cfg └── setup.py I am running it on mac and my project folder is created at the location: /user/username/kmss And within items.py I do have a class named " KmssItem " . If I am going to run the first.py ( my spider), I have to import items.py. , which is at a higher level. I am having problem with the following line: from kmss.items import KmssItem Within items.py , the codes are: from scrapy import Item, Field class

Recommendations for a spidering tool to use with Lucene or Solr? [closed]

坚强是说给别人听的谎言 提交于 2019-11-29 07:38:44
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 6 years ago . What is a good crawler (spider) to use against HTML and XML documents (local or web-based) and that works well in the Lucene / Solr

Save all image files from a website

删除回忆录丶 提交于 2019-11-29 04:37:55
I'm creating a small app for myself where I run a Ruby script and save all of the images off of my blog. I can't figure out how to save the image files after I've identified them. Any help would be much appreciated. require 'rubygems' require 'nokogiri' require 'open-uri' url = '[my blog url]' doc = Nokogiri::HTML(open(url)) doc.css("img").each do |item| #something end Phrogz URL = '[my blog url]' require 'nokogiri' # gem install nokogiri require 'open-uri' # already part of your ruby install Nokogiri::HTML(open(URL)).xpath("//img/@src").each do |src| uri = URI.join( URL, src ).to_s # make

how to identify web crawlers of google/yahoo/msn by PHP?

人走茶凉 提交于 2019-11-29 04:27:05
问题 AFAIK, $_SERVER['REMOTE_HOST'] should end up with "google.com" or "yahoo.com". but is it the most ensuring method? any other way out? 回答1: You identify search engines by user agent and IP address. More info can be found in How to identify search engine spiders and webbots. It's also worth noting this list. You shouldn't treat user agents (or even remote hosts) as necessarily definitive however. User agents are really nothing more than what the other end tells you it is and it is of course

Crawl a website, get the links, crawl the links with PHP and XPATH

一曲冷凌霜 提交于 2019-11-29 04:25:42
I want to crawl an entire website , I have read several threads but I cannot manage to get data in a 2nd level. That is, I can return the links from a starting page but then I cannot find a way to parse the links and get the content of each link... The code I use is: <?php // SELECT STARTING PAGE $url = 'http://mydomain.com/'; $html= file_get_contents($url); // GET ALL THE LINKS OF EACH PAGE // create a dom object $dom = new DOMDocument(); @$dom->loadHTML($html); // run xpath for the dom $xPath = new DOMXPath($dom); // get links from starting page $elements = $xPath->query("//a/@href");

Lucene crawler (it needs to build lucene index)

眉间皱痕 提交于 2019-11-29 04:18:06
问题 I am looking for Apache Lucene web crawler written in java if possible or in any other language. The crawler must use lucene and create a valid lucene index and document files, so this is the reason why nutch is eliminated for example... Does anybody know does such a web crawler exist and can If answer is yes where I can find it. Tnx... 回答1: What you're asking is two components: Web crawler Lucene-based automated indexer First a word of couragement: Been there, done that. I'll tackle both of

How to programmatically fill input elements built with React?

六月ゝ 毕业季﹏ 提交于 2019-11-29 03:42:06
I'm tasked with crawling website built with React. I'm trying to fill in input fields and submitting the form using javascript injects to the page (either selenium or webview in mobile). This works like a charm on every other site + technology but React seems to be a real pain. so here is a sample code var email = document.getElementById( 'email' ); email.value = 'example@mail.com'; I the value changes on the DOM input element, but the React does not trigger the change event. I've been trying plethora of different ways to get the React to update the state. var event = new Event('change', {

Nutch No agents listed in 'http.agent.name'

≯℡__Kan透↙ 提交于 2019-11-29 02:54:15
问题 Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1068) at org.apache.nutch.crawl.Crawl.run(Crawl.java:135) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect

crawl dynamic web page using htmlunit

∥☆過路亽.° 提交于 2019-11-29 02:14:22
I am crawling data using HtmlUnit from a dynamic webpage, which uses infinite scrolling to fetch data dynamically, just like facebook's newsfeed. I used the following sentence to simulate the scrolling down event: webclient.setJavaScriptEnabled(true); webclient.setAjaxController(new NicelyResynchronizingAjaxController()); ScriptResult sr=myHtmlPage.executeJavaScript("window.scrollBy(0,600)"); webclient.waitForBackgroundJavaScript(10000); myHtmlPage=(HtmlPage)sr.getNewPage(); But it seems myHtmlPage stays the same with the previous one, i.e., new data is not appended in myHtmlPage, as a result