web-crawler

Scrapy import module items error

和自甴很熟 提交于 2019-12-29 08:19:30
问题 My project structure: kmss/ ├── kmss │ ├── __init__.py │ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ ├── __init__.py │ └── first.py ├── README.rst ├── scrapy.cfg └── setup.py I am running it on mac and my project folder is created at the location: /user/username/kmss And within items.py I do have a class named " KmssItem " . If I am going to run the first.py ( my spider), I have to import items.py. , which is at a higher level. I am having problem with the following line

Nightmare conditional wait()

六眼飞鱼酱① 提交于 2019-12-29 08:07:40
问题 I'm trying to crawl a webpage using Nightmare, but want to wait for #someelem to be present, only if it actually exists. Otherwise, I want Nightmare to move on. How can this be done using .wait() ? I can't use .wait(ms) . Using .wait(selector) means Nightmare will keep waiting until the element is present, but if the page will never have this element, Nightmare will keep waiting forever. The last option is to use .wait(fn) . And I've tried something like this .wait(function(cheerio) { var $ =

Crawl a website, get the links, crawl the links with PHP and XPATH

核能气质少年 提交于 2019-12-29 06:19:11
问题 I want to crawl an entire website , I have read several threads but I cannot manage to get data in a 2nd level. That is, I can return the links from a starting page but then I cannot find a way to parse the links and get the content of each link... The code I use is: <?php // SELECT STARTING PAGE $url = 'http://mydomain.com/'; $html= file_get_contents($url); // GET ALL THE LINKS OF EACH PAGE // create a dom object $dom = new DOMDocument(); @$dom->loadHTML($html); // run xpath for the dom

how to exclude all title with find?

穿精又带淫゛_ 提交于 2019-12-25 19:02:34
问题 i have function that get me all the titles from my website i dont want to get the title from some products is this the right way ? i dont want titles from products with the words "OLP NL" or "Arcserve" or "LicSAPk" or "symantec" def get_title ( u ): html = requests.get ( u ) bsObj = BeautifulSoup ( html.content, 'xml' ) title = str ( bsObj.title ).replace ( '<title>', '' ).replace ( '</title>', '' ) if (title.find ( 'Arcserve' ) or title.find ( 'OLP NL' ) or title.find ( 'LicSAPk' ) or title

how to exclude all title with find?

ぃ、小莉子 提交于 2019-12-25 19:01:33
问题 i have function that get me all the titles from my website i dont want to get the title from some products is this the right way ? i dont want titles from products with the words "OLP NL" or "Arcserve" or "LicSAPk" or "symantec" def get_title ( u ): html = requests.get ( u ) bsObj = BeautifulSoup ( html.content, 'xml' ) title = str ( bsObj.title ).replace ( '<title>', '' ).replace ( '</title>', '' ) if (title.find ( 'Arcserve' ) or title.find ( 'OLP NL' ) or title.find ( 'LicSAPk' ) or title

Tools to convert asp.net dynamic site into static site

一个人想着一个人 提交于 2019-12-25 18:24:12
问题 Are there any tools that will spider an asp.net website and create a static site? 回答1: http://www.httrack.com/ Have used for this purpose a few times, may need to do a little tidying up of urls, and some css linked images might not make it, depends on how good a job you want to do. If you have dreamweaver, you can use that to manage the links if you need to clean up the file names afterwards. Optionally use the link checker extension for firefox to check it all afterwards. 回答2: Another

Web crawling evaluation?

≯℡__Kan透↙ 提交于 2019-12-25 17:07:46
问题 I have seen in focused web crawling (a.k.a. topical web crawling), the evaluation metric - harvest ratio - is defined as: after crawling ' t ' pages, harvest ratio = number_of_relevant_pages/pages_crawled(t) . So for example after crawling 100 pages I get 80 true positives then the harvest ratio of the crawler at that point is 0.9. But the crawler might have ignored some pages off crawling that are totally relevant to the crawling domain but is not accounted in the evaluation ratio. What is

Web crawling evaluation?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-25 17:07:13
问题 I have seen in focused web crawling (a.k.a. topical web crawling), the evaluation metric - harvest ratio - is defined as: after crawling ' t ' pages, harvest ratio = number_of_relevant_pages/pages_crawled(t) . So for example after crawling 100 pages I get 80 true positives then the harvest ratio of the crawler at that point is 0.9. But the crawler might have ignored some pages off crawling that are totally relevant to the crawling domain but is not accounted in the evaluation ratio. What is

configuring nutch regex-normalize.xml

丶灬走出姿态 提交于 2019-12-25 15:22:08
问题 I am using the Java-based Nutch web-search software. In order to prevent duplicate (url) results from being returned in my search query results, I am trying to remove (a.k.a. normalize) the expressions of 'jsessionid' from the urls being indexed when running the Nutch crawler to index my intranet. However my modifications to $NUTCH_HOME/conf/regex-normalize.xml (prior to running my crawl) do not seem to be having any effect. How can I ensure that my regex-normalize.xml configuration is being

Ajax generated content, crawling and black listing

此生再无相见时 提交于 2019-12-25 13:17:12
问题 My website uses ajax. I've got a user list page which list users in an ajax table (with paging and more information stuff...). The url of this page is : /user-list User list is created by ajax. When the user click on one user, he is redirected to a page which url is : /member/memberName So we can see here that ajax is used to generate content and not to manage navigation (with the # character). I want to detect bot to index all pages. So, in ajax I want to display an ajax table with paging