web-crawler | 易学教程

Scrapy import module items error

阅读更多关于 Scrapy import module items error

问题 My project structure: kmss/ ├── kmss │ ├── __init__.py │ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ ├── __init__.py │ └── first.py ├── README.rst ├── scrapy.cfg └── setup.py I am running it on mac and my project folder is created at the location: /user/username/kmss And within items.py I do have a class named " KmssItem " . If I am going to run the first.py ( my spider), I have to import items.py. , which is at a higher level. I am having problem with the following line

Nightmare conditional wait()

阅读更多关于 Nightmare conditional wait()

问题 I'm trying to crawl a webpage using Nightmare, but want to wait for #someelem to be present, only if it actually exists. Otherwise, I want Nightmare to move on. How can this be done using .wait() ? I can't use .wait(ms) . Using .wait(selector) means Nightmare will keep waiting until the element is present, but if the page will never have this element, Nightmare will keep waiting forever. The last option is to use .wait(fn) . And I've tried something like this .wait(function(cheerio) { var $ =

Crawl a website, get the links, crawl the links with PHP and XPATH

阅读更多关于 Crawl a website, get the links, crawl the links with PHP and XPATH

问题 I want to crawl an entire website , I have read several threads but I cannot manage to get data in a 2nd level. That is, I can return the links from a starting page but then I cannot find a way to parse the links and get the content of each link... The code I use is: <?php // SELECT STARTING PAGE $url = 'http://mydomain.com/'; $html= file_get_contents($url); // GET ALL THE LINKS OF EACH PAGE // create a dom object $dom = new DOMDocument(); @$dom->loadHTML($html); // run xpath for the dom

how to exclude all title with find?

阅读更多关于 how to exclude all title with find?

问题 i have function that get me all the titles from my website i dont want to get the title from some products is this the right way ? i dont want titles from products with the words "OLP NL" or "Arcserve" or "LicSAPk" or "symantec" def get_title ( u ): html = requests.get ( u ) bsObj = BeautifulSoup ( html.content, 'xml' ) title = str ( bsObj.title ).replace ( '<title>', '' ).replace ( '</title>', '' ) if (title.find ( 'Arcserve' ) or title.find ( 'OLP NL' ) or title.find ( 'LicSAPk' ) or title

how to exclude all title with find?

阅读更多关于 how to exclude all title with find?

Tools to convert asp.net dynamic site into static site

阅读更多关于 Tools to convert asp.net dynamic site into static site

问题 Are there any tools that will spider an asp.net website and create a static site? 回答1: http://www.httrack.com/ Have used for this purpose a few times, may need to do a little tidying up of urls, and some css linked images might not make it, depends on how good a job you want to do. If you have dreamweaver, you can use that to manage the links if you need to clean up the file names afterwards. Optionally use the link checker extension for firefox to check it all afterwards. 回答2: Another

Web crawling evaluation?

阅读更多关于 Web crawling evaluation?

问题 I have seen in focused web crawling (a.k.a. topical web crawling), the evaluation metric - harvest ratio - is defined as: after crawling ' t ' pages, harvest ratio = number_of_relevant_pages/pages_crawled(t) . So for example after crawling 100 pages I get 80 true positives then the harvest ratio of the crawler at that point is 0.9. But the crawler might have ignored some pages off crawling that are totally relevant to the crawling domain but is not accounted in the evaluation ratio. What is

Web crawling evaluation?

阅读更多关于 Web crawling evaluation?

configuring nutch regex-normalize.xml

阅读更多关于 configuring nutch regex-normalize.xml

问题 I am using the Java-based Nutch web-search software. In order to prevent duplicate (url) results from being returned in my search query results, I am trying to remove (a.k.a. normalize) the expressions of 'jsessionid' from the urls being indexed when running the Nutch crawler to index my intranet. However my modifications to $NUTCH_HOME/conf/regex-normalize.xml (prior to running my crawl) do not seem to be having any effect. How can I ensure that my regex-normalize.xml configuration is being

Ajax generated content, crawling and black listing

阅读更多关于 Ajax generated content, crawling and black listing

问题 My website uses ajax. I've got a user list page which list users in an ajax table (with paging and more information stuff...). The url of this page is : /user-list User list is created by ajax. When the user click on one user, he is redirected to a page which url is : /member/memberName So we can see here that ajax is used to generate content and not to manage navigation (with the # character). I want to detect bot to index all pages. So, in ajax I want to display an ajax table with paging