web-crawler | 易学教程

dynamic start_urls in scrapy

阅读更多关于 dynamic start_urls in scrapy

问题 I'm using scrapy to crawl multiple pages on a site. The variable start_urls is used to define pages to be crawled. I would initially start with 1st page, thus defining start_urls = [1st page] in the file example_spider.py Upon getting more info from 1st page, I would determine what are next pages to be crawled, then would assign start_urls accordingly. Hence, I have to overwrite above example_spider.py with changes to start_urls = [1st page, 2nd page, ..., Kth page] , then run scrapy crawl

How to extract URLs from an HTML page in Python [closed]

阅读更多关于 How to extract URLs from an HTML page in Python [closed]

问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 6 years ago . I have to write a web crawler in Python. I don't know how to parse a page and extract the URLs from HTML. Where should I go and study to write such a program? In other words, is there a simple python program

Strategy for how to crawl/index frequently updated webpages?

阅读更多关于 Strategy for how to crawl/index frequently updated webpages?

问题 I'm trying to build a very small, niche search engine, using Nutch to crawl specific sites. Some of the sites are news/blog sites. If I crawl, say, techcrunch.com, and store and index their frontpage or any of their main pages, then within hours my index for that page will be out of date. Does a large search engine such as Google have an algorithm to re-crawl frequently updated pages very frequently, hourly even? Or does it just score frequently updated pages very low so they don't get

Difference between BeautifulSoup and Scrapy crawler?

阅读更多关于 Difference between BeautifulSoup and Scrapy crawler?

问题 I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler . 回答1: Scrapy is a Web-spider or web scraper framework , You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling . While BeautifulSoup is a

Read article content using goose retrieving nothing

阅读更多关于 Read article content using goose retrieving nothing

问题 I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue. Goose version used:https://github.com/agolo/python-goose/ Present version gives some errors. from goose import Goose from requests import get response = get('http://www.highbeam.com/doc/1P3-979471971.html') extractor = Goose() article = extractor.extract(raw_html=response.content) text = article.cleaned_text

Selenium find_elements_ from a tag

阅读更多关于 Selenium find_elements_ from a tag

问题 I want to scrape some hotel information from Booking.com. The website provides some hotel informations, in this particular case, how many rooms are still available. The following shows the span tag from the Booking.com website and i want to extract only the number of data-x-left-count for all listed hotels. <span class="only_x_left sr_rooms_left_wrap " data-x-left-count="6"> Nur noch 6 Zimmer auf unserer Seite verfügbar! </span> I tried to approach it by finding the elements and returning an

Crawler script php

阅读更多关于 Crawler script php

问题 I've grab a piece of script off here to crawl a website, put it up on my server and it works. The only issue is that if I try and crawl set the depth anything above 4 it doesn't work. I'm wondering if it due to the servers lack of resources or the code itself. <?php error_reporting(E_ALL); function crawl_page($url, $depth) { static $seen = array(); if (isset($seen[$url]) || $depth === 0) { return; } $seen[$url] = true; $dom = new DOMDocument('1.0'); @$dom->loadHTMLFile($url); $anchors = $dom-

Does Facebook crawler currently interpret javascript before parsing the DOM?

阅读更多关于 Does Facebook crawler currently interpret javascript before parsing the DOM?

问题 Following link seems to tell that it can't: How does Facebook Sharer select Images and other metadata when sharing my URL? But I wanted to know if it is still the case at current date... (The documentation on facebook dev site doesn't give any precision about this point) 回答1: In the tests I've run I've never seen it interpret the JS, but that might be contextual / domain-specific (who knows). To test your specific case, use the Facebook linter: https://developers.facebook.com/tools/debug (log

Scrapy view returns a blank page

阅读更多关于 Scrapy view returns a blank page

问题 I'm new at Scrapy and I was just trying to scrape http://www.diseasesdatabase.com/ When I type scrapy view http://www.diseasesdatabase.com/ , it displays a blank page but if I download the page and do it on the local file, it displays as usual. Why is this happening? 回答1: Pretend being a real browser providing a User-Agent header: scrapy view http://www.diseasesdatabase.com/ -s USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357

Scrapy and Django import error

阅读更多关于 Scrapy and Django import error

问题 When I am calling Spider through a Python script, it is giving me an ImportError : ImportError: No module named app.models My items.py is like this: from scrapy.item import Item, Field from scrapy.contrib.djangoitem import DjangoItem from app.models import Person class aqaqItem(DjangoItem): django_model=Person pass My settings.py is like this: # # For simplicity, this file contains only the most important settings by # default. All the other settings are documented here: # # http://doc.scrapy