web-crawler

dynamic start_urls in scrapy

天涯浪子 提交于 2019-12-20 08:51:44
问题 I'm using scrapy to crawl multiple pages on a site. The variable start_urls is used to define pages to be crawled. I would initially start with 1st page, thus defining start_urls = [1st page] in the file example_spider.py Upon getting more info from 1st page, I would determine what are next pages to be crawled, then would assign start_urls accordingly. Hence, I have to overwrite above example_spider.py with changes to start_urls = [1st page, 2nd page, ..., Kth page] , then run scrapy crawl

How to extract URLs from an HTML page in Python [closed]

╄→尐↘猪︶ㄣ 提交于 2019-12-20 08:49:58
问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 6 years ago . I have to write a web crawler in Python. I don't know how to parse a page and extract the URLs from HTML. Where should I go and study to write such a program? In other words, is there a simple python program

Strategy for how to crawl/index frequently updated webpages?

ぐ巨炮叔叔 提交于 2019-12-20 08:19:35
问题 I'm trying to build a very small, niche search engine, using Nutch to crawl specific sites. Some of the sites are news/blog sites. If I crawl, say, techcrunch.com, and store and index their frontpage or any of their main pages, then within hours my index for that page will be out of date. Does a large search engine such as Google have an algorithm to re-crawl frequently updated pages very frequently, hourly even? Or does it just score frequently updated pages very low so they don't get

Difference between BeautifulSoup and Scrapy crawler?

给你一囗甜甜゛ 提交于 2019-12-20 07:56:52
问题 I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler . 回答1: Scrapy is a Web-spider or web scraper framework , You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling . While BeautifulSoup is a

Read article content using goose retrieving nothing

a 夏天 提交于 2019-12-20 07:45:20
问题 I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue. Goose version used:https://github.com/agolo/python-goose/ Present version gives some errors. from goose import Goose from requests import get response = get('http://www.highbeam.com/doc/1P3-979471971.html') extractor = Goose() article = extractor.extract(raw_html=response.content) text = article.cleaned_text

Selenium find_elements_ from a tag

独自空忆成欢 提交于 2019-12-20 07:39:48
问题 I want to scrape some hotel information from Booking.com. The website provides some hotel informations, in this particular case, how many rooms are still available. The following shows the span tag from the Booking.com website and i want to extract only the number of data-x-left-count for all listed hotels. <span class="only_x_left sr_rooms_left_wrap " data-x-left-count="6"> Nur noch 6 Zimmer auf unserer Seite verfügbar! </span> I tried to approach it by finding the elements and returning an

Crawler script php

醉酒当歌 提交于 2019-12-20 04:57:11
问题 I've grab a piece of script off here to crawl a website, put it up on my server and it works. The only issue is that if I try and crawl set the depth anything above 4 it doesn't work. I'm wondering if it due to the servers lack of resources or the code itself. <?php error_reporting(E_ALL); function crawl_page($url, $depth) { static $seen = array(); if (isset($seen[$url]) || $depth === 0) { return; } $seen[$url] = true; $dom = new DOMDocument('1.0'); @$dom->loadHTMLFile($url); $anchors = $dom-

Does Facebook crawler currently interpret javascript before parsing the DOM?

送分小仙女□ 提交于 2019-12-20 04:30:22
问题 Following link seems to tell that it can't: How does Facebook Sharer select Images and other metadata when sharing my URL? But I wanted to know if it is still the case at current date... (The documentation on facebook dev site doesn't give any precision about this point) 回答1: In the tests I've run I've never seen it interpret the JS, but that might be contextual / domain-specific (who knows). To test your specific case, use the Facebook linter: https://developers.facebook.com/tools/debug (log

Scrapy view returns a blank page

柔情痞子 提交于 2019-12-20 04:25:11
问题 I'm new at Scrapy and I was just trying to scrape http://www.diseasesdatabase.com/ When I type scrapy view http://www.diseasesdatabase.com/ , it displays a blank page but if I download the page and do it on the local file, it displays as usual. Why is this happening? 回答1: Pretend being a real browser providing a User-Agent header: scrapy view http://www.diseasesdatabase.com/ -s USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357

Scrapy and Django import error

孤人 提交于 2019-12-20 03:23:22
问题 When I am calling Spider through a Python script, it is giving me an ImportError : ImportError: No module named app.models My items.py is like this: from scrapy.item import Item, Field from scrapy.contrib.djangoitem import DjangoItem from app.models import Person class aqaqItem(DjangoItem): django_model=Person pass My settings.py is like this: # # For simplicity, this file contains only the most important settings by # default. All the other settings are documented here: # # http://doc.scrapy