scrapy

Difference between BeautifulSoup and Scrapy crawler?

给你一囗甜甜゛ 提交于 2019-12-20 07:56:52
问题 I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler . 回答1: Scrapy is a Web-spider or web scraper framework , You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling . While BeautifulSoup is a

Scrapy spider is not working

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-20 07:48:44
问题 Since nothing so far is working I started a new project with python scrapy-ctl.py startproject Nu I followed the tutorial exactly, and created the folders, and a new spider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from Nu.items import NuItem from urls import u class NuSpider(CrawlSpider): domain_name = "wcase" start_urls = ['http://www

Python Scrapy Get HTML <script> tag

蓝咒 提交于 2019-12-20 07:46:08
问题 I have a project and i need the get script in html code. <script> (function() { ... / More Code Level.grade = "2"; Level.level = "1"; Level.max_line = "5"; Level.cozum = 'adım 12\ndön sağ\nadım 13\ndön sol\nadım 11'; ... / More Code </script> How i get only " adım 12\ndön sağ\nadım 13\ndön sol\nadım 11 " this code? Thanks for Helps 回答1: Use Regex to do that First grab the content of that SCRIPT tag like response.css("script").extract_first() And then use this regex (Level\.cozum = )(.*?)(\;)

Extracting p within h1 with Python/Scrapy

ⅰ亾dé卋堺 提交于 2019-12-20 07:26:31
问题 I am using Scrapy to extract some data about musical concerts from websites. At least one website I'm working with uses (incorrectly, according to W3C - Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?) a p element within an h1 element. I need to extract the text within the p element nevertheless, and cannot figure out how. I have read the documentation and looked around for example uses, but am relatively new to Scrapy. I understand the solution has

Order a json by field using scrapy

倾然丶 夕夏残阳落幕 提交于 2019-12-20 05:53:42
问题 I have created a spider to scrape problems from projecteuler.net. Here I have concluded my answer to a related question with I launch this with the command scrapy crawl euler -o euler.json and it outputs an array of unordered json objects, everyone corrisponding to a single problem: this is fine for me because I'm going to process it with javascript, even if I think resolving the ordering problem via scrapy can be very simple. But unfortunately, ordering items to write in json by scrapy (I

爬虫框架scrapy(1)

一个人想着一个人 提交于 2019-12-20 05:45:02
1.scrapy是为了爬取网站数据,提取结构性数据而编写的应用框架,只需要少量的代码,就能够快速的抓取所需的东西,Twisted异步网络框架,可以加快我们的下载速度 2.流程:装包:pip install scrapy pip install pywin32 创建一个自己的爬虫:打开cmd 用cd 切换到自己想要创建的位置 然后 scrapy startproject Myspider >cd Myspider>scrapy genspider itcat itcast.cn 》tree查看创建好的爬虫 》启动:scrapy crawl itcast cmd 上面的一些命令 :网址:http://www.itcast.cn/channel/teacher.shtml 。。。。。。。。。。。。。。。。。 使用scrapy crawl itcast启动后爬取信息的一些效果 ,如图: #不想前面出现太多的日志,可以去setting上面设置一下 如图: 这样子就好看一些。。。。。。(其实也不好看) 用yield item 传到piplines.py那里,然后直接在piplines.py(print(item),再去setting.py那里68 行那里开启piplines,如图: 通过在pipelines.py和setting上设置可以知道,pipeline的权重越小优先级越重,如图:

scrapy can't crawl all links in a page

℡╲_俬逩灬. 提交于 2019-12-20 05:16:33
问题 I am trying scrapy to crawl a ajax website http://play.google.com/store/apps/category/GAME/collection/topselling_new_free I want to get all the links directing to each game. I inspect the element of the page. And it looks like this: how the page looks like so I want to extract all links with the pattern /store/apps/details?id= but when I ran commands in the shell, it returns nothing: shell command I've also tried //a/@href. didn't work out either but Don't know what is wrong going on.... Now

Scrapy + selenium requests twice for each url

别来无恙 提交于 2019-12-20 05:15:09
问题 import scrapy from selenium import webdriver class ProductSpider(scrapy.Spider): name = "product_spider" allowed_domains = ['ebay.com'] start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40'] def __init__(self): self.driver = webdriver.Firefox() def parse(self, response): self.driver.get(response.url) while True: next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a') try: next.click() #

Scrapy view returns a blank page

柔情痞子 提交于 2019-12-20 04:25:11
问题 I'm new at Scrapy and I was just trying to scrape http://www.diseasesdatabase.com/ When I type scrapy view http://www.diseasesdatabase.com/ , it displays a blank page but if I download the page and do it on the local file, it displays as usual. Why is this happening? 回答1: Pretend being a real browser providing a User-Agent header: scrapy view http://www.diseasesdatabase.com/ -s USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357

XPath text with children

a 夏天 提交于 2019-12-20 04:22:38
问题 Given this html: <ul> <li>This is <a href="#">a link</a></li> <li>This is <a href="#">another link</a>.</li> </ul> How can I use XPath to get the following result: [ 'This is a link', 'This is another link.' ] What I've tried: //ul/li/text() But this gives me ['This is ', 'This is .'] (withoug the text in the a tags Also: string(//ul/li) But this gives me ['This is a link'] (so only the first element) Also //ul/li/descendant-or-self::text() But this gives me ['This is ', 'a link', 'This is ',