Scrapy: Google Crawl doesn't work

纵然是瞬间 提交于 2020-01-21 10:41:47

问题


When I try to crawl Google for search results, Scrapy just yields the Google home page: http://pastebin.com/FUbvbhN4

Here is my spider:

import scrapy

class GoogleFinanceSpider(scrapy.Spider):
    name = "google"
    start_urls = ['http://www.google.com/#q=finance.google.com:+3m+co']
    allowed_domains = ['www.google.com']

    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

Is there something wrong with this url as a starting url? When I open it in my browser -- by putting it in the address bar (not by filling in the search form) -- I get valid search results.


回答1:


Yes, looks like that address is redirecting to the home page:

example with scrapy shell http://www.google.com/#q=finance.google.com:+3m+co:

...
[s]   request    <GET http://www.google.com/#q=finance.google.com:+3m+co>
[s]   response   <200 http://www.google.com/>
...

Checking your url it makes sense, it isn't containing parameters, but #q (which isn't a url parameter) and the browser is the one recognizing that and making it a google search, so it is not exactly a url path.

the correct google search url is: http://www.google.com/search?q=YOURQUERY




回答2:


for the most cases, google would redirect the spider to the CAPTCHA page, bing search result is easier to crawl.

there is a project for crawling search result from Google/Bing/Baidu https://github.com/titantse/seCrawler



来源:https://stackoverflow.com/questions/33395133/scrapy-google-crawl-doesnt-work

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!