Having trouble understanding where to look in source code, in order to create a web scraper

≯℡__Kan透↙ 提交于 2019-12-11 11:37:34

问题


I am noob with python, been on and off teaching myself since this summer. I am going through the scrapy tutorial, and occasionally reading more about html/xml to help me understand scrapy. My project to myself is to imitate the scrapy tutorial in order to scrape http://www.gamefaqs.com/boards/916373-pc. I want to get a list of the thread title along with the thread url, should be simple!

My problem lies in not understanding xpath, and also html i guess. When viewing the source code for the gamefaqs site, I am not sure what to look for in order to pull the link and title. I want to say just look at the anchor tag and grab the text, but i am confused on how.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem

class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["http://www.gamefaqs.com"]
start_urls = ["http://www.gamefaqs.com/boards/916373-pc"]


def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//a')
    items = []
    for site in sites:
        item = DmozItem()
        item['link'] = site.select('a/@href').extract()
        item['desc'] = site.select('text()').extract()
        items.append(item)
    return items

I want to change this to work on gamefaqs, so what would i put in this path? I imagine the program returning results something like this thread name thread url I know the code is not really right but can someone help me rewrite this to obtain the results, it would help me understand the scraping process better.


回答1:


The layout and organization of a web page can change and deep tag based paths can be difficult to deal with. I prefer to pattern match the text of the links. Even if the link format changes, matching the new pattern is simple.

For gamefaqs the article links look like:

http://www.gamefaqs.com/boards/916373-pc/37644384

That's the protocol, domain name, literal 'boards' path. '916373-pc' identifies the forum area and '37644384' is the article ID.

We can match links for a specific forum area using using a regular expression:

reLink = re.compile(r'.*\/boards\/916373-pc\/\d+$')
if reLink.match(link)

Or any forum area using using:

reLink = re.compile(r'.*\/boards\/\d+-[^/]+\/\d+$')
if reLink.match(link)

Adding link matching to your code we get:

import re
reLink = re.compile(r'.*\/boards\/\d+-[^/]+\/\d+$')

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//a')
    items = []
    for site in sites:
        link = site.select('a/@href').extract()
        if reLink.match(link)
            item = DmozItem()
            item['link'] = link
            item['desc'] = site.select('text()').extract()
            items.append(item)
    return items

Many sites have separate summary and detail pages or description and file links where the paths match a template with an article ID. If needed, you can parse the forum area and article ID like this:

reLink = re.compile(r'.*\/boards\/(?P<area>\d+-[^/]+)\/(?P<id>\d+)$')
m = reLink.match(link)
if m:
    areaStr = m.groupdict()['area']
    idStr = m.groupdict()['id']

isStr will be a string which is fine for filling in a URL template, but if you need to calculate the previous ID, etc., then convert it to a number:

idInt = int(idStr)

I hope this helps.



来源:https://stackoverflow.com/questions/13465016/having-trouble-understanding-where-to-look-in-source-code-in-order-to-create-a

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!