问题
I am noob with python, been on and off teaching myself since this summer. I am going through the scrapy tutorial, and occasionally reading more about html/xml to help me understand scrapy. My project to myself is to imitate the scrapy tutorial in order to scrape http://www.gamefaqs.com/boards/916373-pc. I want to get a list of the thread title along with the thread url, should be simple!
My problem lies in not understanding xpath, and also html i guess. When viewing the source code for the gamefaqs site, I am not sure what to look for in order to pull the link and title. I want to say just look at the anchor tag and grab the text, but i am confused on how.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["http://www.gamefaqs.com"]
start_urls = ["http://www.gamefaqs.com/boards/916373-pc"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//a')
items = []
for site in sites:
item = DmozItem()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
I want to change this to work on gamefaqs, so what would i put in this path? I imagine the program returning results something like this thread name thread url I know the code is not really right but can someone help me rewrite this to obtain the results, it would help me understand the scraping process better.
回答1:
The layout and organization of a web page can change and deep tag based paths can be difficult to deal with. I prefer to pattern match the text of the links. Even if the link format changes, matching the new pattern is simple.
For gamefaqs the article links look like:
http://www.gamefaqs.com/boards/916373-pc/37644384
That's the protocol, domain name, literal 'boards' path. '916373-pc' identifies the forum area and '37644384' is the article ID.
We can match links for a specific forum area using using a regular expression:
reLink = re.compile(r'.*\/boards\/916373-pc\/\d+$')
if reLink.match(link)
Or any forum area using using:
reLink = re.compile(r'.*\/boards\/\d+-[^/]+\/\d+$')
if reLink.match(link)
Adding link matching to your code we get:
import re
reLink = re.compile(r'.*\/boards\/\d+-[^/]+\/\d+$')
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//a')
items = []
for site in sites:
link = site.select('a/@href').extract()
if reLink.match(link)
item = DmozItem()
item['link'] = link
item['desc'] = site.select('text()').extract()
items.append(item)
return items
Many sites have separate summary and detail pages or description and file links where the paths match a template with an article ID. If needed, you can parse the forum area and article ID like this:
reLink = re.compile(r'.*\/boards\/(?P<area>\d+-[^/]+)\/(?P<id>\d+)$')
m = reLink.match(link)
if m:
areaStr = m.groupdict()['area']
idStr = m.groupdict()['id']
isStr will be a string which is fine for filling in a URL template, but if you need to calculate the previous ID, etc., then convert it to a number:
idInt = int(idStr)
I hope this helps.
来源:https://stackoverflow.com/questions/13465016/having-trouble-understanding-where-to-look-in-source-code-in-order-to-create-a