I want to scrape a site using GAE and post the results into a Google Entity

问题

I want to scrape this URL : https://www.xstreetsl.com/modules.php?searchSubmitImage_x=0&searchSubmitImage_y=0&SearchLocale=0&name=Marketplace&SearchKeyword=business&searchSubmitImage.x=0&searchSubmitImage.y=0&SearchLocale=0&SearchPriceMin=&SearchPriceMax=&SearchRatingMin=&SearchRatingMax=&sort=&dir=asc

Go into each of the links and extract out various pieces of information e.g. permissions, prims etc then post the results into a Entity on google app engine.

I want to know the best way to go about it?

Chris

回答1:

For normalizing HTML using a pure Python library I have had better experiences with html5lib than BeautifulSoup.

However you just want to extract simply structured information, which doesn't actually require normalizing the HTML. I have a few scraping apps on Google App Engine which use my own xpath library that works with raw HTML. Or you can use regular expressions for one off jobs.

回答2:

There are several nice screen scraping libraries you can use in Python.

Perhaps the easiest to knock up an advanced scraper with is scrapy. It relies on Twisted to implement the main engine but provides a very easy to use interface for implementing custom scraping code.

Otherwise you can look at doing it more manually with something like BeautifulSoup, or Mechanize which provides a "mechanical" browser implementation.

BeautifulSoup and Mechanize should both work out of the box on App Engine - it provides a wrapper around httplib and urllib that uses urlfetch as a backend. Only scrapy will be problematic, due to its use of twisted. [thanks to Nick Johnson for the update].

来源：https://stackoverflow.com/questions/2406428/i-want-to-scrape-a-site-using-gae-and-post-the-results-into-a-google-entity

标签

python

google-app-engine

screen-scraping