I want to scrape a site using GAE and post the results into a Google Entity

回眸只為那壹抹淺笑 提交于 2020-01-14 03:28:06

问题


I want to scrape this URL : https://www.xstreetsl.com/modules.php?searchSubmitImage_x=0&searchSubmitImage_y=0&SearchLocale=0&name=Marketplace&SearchKeyword=business&searchSubmitImage.x=0&searchSubmitImage.y=0&SearchLocale=0&SearchPriceMin=&SearchPriceMax=&SearchRatingMin=&SearchRatingMax=&sort=&dir=asc

Go into each of the links and extract out various pieces of information e.g. permissions, prims etc then post the results into a Entity on google app engine.

I want to know the best way to go about it?

Chris


回答1:


For normalizing HTML using a pure Python library I have had better experiences with html5lib than BeautifulSoup.

However you just want to extract simply structured information, which doesn't actually require normalizing the HTML. I have a few scraping apps on Google App Engine which use my own xpath library that works with raw HTML. Or you can use regular expressions for one off jobs.




回答2:


There are several nice screen scraping libraries you can use in Python.

Perhaps the easiest to knock up an advanced scraper with is scrapy. It relies on Twisted to implement the main engine but provides a very easy to use interface for implementing custom scraping code.

Otherwise you can look at doing it more manually with something like BeautifulSoup, or Mechanize which provides a "mechanical" browser implementation.

BeautifulSoup and Mechanize should both work out of the box on App Engine - it provides a wrapper around httplib and urllib that uses urlfetch as a backend. Only scrapy will be problematic, due to its use of twisted. [thanks to Nick Johnson for the update].



来源:https://stackoverflow.com/questions/2406428/i-want-to-scrape-a-site-using-gae-and-post-the-results-into-a-google-entity

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!