What pure Python library should I use to scrape a website?

╄→尐↘猪︶ㄣ 提交于 2019-12-06 09:49:12

问题


I currently have some Ruby code used to scrape some websites. I was using Ruby because at the time I was using Ruby on Rails for a site, and it just made sense.

Now I'm trying to port this over to Google App Engine, and keep getting stuck.

I've ported Python Mechanize to work with Google App Engine, but it doesn't support DOM inspection with XPATH.

I've tried the built-in ElementTree, but it choked on the first HTML blob I gave it when it ran into '&mdash'.

Do I keep trying to hack ElementTree in there, or do I try to use something else?

thanks, Mark


回答1:


Beautiful Soup.




回答2:


lxml -- 100x better than elementtree




回答3:


There's also scrapy, might be more up your alley.




回答4:


There are a number of examples of web page scrapers written using pyparsing, such as this one (extracts all URL links from yahoo.com) and this one (for extracting the NIST NTP server addresses). Be sure to use the pyparsing helper method makeHTMLTags, instead of just hand coding "<" + Literal(tagname) + ">" - makeHTMLTags creates a very robust parser, with accommodation for extra spaces, upper/lower case inconsistencies, unexpected attributes, attribute values with various quoting styles, and so on. Pyparsing will also give you more control over special syntax issues, such as custom entities. Also it is pure Python, liberally licensed, and small footprint (a single source module), so it is easy to drop into your GAE app right in with your other application code.




回答5:


BeautifulSoup is good, but its API is awkward. Try ElementSoup, which provides an ElementTree interface to BeautifulSoup.



来源:https://stackoverflow.com/questions/1563165/what-pure-python-library-should-i-use-to-scrape-a-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!