How to crawl a website/extract data into database with python?

北慕城南 提交于 2019-12-02 16:25:36
Acorn

If you want to use a powerful scraping framework there's Scrapy. It has some good documentation too. It may be a little overkill depending on your task though.

sharjeel

Scrapy is probably the best Python library for crawling. It can maintain state for authenticated sessions.

Dealing with binary data should be handled separately. For each file type, you'll have to handle it differently according to your own logic. For almost any kind of format, you'll probably be able to find a library. For instance take a look at PyPDF for handling PDFs. For excel files you can try xlrd.

I liked using BeatifulSoup for extracting html data

It's as easy as this:

from BeautifulSoup import BeautifulSoup 
import urllib

ur = urllib.urlopen("http://pragprog.com/podcasts/feed.rss")
soup = BeautifulSoup(ur.read())
items = soup.findAll('item')

urls = [item.enclosure['url'] for item in items]

For this purpose there is a very useful tool called web-harvest Link to their website http://web-harvest.sourceforge.net/ I use this to crawl webpages

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!