Python Scrapy on offline (local) data

≯℡__Kan透↙ 提交于 2019-11-28 23:31:52

SimpleHTTP Server Hosting

If you truly want to host it locally and use scrapy, you could serve it by navigating to the directory it's stored in and run the SimpleHTTPServer (port 8000 shown below):

python -m SimpleHTTPServer 8000

Then just point scrapy at 127.0.0.1:8000

$ scrapy crawl 127.0.0.1:8000

file://

An alternative is to just have scrapy point to the set of files directly:

$ scrapy crawl file:///home/sagi/html_files # Assuming you're on a *nix system

Wrapping up

Once you've set up your scraper for scrapy (see example dirbot), just run the crawler:

$ scrapy crawl 127.0.0.1:8000

If links in the html files are absolute rather than relative though, these may not work well. You would need to adjust the files yourself.

Go to your Dataset folder :

import os
files = os.listdir(os.getcwd())
for file in files:
    with open(file,"r") as f:
        page_content = f.read()
        #do here watever you want to do with page_content. I guess parsing with lxml or Beautiful soup.

No need to go for Scrapy !

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!