Scrapy- How to extract all blog posts from a category?

拈花ヽ惹草 提交于 2019-12-05 19:42:28

You have some things you can improve in your code and two problems you want to solve: reading posts, automatic crawling.

If you want to get the contents of a new blog post you have to re-run your spider. Otherwise you would have an endless loop. Naturally in this case you have to make sure that you do not scrape already scraped entries (database, read available files at spider start and so on). But you cannot have a spider which runs forever and waits for new entries. This is not the purpose.

Your approach to store the posts into a file is wrong. This means why do you scrape a list of items and then do nothing with them? And why do you save the items in the parse_page function? For this there are item pipelines, you should write one and do there the exporting. And the f.close() is not necessary because you use the with statement which does this for you at the end.

Your rules variable should throw an error because it is not iterable. I wonder if you even tested your code. And the Rule is too complex. You can simplify it to this:

rules = [Rule(LinkExtractor(allow='page/*'), follow=True, callback='parse_page'),]

And it follows every URL which has /page in it.

If you start your scraper you will see that the results are filtered because of your allowed domains:

Filtered offsite request to 'edumine.wordpress.com': <GET https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/>

To solve this change your domain to:

allowed_domains = ["edumine.wordpress.com"]

If you want to get other wordpress sites, change it simply to

allowed_domains = ["wordpress.com"]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!