Scrapy crawler is being blocked and gets 404

偶尔善良 提交于 2021-01-29 06:00:56

问题


I'm trying to scrape the page 'https://zhuanlan.zhihu.com/wangzhenotes' with Scrapy, with the configuration in the post and the end of this post.

This command

scrapy shell 'https://zhuanlan.zhihu.com/wangzhenotes'

gets me

2020-07-02 05:50:04 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://zhuanlan.zhihu.com/robots.txt> (referer: None)
2020-07-02 05:50:04 [protego] DEBUG: Rule at line 19 without any user agent to enforce it on.
...
2020-07-02 05:50:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://zhuanlan.zhihu.com/wangzhenotes> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x10ac98790>
[s]   item       {}
[s]   request    <GET https://zhuanlan.zhihu.com/wangzhenotes>
[s]   response   <200 https://zhuanlan.zhihu.com/wangzhenotes>
...

I guess the crawler is being blocked as this command gets only 3,

len(response.xpath('//span'))

While searching "span" in the source in Chrome browser gets over 80,

and response.css("h2.ContentItem-title") gets an empty list [].

How do I get those spans?


Here is the configuration I'm using, the same as the one in the referred post.

class CustomMiddleware(object):
    def process_request(self, request, spider):
        request.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"


DOWNLOADER_MIDDLEWARES = {
    'projectname.middlewares.CustomMiddleware': 543,
}

回答1:


The problem is that spans and such h2.ContentItem-title elements not present in the page source. They come from separate request.

This is an example of how to get information using requests module, but you can use the same approach using scrapy as well:

import requests

headers = {
    'authority': 'www.zhihu.com',
    'x-requested-with': 'fetch',
    'x-ab-param': 'top_quality=0;li_topics_search=0;qap_question_visitor= 0;se_click_v_v=0;se_topicfeed=0;se_aa_base=0;se_video200=0;tsp_hotlist_ui=1;zr_expslotpaid=1;pf_fuceng=1;qap_labeltype=1;zr_km_answer=open_cvr;zr_zr_search_sims=0;top_v_album=1;pf_adjust=0;li_svip_cardshow=1;tp_dingyue_video=0;top_test_4_liguangyi=1;li_se_section=1;zr_rel_search=base;zr_training_first=false;tp_discover=0;tp_move_scorecard=0;li_salt_hot=1;zr_intervene=0;zr_slotpaidexp=1;se_college=default;se_colorfultab=1;se_entity22=0;tp_m_intro_re_topic=1;top_universalebook=1;zr_training_boost=false;zr_ans_rec=gbrank;se_whitelist=0;se_searchvideo=3;li_video_section=0;li_vip_verti_search=0;zr_topic_rpc=0;zr_rec_answer_cp=open;se_cla_v2=1;se_col_boost=0;se_v_v005=0;top_ebook=0;zr_search_topic=0;tp_sft=a;tsp_ad_cardredesign=0;li_paid_answer_exp=0;tsp_ios_cardredesign=0;pf_newguide_vertical=0;ls_video_commercial=0;tp_header_style=1;se_v040=0;zw_sameq_sorce=999;zr_art_rec=base;se_web0answer=0;se_bsi=0;ls_videoad=2;li_svip_tab_search=1;zr_test_aa1=0;se_multi_images=0;tp_club_qa_entrance=1;li_yxzl_new_style_a=1;se_sim_bst=1;se_bert_eng=0;tp_club_fdv4=0;soc_notification=1;ug_follow_topic_1=2;li_car_meta=0;li_panswer_topic=0;zr_search_sim2=0;tp_club_entrance=1;ug_newtag=1;li_answer_card=0;tp_movie_ux=0;se_sug_term=0;pf_noti_entry_num=0;se_mobilecard=0;se_cardrank_3=0;se_oneboxtopic=1;se_v045=0;tp_club_feed=0;tp_topic_tab_new=0-0-0;se_adsrank=4;tp_contents=2;se_v_rate=0;se_major=0;tp_meta_card=0;tp_topic_style=0;top_hotcommerce=1;li_catalog_card=1;se_searchwiki=0;se_ffzx_jushen1=0;se_v040_2=0;pf_creator_card=1;pf_profile2_tab=0;li_ebook_gen_search=0;tp_club_top=0;pf_foltopic_usernum=0;li_viptab_name=0;tp_topic_tab=0;tp_club_bt=0;se_wil_act=0;se_content0=1;se_vdnn_4=0;se_v_v006=0;se_videobox=0;soc_feed_intelligent=0;top_root=0;ls_fmp4=0;zr_slot_training=1;qap_question_author=0;zr_search_paid=1;zr_search_sims=0;se_hi_trunc=0;tp_club__entrance2=1;ls_recommend_test=0',
    'x-zse-86': '1.0_a720UDu0k8YxUC28Zw2qUJuqHU2Ygu28B0xqFruBoH2p',
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
    'x-zse-83': '3_2.0',
    'accept': '*/*',
    'origin': 'https://zhuanlan.zhihu.com',
    'sec-fetch-site': 'same-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://zhuanlan.zhihu.com/wangzhenotes',
    'accept-language': 'en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7,uk;q=0.6,en-GB;q=0.5',
}

response = requests.get('https://www.zhihu.com/api/v4/columns/wangzhenotes/items', headers=headers)

print(response.json())

To parse response to json in scrapy, you can use the following code:

j_obj = json.loads(response.body_as_unicode())



来源:https://stackoverflow.com/questions/62686000/scrapy-crawler-is-being-blocked-and-gets-404

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!