Scrapy crawler is being blocked and gets 404

问题

I'm trying to scrape the page 'https://zhuanlan.zhihu.com/wangzhenotes' with Scrapy, with the configuration in the post and the end of this post.

This command

scrapy shell 'https://zhuanlan.zhihu.com/wangzhenotes'

gets me

2020-07-02 05:50:04 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://zhuanlan.zhihu.com/robots.txt> (referer: None)
2020-07-02 05:50:04 [protego] DEBUG: Rule at line 19 without any user agent to enforce it on.
...
2020-07-02 05:50:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://zhuanlan.zhihu.com/wangzhenotes> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x10ac98790>
[s]   item       {}
[s]   request    <GET https://zhuanlan.zhihu.com/wangzhenotes>
[s]   response   <200 https://zhuanlan.zhihu.com/wangzhenotes>
...

I guess the crawler is being blocked as this command gets only 3,

len(response.xpath('//span'))

While searching "span" in the source in Chrome browser gets over 80,

and response.css("h2.ContentItem-title") gets an empty list [].

How do I get those spans?

Here is the configuration I'm using, the same as the one in the referred post.

class CustomMiddleware(object):
    def process_request(self, request, spider):
        request.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"


DOWNLOADER_MIDDLEWARES = {
    'projectname.middlewares.CustomMiddleware': 543,
}

回答1:

The problem is that spans and such h2.ContentItem-title elements not present in the page source. They come from separate request.

This is an example of how to get information using requests module, but you can use the same approach using scrapy as well:

import requests

headers = {
    'authority': 'www.zhihu.com',
    'x-requested-with': 'fetch',
    'x-ab-param': 'top_quality=0;li_topics_search=0;qap_question_visitor= 0;se_click_v_v=0;se_topicfeed=0;se_aa_base=0;se_video200=0;tsp_hotlist_ui=1;zr_expslotpaid=1;pf_fuceng=1;qap_labeltype=1;zr_km_answer=open_cvr;zr_zr_search_sims=0;top_v_album=1;pf_adjust=0;li_svip_cardshow=1;tp_dingyue_video=0;top_test_4_liguangyi=1;li_se_section=1;zr_rel_search=base;zr_training_first=false;tp_discover=0;tp_move_scorecard=0;li_salt_hot=1;zr_intervene=0;zr_slotpaidexp=1;se_college=default;se_colorfultab=1;se_entity22=0;tp_m_intro_re_topic=1;top_universalebook=1;zr_training_boost=false;zr_ans_rec=gbrank;se_whitelist=0;se_searchvideo=3;li_video_section=0;li_vip_verti_search=0;zr_topic_rpc=0;zr_rec_answer_cp=open;se_cla_v2=1;se_col_boost=0;se_v_v005=0;top_ebook=0;zr_search_topic=0;tp_sft=a;tsp_ad_cardredesign=0;li_paid_answer_exp=0;tsp_ios_cardredesign=0;pf_newguide_vertical=0;ls_video_commercial=0;tp_header_style=1;se_v040=0;zw_sameq_sorce=999;zr_art_rec=base;se_web0answer=0;se_bsi=0;ls_videoad=2;li_svip_tab_search=1;zr_test_aa1=0;se_multi_images=0;tp_club_qa_entrance=1;li_yxzl_new_style_a=1;se_sim_bst=1;se_bert_eng=0;tp_club_fdv4=0;soc_notification=1;ug_follow_topic_1=2;li_car_meta=0;li_panswer_topic=0;zr_search_sim2=0;tp_club_entrance=1;ug_newtag=1;li_answer_card=0;tp_movie_ux=0;se_sug_term=0;pf_noti_entry_num=0;se_mobilecard=0;se_cardrank_3=0;se_oneboxtopic=1;se_v045=0;tp_club_feed=0;tp_topic_tab_new=0-0-0;se_adsrank=4;tp_contents=2;se_v_rate=0;se_major=0;tp_meta_card=0;tp_topic_style=0;top_hotcommerce=1;li_catalog_card=1;se_searchwiki=0;se_ffzx_jushen1=0;se_v040_2=0;pf_creator_card=1;pf_profile2_tab=0;li_ebook_gen_search=0;tp_club_top=0;pf_foltopic_usernum=0;li_viptab_name=0;tp_topic_tab=0;tp_club_bt=0;se_wil_act=0;se_content0=1;se_vdnn_4=0;se_v_v006=0;se_videobox=0;soc_feed_intelligent=0;top_root=0;ls_fmp4=0;zr_slot_training=1;qap_question_author=0;zr_search_paid=1;zr_search_sims=0;se_hi_trunc=0;tp_club__entrance2=1;ls_recommend_test=0',
    'x-zse-86': '1.0_a720UDu0k8YxUC28Zw2qUJuqHU2Ygu28B0xqFruBoH2p',
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
    'x-zse-83': '3_2.0',
    'accept': '*/*',
    'origin': 'https://zhuanlan.zhihu.com',
    'sec-fetch-site': 'same-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://zhuanlan.zhihu.com/wangzhenotes',
    'accept-language': 'en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7,uk;q=0.6,en-GB;q=0.5',
}

response = requests.get('https://www.zhihu.com/api/v4/columns/wangzhenotes/items', headers=headers)

print(response.json())

To parse response to json in scrapy, you can use the following code:

j_obj = json.loads(response.body_as_unicode())

来源：https://stackoverflow.com/questions/62686000/scrapy-crawler-is-being-blocked-and-gets-404

标签

scrapy