Need to scrape information from a webpage with a “show more” button, any recommendations?

为君一笑 提交于 2019-12-06 08:35:12

问题


Currently developing a "crawler" for educational reasons,

Everything is working fine, i can extract url's & information & save it in a json file, everything is all fine and dandy... EXCEPT

the page has a "load more" button that i NEED to interact with in order for the crawler to continue looking for more urls.

This is where i could use you amazing guys & girls!

Any recommendations on how to do this?

I would like to interact with the "load more" button and re-send the HTML information to my crawler.

Would really, appreciate any amount of help from you guys!

Website: http://virali.se/photo/gallery/

bit of example code for finding business names:

def base_spider(self, max_pages, max_CIDS):
    url = "http://virali.se/photo/gallery/photog/"  # Input URL

    for pages in range(0, max_pages):
        source_code = requests.get(url)  # gets the source_code from the URL
        plain_text = source_code.text  # Pure text transform for BeautifulSoup
        soup = BeautifulSoup(plain_text, "html.parser")  # Use HTML parser to read the plain_text var
    for article in soup.find_all("article"):
            business_name_pattern = re.compile(r"<h1>(.*?)</?h1>")
            business_name_raw = str(re.findall(business_name_pattern, str(article)))
            business_name_clean = re.sub("[\[\]\'\"]", "", business_name_raw)
            self.myprint(business_name_clean)  # custom print function for weird chars

This code is only looking for the business names, but of course, it is going to run out of business names to search for if the "show more results" button on the url is not interacted with.


回答1:


If you look at the site with a developer tool (I used Chrome) then you can see that an XHR post request is fired when you click the "Show more results" button.

In this case you can emulate this request to gather the data:

with requests.Session() as session:
    response = session.post("http://virali.se/photo/gallery/search", data={'start':0})
    print(response.content)

The "magic" is in the data parameter of the session.post: it is the required argument to load the images from this offset. In the example above 0 is the first bunch of images you see per default on the site.

And you can parse response.content with BeautifulSoup.

I hope this helps you get started, although the example uses Python 3 but it can be solved with Python 2 too in the same manner (without using the with construct).



来源:https://stackoverflow.com/questions/32246714/need-to-scrape-information-from-a-webpage-with-a-show-more-button-any-recomme

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!