问题
Currently developing a "crawler" for educational reasons,
Everything is working fine, i can extract url's & information & save it in a json file, everything is all fine and dandy... EXCEPT
the page has a "load more" button that i NEED to interact with in order for the crawler to continue looking for more urls.
This is where i could use you amazing guys & girls!
Any recommendations on how to do this?
I would like to interact with the "load more" button and re-send the HTML information to my crawler.
Would really, appreciate any amount of help from you guys!
Website: http://virali.se/photo/gallery/
bit of example code for finding business names:
def base_spider(self, max_pages, max_CIDS):
url = "http://virali.se/photo/gallery/photog/" # Input URL
for pages in range(0, max_pages):
source_code = requests.get(url) # gets the source_code from the URL
plain_text = source_code.text # Pure text transform for BeautifulSoup
soup = BeautifulSoup(plain_text, "html.parser") # Use HTML parser to read the plain_text var
for article in soup.find_all("article"):
business_name_pattern = re.compile(r"<h1>(.*?)</?h1>")
business_name_raw = str(re.findall(business_name_pattern, str(article)))
business_name_clean = re.sub("[\[\]\'\"]", "", business_name_raw)
self.myprint(business_name_clean) # custom print function for weird chars
This code is only looking for the business names, but of course, it is going to run out of business names to search for if the "show more results" button on the url is not interacted with.
回答1:
If you look at the site with a developer tool (I used Chrome) then you can see that an XHR post request is fired when you click the "Show more results" button.
In this case you can emulate this request to gather the data:
with requests.Session() as session:
response = session.post("http://virali.se/photo/gallery/search", data={'start':0})
print(response.content)
The "magic" is in the data
parameter of the session.post
: it is the required argument to load the images from this offset. In the example above 0
is the first bunch of images you see per default on the site.
And you can parse response.content
with BeautifulSoup.
I hope this helps you get started, although the example uses Python 3 but it can be solved with Python 2 too in the same manner (without using the with
construct).
来源:https://stackoverflow.com/questions/32246714/need-to-scrape-information-from-a-webpage-with-a-show-more-button-any-recomme