Python Requests/BeautifulSoup access to pagination

陌路散爱 提交于 2019-12-22 09:47:38

问题


I am trying to access different pages of a website to get a list of items (20 per pages). There is one extra parameter to send to select the page but somehow i am not able to pass it along properly - the parameter has to be sent in the body of the request. I tried with params and with data without any success. What is the proper method to add soething to the "body" of a request?

Here is what I have. It gives me 6 times the first page.

import requests
from bs4 import BeautifulSoup
import time

def SP_STK_LIST(numpage):
    payload = {}
    for i in range(0, numpage):
        payload['ct100$Contenido$GoPag'] = i
        header = {'Content-Type': 'text/html; charset=utf-8'}
        req = requests.post("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", headers = header, params = payload)
        print(req.url)
        page = req.text
        soup = BeautifulSoup(page)            
        table = soup.find("table", {"id" : "ctl00_Contenido_tblEmisoras"})
        for row in table.findAll("tr"):
            print(row)
        time.sleep(1)

SP_STK_LIST(6)

I don't think I understand clearly the differences between 'data' and 'params' or even 'files' which I have seen but does not seem(I think) to relate to my problem.

1st EDIT: I want to thank Selcuk for its quick answer, I managed to implement it on my system and as jlaur poited out, it is extremely slow and despite the "headless" options, there is a command box that open on the screen. Using jlaur suggestion, I came up with this: (still not working but I am sure not much is missing from that).

import requests
from bs4 import BeautifulSoup
import time
import collections

def SPAIN_STK_LIST(numpage):
    payload = collections.OrderedDict()
    for i in range(0, numpage):
        header = {'Content-Type': 'text/html; charset=utf-8'}
        ses = requests.session()
        req = ses.post("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", data = payload)
        print(req.request.body)
        page = req.text
        soup = BeautifulSoup(page)
        # find next __VIEWSTATE and __EVENTVALIDATION
        viewstate = soup.find("input", {"id":"__VIEWSTATE"})['value']
        print("VIEWSTATE: ", viewstate)
        eventval = soup.find("input", {"id":"__EVENTVALIDATION"})['value']
        print("EVENTVALIDATION: ", eventval)        
        payload['__EVENTTARGET'] = ""
        payload['__EVENTARGUMENT'] = ""
        payload['__VIEWSTATE'] = viewstate
        payload['__VIEWSTATEGENERATOR'] = "65A1DED9"
        payload['__EVENTVALIDATION'] = eventval
        payload['ct100$Contenido$GoPag'] = i + 1
        table = soup.find("table", {"id" : "ctl00_Contenido_tblEmisoras"})
        for row in table.findAll("tr")[1:]:
            cells = row.findAll("td")
            print(cells[0].find("a").get_text().replace(",","").replace("S.A.", ""))
        time.sleep(1)


SPAIN_STK_LIST(3)

Each page generates an VIEWSTATE and EVENTVALIDATION numbers that are stored in hidden tags on the page. I use them to go to the next page. I also used session as suggested but this is still not working. However it has the exact same format (I used ordereddict) as the request body from the webpage. Any ideas what would be missing?


回答1:


I don't know if you can use Selenium but if you are going to interact with page you should.

You can install selenium with pip

I used pandas for only visualisation purposes. You do not have to use it.

First download chrome driver for selenium from here https://chromedriver.storage.googleapis.com/index.html?path=2.40/ and extract it to your workspace or you can specify it in executable_path parameter.It's up to you.

This will get all the data in the table until there is no next page.

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
import pandas as pd

options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path=r"chromedriver.exe",options=options)

driver.get("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx")
next_button = driver.find_element_by_id("ctl00_Contenido_SiguientesArr")
data = []
try:
    while (next_button):    
        soup = BeautifulSoup(driver.page_source,'html.parser')
        table = soup.find('table',{'id':'ctl00_Contenido_tblEmisoras'})
        table_body = table.find('tbody')
        rows = table_body.find_all('tr')
        for row in rows:            
            cols = row.find_all('td')
            cols = [ele.text.strip() for ele in cols]                
            data.append([ele for ele in cols])        
        #Wait for table to load
        time.sleep(2)
        next_button = driver.find_element_by_id("ctl00_Contenido_SiguientesArr")
        next_button.click()
except NoSuchElementException:
    print('No more page to load')

df = pd.DataFrame(columns= ('Name','Sector - Subsector','Market','Indices'), data = data)

print(df.mask(df.eq('None')).dropna())

The output is

 Name                        ...                                                                    Indices
1                               ABENGOA, S.A.                        ...
2              ABERTIS INFRAESTRUCTURAS, S.A.                        ...
3                                ACCIONA,S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
4                              ACERINOX, S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
5    ACS,ACTIVIDADES DE CONST.Y SERVICIOS S.A                        ...                                              IBEX 35®, IBEX TOP Dividendo®
6                      ADOLFO DOMINGUEZ, S.A.                        ...
7             ADVEO GROUP INTERNATIONAL, S.A.                        ...
8                           AEDAS HOMES, S.A.                        ...
9                          AENA, S.M.E., S.A.                        ...                                                                   IBEX 35®
10                                  AIRBUS SE                        ...
11                     ALANTRA PARTNERS, S.A.                        ...
12                       ALFA, S.A.B. DE C.V.                        ...                                   FTSE Latibex TOP, FTSE Latibex All Share
13                             ALMIRALL, S.A.                        ...
14                     AMADEUS IT GROUP, S.A.                        ...                                                                   IBEX 35®
15              AMERICA MOVIL, S.A.B. DE C.V.                        ...                                   FTSE Latibex TOP, FTSE Latibex All Share
16                                AMPER, S.A.                        ...
17                    APERAM, SOCIETE ANONYME                        ...
18                      APPLUS SERVICES, S.A.                        ...
19                        ARCELORMITTAL, S.A.                        ...                                                                   IBEX 35®
20    ATRESMEDIA CORP. DE MEDIOS DE COM. S.A.                        ...                                                        IBEX TOP Dividendo®
21                     AUDAX RENOVABLES, S.A.                        ...
22             AXIARE PATRIMONIO SOCIMI, S.A.                        ...
23              AYCO GRUPO INMOBILIARIO, S.A.                        ...
24                               AZKOYEN S.A.                        ...
25                          AZORA ALTUS, S.A.                        ...
27      BANCO BILBAO VIZCAYA ARGENTARIA, S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
28                        BANCO BRADESCO S.A.                        ...                          FTSE Latibex BRASIL, FTSE Latibex TOP, FTSE La...
29                    BANCO DE SABADELL, S.A.                        ...                                                                   IBEX 35®
30                  BANCO SANTANDER RIO, S.A.                        ...                                                     FTSE Latibex All Share
31                      BANCO SANTANDER, S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
..                                        ...                        ...                                                                        ...
145                       RENTA 4 BANCO, S.A.                        ...
146       RENTA CORPORACION REAL ESTATE, S.A.                        ...
147                              REPSOL, S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
148                               SACYR, S.A.                        ...
149                         SAETA YIELD, S.A.                        ...
150              SARE HOLDING, S.A.B, DE C.V.                        ...                                   FTSE Latibex TOP, FTSE Latibex All Share
151             SERVICE POINT SOLUTIONS, S.A.                        ...
152     SIEMENS GAMESA RENEWABLE ENERGY, S.A.                        ...                                                                   IBEX 35®
153                              SNIACE, S.A.                        ...
154    SOLARIA ENERGIA Y MEDIO AMBIENTE, S.A.                        ...
155                               TALGO, S.A.                        ...
157                   TECNICAS REUNIDAS, S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
158                          TELEFONICA, S.A.                        ...                                              IBEX 35®, IBEX TOP Dividendo®
159                     TELEPIZZA GROUP, S.A.                        ...
160             TR HOTEL JARDIN DEL MAR, S.A.                        ...
161                             TUBACEX, S.A.                        ...
162                       TUBOS REUNIDOS,S.A.                        ...
163                   TV AZTECA, S.A. DE C.V.                        ...                                   FTSE Latibex TOP, FTSE Latibex All Share
164                       UNICAJA BANCO, S.A.                        ...
165           UNION CATALANA DE VALORES, S.A.                        ...
166                    URBAR INGENIEROS, S.A.                        ...
167              URBAS GRUPO FINANCIERO, S.A.                        ...
168  USINAS SIDERURGICAS DE MINAS GERAIS,S.A.                        ...                          FTSE Latibex BRASIL, FTSE Latibex TOP, FTSE La...
169                                VALE, S.A.                        ...                          FTSE Latibex BRASIL, FTSE Latibex TOP, FTSE La...
170  VERTICE TRESCIENTOS SESENTA GRADOS, S.A.                        ...
171                              VIDRALA S.A.                        ...
172                            VISCOFAN, S.A.                        ...                                                                   IBEX 35®
173                             VOCENTO, S.A.                        ...
174            VOLCAN, COMPAñIA MINERA S.A.A.                        ...                                                     FTSE Latibex All Share
175                        ZARDOYA OTIS, S.A.                        ...

[169 rows x 4 columns]


来源:https://stackoverflow.com/questions/50920242/python-requests-beautifulsoup-access-to-pagination

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!