问题
I am trying to access different pages of a website to get a list of items (20 per pages). There is one extra parameter to send to select the page but somehow i am not able to pass it along properly - the parameter has to be sent in the body of the request. I tried with params and with data without any success. What is the proper method to add soething to the "body" of a request?
Here is what I have. It gives me 6 times the first page.
import requests
from bs4 import BeautifulSoup
import time
def SP_STK_LIST(numpage):
payload = {}
for i in range(0, numpage):
payload['ct100$Contenido$GoPag'] = i
header = {'Content-Type': 'text/html; charset=utf-8'}
req = requests.post("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", headers = header, params = payload)
print(req.url)
page = req.text
soup = BeautifulSoup(page)
table = soup.find("table", {"id" : "ctl00_Contenido_tblEmisoras"})
for row in table.findAll("tr"):
print(row)
time.sleep(1)
SP_STK_LIST(6)
I don't think I understand clearly the differences between 'data' and 'params' or even 'files' which I have seen but does not seem(I think) to relate to my problem.
1st EDIT: I want to thank Selcuk for its quick answer, I managed to implement it on my system and as jlaur poited out, it is extremely slow and despite the "headless" options, there is a command box that open on the screen. Using jlaur suggestion, I came up with this: (still not working but I am sure not much is missing from that).
import requests
from bs4 import BeautifulSoup
import time
import collections
def SPAIN_STK_LIST(numpage):
payload = collections.OrderedDict()
for i in range(0, numpage):
header = {'Content-Type': 'text/html; charset=utf-8'}
ses = requests.session()
req = ses.post("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", data = payload)
print(req.request.body)
page = req.text
soup = BeautifulSoup(page)
# find next __VIEWSTATE and __EVENTVALIDATION
viewstate = soup.find("input", {"id":"__VIEWSTATE"})['value']
print("VIEWSTATE: ", viewstate)
eventval = soup.find("input", {"id":"__EVENTVALIDATION"})['value']
print("EVENTVALIDATION: ", eventval)
payload['__EVENTTARGET'] = ""
payload['__EVENTARGUMENT'] = ""
payload['__VIEWSTATE'] = viewstate
payload['__VIEWSTATEGENERATOR'] = "65A1DED9"
payload['__EVENTVALIDATION'] = eventval
payload['ct100$Contenido$GoPag'] = i + 1
table = soup.find("table", {"id" : "ctl00_Contenido_tblEmisoras"})
for row in table.findAll("tr")[1:]:
cells = row.findAll("td")
print(cells[0].find("a").get_text().replace(",","").replace("S.A.", ""))
time.sleep(1)
SPAIN_STK_LIST(3)
Each page generates an VIEWSTATE and EVENTVALIDATION numbers that are stored in hidden tags on the page. I use them to go to the next page. I also used session as suggested but this is still not working. However it has the exact same format (I used ordereddict) as the request body from the webpage. Any ideas what would be missing?
回答1:
I don't know if you can use Selenium but if you are going to interact with page you should.
You can install selenium with pip
I used pandas for only visualisation purposes. You do not have to use it.
First download chrome driver for selenium from here https://chromedriver.storage.googleapis.com/index.html?path=2.40/ and extract it to your workspace or you can specify it in executable_path parameter.It's up to you.
This will get all the data in the table until there is no next page.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
import pandas as pd
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path=r"chromedriver.exe",options=options)
driver.get("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx")
next_button = driver.find_element_by_id("ctl00_Contenido_SiguientesArr")
data = []
try:
while (next_button):
soup = BeautifulSoup(driver.page_source,'html.parser')
table = soup.find('table',{'id':'ctl00_Contenido_tblEmisoras'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols])
#Wait for table to load
time.sleep(2)
next_button = driver.find_element_by_id("ctl00_Contenido_SiguientesArr")
next_button.click()
except NoSuchElementException:
print('No more page to load')
df = pd.DataFrame(columns= ('Name','Sector - Subsector','Market','Indices'), data = data)
print(df.mask(df.eq('None')).dropna())
The output is
Name ... Indices
1 ABENGOA, S.A. ...
2 ABERTIS INFRAESTRUCTURAS, S.A. ...
3 ACCIONA,S.A. ... IBEX 35®, IBEX TOP Dividendo®
4 ACERINOX, S.A. ... IBEX 35®, IBEX TOP Dividendo®
5 ACS,ACTIVIDADES DE CONST.Y SERVICIOS S.A ... IBEX 35®, IBEX TOP Dividendo®
6 ADOLFO DOMINGUEZ, S.A. ...
7 ADVEO GROUP INTERNATIONAL, S.A. ...
8 AEDAS HOMES, S.A. ...
9 AENA, S.M.E., S.A. ... IBEX 35®
10 AIRBUS SE ...
11 ALANTRA PARTNERS, S.A. ...
12 ALFA, S.A.B. DE C.V. ... FTSE Latibex TOP, FTSE Latibex All Share
13 ALMIRALL, S.A. ...
14 AMADEUS IT GROUP, S.A. ... IBEX 35®
15 AMERICA MOVIL, S.A.B. DE C.V. ... FTSE Latibex TOP, FTSE Latibex All Share
16 AMPER, S.A. ...
17 APERAM, SOCIETE ANONYME ...
18 APPLUS SERVICES, S.A. ...
19 ARCELORMITTAL, S.A. ... IBEX 35®
20 ATRESMEDIA CORP. DE MEDIOS DE COM. S.A. ... IBEX TOP Dividendo®
21 AUDAX RENOVABLES, S.A. ...
22 AXIARE PATRIMONIO SOCIMI, S.A. ...
23 AYCO GRUPO INMOBILIARIO, S.A. ...
24 AZKOYEN S.A. ...
25 AZORA ALTUS, S.A. ...
27 BANCO BILBAO VIZCAYA ARGENTARIA, S.A. ... IBEX 35®, IBEX TOP Dividendo®
28 BANCO BRADESCO S.A. ... FTSE Latibex BRASIL, FTSE Latibex TOP, FTSE La...
29 BANCO DE SABADELL, S.A. ... IBEX 35®
30 BANCO SANTANDER RIO, S.A. ... FTSE Latibex All Share
31 BANCO SANTANDER, S.A. ... IBEX 35®, IBEX TOP Dividendo®
.. ... ... ...
145 RENTA 4 BANCO, S.A. ...
146 RENTA CORPORACION REAL ESTATE, S.A. ...
147 REPSOL, S.A. ... IBEX 35®, IBEX TOP Dividendo®
148 SACYR, S.A. ...
149 SAETA YIELD, S.A. ...
150 SARE HOLDING, S.A.B, DE C.V. ... FTSE Latibex TOP, FTSE Latibex All Share
151 SERVICE POINT SOLUTIONS, S.A. ...
152 SIEMENS GAMESA RENEWABLE ENERGY, S.A. ... IBEX 35®
153 SNIACE, S.A. ...
154 SOLARIA ENERGIA Y MEDIO AMBIENTE, S.A. ...
155 TALGO, S.A. ...
157 TECNICAS REUNIDAS, S.A. ... IBEX 35®, IBEX TOP Dividendo®
158 TELEFONICA, S.A. ... IBEX 35®, IBEX TOP Dividendo®
159 TELEPIZZA GROUP, S.A. ...
160 TR HOTEL JARDIN DEL MAR, S.A. ...
161 TUBACEX, S.A. ...
162 TUBOS REUNIDOS,S.A. ...
163 TV AZTECA, S.A. DE C.V. ... FTSE Latibex TOP, FTSE Latibex All Share
164 UNICAJA BANCO, S.A. ...
165 UNION CATALANA DE VALORES, S.A. ...
166 URBAR INGENIEROS, S.A. ...
167 URBAS GRUPO FINANCIERO, S.A. ...
168 USINAS SIDERURGICAS DE MINAS GERAIS,S.A. ... FTSE Latibex BRASIL, FTSE Latibex TOP, FTSE La...
169 VALE, S.A. ... FTSE Latibex BRASIL, FTSE Latibex TOP, FTSE La...
170 VERTICE TRESCIENTOS SESENTA GRADOS, S.A. ...
171 VIDRALA S.A. ...
172 VISCOFAN, S.A. ... IBEX 35®
173 VOCENTO, S.A. ...
174 VOLCAN, COMPAñIA MINERA S.A.A. ... FTSE Latibex All Share
175 ZARDOYA OTIS, S.A. ...
[169 rows x 4 columns]
来源:https://stackoverflow.com/questions/50920242/python-requests-beautifulsoup-access-to-pagination