问题
I have an imdb lists url that I want to parse. I say it as base_url. I have done lot of search online but couldn't find anybody making it through to login to imdb. Probably due to almost 10 items required in FormRequest formdata or some other complexity. I need to login to imdb before parsing that is not working at all. I understand and strongly think there are multiple errors in this code that will pop up once currently active error is fixed, so please keep patience with how this matter is resolved. Here is what I have and what terminal is giving me;
Starting code is as follows, it says [if logged in then process base_url parsing else parse for logging in.
import scrapy
from scrapy.http import FormRequest
class lisTopSpider(scrapy.Spider):
name= 'imdbLog'
allowed_domains = ['imdb.com']
start_urls = [
'https://www.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https://www.imdb.com/registration/ap-signin-handler/imdb_us&openid.identity=http://specs.openid.net/auth/2.0/identifier_select&openid.assoc_handle=imdb_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl91cyIsInJlZGlyZWN0VG8iOiJodHRwczovL3d3dy5pbWRiLmNvbS8_cmVmXz1sb2dpbiJ9&openid.claimed_id=http://specs.openid.net/auth/2.0/identifier_select&openid.ns=http://specs.openid.net/auth/2.0&tag=imdbtag_reg-20'
]
def parse(self, response):
token = response.xpath('//form/input[@name="appActionToken"]/@value').get()
appAction = response.xpath('//form/input[@name="appAction"]/@value').get()
siteState = response.xpath('//form/input[@name="siteState"]/@value').get()
openid = response.xpath('//form/input[@name="openid.return_to"]/@value').get()
prevRID = response.xpath('//form/input[@name="prevRID"]/@value').get()
workflowState = response.xpath('//form/input[@name="workflowState"]/@value').get()
create = response.xpath('//input[@name="create"]/@value').get()
metadata1 = response.xpath('//input[@name="metadata1"]/@value').get()
base_url = 'https://www.imdb.com/lists/tt0120852'
if 'login' in response.url:
return scrapy.Request(base_url, callback = self.listParse)
else:
return FormRequest.from_response(response,formdata={
'appActionToken':token,
'appAction':appAction,
'siteState':siteState,
'openid.return_to':openid,
'prevRID':prevRID,
'workflowState':workflowState,
'email':'.......@.....com',
'create':create,
'passwrod':'........',
'metadata1':metadata1
},callback=self.parse)
Next I have testing print code as follows:
#Test Prints
print('token:'+token)
print('appAction:'+appAction)
print('siteState:'+siteState)
print('openId:'+openid)
print('prevRID:'+prevRID)
print('workflowState:'+workflowState)
print('create:'+create)
print(metadata1)
While testing print results in VSCode Terminal are as follows;
token:scIhj2FOCtxr39z7eUIj2FWeNOWxtIwj3D
appAction:SIGNIN
siteState:ape:ZXlKdmNHVnVhV1F1WVhOemIyTmZhR0Z1Wkd4bElqb2lhVzFrWWw5MWN5SXNJbkpsWkdseVpXTjBWRzhpT2lKb2RIUndjem92TDNkM2R5NXBiV1JpTG1OdmJTOF9jbVZtWHoxc2IyZHBiaUo5
openId:ape:aHR0cHM6Ly93d3cuaW1kYi5jb20vcmVnaXN0cmF0aW9uL2FwLXNpZ25pbi1oYW5kbGVyL2ltZGJfdXM=
prevRID:ape:Qzk5NEUwNjJLOFBSUzVHQktUQ1c=
workflowState:eyJ6aXAiOiJERUYiLCJlbmMiOiJBMjU2R0NNIiwiYWxnIjoiQTI1NktXIn0.r2wu9Fca1h4JT_iSzKWG4FT_F6SxlriABvmYnEdAKpkSdMYGVZbJNw.HdQghAIsJKttKgLN.jS-NSMh66f7pBmcLN07pzHwgz2oc1D2SWGntJUxY1yqNl9PT2v7BDCP-A3p4ao7_TBjJXrJuhZo3Az8DKd2GKS77TPDM8e1FLtaAUEYzzlpoTD7D9dTie-E0ig1h9TNqSniyKs9NMuufKscAqKixn9tArddoMqhAuzpOAvFV9CmRLG5AYnonPSBTE4GgST7BQ3l6IpMMRCaTGJMofGwbhzyYwtEJJkIl5zMx47wrgILy6QO9SL0z-zPbMCCtzZ-75gwd-UeuF5h7wSSR3_UaQjtWaxBaShHVTpP5DvuT.E_coQgOJ93WNKJvv53n3bw
create:0
None
Following is the code that is supposed to be used to parse base_url once the login process is successful.
def listParse(self, response):
listsLinks = response.xpath('//div[2]/strong')
for link in listsLinks:
list_url = response.urljoin(link.xpath('.//a/@href').get())
yield scrapy.Request(list_url, callback=self.parse_list, meta={'list_url': list_url})
next_page_url = response.xpath('//a[@class="flat-button next-page "]/@href').get()
if next_page_url is not None:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(next_page_url, callback=self.listParse)
#Link of each list
def parse_list(self, response):
list_url = response.meta['list_url']
myRatings = response.xpath('//div[@class="ipl-rating-star small"]/span[2]/text()').getall()
yield{
'list': list_url,
'ratings': myRatings,
}
I have got a lot types of terminal messages. Each time I change the code, I get different terminal error like Form element not present etc. Currently I am getting a new terminal error. Relevant section is as follows;
2020-05-05 15:31:56 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.imdb.com/ap/signin> (referer: https://www.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https://www.imdb.com/registration/ap-signin-handler/imdb_us&openid.identity=http://specs.openid.net/auth/2.0/identifier_select&openid.assoc_handle=imdb_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl91cyIsInJlZGlyZWN0VG8iOiJodHRwczovL3d3dy5pbWRiLmNvbS8_cmVmXz1sb2dpbiJ9&openid.claimed_id=http://specs.openid.net/auth/2.0/identifier_select&openid.ns=http://specs.openid.net/auth/2.0&tag=imdbtag_reg-20)
2020-05-05 15:31:56 [scrapy.core.scraper] ERROR: Spider error processing <POST https://www.imdb.com/ap/signin> (referer: https://www.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https://www.imdb.com/registration/ap-signin-handler/imdb_us&openid.identity=http://specs.openid.net/auth/2.0/identifier_select&openid.assoc_handle=imdb_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl91cyIsInJlZGlyZWN0VG8iOiJodHRwczovL3d3dy5pbWRiLmNvbS8_cmVmXz1sb2dpbiJ9&openid.claimed_id=http://specs.openid.net/auth/2.0/identifier_select&openid.ns=http://specs.openid.net/auth/2.0&tag=imdbtag_reg-20)
Traceback (most recent call last):
File "c:\users\abdul\appdata\local\programs\python\python37-32\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Python Projects\Scrapy\imdbscrappervenv\imdb_project\imdb_project\spiders\Loginner.py", line 23, in parse
print('token:'+token)
TypeError: can only concatenate str (not "NoneType") to str
2020-05-05 15:31:57 [scrapy.core.engine] INFO: Closing spider (finished)
Please suggest what is wrong with the code. Thanks
UPDATE1: Concatenation of string+Num error removed I changed print('token:'+token) to print(token) as there was string + Num concatenation error. That is resolved and now I am having following error in terminal:
2020-05-05 16:08:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https://www.imdb.com/registration/ap-signin-handler/imdb_us&openid.identity=http://specs.openid.net/auth/2.0/identifier_select&openid.assoc_handle=imdb_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl91cyIsInJlZGlyZWN0VG8iOiJodHRwczovL3d3dy5pbWRiLmNvbS8_cmVmXz1sb2dpbiJ9&openid.claimed_id=http://specs.openid.net/auth/2.0/identifier_select&openid.ns=http://specs.openid.net/auth/2.0&tag=imdbtag_reg-20> (referer: None)
2020-05-05 16:08:02 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.imdb.com/ap/signin> (referer: https://www.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https://www.imdb.com/registration/ap-signin-handler/imdb_us&openid.identity=http://specs.openid.net/auth/2.0/identifier_select&openid.assoc_handle=imdb_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl91cyIsInJlZGlyZWN0VG8iOiJodHRwczovL3d3dy5pbWRiLmNvbS8_cmVmXz1sb2dpbiJ9&openid.claimed_id=http://specs.openid.net/auth/2.0/identifier_select&openid.ns=http://specs.openid.net/auth/2.0&tag=imdbtag_reg-20)
2020-05-05 16:08:02 [scrapy.core.scraper] ERROR: Spider error processing <POST https://www.imdb.com/ap/signin> (referer: https://www.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https://www.imdb.com/registration/ap-signin-handler/imdb_us&openid.identity=http://specs.openid.net/auth/2.0/identifier_select&openid.assoc_handle=imdb_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl91cyIsInJlZGlyZWN0VG8iOiJodHRwczovL3d3dy5pbWRiLmNvbS8_cmVmXz1sb2dpbiJ9&openid.claimed_id=http://specs.openid.net/auth/2.0/identifier_select&openid.ns=http://specs.openid.net/auth/2.0&tag=imdbtag_reg-20)
Traceback (most recent call last):
File "c:\users\abdul\appdata\local\programs\python\python37-32\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Python Projects\Scrapy\imdbscrappervenv\imdb_project\imdb_project\spiders\Loginner.py", line 37, in parse
},callback=self.parse)
File "c:\users\abdul\appdata\local\programs\python\python37-32\lib\site-packages\scrapy\http\request\form.py", line 48, in from_response
form = _get_form(response, formname, formid, formnumber, formxpath)
File "c:\users\abdul\appdata\local\programs\python\python37-32\lib\site-packages\scrapy\http\request\form.py", line 83, in _get_form
raise ValueError("No <form> element found in %s" % response)
HELPFUL INFO I tried to run login page by disabling javascript and found that one element 'Metadata1' was not appearing in inspect. Means it is javascript enabled element so I can't login with scrapy? Any shortcut, if possible?
来源:https://stackoverflow.com/questions/61611585/scrapy-formrequest-login-to-imdb-involving-javascript-form-field