why xpath derived from chrome does not work

隐身守侯 提交于 2021-02-08 11:19:27

问题


I am trying to scrap data from web of science

And here is the specific page I am going to work with.

Below is the code I use for extract the abstract:

import lxml
import requests

url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no'
s = requests.Session()
d = s.get(url)
soup1 = etree.HTML(d.text)

And here is the xpath I got through the copy xpath in Chrome:

//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()

So I tried to get the abstract like this

path = '//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()'   
print(soup1.xpath(path))

However, I just hot an empty list! Then I tried another way to test the xpath.

Firstly, I save the specific page as a local html file.

with open('1.html','w',encoding='UTF=8') as f:
    f.write(d.text)
f.close()

Then, open the file

s.mount('file://',FileAdapter())
d = s.get('file:///K:/single_paper.html')
soup2 = etree.HTML(d.text)
soup2.xpath('//*[@id="records_form"]/div/div/div/div[1]/div/div[4]/p/text()')

And it gives me the abstract I want! Could anyone tell me why that happens?

Weired when I try to do the steps with another page in the saving local file way, it returns an empty list again!

I checked that the xpath given by Chrome is the same for these two pages.

So could anyone tell me what's wrong with my code and how to fix it?


回答1:


Browser given full Xpaths are usually unhelpful and you should use relative and clever ones based on attributes (such as id, class, etc) or any identifying features like contains(@href, 'image').

You could try more specific xpath expression: (//div[@class="block-record-info"])[2]/p/text() and rewrite your code like this:

import requests
from lxml import html

url = 'https://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=2&SID=Q1yAnqE4al4KxALF7RM&page=1&doc=3&cacheurlFromRightClick=no'
s = requests.Session()
r = s.get(url)
tree = html.fromstring(r.content)
element = tree.xpath('(//div[@class="block-record-info"])[2]/p/text()')
print(element)

Output:



来源:https://stackoverflow.com/questions/43090530/why-xpath-derived-from-chrome-does-not-work

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!