python scraping reuters site…bad xpath?

浪尽此生 提交于 2020-01-05 12:16:48

问题


I am trying to do something which appeared to be simple...I am trying to scrape company names of reuters list from this link:

http://www.reuters.com/finance/markets/index?symbol=us!spx&sortBy=&sortDir=&pn=

however, I just can't access the company names! Really, after playing around with a lot of xpath queries, I have problems accessing the table. I am trying to grab the names such as "3M company" and "Abbott Laboratories"

Here are snippets of code I have used:

scrape = []
companies =[]
import lxml
import lxml.html
import lxml.etree

urlbase = 'http://reuters.com/finance/markets/index?symbol=us!spx&sortBy=&sortDir=&pn='
for i in range(1:18):
    url = urlbase+str(i)
    content = lxml.html.parse(url)
    item = content.xpath('XPATH HERE')
    ticker = [thing.text for thing in item]

Here are the xpaths i have been playing with:

'//*[@id="topContent"]/div/div[2]/div[1]/table/tr[2]/td[1]/a'
'//*[@id="topContent"]/div/div[2]/div[1]/table/tbody/tr[2]/td[1]/a
'/html/body/div[3]/div[3]/div/div[2]/div/table/tbody/tr[3]/td/a'
'/html/body/div[3]/div[3]/div/div[2]/div/table/tr[3]/td/a'

I have tried accessing that one particular table through: '//table[@class="dataTable sortable"]', but have not had any luck

can anyone help? I feel like this is something that someone who knows what they are doing will be able to fix rather quickly THANKS!


回答1:


The page you're trying to scrape has a form inside the table. The correct xpath should be '//table[@class="dataTable sortable"]/form/tr/td[1]/a'

Also, you probably have a typo in your code, it should be range(1,18) instead of range(1:18). Here's the final code that works on my side:

scrape = []
companies =[]
import lxml
import lxml.html
import lxml.etree

urlbase = 'http://reuters.com/finance/markets/index?symbol=us!spx&sortBy=&sortDir=&pn='
for i in range(1,18):
    url = urlbase+str(i)
    content = lxml.html.parse(url)
    item = content.xpath('//table[@class="dataTable sortable"]/form/tr/td[1]/a')
    ticker = [thing.text for thing in item]
    print ticker


来源:https://stackoverflow.com/questions/10907469/python-scraping-reuters-site-bad-xpath

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!