Reading a particular line from a webpage in python

霸气de小男生 提交于 2019-12-06 21:22:32

Use XPath. It's exactly what we need.

XPath, the XML Path Language, is a query language for selecting nodes from an XML document.

The lxml python library will help us with this. It's one of many. Libxml2, Element Tree, and PyXML are some of the options. There are many, many, many libraries to do this type of thing.

Using XPath

Something like the following, based on your existing code, will work:

import urllib2
from lxml import html
line_number = 10
id = (np.arange(1,5))
for n in id:
    link =  urllib2.urlopen("http://www.cv.edu/id={}".format(n))
    l = link.read()
    tree = html.fromstring(l)
    print tree.xpath("//b/text()")[0]

The XPath query //b/text() basically says "get the text from the <b> elements on a page. The tree.xpath function call returns a list, and we select the first one using [0]. Easy.

An aside about Requests

The Requests library is the state-of-the-art when it comes to reading webpages in code. It may save you some headaches later.

The complete program might look like this:

from lxml import html
import requests

for nn in range(1, 6):
    page = requests.get("http://www.cv.edu/id=%d" % nn)
    tree = html.fromstring(page.text)
    print tree.xpath("//b/text()")[0]

Caveats

The urls didn't work for me, so you might have to tinker a bit. The concept is sound, though.

Reading from the webpages aside, you can use the following to test the XPath:

from lxml import html

tree = html.fromstring("""<html>
    <head>
        <link rel="stylesheet">
    </head>
    <body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>""")

print tree.xpath("//b/text()")[0] # "Old cars"
Michael Bird

If you are going to do this on many different webpages that might be written differently, you might find that BeautifulSoup is helpful.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

As you can see at the bottom of quick start, it should be possible for you to extract all the text from the page and then take whatever line you are interested in.

Keep in mind that this will only work for HTML text. Some webpages use javascript extensively, and requests/BeautifulSoup will not be able to read content provided by the javascript.

Using Requests and BeautifulSoup - Python returns tag with no text

See also an issue I have had in the past, which was clarified by user avi: Want to pull a journal title from an RCSB Page using python & BeautifulSoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!