In my code I'm trying to get the first line of text from a webpage into a variable in python. At the moment I'm using urlopen to get the whole page for each link I want to read. How do I only read the first line of words on the webpage.
My code:
import urllib2
line_number = 10
id = (np.arange(1,5))
for n in id:
link = urllib2.urlopen("http://www.cv.edu/id={}".format(n))
l = link.read()
I want to extract the word "old car" from the following html code of the webpage:
<html>
<head>
<link rel="stylesheet">
<style>
.norm { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
.norm:Visited { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
.norm:Hover { font-family: arial; font-size: 8.5pt; color : #000000; text-decoration : underline; }
</style>
</head>
<body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>
Use XPath. It's exactly what we need.
XPath, the XML Path Language, is a query language for selecting nodes from an XML document.
The lxml python library will help us with this. It's one of many. Libxml2, Element Tree, and PyXML are some of the options. There are many, many, many libraries to do this type of thing.
Using XPath
Something like the following, based on your existing code, will work:
import urllib2
from lxml import html
line_number = 10
id = (np.arange(1,5))
for n in id:
link = urllib2.urlopen("http://www.cv.edu/id={}".format(n))
l = link.read()
tree = html.fromstring(l)
print tree.xpath("//b/text()")[0]
The XPath query //b/text() basically says "get the text from the <b> elements on a page. The tree.xpath function call returns a list, and we select the first one using [0]. Easy.
An aside about Requests
The Requests library is the state-of-the-art when it comes to reading webpages in code. It may save you some headaches later.
The complete program might look like this:
from lxml import html
import requests
for nn in range(1, 6):
page = requests.get("http://www.cv.edu/id=%d" % nn)
tree = html.fromstring(page.text)
print tree.xpath("//b/text()")[0]
Caveats
The urls didn't work for me, so you might have to tinker a bit. The concept is sound, though.
Reading from the webpages aside, you can use the following to test the XPath:
from lxml import html
tree = html.fromstring("""<html>
<head>
<link rel="stylesheet">
</head>
<body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>""")
print tree.xpath("//b/text()")[0] # "Old cars"
If you are going to do this on many different webpages that might be written differently, you might find that BeautifulSoup is helpful.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
As you can see at the bottom of quick start, it should be possible for you to extract all the text from the page and then take whatever line you are interested in.
Keep in mind that this will only work for HTML text. Some webpages use javascript extensively, and requests/BeautifulSoup will not be able to read content provided by the javascript.
Using Requests and BeautifulSoup - Python returns tag with no text
See also an issue I have had in the past, which was clarified by user avi: Want to pull a journal title from an RCSB Page using python & BeautifulSoup
来源:https://stackoverflow.com/questions/31248245/reading-a-particular-line-from-a-webpage-in-python