问题
I dont know much about html... How do you remove just text from the page? For example if the html page reads as:
<meta name="title" content="How can I make money at home online? No gimmacks please? - Yahoo! Answers">
<title>How can I make money at home online? No gimmicks please? - Yahoo! Answers</title>
I just want to extract this.
How can I make money at home online? No gimmicks please? - Yahoo! Answers
I am using re function:
def striphtml(data):
p = re.compile(r'<.*?>')
return p.sub(' ',data)
but still it's not doing what I intend it to do..?
The above function is called as:
for lines in filehandle.readlines():
#k = str(section[6].strip())
myFile.write(lines)
lines = striphtml(lines)
content.append(lines)
回答1:
Don't use Regular expressions for HTML/XML parsing. Try http://www.crummy.com/software/BeautifulSoup/ instead.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('Your resource<title>hi</title>')
soup.title.string # Your title string.
回答2:
Use an html parser for that. One could be BeautifulSoup
To get text content of the page:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_html)
text_nodes = soup.findAll(text = True)
retult = ' '.join(text_nodes)
回答3:
I usually use http://lxml.de/ for html parsing! it is really easy to use, and pretty much to get tags you can use xpath for it! which just make things easy as well as fast.
I have a example of use, in a script that I did to read a xml feed and count the words:
https://gist.github.com/1425228
Also you can find more examples in the documentation: http://lxml.de/lxmlhtml.html
来源:https://stackoverflow.com/questions/8783385/processing-html-files-python