I am using BeautifulSoup to scrape a url and I had the following code
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})
Now in the above code we can use findAll
to get tags and information related to them, but I want to use xpath. Is it possible to use xpath with BeautifulSoup? If possible, can anyone please provide me an example code so that it will be more helpful?
Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you've parsed your document into an lxml tree, you can use the .xpath()
method to search for elements.
import urllib2
from lxml import etree
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)
Of possible interest to you is the CSS Selector support; the CSSSelector
class translates CSS statements into XPath expressions, making your search for td.empformbody
that much easier:
from lxml.cssselect import CSSSelector
td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.
Coming full circle: BeautifulSoup itself does have very complete CSS selector support:
for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.
I can confirm that there is no XPath support within Beautiful Soup.
Martijn's code no longer works properly (it is 4+ years old by now...), the etree.parse()
line prints to the console and doesn't assign the value to the tree
variable. Referencing this, I was able to figure out this works using requests and lxml:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')
print 'Buyers: ', buyers
print 'Prices: ', prices
BeautifulSoup has a function named findNext from current element directed childern,so:
father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')
Above code can imitate the following xpath:
div[class=class_value]/div[id=id_value]
when you use lxml all simple:
tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')
but when use BeautifulSoup BS4 all simple too:
- first remove "//" and "@"
- second - add star before "="
try this magic:
soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')
as you see, this does not support sub-tag, so i remove "/@href" part
This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.
Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.
from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()
来源:https://stackoverflow.com/questions/11465555/can-we-use-xpath-with-beautifulsoup