beautifulsoup | 易学教程

How do i get rid of all the smart quotes while parsing a web page?

阅读更多关于 How do i get rid of all the smart quotes while parsing a web page?

问题 This is my code : name = namestr.decode("utf-8") name.replace(u"\u2018", "").replace(u"\u2019", "").replace(u"\u201c","").replace(u"\u201d", "") This doesn't seem to work . I still find &ldquo , &rdquo etc in my text. Also this text has been parsed using beautiful soup 回答1: Replace the last line of your code with this one: name = name.replace(u"\u2018", "").replace(u"\u2019", "").replace(u"\u201c","").replace(u"\u201d", "") The replace method returns a modified string but it does not affect

BeautifulSoup doesn't find correctly parsed elements

阅读更多关于 BeautifulSoup doesn't find correctly parsed elements

问题 I am using BeautifulSoup to parse a bunch of possibly very dirty HTML documents. I stumbled upon a very bizarre thing. The HTML comes from this page: http://www.wvdnr.gov/ It contains multiple errors, like multiple <html></html> , <title> outside the <head> , etc... However, html5lib usually works well even in these cases. In fact, when I do: soup = BeautifulSoup(document, "html5lib") and I pretti-print soup , I see the following output: http://pastebin.com/8BKapx88 which contains a lot of <a

soup.findAll is not working for table

阅读更多关于 soup.findAll is not working for table

问题 I am trying to parse this site https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017 using the following code from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup import ssl context = ssl._create_unverified_context() dibbsurl = 'https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017' uClient = uReq(dibbsurl, context=context) dibbshtml = uClient.read() uClient.close() #html parser dibbssoup =

Getting Tag Names with BeautifulSoup

阅读更多关于 Getting Tag Names with BeautifulSoup

问题 from bs4 import BeautifulSoup source_code = """<a href="#" name="linkName">ok</a>""" soup = BeautifulSoup(source_code) print soup.a.? #find the object name Using the code displayed above, i am trying to print the anchor tags 'name' which is linkName but i'm not sure which module or object i will be using, i have tried contents , name and tag_name_re . Can anybody help me out? thanks! 回答1: You already answered your question. soup.a['name'] Edit If you have more than one a element, you can do

Beautifulsoup Exception list out of range

阅读更多关于 Beautifulsoup Exception list out of range

问题 I'm using beautifulsoup to do the following: section = soup.findAll('tbody')[0] How can set variable like that using the first list item... without it throwing an exception to: IndexError: list index out of range if BS4 can't find tbody? Any ideas? 回答1: You can return the answer from findAll and chek it's length first: x = soup.findAll("tbody") if x is not None and len(x) > 0: section = x[0] 回答2: Everyone who parses HTML will run into this type a question. The element you are looking for is

Scraping text in h3 and p tags using beautifulsoup python

阅读更多关于 Scraping text in h3 and p tags using beautifulsoup python

问题 I have experience with python, BeautifulSoup but I'm eager to scrape data from a website and store as a csv file. A single sample of data I need is coded as follows (a single row of data). ...body and not nested divs... <h3 class="college"> <span class="num">1.</span> <a href="https://www.stanford.edu/">Stanford University</a> </h3> <div class="he-mod" data-block="paragraph-9"></div> <p class="school-location">Stanford, CA</p> ...body and not nested divs... <h3 id="MIT" class="college"> <span

Scrape Finviz Page for Specific Values in Table

阅读更多关于 Scrape Finviz Page for Specific Values in Table

问题 I will start out by saying I'm not endorsing scraping of sites that do not allow it in their terms of service and this is purely for academic research of hypothetical gathering of financial data from various websites. If one wanted to look at this link: https://finviz.com/screener.ashx?v=141&f=geo_usa,ind_stocksonly,sh_avgvol_o100,sh_price_o1&o=ticker ...which is stored in a URLs.csv file, and wanted to scrape columns 2-5 (ie. Ticker, Perf Week, Perf Month, Perf Quarter) and wanted to export

Python Web Scraping; Beautiful Soup

阅读更多关于 Python Web Scraping; Beautiful Soup

问题 This was covered in this post: Python web scraping involving HTML tags with attributes But I haven't been able to do something similar for this web page: http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland? I'm trying to scrape the values of: <td class="price city-2"> NZ$15.62 <span style="white-space:nowrap;">(AU$12.10)</span> </td> <td class="price city-1"> AU$15.82 </td> Basically price city-2 and price city-1 (NZ$15.62 and AU$15.82) Currently have: import urllib2 from

not iterating the list in web scraping

阅读更多关于 not iterating the list in web scraping

问题 From a link , I am trying to create two lists: one for country and the other for currency. However, I'm stuck at some point where it only gives me the first country name but doesn't iterate to list of all countries. Any help as to how I can fix this will be appreciated.Thanks in advance. Here is my try: from bs4 import BeautifulSoup import urllib.request url = "http://www.worldatlas.com/aatlas/infopage/currency.htm" headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)

Python get request returning different HTML than view source

阅读更多关于 Python get request returning different HTML than view source

问题 I'm trying to extract the fanfiction from an Archive of Our Own URL in order to use the NLTK library to do some linguistic analysis on it. However every attempt at scraping the HTML from the URL is returning everything BUT the fanfic (and the comments form, which I don't need). First I tried with the built in urllib library (and BeautifulSoup): import urllib from bs4 import BeautifulSoup html = request.urlopen("http://archiveofourown.org/works/6846694").read() soup = BeautifulSoup(html,"html