beautifulsoup

How do i get rid of all the smart quotes while parsing a web page?

喜你入骨 提交于 2019-12-24 02:23:58
问题 This is my code : name = namestr.decode("utf-8") name.replace(u"\u2018", "").replace(u"\u2019", "").replace(u"\u201c","").replace(u"\u201d", "") This doesn't seem to work . I still find &ldquo , &rdquo etc in my text. Also this text has been parsed using beautiful soup 回答1: Replace the last line of your code with this one: name = name.replace(u"\u2018", "").replace(u"\u2019", "").replace(u"\u201c","").replace(u"\u201d", "") The replace method returns a modified string but it does not affect

BeautifulSoup doesn't find correctly parsed elements

那年仲夏 提交于 2019-12-24 02:10:03
问题 I am using BeautifulSoup to parse a bunch of possibly very dirty HTML documents. I stumbled upon a very bizarre thing. The HTML comes from this page: http://www.wvdnr.gov/ It contains multiple errors, like multiple <html></html> , <title> outside the <head> , etc... However, html5lib usually works well even in these cases. In fact, when I do: soup = BeautifulSoup(document, "html5lib") and I pretti-print soup , I see the following output: http://pastebin.com/8BKapx88 which contains a lot of <a

soup.findAll is not working for table

穿精又带淫゛_ 提交于 2019-12-24 02:02:30
问题 I am trying to parse this site https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017 using the following code from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup import ssl context = ssl._create_unverified_context() dibbsurl = 'https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017' uClient = uReq(dibbsurl, context=context) dibbshtml = uClient.read() uClient.close() #html parser dibbssoup =

Getting Tag Names with BeautifulSoup

℡╲_俬逩灬. 提交于 2019-12-24 01:56:13
问题 from bs4 import BeautifulSoup source_code = """<a href="#" name="linkName">ok</a>""" soup = BeautifulSoup(source_code) print soup.a.? #find the object name Using the code displayed above, i am trying to print the anchor tags 'name' which is linkName but i'm not sure which module or object i will be using, i have tried contents , name and tag_name_re . Can anybody help me out? thanks! 回答1: You already answered your question. soup.a['name'] Edit If you have more than one a element, you can do

Beautifulsoup Exception list out of range

南笙酒味 提交于 2019-12-24 01:44:55
问题 I'm using beautifulsoup to do the following: section = soup.findAll('tbody')[0] How can set variable like that using the first list item... without it throwing an exception to: IndexError: list index out of range if BS4 can't find tbody? Any ideas? 回答1: You can return the answer from findAll and chek it's length first: x = soup.findAll("tbody") if x is not None and len(x) > 0: section = x[0] 回答2: Everyone who parses HTML will run into this type a question. The element you are looking for is

Scraping text in h3 and p tags using beautifulsoup python

血红的双手。 提交于 2019-12-24 01:39:08
问题 I have experience with python, BeautifulSoup but I'm eager to scrape data from a website and store as a csv file. A single sample of data I need is coded as follows (a single row of data). ...body and not nested divs... <h3 class="college"> <span class="num">1.</span> <a href="https://www.stanford.edu/">Stanford University</a> </h3> <div class="he-mod" data-block="paragraph-9"></div> <p class="school-location">Stanford, CA</p> ...body and not nested divs... <h3 id="MIT" class="college"> <span

Scrape Finviz Page for Specific Values in Table

僤鯓⒐⒋嵵緔 提交于 2019-12-24 01:38:06
问题 I will start out by saying I'm not endorsing scraping of sites that do not allow it in their terms of service and this is purely for academic research of hypothetical gathering of financial data from various websites. If one wanted to look at this link: https://finviz.com/screener.ashx?v=141&f=geo_usa,ind_stocksonly,sh_avgvol_o100,sh_price_o1&o=ticker ...which is stored in a URLs.csv file, and wanted to scrape columns 2-5 (ie. Ticker, Perf Week, Perf Month, Perf Quarter) and wanted to export

Python Web Scraping; Beautiful Soup

岁酱吖の 提交于 2019-12-24 01:24:07
问题 This was covered in this post: Python web scraping involving HTML tags with attributes But I haven't been able to do something similar for this web page: http://www.expatistan.com/cost-of-living/comparison/melbourne/auckland? I'm trying to scrape the values of: <td class="price city-2"> NZ$15.62 <span style="white-space:nowrap;">(AU$12.10)</span> </td> <td class="price city-1"> AU$15.82 </td> Basically price city-2 and price city-1 (NZ$15.62 and AU$15.82) Currently have: import urllib2 from

not iterating the list in web scraping

断了今生、忘了曾经 提交于 2019-12-24 00:58:43
问题 From a link , I am trying to create two lists: one for country and the other for currency. However, I'm stuck at some point where it only gives me the first country name but doesn't iterate to list of all countries. Any help as to how I can fix this will be appreciated.Thanks in advance. Here is my try: from bs4 import BeautifulSoup import urllib.request url = "http://www.worldatlas.com/aatlas/infopage/currency.htm" headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)

Python get request returning different HTML than view source

你说的曾经没有我的故事 提交于 2019-12-24 00:44:33
问题 I'm trying to extract the fanfiction from an Archive of Our Own URL in order to use the NLTK library to do some linguistic analysis on it. However every attempt at scraping the HTML from the URL is returning everything BUT the fanfic (and the comments form, which I don't need). First I tried with the built in urllib library (and BeautifulSoup): import urllib from bs4 import BeautifulSoup html = request.urlopen("http://archiveofourown.org/works/6846694").read() soup = BeautifulSoup(html,"html