bs4 | 易学教程

Compiled with CX_FREEZE, Beautiful Soup program wont run in Console

阅读更多关于 Compiled with CX_FREEZE, Beautiful Soup program wont run in Console

问题 This is the error I am getting when I run the EXE file of the program. The program runs fine in Pycharm but generates such error in console. bs4.FeatureNotFound: Couldn't find a Tree Builder with features you requested. Do you need to install a parser library? import sys from cx_Freeze import setup, Executable build_exe_options = {"packages": ["bs4, urllib, requests"], "excludes": [""]} base = None setup( name = "Weather", version = "0.9.0", options = {"program": build_exe_options},

regex not working in bs4

阅读更多关于 regex not working in bs4

问题 I am trying to extract some links from a specific filehoster on watchseriesfree.to website. In the following case I want rapidvideo links, so I use regex to filter out those tags with text containing rapidvideo import re import urllib2 from bs4 import BeautifulSoup def gethtml(link): req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"}) con = urllib2.urlopen(req) html = con.read() return html def findLatest(): url = "https://watchseriesfree.to/serie/Madam-Secretary" head =

BeautifulSoup (bs4) parsing wrong

阅读更多关于 BeautifulSoup (bs4) parsing wrong

问题 Parsing this sample document with bs4, from python 2.7.6: <html> <body> HTML allows omitting P end-tags. Like that and this. And this, too. What happened? And can we nest a paragraph, too? </body> </html> Using: from bs4 import BeautifulSoup as BS ... tree = BS(fh) HTML has, for ages, allowed omitted end-tags for various element types, including P (check the schema, or a parser). However, bs4's prettify() on this document shows that it doesn't end any of those

Python - AttributeError: 'NoneType' object has no attribute 'get_text'

阅读更多关于 Python - AttributeError: 'NoneType' object has no attribute 'get_text'

问题 I am following some tutorial for bs4. I am trying to get_text() for below example with 'a'. Tutorial return result McDermott International and MDR without problem. But when I do I got AttributeError: 'NoneType' object has no attribute 'get_text'. Please help. Many thanks! with open('Energy.htm') as f: soup = BeautifulSoup(f,"lxml") energylist = soup.find_all('td', {"style" : "text-align:left;"}) for stock in energylist: try: stock_name = stock.find('a').get_text() except: stock_name = ''

How Do I Remove An XML Declaration Using BeautifulSoup4

阅读更多关于 How Do I Remove An XML Declaration Using BeautifulSoup4

问题 I have an XHTML file that is structured like this: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html> <html lang="en"> <head> ... </head> <body> ... </body> <html> I'm using BeautifulSoup and I want to remove the XML declaration from the document, so what I have looks like this: <!DOCTYPE html> <html lang="en"> <head> ... </head> <body> ... </body> <html> I can't find a way to get at the XML declaration to remove it. It doesn't appear to be a Doctype, Declaration, Tag, or NavigableString

Scraping a list of urls

阅读更多关于 Scraping a list of urls

问题 I am using Python 3.5 and trying to scrape a list of urls (from the same website), code as follows: import urllib.request from bs4 import BeautifulSoup url_list = ['URL1', 'URL2','URL3] def soup(): for url in url_list: sauce = urllib.request.urlopen(url) for things in sauce: soup_maker = BeautifulSoup(things, 'html.parser') return soup_maker # Scraping def getPropNames(): for propName in soup.findAll('div', class_="property-cta"): for h1 in propName.findAll('h1'): print(h1.text) def getPrice(

using bs4 to find a html tag (h2) having text

阅读更多关于 using bs4 to find a html tag (h2) having text

问题 for this part of html code: html3= """<a name="definition"> </a> <h2>3.342.2323 Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2> <hr/> <div><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td>Code</td><td>Display</td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History

Extracting text between tags using BeautifulSoup

阅读更多关于 Extracting text between tags using BeautifulSoup

问题 I am trying to extract text from a series of webpages that all follow a similar format using BeautifulSoup. The html for the text I wish to extract is below. The actual link is here: http://www.p2016.org/ads1/bushad120215.html. <big></big><span style="font-family:

Accessing untagged text using beautifulsoup

阅读更多关于 Accessing untagged text using beautifulsoup

问题 I am using python and beautifulsoup4 to extract some address information. More specifically, I require assistance when retrieving non-US based zip codes. Consider the following html data of a US based company: (already a soup object) <div class="compContent curvedBottom" id="companyDescription"> <div class="vcard clearfix"> 999 State St Ste 100 Salt Lake City, UT <span class="zip"

Accessing untagged text using beautifulsoup

阅读更多关于 Accessing untagged text using beautifulsoup

I am using python and beautifulsoup4 to extract some address information. More specifically, I require assistance when retrieving non-US based zip codes. Consider the following html data of a US based company: (already a soup object) <div class="compContent curvedBottom" id="companyDescription"> <div class="vcard clearfix"> 999 State St Ste 100 Salt Lake City, UT 84114-0002, United States