beautifulsoup

BeautifulSoup Tag Removal

懵懂的女人 提交于 2019-12-24 15:48:03
问题 I have am looking to parse a HTML table with Python/BeautifulSoup... This is my first attempt at coding anything in Python, so its probably not the most efficient. I grabbed a function another post here (works great for the most part), but I am running into a couple of problems. The code I am running is here: def strip_tags(html, invalid_tags): bs2 = BeautifulSoup(str(html)) for tag in bs2.findAll(True): if tag.name in invalid_tags: s = "" for c in tag.contents: if not isinstance(c,

BeautifulSoup find the next specific tag following a found tag

纵然是瞬间 提交于 2019-12-24 15:40:11
问题 Given the following (simplified from a larger document) <tr class="row-class"> <td>Age</td> <td>16</td> </tr> <tr class="row-class"> <td>Height</td> <td>5.6</td> </tr> <tr class="row-class"> <td>Weight</td> <td>103.4</td> </tr> I have tried to return the 16 from the appropriate row using bs4 and lxml . The issue seems to be that there is a Navigable String between the two td tags, so that page.find_all("tr", {"class":"row-class"}) yields a result set with result[0] = {Tag} <tr class="row

Search for text inside a tag using beautifulsoup and returning the text in the tag after it

前提是你 提交于 2019-12-24 14:28:37
问题 I'm trying to parse the follow HTML code in python using beautiful soup. I would like to be able to search for text inside a tag, for example "Color" and return the text next tag "Slate, mykonos" and do so for the next tags so that for a give text category I can return it's corresponding information. However, I'm finding it very difficult to find the right code to do this. <h2>Details</h2> <div class="section-inner"> <div class="_UCu"> <h3 class="_mEu">General</h3> <div class="_JDu"> <span

Empty value from web scraping with python beautiful soup

浪子不回头ぞ 提交于 2019-12-24 14:28:18
问题 I am trying to scrap this website but having issues extracting the right values. The website is on the prices of Sliver, gold, palladium and platinum. http://www.lbma.org.uk/precious-metal-prices The html of the website is below. <div id="header-tabs-content" data-tabs-content="header-tabs"> <div class="tabs-panel is-active" id="header-tabs-panel1" role="tabpanel" aria-hidden="false" aria-labelledby="header-tabs- panel1-label"> <a href="/precious-metal-prices"> <p>Gold Price</p> <p>AM:

After starting My Scraper I do not get an output

时间秒杀一切 提交于 2019-12-24 14:16:22
问题 I am running a scraper to retrieve Product name, Cat No, Size and Price but when I run the script it doesn't give me an output or an error message. I am using Jupyter Notebook for this and not sure if that is the problem. I am also not sure if because I am imputing this into a CSV file if this is also giving it issues. Any help would be greatly appreciated. This is the code that I am running. from selenium import webdriver import csv, os from bs4 import BeautifulSoup os.chdir(r'C:\Users\kevin

Extracting comments from news articles

浪尽此生 提交于 2019-12-24 13:44:14
问题 My question is similar to the one asked here: https://stackoverflow.com/questions/14599485/news-website-comment-analysis I am trying to extract comments from any news article. E.g. i have a news url here: http://www.cnn.com/2013/09/24/politics/un-obama-foreign-policy/ I am trying to use BeautifulSoup in python to extract the comments. However it seems the comment section is either embedded within an iframe or loaded through javascript. Viewing the source through firebug does not reveal the

Webscraping with Python: WinError 10061: Target machine actively refused

微笑、不失礼 提交于 2019-12-24 13:42:41
问题 I am writing a code to scrape data from a website. The code was working fine until I decided to hide my IP address. I get the following error "urlopen error [WinError 10061] No connection could be made because the target machine actively refused it" I have disabled the firewalls and antivirus on my machine; Tor is installed and running, internet connection is fine (obviously). Could someone help me figure out where the problem is? And whether it can be fixed (I cannot change the website I am

Passing lxml output to BeautifulSoup

核能气质少年 提交于 2019-12-24 13:33:15
问题 My offline code works fine but I'm having trouble passing a web page from urllib via lxml to BeautifulSoup. I'm using urllib for basic authentication then lxml to parse (it gives a good result with the specific pages we need to scrape) then to BeautifulSoup. #! /usr/bin/python import urllib.request import urllib.error from io import StringIO from bs4 import BeautifulSoup from lxml import etree from lxml import html file = open("sample.html") doc = file.read() parser = etree.HTMLParser() html

Extract attribute value in beautiful soup

泪湿孤枕 提交于 2019-12-24 13:25:33
问题 The following is part of a website that I am trying to extract the video titles from: </div> <div class="yt-lockup-content"> <h3 class="yt-lockup-title"> <a class="yt-uix-sessionlink yt-uix-tile-link yt-uix-contextlink yt-ui-ellipsis yt-ui-ellipsis-2" dir="ltr" title="Harder Polynomials" data-sessionlink="ei=fYsHUvSLA8uzigLq74CABQ&ved=CB8Qvxs&feature=c4-videos-u" href="/watch?v=LHvQeBRLFn8" > Harder Polynomials </a> I wish to extract the video title (Harder Polynomials) from this. I have

ImportError: No module named html.entities

戏子无情 提交于 2019-12-24 12:46:32
问题 I am new to python. I am using python 2.7.5. I want to write a web crawler. For that I have installed BeautifulSoup 4.3.2. I have installed it using this command(I haven't used pip) python setup.py install I am using eclipse 4.2 with pydev installed. When I try to import this library in my script from bs4 import BeautifulSoup I am getting this error ImportError: No module named html.entities Please explain me what should I do to rectify it. 回答1: Is there any reason why are you not using pip