beautifulsoup

beautiful soup findall multiple class using one query

徘徊边缘 提交于 2020-01-03 16:52:01
问题 I searched thoroughly for solution on many websites and on here but none of them works! I am trying to scrape flashscores.com and i want to parse a <td> with the class name cell_ab team-home or cell_ab team-home bold I tried using re soup.find_all('td', { 'class'= re.compile(r"^(cell_ab team-home |cell_ab team-home bold )$")) and soup.find_all('td', { 'class' : ['cell_ab team-home ','cell_ab team-home bold ']) neither of them works. someone requested for the codes so here it is from tkinter

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER. With Requests and Beastuifulsoup

无人久伴 提交于 2020-01-03 16:00:32
问题 I had this web scraping code working a few minutes ago, but now I get this warning and encoding. Since this request doesn't return html, Beautifulsoup is returning a None type when I search for the contents of a tag. What is going wrong here? I tried to google a bit for this encoding problem, but couldn't find a clear answer. import requests from bs4 import BeautifulSoup url = 'http://finance.yahoo.com/q?s=aapl&fr=uh3_finance_web&uhb=uhb2' data = requests.get(url) soup = BeautifulSoup(data

BeautifulSoup - How to get all text between two different tags?

好久不见. 提交于 2020-01-03 13:00:29
问题 I would like to get all text between two tags: <div class="lead">I DONT WANT this</div> #many different tags - p, table, h2 including text that I want <div class="image">...</div> I started this way: url = "http://......." req = urllib.request.Request(url) source = urllib.request.urlopen(req) soup = BeautifulSoup(source, 'lxml') start = soup.find('div', {'class': 'lead'}) end = soup.find('div', {'class': 'image'}) And I have no idea what to do next 回答1: try using the code below: from bs4

How to ignore empty lines while using .next_sibling in BeautifulSoup4 in python

本小妞迷上赌 提交于 2020-01-03 11:28:19
问题 As i want to remove duplicated placeholders in a html website, i use the .next_sibling operator of BeautifulSoup. As long as the duplicates are in the same line, this works fine (see data). But sometimes there is a empty line between them - so i want .next_sibling to ignore them (have a look at data2) That is the code: from bs4 import BeautifulSoup, Tag data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>" data2 = """<p>method-removed-here</p> <p>method

python beautifulsoup new_tag: assign class as an attribute

眉间皱痕 提交于 2020-01-03 08:39:12
问题 I'm new to both python and beautifulsoup, so maybe there is a simple answer I can't find. When I call .new_tag('name') I also can assign attributes like .new_tag('a', href='#', id='link1') But I can't assign class this way, because it is reserved word. Also I can't add name this way, because it's used as keyword for the tag name attribute. I know I can add them later, using tag['class'] for example, but I would like to know, is this the only way to add class to new tag? Or there is a way to

How do I remove a spurious tag in BeautifulSoup

为君一笑 提交于 2020-01-03 05:05:38
问题 I'm pulling text from the Presidential debates. I got to one that has an issue: it errantly turns every mention of the word "debate" into a tag <debate> . Go ahead, search for "Welcome back to the Republican presidential"; notice an obvious word missing? Cool, so BeautifulSoup does a superb job of cleaning up messy HTML and adding closing tags were they should have been. But in this case, that mucks me up, because <debate> is now a child of a <p> and the closing </debate> is added allllll the

Real Estate Market Scrapping using Python and BeautifulSoup

▼魔方 西西 提交于 2020-01-03 04:57:08
问题 I need some concept how to parse a real estate market using Python. I've searched some information about parsing the websites, I even did this in VBA, but I would like to do it in python. This is the site which will be parsed (it's one offer only now, but it will be working on full range of real estate offers, multiple sites from kontrakt.szczecin.pl): http://www.kontrakt.szczecin.pl/mieszkanie-sprzedaz-100m2-335000pln-grudziadzka-pomorzany-szczecin-zachodniopomorskie,351149 First of all,

Nested For Loop with Unequal Entities

天大地大妈咪最大 提交于 2020-01-03 04:55:07
问题 I would like to scrape the contents of a website with a similar structure to https://www.wellstar.org/locations/pages/default.aspx Using the provided website as a framework, I would like to extract the location's name and the heading associated with that location. I want to be able to produce the following: WellStar Hospitals WELLSTAR ATLANTA MEDICAL CENTER WellStar Hospitals WELLSTAR ATLANTA MEDICAL CENTER SOUTH ... WellStar Health Parks ACWORTH HEALTH PARK ... Thus far I have attempted a

BeautifulSoup .children or .content without whitespace between tags

你离开我真会死。 提交于 2020-01-03 04:32:12
问题 I want all children of a tag without the whitespace between the tags. But BeautifulSoups .contents and .children also returns the whitespace between the tags. html = """ <div id="list"> <span>1</span> <a href="2.html">2</a> <a href="3.html">3</a> </div> """ soup = BeautifulSoup(html, 'html.parser') print(soup.find(id='list').contents) This prints: ['\n', <span>1</span>, '\n', <a href="2.html">2</a>, '\n', <a href="3.html">3</a>, '\n'] Same with print(list(soup.find(id='list').children)) What

beautiful soup extract a href from google search

狂风中的少年 提交于 2020-01-03 04:18:08
问题 A google search gives me the following first result on HTML: <h3 class="r"><a href="https://rads.stackoverflow.com/amzn/click/com/0470284889" rel="nofollow noreferrer" class="l vst" onmousedown="return rwt(this,'','','','1','AFQjCNEv1W9YC2jcSKYdEo2kNqBMJ-Utmg','k89K9hF4cVNpxQYHtEKiUQ','0CCoQFjAA',null,event)"><em>Quantitative Trading</em>: <em>How to Build Your Own Algorithmic</em> <b>...</b> - Amazon</a></h3> I would like to extract the link http://www.amazon.com/Quantitative-Trading-Build