beautifulsoup

BeautifulSoup4: change text inside xml tag

≡放荡痞女 提交于 2019-12-23 05:39:17
问题 I simply want to change the text inside an xml tag after it becomes a BeautifulSoup object. Current code: example_string = '<conversion><person>John</person></conversion>' bsoup = BeautifulSoup(example_string) bsoup.person.text = 'Michael' running this code in my console renders this error: Traceback (most recent call last): File "<stdin>", line 3, in <module> AttributeError: can't set attribute How can I change the value inside the person xml tag? 回答1: You need to set the .string attribute,

Beautiful soup missing some html table tags

老子叫甜甜 提交于 2019-12-23 05:17:45
问题 I'm trying to extract data from a website using beautiful soup to parse the html. I'm currently trying to get the table data from the following webpage : link to webpage I want to get the data from the table. First I save the page as an html file on my computer (this part works fine, I checked that I got all the information) but when I try to parse with the following code : soup = BeautifulSoup(fh, 'html.parser') table = soup.find_all('table') cols = table[0].find_all('tr') cells = cols[1]

BeautifulSoup - How to extract text after specified string

岁酱吖の 提交于 2019-12-23 05:15:06
问题 I have HTML like: <tr> <td>Title:</td> <td>Title value</td> </tr> I have to specify after which <td> with text i want to grab text of second <td> . Something like: Grab text of first next <td> after <td> which contain text Title: . Result should be: Title value I have some basic understanding of Python and BeutifulSoupno and i have no idea how can I do this when there is no class to specify. I have tried this: row = soup.find_all('td', string='Title:') text = str(row.nextSibling) print(text)

HTML Link parsing using BeautifulSoup

微笑、不失礼 提交于 2019-12-23 04:39:32
问题 here is my Python code which I'm using to extract the Specific HTML from the Page links I'm sending as parameter. I'm using BeautifulSoup. This code works fine for sometimes and sometimes it is getting stuck! import urllib from bs4 import BeautifulSoup rawHtml = '' url = r'http://iasexamportal.com/civilservices/tag/voice-notes?page=' for i in range(1, 49): #iterate url and capture content sock = urllib.urlopen(url+ str(i)) html = sock.read() sock.close() rawHtml += html print i Here I'm

how to scrape product details on amazon webpage using beautifulsoup [closed]

…衆ロ難τιáo~ 提交于 2019-12-23 04:37:10
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . For webpage: http://www.amazon.com/Harry-Potter-Prisoner-Azkaban-Rowling/dp/0439136369/ref=pd_sim_b_2?ie=UTF8&refRID=1MFBRAECGPMVZC5MJCWG How could I scrape product details and output dict in python. In above case, the dict output I want to have will be: Age Range: 9 - 12 years

How to set the path to a browser executable with python webbrowser

与世无争的帅哥 提交于 2019-12-23 04:37:07
问题 I am trying to build a utility function to output beautiful soup code to a browser I have the following code: def bs4_to_browser(bs4Tag): import os import webbrowser html= str(bs4Tag) # html = '<html> ... generated html string ...</html>' path = os.path.abspath('temp.html') url = 'file://' + path with open(path, 'w') as f: f.write(html) webbrowser.open(url) return This works great and opens up the HTML in the default browser. However I would like to set the path to a portable firefox

Extracting text without tags of HTML with Beautifulsoup Python

心不动则不痛 提交于 2019-12-23 04:24:39
问题 I try to extract this part of text but i don't figure it out how to do it, i'm working with several html files locally. <HTML><HEAD><STYLE>SOME STYLE CODE</STYLE></HEAD><META http-equiv=Content-Type content="text/html; charset=utf-8"> <BODY> <H1>SOME TEXT I DONT WANT</H1> THIS TEXT IS WHICH I WANT <H1>ANOTHER TEXT I DONT WANT</H1> ANOTHER TEXT THAT I WANT [.. Continues ..] </BODY></HTML> Thanks for your help EDIT: I have tried with this code but sometimes prints the h1 tags import glob from

Extracting/Scraping text from a href inside p inside div

杀马特。学长 韩版系。学妹 提交于 2019-12-23 04:22:05
问题 I am using beautiful soup(bs4) and Python I currently have this structure <div class="class1"> <a class="name" href="/doctor/dr-xxxxxxxxx"><h2>Dr. XX XXXX</h2></a> <p class="specialties"><a href="/location/abcd">ab cd</a></p> <p class="doc-clinic-name"> <a class="light_grey link" href="/clinic/fff">f ff</a> </p> </div> <div class="class2"> <p class="locality"> <a class="link grey" href="/location/doctors/ccc">c cc</a> </p> <p class="fees">INR 999</p> <div class="timings"> <p><span class=

BeautifulSoup - getting rid of paragraph whitespace/line breaks

不打扰是莪最后的温柔 提交于 2019-12-23 04:03:05
问题 similarlist = res.find_all_next("div", class_="result-wrapper") for item in similarlist: print(item) This returns: <div class="result-wrapper"> <div class="row-fluid result-row"> <div class="span6 result-left"> <p> <a class="tooltipLink warn-cs" data-original-title="Listen" href="..." rel="tooltip"><i class="..."></i></a> <a class="muted-link" href="/dictionary/german-english/aa-machen">Aa <b>machen</b></a> </p> </div> <div class="span6 result-right row-fluid"> <span class="span9"> <a class=

Scraping a page for URLs using Beautifulsoup

烈酒焚心 提交于 2019-12-23 03:57:47
问题 I can scrape the page to the headlines, no problem. The URLs are another story. They are fragments that get appended on the end of the base URL - I understand that... What do I need to pull the related URLs for storage in format - base_url.scraped_fragment from urllib2 import urlopen import requests from bs4 import BeautifulSoup import csv import MySQLdb import re html = urlopen("http://advances.sciencemag.org/") soup = BeautifulSoup(html.read().decode('utf-8'),"lxml") #links = soup.findAll(