beautifulsoup | 易学教程

BeautifulSoup4: change text inside xml tag

阅读更多关于 BeautifulSoup4: change text inside xml tag

问题 I simply want to change the text inside an xml tag after it becomes a BeautifulSoup object. Current code: example_string = '<conversion><person>John</person></conversion>' bsoup = BeautifulSoup(example_string) bsoup.person.text = 'Michael' running this code in my console renders this error: Traceback (most recent call last): File "<stdin>", line 3, in <module> AttributeError: can't set attribute How can I change the value inside the person xml tag? 回答1: You need to set the .string attribute,

Beautiful soup missing some html table tags

阅读更多关于 Beautiful soup missing some html table tags

问题 I'm trying to extract data from a website using beautiful soup to parse the html. I'm currently trying to get the table data from the following webpage : link to webpage I want to get the data from the table. First I save the page as an html file on my computer (this part works fine, I checked that I got all the information) but when I try to parse with the following code : soup = BeautifulSoup(fh, 'html.parser') table = soup.find_all('table') cols = table[0].find_all('tr') cells = cols[1]

BeautifulSoup - How to extract text after specified string

阅读更多关于 BeautifulSoup - How to extract text after specified string

问题 I have HTML like: <tr> <td>Title:</td> <td>Title value</td> </tr> I have to specify after which <td> with text i want to grab text of second <td> . Something like: Grab text of first next <td> after <td> which contain text Title: . Result should be: Title value I have some basic understanding of Python and BeutifulSoupno and i have no idea how can I do this when there is no class to specify. I have tried this: row = soup.find_all('td', string='Title:') text = str(row.nextSibling) print(text)

HTML Link parsing using BeautifulSoup

阅读更多关于 HTML Link parsing using BeautifulSoup

问题 here is my Python code which I'm using to extract the Specific HTML from the Page links I'm sending as parameter. I'm using BeautifulSoup. This code works fine for sometimes and sometimes it is getting stuck! import urllib from bs4 import BeautifulSoup rawHtml = '' url = r'http://iasexamportal.com/civilservices/tag/voice-notes?page=' for i in range(1, 49): #iterate url and capture content sock = urllib.urlopen(url+ str(i)) html = sock.read() sock.close() rawHtml += html print i Here I'm

how to scrape product details on amazon webpage using beautifulsoup [closed]

阅读更多关于 how to scrape product details on amazon webpage using beautifulsoup [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . For webpage: http://www.amazon.com/Harry-Potter-Prisoner-Azkaban-Rowling/dp/0439136369/ref=pd_sim_b_2?ie=UTF8&refRID=1MFBRAECGPMVZC5MJCWG How could I scrape product details and output dict in python. In above case, the dict output I want to have will be: Age Range: 9 - 12 years

How to set the path to a browser executable with python webbrowser

阅读更多关于 How to set the path to a browser executable with python webbrowser

问题 I am trying to build a utility function to output beautiful soup code to a browser I have the following code: def bs4_to_browser(bs4Tag): import os import webbrowser html= str(bs4Tag) # html = '<html> ... generated html string ...</html>' path = os.path.abspath('temp.html') url = 'file://' + path with open(path, 'w') as f: f.write(html) webbrowser.open(url) return This works great and opens up the HTML in the default browser. However I would like to set the path to a portable firefox

Extracting text without tags of HTML with Beautifulsoup Python

阅读更多关于 Extracting text without tags of HTML with Beautifulsoup Python

问题 I try to extract this part of text but i don't figure it out how to do it, i'm working with several html files locally. <HTML><HEAD><STYLE>SOME STYLE CODE</STYLE></HEAD><META http-equiv=Content-Type content="text/html; charset=utf-8"> <BODY> <H1>SOME TEXT I DONT WANT</H1> THIS TEXT IS WHICH I WANT <H1>ANOTHER TEXT I DONT WANT</H1> ANOTHER TEXT THAT I WANT [.. Continues ..] </BODY></HTML> Thanks for your help EDIT: I have tried with this code but sometimes prints the h1 tags import glob from

Extracting/Scraping text from a href inside p inside div

阅读更多关于 Extracting/Scraping text from a href inside p inside div

问题 I am using beautiful soup(bs4) and Python I currently have this structure <div class="class1"> <a class="name" href="/doctor/dr-xxxxxxxxx"><h2>Dr. XX XXXX</h2></a> <p class="specialties"><a href="/location/abcd">ab cd</a></p> <p class="doc-clinic-name"> <a class="light_grey link" href="/clinic/fff">f ff</a> </p> </div> <div class="class2"> <p class="locality"> <a class="link grey" href="/location/doctors/ccc">c cc</a> </p> <p class="fees">INR 999</p> <div class="timings"> <p><span class=

BeautifulSoup - getting rid of paragraph whitespace/line breaks

阅读更多关于 BeautifulSoup - getting rid of paragraph whitespace/line breaks

问题 similarlist = res.find_all_next("div", class_="result-wrapper") for item in similarlist: print(item) This returns: <div class="result-wrapper"> <div class="row-fluid result-row"> <div class="span6 result-left"> <p> <a class="tooltipLink warn-cs" data-original-title="Listen" href="..." rel="tooltip"><i class="..."></i></a> <a class="muted-link" href="/dictionary/german-english/aa-machen">Aa <b>machen</b></a> </p> </div> <div class="span6 result-right row-fluid"> <span class="span9"> <a class=

Scraping a page for URLs using Beautifulsoup

阅读更多关于 Scraping a page for URLs using Beautifulsoup

问题 I can scrape the page to the headlines, no problem. The URLs are another story. They are fragments that get appended on the end of the base URL - I understand that... What do I need to pull the related URLs for storage in format - base_url.scraped_fragment from urllib2 import urlopen import requests from bs4 import BeautifulSoup import csv import MySQLdb import re html = urlopen("http://advances.sciencemag.org/") soup = BeautifulSoup(html.read().decode('utf-8'),"lxml") #links = soup.findAll(