beautifulsoup

Python Conditionally Add Class to <td> Tags in HTML Table

最后都变了- 提交于 2019-12-25 04:47:09
问题 I have some data in the form of a csv file that I'm reading into Python and converting to an HTML table using Pandas. Heres some example data: name threshold col1 col2 col3 A 10 12 9 13 B 15 18 17 23 C 20 19 22 25 And some code: import pandas as pd df = pd.read_csv("data.csv") table = df.to_html(index=False) This creates the following HTML: <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th>name</th> <th>threshold</th> <th>col1</th> <th>col2</th> <th>col3</th> <

Beautiful soup find href

倖福魔咒の 提交于 2019-12-25 04:46:09
问题 I am trying to select just the href inside a specific tr tag. Here is my code: soup=bs(driver.page_source, 'html.parser') obj=soup.find(text="test545") new=obj.parent.previous_sibling.previous_sibling.previous_sibling print new if new.has_key('href'): new=new['href'] print"found!" Here is the output: <td headers="LINK"><a href="f?p=106:3:92877880706::NO::P3_ID:5502&cs=tmX92fFLmToJQ69ZOs2w"><img border="0" src="/i_5.0/menu/pencil3416x16.gif"/></a></td> I want to just select the link inside of

How to remove string unicode from list

杀马特。学长 韩版系。学妹 提交于 2019-12-25 04:33:31
问题 I am trying to remove the string unicode "u'" marks in my string list. The list is a list of actors from this site http://www.boxofficemojo.com/yearly/chart/?yr=2013&p=.htm. I have a method that gets these strings from this website: def getActors(item_url): response = requests.get(item_url) soup = BeautifulSoup(response.content, "lxml") # or BeautifulSoup(response.content, "html5lib") tempActors = [] try: tempActors.append(soup.find(text="Actors:").find_parent("tr").find_all(text=True)[1:])

Python BeautifulSoup: parsing multiple tables with same class name

雨燕双飞 提交于 2019-12-25 04:24:37
问题 I am trying to parse some tables from a wiki page e.g. http://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2014. there are four tables with same class name "wikitable". When I write: movieList= soup.find('table',{'class':'wikitable'}) rows = movieList.findAll('tr') It works fine, but when I write: movieList= soup.findAll('table',{'class':'wikitable'}) rows = movieList.findAll('tr') It throws an error: Traceback (most recent call last): File "C:\Python27\movieList.py", line 24, in <module>

how to get href link from onclick function in python

独自空忆成欢 提交于 2019-12-25 04:23:07
问题 I want to get href link of website form onclick function Here is html code in which onclick function call a website <div class="fl"> <span class="taLnk" onclick="ta.trackEventOnPage('Eatery_Listing', 'Website', 594024, 1); ta.util.cookie.setPIDCookie(15190); ta.call('ta.util.link.targetBlank', event, this, {'aHref':

Using BeautifulSoup in CGI without installing

柔情痞子 提交于 2019-12-25 04:22:57
问题 I am trying to build a simple scraper in Python, which will run on a Webserver via CGI. Basically it will return a value determined by a parameter passed to it in a URL. I need BeautifulSoup to do the processing of HTML pages on the webserver. However, I'm using HelioHost, which doesn't give me shell access or pip etc. I can only use FTP. One the BS website, it says you can directly extract it and use it without installing. So I got the tarball on my Win7 machine, used 7-zip to remove bz2

Read HTML with Beautifulsoup and find typical data

南笙酒味 提交于 2019-12-25 04:14:31
问题 I wrote similar question before, but I need something different what I got from previous question. I have a html data which is written below (part of the data where I need). I already got rcpNo value, but eleId is changed from 1 to 33, offset, length don't have any regular pattern. Three of the data is consist of numbers, sometime different digit. I need to read rcpNO , eleId , offset , length and dtd . (dtd is fixed as 'dart3.xsd' but I try this only one html so there is possibility

Print URL from two different BeautifulSoap outputs

时光毁灭记忆、已成空白 提交于 2019-12-25 03:30:43
问题 I am scraping a few URLs in batch using BeautifulSoap. Here is my script (only relevant stuff): import urllib2 from bs4 import BeautifulSoup quote_page = 'https://example.com/foo/bar' page = urllib2.urlopen(quote_page) soup = BeautifulSoup(page, 'html.parser') url_box = soup.find('div', attrs={'class': 'player'}) print url_box This gives 2 different kinds of print depending on the HTML of URL (about half pages gives first print and rest give the second print). Here's first kind of print: <div

Iterate through multiple files and append text from HTML using Beautiful Soup

只愿长相守 提交于 2019-12-25 03:13:36
问题 I have a directory of downloaded HTML files (46 of them) and I am attempting to iterate through each of them, read their contents, strip the HTML, and append only the text into a text file. However, I'm unsure where I'm messing up, though, as nothing gets written to my text file? import os import glob from bs4 import BeautifulSoup path = "/" for infile in glob.glob(os.path.join(path, "*.html")): markup = (path) soup = BeautifulSoup(markup) with open("example.txt", "a") as myfile: myfile.write

beautifulsoup with an invalid html document

痞子三分冷 提交于 2019-12-25 02:27:13
问题 I am trying to parse the document http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm. I want to extract everything before Commission: . ( I need Beautifulsoup because the second step is to extract countries and person names ) If i do: import urllib import re from bs4 import BeautifulSoup url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm" soup=BeautifulSoup(urllib.urlopen(url)) print soup.find_all(text=re.compile(