beautifulsoup

Apply BeautifulSoup function to Pandas DataFrame

家住魔仙堡 提交于 2020-01-06 05:56:09
问题 I have a Pandas DataFrame that I got from reading a csv, in that file there is HTML tags I want to remove. I want to remove the tags with BeautifulSoup because it is more reliable than using a simple regex like <.*?>. I usually remove HTML tags from Strings by executing text = BeautifulSoup(text, 'html.parser').get_text() Now I want to do this with every element in my DataFrame, so I tried the following: df.apply(lambda text: BeautifulSoup(text, 'html.parser').get_text()) But that returns the

Scrape .aspx form with Python

夙愿已清 提交于 2020-01-06 05:56:08
问题 i'm trying to scrape: https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx, which in paper seems like a easy task and with a lot of resources from other SO questions. Nonetheless, I'm getting the same error no matter how I change my request. I've tried the following: import requests from bs4 import BeautifulSoup url = "https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx" with requests.Session() as s: s.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10

Scrapes Emails from a list of URLs saved in CSV - BeautifulSoup

醉酒当歌 提交于 2020-01-06 05:50:09
问题 I am trying to parse thru a list of URLs saved in CSV format to scrape email addresses. However, the below code only managed to fetch email addresses from a single website. Need advice on how to modify the code to loop thru the list and save the outcome (the list of emails) to csv file. import requests import re import csv from bs4 import BeautifulSoup allLinks = [];mails=[] with open(r'url.csv', newline='') as csvfile: urls = csv.reader(csvfile, delimiter=' ', quotechar='|') links = [] for

Scrape America's Career InfoNet

喜欢而已 提交于 2020-01-06 05:49:35
问题 I've got employer IDs, which can be utilized get the business area: https://www.careerinfonet.org/employ4.asp?emp_id=558742391 The HTML contains the data in tr/td tables: Business Description: Exporters (Whls) Primary Industry:Other Miscellaneous Durable Goods Merchant Wholesalers Related Industry:Sporting and Athletic Goods Manufacturing So I would like to get Exporters (Whls) Other Miscellaneous Durable Goods Merchant Wholesalers Sporting and Athletic Goods Manufacturing My example code

Html-table scraping and exporting to csv: attribute error

心不动则不痛 提交于 2020-01-06 05:24:25
问题 I'm trying to scrape this html table with BeautifulSoup on Python 3.6 in order to export it to csv, as in the scripts below. I used a former example, trying to fit my case. url = 'http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/2050540010/cod/4/anno/2015/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/03' html =urlopen(url).read soup = BeautifulSoup(html(), "lxml") table = soup.select_one("table.tabfin") headers = [th.text("iso-8859-1") for th in table.select("tr

Unable to scrap all data

萝らか妹 提交于 2020-01-06 05:15:25
问题 from bs4 import BeautifulSoup import requests , sys ,os import pandas as pd URL = r"https://www.vault.com/best-companies-to-work-for/law/top-100-law-firms-rankings/year/" My_list = ['2007','2008','2009','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019','2020'] Year= [] CompanyName = [] Rank = [] Score = [] print('\n>>Process started please wait\n\n') for I, Page in enumerate(My_list, start=1): url = r'https://www.vault.com/best-companies-to-work-for/law/top-100-law-firms

Beautiful soup find_all doesn't find CSS selector with multiple classes

徘徊边缘 提交于 2020-01-06 03:55:06
问题 On the website there is this <a> element <a role="listitem" aria-level="1" href="https://www.rest.co.il" target="_blank" class="icon rest" title="this is main title" iconwidth="35px" aria-label="website connection" style="width: 30px; overflow: hidden;"></a> So I use this code to catch the element (note the find_all argument a.icon.rest ) import requests from bs4 import BeautifulSoup url = 'http://www.zap.co.il/models.aspx?sog=e-cellphone&pageinfo=1' source_code = requests.get(url) plain_text

Pass variable in soup.find() method - Beautifulsoup Python

梦想与她 提交于 2020-01-06 03:53:31
问题 Currently, I'm searching for elements using the below syntax: priceele = soup.find(itemprop='price').string.strip() Actually, the page contains <span> element having attribute name itemprop with value price . But, I don't need to look for <span> element because there is only one element with attribute itemprop . Now, what I want is to pass itemprop='price' as a variable to soup.find() method becausing I'm loading these two things from database dynamically. Is it possible? 回答1: If by "two

Trouble in scraping from a page

本小妞迷上赌 提交于 2020-01-05 12:16:55
问题 Refering to the one of my previous question, I have to scrape reviews(all reviews) of a hotel, for example this hotel With using BeautifulSoap , what I have done that I first get all the review pages links from pagination within the div having class BVRRPager BVRRPageBasedPager , and then scrape reviews from all pages. Problem with BeautifulSoap is that the content in div.BVRRRatingSummary does not come along(try loaing that page with JS disabled) I have scraped the reviews using Selinium but

Detecting BeautifulSoup Freeze

狂风中的少年 提交于 2020-01-05 12:08:13
问题 Parsing the following site using beautifulsoup with html5lib parser hangs but the same works well with html.parser Site: http://www.webmasterworld.com/google/3584866.htm #contents is fetched via urllib2 soup = BeautifulSoup(contents, 'html5lib') soup('a') hangs but the following works just fine soup = BeautifulSoup(contents1, 'html.parser') soup('a') > [.....] Is there something wrong with what i am doing or is there a way to detect if soup will fail before making the call to soup('a') from