beautifulsoup

CSS selectors to be used for scraping specific links

ぃ、小莉子 提交于 2019-12-24 07:29:32
问题 I am new to Python and working on a scraping project. I am using Firebug to copy the CSS path of required links. I am trying to collect the links under the tab of "UPCOMING EVENTS" from http://kiascenehai.pk/ but it is just for learning how I can get the specified links. I am looking for the fix of this problem and also suggestions for how to retrieve specified links using CSS selectors. from bs4 import BeautifulSoup import requests url = "http://kiascenehai.pk/" r = requests.get(url) data =

Scrape tables with python

最后都变了- 提交于 2019-12-24 07:25:58
问题 I am trying to scrape tables and convert them into data.tables in python, but I have little luck of election data in USA. This is html of the data I want to scrape. <tr class="type-republican"> <th class="results-name" scope="row"><a href="xxxxx"><span class="name-combo"><span class="token token-party"><abbr title="Republican">R</abbr></span> <span class="token token-winner"><b aria-hidden="true" class="icon icon-check"></b> <span class="icon-text">Winner</span></span> D. Trump</span></a></th

Extracting a table from a website

假如想象 提交于 2019-12-24 07:24:42
问题 I've tried many times to retrieve the table at this website: http://www.whoscored.com/Players/845/History/Tomas-Rosicky (the one under "Historical Participations") import urllib2 from bs4 import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen('http://www.whoscored.com/Players/845/').read()) This is the Python code I am using to retrieve the table html, but I am getting an empty string. Help me out! 回答1: The desired table is formed via an asynchronous API call to the http://www.whoscored

Get content-type from HTML page with BeautifulSoup

非 Y 不嫁゛ 提交于 2019-12-24 07:12:16
问题 I am trying to get the character encoding for pages that I scrape, but in some cases it is failing. Here is what I am doing: resp = urllib2.urlopen(request) self.COOKIE_JAR.extract_cookies(resp, request) content = resp.read() encodeType= resp.headers.getparam('charset') resp.close() That is my first attempt. But if charset comes back as type None , I do this: soup = BeautifulSoup(html) if encodeType == None: try: encodeType = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content

Beautiful Soup - Results to CSV for all items in lists

試著忘記壹切 提交于 2019-12-24 07:01:54
问题 The below snippet "works" but is only outputting the first record to the CSV. I'm trying to get it to output the same output, but for each gun in the list of gun urls in the all_links list. Any modification i've made to it with prints for the output (just to see it working) prints the same result or if i make a gun_details list and try to print it, get the same one item output. How would i go about printing all the gun_details labels and spans into a CSV? import csv import urllib.request

Detecting header in HTML tables using beautifulsoup / lxml when table lacks thead element

这一生的挚爱 提交于 2019-12-24 07:01:34
问题 I'd like to detect the header of an HTML table when that table does not have <thead> elements. (MediaWiki, which drives Wikipedia, does not support <thead> elements.) I'd like to do this with python in both BeautifulSoup and lxml. Let's say I already have a table object and I'd like to get out of it a thead object, a tbody object, and a tfoot object. Currently, parse_thead does the following when the <thead> tag is present: In BeautifulSoup, I get table objects with doc.find_all('table') and

Pandas: Trouble Stripping HTML Tags From DataFrame Column

人盡茶涼 提交于 2019-12-24 06:57:27
问题 I have a Pandas DataFrame with a text column containing HTML. I want to get just the text, aka strip the tags. I try to do this below as follows: from bs4 import BeautifulSoup result_df['text'] = BeautifulSoup(result_df['text']).get_text() However, I end up getting this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). What am I doing incorrectly? Thanks! 回答1: Try this: from bs4 import BeautifulSoup result_df['text'] =

Cannot figure out what's wrong with beautifulsoup4 in my python 3 script

风格不统一 提交于 2019-12-24 06:47:50
问题 Traceback (most recent call last): File "urlgrabber.py", line 1, in <module> from bs4 import BeautifulSoup File "/Users/asdf/Desktop/Scraper/bs4/__init__.py", line 29, in <module> from .builder import builder_registry File "/Users/asdf/Desktop/Scraper/bs4/builder/__init__.py", line 4, in <module> from bs4.element import ( File "/Users/asdf/Desktop/Scraper/bs4/element.py", line 5, in <module> from bs4.dammit import EntitySubstitution File "/Users/asdf/Desktop/Scraper/bs4/dammit.py", line 13,

How to get src attribute from <image/> with Python

ε祈祈猫儿з 提交于 2019-12-24 06:30:51
问题 I am scraping data from one site, and I need to find one img. I get it but the output is not what I need. I have tried looking online for solutions, changing code but nothing worked. r = requests.get(baseurl) content = r.content soup = BeautifulSoup(content, "html.parser") images = soup.findAll('img')[1] print(images) Output I get: <img src="https://cdn.rubyrealms.com/images/WKpivrdGBJJ9p6etIY2aJpixikFj4vnpmpPR9pXjK4Y8K.png" style="border-radius: 5px"/> Output I need: cdn.rubyrealms.com

Parsing xml file using Python3 and BeautifulSoup

烈酒焚心 提交于 2019-12-24 05:52:13
问题 I know there are several answers to questions regarding xml parsing with Python 3, but I can't find the answer to two that I have. I am trying to parse and extract information from a BoardGameGeek xml file that looks like the following (it's too long for me to paste in here): https://www.boardgamegeek.com/xmlapi/boardgame/10 1) I am having trouble extracting the primary game name from these two lines: <name sortindex="1" primary="true">Elfenland</name> <name sortindex="1">Elfenland (Волшебное