beautifulsoup

Scraping a website whose encoding is iso-8859-1 instead of utf-8: how do I store the correct unicode in my database?

家住魔仙堡 提交于 2019-12-24 10:46:59
问题 I'd like to scrape a website using Python that is full of horrible problems, one being the wrong encoding at the top: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> This is wrong because the page is full of occurrences like the following: Nell’ambito instead of Nell'ambito (please notice ’ replaces ' ) If I understand correctly, this is happening because utf-8 bytes (probably the database encoding) are interpreted as iso-8859-1 bytes (forced by the charset in the

Parsing a Table from the following website

假如想象 提交于 2019-12-24 10:39:48
问题 I want to collect the past weather details of a particular city in India for each day in the year 2016.The following website has this data : "https://www.timeanddate.com/weather/india/kanpur/historic?month=1&year=2016" This link has the data for month January 2016. There is a nice table out there I want to extract this table I have tried enough and I could extract another table which is this one. BUT I DO NOT WANT THIS ONE. It is not serving my purpose I want the other big table with data

Python - Issue Scraping with BeautifulSoup

一笑奈何 提交于 2019-12-24 09:51:11
问题 I'm trying to scrape the Stack Overflow jobs page using Beautiful Soup 4 and URLLIB as a personal project. I'm facing an issue where I'm trying to scrape all the links to the 50 jobs listed on each page. I'm using a regex to identify these links. Even though I reference the tag properly, I am facing these two specific issues: Instead of the 50 links clearly visible in the source code, I get only 25 results each time as my output(after accounting for an removing an initial irrelevant link)

BeautifulSoup select all href in some element with specific class

。_饼干妹妹 提交于 2019-12-24 09:29:39
问题 I'm trying to scrap images from this website. I tried with Scrapy(using Docker)and with scrapy/slenium. Scrapy seems not to work in windows10 home so I'm now trying with Selenium/Beautifulsoup. I'm using Python 3.6 with Spider into an Anaconda env. This is how the href elements I need look like: <a class="emblem" href="detail/emblem/av1615001"> I have to major problems: - how should I select href with Beautifulsoup? Below in my code, you can see what I tried (but didn't work) - As it is

Send POST data in input form and scrap page, Python, Requests library

痴心易碎 提交于 2019-12-24 09:19:16
问题 I have problem. I dont know how i can send POST data and scrap content of next page. Simple Example for better understanding: Facebook website of profile recovery with one input: http://m.facebook.com/login/identify?ctx=recover Input: <input autocapitalize="off" class="y z ba" id="login_identify_search_placeholder" name="email" autofocus="1" placeholder="Adres e-mail lub numer telefonu" type="text"> I wanna make script, which recover my account, so i wanna send my email to input by POST and

Scrap embedded tweets from a webpage with Selenium and BeautifulSoup

∥☆過路亽.° 提交于 2019-12-24 08:23:40
问题 I need to extract tweets embedded in text articles. The problem with the pages I'm testing is that they load tweets in ~5 out of 10 runs. So I need to use Selenium to wait for the page to load but I cannot make it work. I followed steps from their official website: url = 'https://www.bbc.co.uk/news/world-us-canada-44648563' options = webdriver.ChromeOptions() options.add_argument("headless") driver = webdriver.Chrome(executable_path='/Users/ME/Downloads/chromedriver', chrome_options=options)

Beautiful Soup Error: '<class 'bs4.element.Tag'>' object has no attribute 'contents'?

痞子三分冷 提交于 2019-12-24 08:10:13
问题 I'm writing a script that extracts the content out of an article and removes any unnecessary stuff eg. scripts and styling. Beautiful Soup keeps raising the following exception: '<class 'bs4.element.Tag'>' object has no attribute 'contents' Here's the code of the trim function (element is the HTML element that contains the content of the webpage): def trim(element): elements_to_remove = ('script', 'style', 'link', 'form', 'object', 'iframe') for i in elements_to_remove: remove_all_elements

Beautiful soup remove superscripts

强颜欢笑 提交于 2019-12-24 07:57:56
问题 How do I remove the superscripts from all of the text? I have code below that gets all visible text, but the superscripts for footnoting are messing things up. How do I remove them? for example Active accounts (1),(2) , (1),(2) are visible superscripts. from bs4 import BeautifulSoup from bs4.element import Comment import requests f_url='https://www.sec.gov/Archives/edgar/data/1633917/000163391718000094/exhibit991prq12018pypl.htm' def tag_visible(element): if element.parent.name in ['style',

Scraping Wikipedia tables with Python selectively

一世执手 提交于 2019-12-24 07:52:28
问题 I have troubles sorting a wiki table and hope someone who has done it before can give me advice. From the List_of_current_heads_of_state_and_government I need countries (works with the code below) and then only the first mention of Head of state + their names. I am not sure how to isolate the first mention as they all come in one cell. And my attempt to pull their names gives me this error: IndexError: list index out of range . Will appreciate your help! import requests from bs4 import

How to change the coding for python array?

廉价感情. 提交于 2019-12-24 07:48:23
问题 I use the following code to scrape a table from a Chinese website. It works fine. But it seems that the contents I stored in the list are not shown properly. import requests from bs4 import BeautifulSoup import pandas as pd x = requests.get('http://www.sohu.com/a/79780904_126549') bs = BeautifulSoup(x.text,'lxml') clg_list = [] for tr in bs.find_all('tr'): tds = tr.find_all('td') for i in range(len(tds)): clg_list.append(tds[i].text) print(tds[i].text) When I print the text, it shows Chinese