beautifulsoup

Scraper in Python gives “Access Denied”

大兔子大兔子 提交于 2020-07-15 19:22:55
问题 I'm trying to code a scraper in Python to get some info from a page. Like the title of the offers that appear on this page: https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585 By now I use this code : import bs4 import requests def extract_source(url): source=requests.get(url).text return source def extract_data(source): soup=bs4.BeautifulSoup(source) names=soup.findAll('title') for i in names: print i extract_data(extract_source('https://www.justdial.com/Panipat/Saree-Retailers/nct

Scraper in Python gives “Access Denied”

流过昼夜 提交于 2020-07-15 19:21:36
问题 I'm trying to code a scraper in Python to get some info from a page. Like the title of the offers that appear on this page: https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585 By now I use this code : import bs4 import requests def extract_source(url): source=requests.get(url).text return source def extract_data(source): soup=bs4.BeautifulSoup(source) names=soup.findAll('title') for i in names: print i extract_data(extract_source('https://www.justdial.com/Panipat/Saree-Retailers/nct

The Parsing of HTML files at the same directory in the Python

。_饼干妹妹 提交于 2020-07-10 10:32:49
问题 I have designed the code parsing HTML files: from bs4 import BeautifulSoup import re import os from os.path import join for (dirname, dirs, files) in os.walk('.'): for filename in files: if filename.endswith('.html'): thefile = os.path.join(dirname, filename) with open(thefile, 'r') as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') Initialtext = soup.get_text() MediumText = Initialtext.lower().split() clean_tokens = [t for t in text2 if re.match(r'[^\W\d]*$', t)]

Can't get data in table form using Selenium Python

拜拜、爱过 提交于 2020-07-10 03:19:04
问题 Am new to scrapping using selenium python. So i could retrieve some of the data, but i want it in table form as is displayed on the web page: Here is what i have so far: url='https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data' browser = webdriver.Chrome(r"C:\task\chromedriver") browser.get(url) time.sleep(25) rows_in_table = browser.find_elements_by_xpath('//table[@class="dgrid-row-table"]//tr[th or td]') for element in

Can't get data in table form using Selenium Python

南楼画角 提交于 2020-07-10 03:16:48
问题 Am new to scrapping using selenium python. So i could retrieve some of the data, but i want it in table form as is displayed on the web page: Here is what i have so far: url='https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data' browser = webdriver.Chrome(r"C:\task\chromedriver") browser.get(url) time.sleep(25) rows_in_table = browser.find_elements_by_xpath('//table[@class="dgrid-row-table"]//tr[th or td]') for element in

Gibberish text output because of encoding in web scraping

人盡茶涼 提交于 2020-07-09 14:20:37
问题 I'm trying to get a text in Persian language from Google Translate, and the best encoding type for Persian is UTF-8. Google Translate uses Javascript to render its HTML codes, so I'm using html-requests module for this. What I have problem with is the output that I get each time, both either when I use print() or when I try to write it into a file. Both ways will give me a gibberish non-Persian text, and I know it's because of the encoding or something like this. So I was trying to change

How to extract the text in the textarea frame of the DeepL page?

旧巷老猫 提交于 2020-07-09 12:52:43
问题 From https://www.deepl.com/translator#en/fr/Hello%2C%20how%20are%20you%20today%3F We see this: But in code, the translated text "Bonjour, comment allez-vous aujourd'hui?" doesn't appear in any place of the page's source and the frame's code looks like: <textarea class="lmt__textarea lmt__target_textarea lmt__textarea_base_style" data-gramm_editor="false" tabindex="110" dl-test="translator-target-input" lang="fr-FR" style="height: 300px;"></textarea> And no matter how I read the text or source

Max retries exceeded with URL Selenium [duplicate]

六眼飞鱼酱① 提交于 2020-07-09 12:05:32
问题 This question already has answers here : MaxRetryError: HTTPConnectionPool: Max retries exceeded (Caused by ProtocolError('Connection aborted.', error(111, 'Connection refused'))) (2 answers) Closed 8 months ago . So i'm looking to traverse a URL array and open different URL's for web scraping with Selenium. The problem is, as soon as I hit the second browser.get(url), I get a 'Max retries exceeded with URL' and 'No connection could be made because the target machine actively refused it'.

Web Scraping Extract Javascript Table Selenium+Python

佐手、 提交于 2020-07-07 13:07:06
问题 I've read several articles of Web Scraping with but I didn't undestand how to find the elements in the site. The site I want to scrap the table is below: http://www.bmfbovespa.com.br/pt_br/servicos/market-data/cotacoes/mercado-de-derivativos/?symbol=DI1 I want to scrap the tables: "TB01, "TB02, TB03 and TB04" theses are the ids of the tables <tbody> == $0 <tr> <td id="TB01">...</td> <td id="TB02">...</td> <td id="TB03">...</td> <td id="TB04">...</td> <tr> I've tried all the find.element

Web Scraping Extract Javascript Table Selenium+Python

元气小坏坏 提交于 2020-07-07 13:05:44
问题 I've read several articles of Web Scraping with but I didn't undestand how to find the elements in the site. The site I want to scrap the table is below: http://www.bmfbovespa.com.br/pt_br/servicos/market-data/cotacoes/mercado-de-derivativos/?symbol=DI1 I want to scrap the tables: "TB01, "TB02, TB03 and TB04" theses are the ids of the tables <tbody> == $0 <tr> <td id="TB01">...</td> <td id="TB02">...</td> <td id="TB03">...</td> <td id="TB04">...</td> <tr> I've tried all the find.element