web-scraping

Scraping string from a large number of URLs with Julia

我的未来我决定 提交于 2021-01-24 06:57:00
问题 Happy New Year! I have just started to learn Julia and my first mini challenge I have set myself is to scrape data from a large list of URLs. I have ca 50k URLs (which I successfully parsed from a JSON with Julia using Regex) in a CSV file. I want to scrape each one and return a matched string ("/page/12345/view" - where 12345 is any integer). I managed to do so using HTTP and Queryverse (although had started with CSV and CSVFiles but looking at packages for learning purposes) but the script

Python's requests triggers Cloudflare's security while urllib does not

扶醉桌前 提交于 2021-01-21 17:36:29
问题 I'm working on an automated webscrapper for a Restaurant website, but I'm having an issue. The said website uses cloudlfare's anti-bot security, which I would like to bypass, not the Under-Attack-Mode but a captcha test that only triggers when it detects a non-American IP or a bot. I'm trying to bypass it as cloudflare's security doesn't trigger when I clear cookies, disable javascript or when I use an American proxy. Knowing this, I tried using python's requests library as such: import

How to scrape json data from an interactive chart?

爱⌒轻易说出口 提交于 2021-01-21 05:48:28
问题 I have a specific section of a website that I want to scrape data from and here's the screenshot of the section - I inspected the elements of that particular section and noticed that it's within a canvas tag. However, I also checked the source code of the website and I found that the data lies within the source code in a format I'm not familiar with. Here's a sample of that data JSON.parse('\x5B\x7B\x22id\x22\x3A\x2232522\x22,\x22minute\x22\x3A\x2222\x22,\x22result\x22\x3A\x22MissedShots\x22,

How to scrape json data from an interactive chart?

天大地大妈咪最大 提交于 2021-01-21 05:47:07
问题 I have a specific section of a website that I want to scrape data from and here's the screenshot of the section - I inspected the elements of that particular section and noticed that it's within a canvas tag. However, I also checked the source code of the website and I found that the data lies within the source code in a format I'm not familiar with. Here's a sample of that data JSON.parse('\x5B\x7B\x22id\x22\x3A\x2232522\x22,\x22minute\x22\x3A\x2222\x22,\x22result\x22\x3A\x22MissedShots\x22,

Using selenium to retrieve data from webpage - not retrieving all data

醉酒当歌 提交于 2021-01-20 11:28:21
问题 I am trying to retrieve data (coin name, price, coinmarket cap and circulating supply) from coinmarketcap.com, but when I run the code below I only get 11 coin names. Plus, I am not able to retrieve other data. I am tried several options, but none successful. My goal is to store the data in a dataframe, so I can analyze it. driver = webdriver.Chrome(r'C:\Users\Ejer\PycharmProjects\pythonProject\chromedriver') driver.get('https://coinmarketcap.com/') Crypto = driver.find_elements_by_xpath("/

How can I bypass a cookie agreement page while web scraping using Python?

陌路散爱 提交于 2021-01-20 07:24:08
问题 I hurt my nose to a cookie agreement page... What I am doing: import requests url = "https://stockhouse.com/community/bullboards/" r = requests.get(url) soup = BeautifulSoup(r.content, "html.parser") print(soup) which returns HTML from a cookie agreement page. What I am then looking for is to bypass this page and scrape the content of the actual page once we accept the cookies... I tried the code from this question: cookies = dict(BCPermissionLevel='PERSONAL') html = requests.get(website,

'NoneType' object has no attribute 'text' in BeautifulSoup

给你一囗甜甜゛ 提交于 2021-01-18 19:27:35
问题 I am trying to scrape Google results when I search " What is 2+2 ", but the following code is returning 'NoneType' object has no attribute 'text' . Please help me in achieving the required goal. text="What is 2+2" search=text.replace(" ","+") link="https://www.google.com/search?q="+search headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'} source=requests.get(link,headers=headers).text soup=BeautifulSoup(source,

'NoneType' object has no attribute 'text' in BeautifulSoup

放肆的年华 提交于 2021-01-18 19:21:28
问题 I am trying to scrape Google results when I search " What is 2+2 ", but the following code is returning 'NoneType' object has no attribute 'text' . Please help me in achieving the required goal. text="What is 2+2" search=text.replace(" ","+") link="https://www.google.com/search?q="+search headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'} source=requests.get(link,headers=headers).text soup=BeautifulSoup(source,

'NoneType' object has no attribute 'text' in BeautifulSoup

情到浓时终转凉″ 提交于 2021-01-18 19:19:36
问题 I am trying to scrape Google results when I search " What is 2+2 ", but the following code is returning 'NoneType' object has no attribute 'text' . Please help me in achieving the required goal. text="What is 2+2" search=text.replace(" ","+") link="https://www.google.com/search?q="+search headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'} source=requests.get(link,headers=headers).text soup=BeautifulSoup(source,

'NoneType' object has no attribute 'text' in BeautifulSoup

China☆狼群 提交于 2021-01-18 19:17:21
问题 I am trying to scrape Google results when I search " What is 2+2 ", but the following code is returning 'NoneType' object has no attribute 'text' . Please help me in achieving the required goal. text="What is 2+2" search=text.replace(" ","+") link="https://www.google.com/search?q="+search headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'} source=requests.get(link,headers=headers).text soup=BeautifulSoup(source,