web-scraping

Why is HTML returned by requests different from the real page HTML?

笑着哭i 提交于 2021-01-29 07:08:25
问题 Hi friends IM trying to scrap a webpage for getting some data to work with, one of the web pages I want to scrap is this one https://www.etoro.com/people/sparkliang/portfolio, the problem comes when I scrap the web page using: import requests h=requests.get('https://www.etoro.com/people/sparkliang/portfolio') h.content And gives me a completely different result HTML from the original, for example adding a lot of meta kind and deleting the text or type HTML variables I am searching for. For

How to accelerate Webscraping using the combination of Request and BeautifulSoup in Python?

末鹿安然 提交于 2021-01-29 06:12:00
问题 The objective is to scrape multiple pages using BeautifulSoup whose input is from the requests.get module. The steps are: First load up the html using requests page = requests.get('https://oatd.org/oatd/' + url_to_pass) Then, scrape the html content using the definition below: def get_each_page(page_soup): return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text, paper_title=page_soup.find(attrs={"itemprop": "name"}).text) Say, we have a hundred of unique url to be scrap [

How to scrape data bypassing radio button using request in Python 3?

依然范特西╮ 提交于 2021-01-29 05:47:05
问题 I want to scrape data from this website. After visiting, we need to select radio button criteria as 'TIN', then enter the TIN no. as '27680809621V' & click on submit button. I don't know how to do I'm stuck, as there is no name or value. import requests from bs4 import BeautifulSoup s = requests.session() req = s.get('https://mahagst.gov.in/en/know-your-taxpayer') soup = BeautifulSoup(req.text,'lxml') dictinfo = {i['name']:i.get('value','') for i in soup.select('input[name]')} Someone please

target text after br tag using cheerio

喜欢而已 提交于 2021-01-29 05:46:31
问题 I'm practicing creating an API by scraping using cheerio. I'm scraping from this fairly convoluted site: http://www.vegasinsider.com/nfl/odds/las-vegas/ I'm trying to target the text after these <br> tags within the anchor tag in this <td> element: <td class="viCellBg1 cellTextNorm cellBorderL1 center_text nowrap" width="56"> <a class="cellTextNorm" href="/nfl/odds/las-vegas/line-movement/packers-@- bears.cfm/date/9-05-19/time/2020#BT" target="_blank">  <br>46u-10<br>-3½ -10 </a> </td> The

How to click a link by text with No Text in Python

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-29 05:30:27
问题 I am trying to scrape a Wine data from vivino.com and using selenium to automate it and scrape as many data as possible. My code looks like this: import time from selenium import webdriver browser = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe') browser.get('https://www.vivino.com/explore?e=eJwFwbEOQDAUBdC_uaNoMN7NZhQLEXmqmiZaUk3x987xkVXRwLtAVcLLy7qE_tiN0Bz6FhcV7M4s0ZkkB86VUZIL9l4kmyjW4ORmbo0nTTPVDxlkGvg%3D&cart_item_source=nav-explore') # Vivino Website with 5 wines for now

Web-scraping with xpathSApply. Getting xmlValue

别等时光非礼了梦想. 提交于 2021-01-29 05:27:36
问题 For example, I want to extract the price(top-right) and The space(Accommodates: 2,Bathrooms: 1 etc) https://www.airbnb.com/rooms/12949270?guests=1&s=_JaPbz-J Here is my code for price: remDr$navigate(url) doc <- htmlParse(remDr$getPageSource()[[1]]) var <- remDr$findElement('id','details') varxml <- htmlTreeParse(vartxt, useInternalNodes=T) Price <- xpathApply(varxml,"//div[@class='book-it__price-amount h3 text-special pull-left']",xmlValue) But it returns me empty list. Maybe it hapepend,

Click event does nothing when triggered

强颜欢笑 提交于 2021-01-29 03:20:44
问题 When I trigger a .click() event in a non-headless mode in puppeteer, nothing happens, not even an error.. "non-headless mode so i could visually monitor what is being clicked" const scraper = { test: async () => { let browser, page; try { browser = await puppeteer.launch({ headless: false, args: ["--no-sandbox", "--disable-setuid-sandbox"] }); page = await browser.newPage(); } catch (err) { console.log(err); } try { await page.goto("https://www.betking.com/sports/s/eventOdds/1-840-841-0-0,1

Return html code of dynamic page using selenium

偶尔善良 提交于 2021-01-29 03:10:16
问题 I'm trying to crawl this website, problem is it's dynamically loaded. Basically I want what I can see from the browser console, not what I see when I right click > show sources. I've tried some selenium examples but I can't get what I need. The code below uses selenium and get only what you get in right click -> show code. How can I get the content of the loaded page? from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from

How to iterate through a supermarket website and getting the product name and prices?

会有一股神秘感。 提交于 2021-01-29 02:16:07
问题 Im trying to obtain all the product name and prices from all the categories from a supermarket website, all the tutorials that i have found do it just for one const url, i need to iterate through all of them. So far i have got this const puppeteer = require('puppeteer'); async function scrapeProduct(url) { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto(url); const [el2] = await page.$x('//*[@id="product-nonfood-page"]/main/div/div/div[1]/div[1]

Moving to next page for scraping using BeautifulSoup

为君一笑 提交于 2021-01-29 00:49:26
问题 I am unable to automate the following code to go to the next page and scrape data from Indeed.com. Please let me know how to handle this issue. import requests import bs4 from bs4 import BeautifulSoup import pandas as pd import time URL = "https://www.indeed.com/jobs?q=Amazon&l=" # Get the html info of the page page = requests.get(URL) soup = BeautifulSoup(page.text, "html.parser") # Get the job title def extract_job_title_from_result(soup): jobs = [] for div in soup.find_all(name="div",attrs