web-scraping

<tbody> tag displays in chrome but not source

亡梦爱人 提交于 2021-02-05 08:08:11
问题 In doing some scraping work I keep encountering the <tbody> tag in the Chrome DevTools inspector, but it doesn't appear in the source. For what I hope are obvious reasons, I find this super confusing. What's going on here? (I should also add that the html on this page is pretty malformed). For example, DevTools shows: <table> <tbody> <tr valign="top"> <td>...</td> Page source shows: <table border="0"> <tr valign="top"> <td> 回答1: The start tag for <tbody> is optional. That is, you can leave it

How can I access this type of site using requests? [duplicate]

落爺英雄遲暮 提交于 2021-02-05 08:07:44
问题 This question already has answers here : Scraper in Python gives “Access Denied” (3 answers) Closed 8 months ago . This is the first time I've encountered a site where it wouldn't 'allow me access' to the webpage. I'm not sure why and I can't figure out how to scrape from this website. My attempt: import requests from bs4 import BeautifulSoup def html(url): return BeautifulSoup(requests.get(url).content, "lxml") url = "https://www.g2a.com/" soup = html(url) print(soup.prettify()) Output:

How to extract table from website using python

青春壹個敷衍的年華 提交于 2021-02-05 08:02:48
问题 i have been trying to extract the table from website but i am lost. can anyone help me ? my goal is to extract the table of scope page : https://training.gov.au/Organisation/Details/31102 import requests from bs4 import BeautifulSoup url = "https://training.gov.au/Organisation/Details/31102" response = requests.get(url) page = response.text soup = BeautifulSoup(page, 'lxml') table = soup.find(id ="ScopeQualification") [row.text.split() for row in table.find_all("tr")] 回答1: find OrganisationId

PYTHON: How do I use BeautifulSoup to parse a table into a pandas dataframe

本小妞迷上赌 提交于 2021-02-05 08:01:07
问题 I am trying to scrape the CDC website for the data of the last 7 days reported cases for COVID-19. https://covid.cdc.gov/covid-data-tracker/#cases_casesinlast7days I've tried to find the table, by name, id, class, and it always returns as none type. When I print the data scraped, I cant manually locate the table in the html either. Not sure what I'm doing wrong here. Once the data is imported, I need to populate a pandas dataframe to later use for graphing purposes, and export the data table

PYTHON: How do I use BeautifulSoup to parse a table into a pandas dataframe

烈酒焚心 提交于 2021-02-05 08:00:47
问题 I am trying to scrape the CDC website for the data of the last 7 days reported cases for COVID-19. https://covid.cdc.gov/covid-data-tracker/#cases_casesinlast7days I've tried to find the table, by name, id, class, and it always returns as none type. When I print the data scraped, I cant manually locate the table in the html either. Not sure what I'm doing wrong here. Once the data is imported, I need to populate a pandas dataframe to later use for graphing purposes, and export the data table

Python selenium to extract elements with xpath and for loop

自古美人都是妖i 提交于 2021-02-05 07:49:06
问题 I am using Python/Selenium to extract some text from a website to further sort it in Google Sheets. There are 15 headers for which I need to extract text. The text is found under each header in tag h5. Here's one extract of a header: <tr class="dayHeader"> <td colspan="7" style="padding:10px 0;"> <hr> <h5>  Tuesday - 02 February 2021</h5> </td> </tr> What I have done is the following: headers = driver.find_elements_by_tag_name('h5') results = [] for header in headers: result = header.text

Headless Chrome Driver not working for Selenium

核能气质少年 提交于 2021-02-05 07:27:06
问题 I am current having an issue with my scraper when I set options.add_argument("--headless") . However, it works perfectly fine when it is removed. Could anyone advise how I can achieve the same results with headless mode? Below is my python code: from seleniumwire import webdriver as wireDriver from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver

How to extract img src from web page via lxml in beautifulsoup using python?

空扰寡人 提交于 2021-02-05 06:44:45
问题 I am new in python and I am working on web scraping project from amazon and I have a problem on how to extract the product img src from product page via lxml using BeautifulSoup I tried the following code to extract it but it doesn't show the url of the img. here is my code: import requests from bs4 import BeautifulSoup import re url = 'https://www.amazon.com/crocs-Unisex-Classic-Black-Women/dp/B0014C0LSY/ref=sr_1_2?_encoding=UTF8&qid=1560091629&s=fashion-womens-intl-ship&sr=1-2&th=1&psc=1' r

CSS selector QuerySelector alternative

醉酒当歌 提交于 2021-02-04 21:41:04
问题 I have searched a lot and a lot so as to find material about how to get meta data using XMLHTTP. And I think that's impossible to do that using the Early binding method. The only approach that will work is the late binding by CreateObject("HTMLFile") and dealing with that HTML which is late binding. The disadvantage of this approach is that it doesn't support the use of the QuerySelector or QuerySelectorAll .. Now I am trying to find alternative to this CSS selector .. without using the

python requests.get() returns an empty string

↘锁芯ラ 提交于 2021-02-04 21:34:21
问题 When I run the below code it returns an empty string url = 'http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=5&columns[0][data]=box_art&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=title&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=true&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=year&columns[2]