web-scraping | 易学教程

Moving to next page for scraping using BeautifulSoup

阅读更多关于 Moving to next page for scraping using BeautifulSoup

问题 I am unable to automate the following code to go to the next page and scrape data from Indeed.com. Please let me know how to handle this issue. import requests import bs4 from bs4 import BeautifulSoup import pandas as pd import time URL = "https://www.indeed.com/jobs?q=Amazon&l=" # Get the html info of the page page = requests.get(URL) soup = BeautifulSoup(page.text, "html.parser") # Get the job title def extract_job_title_from_result(soup): jobs = [] for div in soup.find_all(name="div",attrs

Scrape website that require login with BeautifulSoup

阅读更多关于 Scrape website that require login with BeautifulSoup

问题 I want to scrape website that requires login with Python and BeautifulSoup and requests libs. (no selenium) This is my code: import requests from bs4 import BeautifulSoup auth = (username, password) headers = { 'authority': 'signon.springer.com', 'cache-control': 'max-age=0', 'upgrade-insecure-requests': '1', 'origin': 'https://signon.springer.com', 'content-type': 'application/x-www-form-urlencoded', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML,

Scrapy: extract text with special characters

阅读更多关于 Scrapy: extract text with special characters

问题 I'm using Scrapy for extract text from some spanish websites. Obviously, the text is written in spanish and some words have special characters like 'ñ' or 'í'. My problem is that when I run in the command line: scrapy crawl econoticia -o prueba.json to get the file with the scraped data, some characters are not shown in a proper way. For example: This is the original text "La exministra, procesada como partícipe a titulo lucrativo, intenta burlar a los fotógrafos" And this is the text scraped

Issue with scraping Understat chart data using Selenium

阅读更多关于 Issue with scraping Understat chart data using Selenium

问题 I'm trying to scrape chart data under 'Timing Sheet' tab at https://understat.com/match/9457. My approach is to use BeautifulSoap and Selenium but I can't seem to get it to work. Here is my python script: from bs4 import BeautifulSoup import requests # Set the url we want xg_url = 'https://understat.com/match/9457' # Use requests to download the webpage xg_data = requests.get(xg_url) # Get the html code for the webpage xg_html = xg_data.content # Parse the html using bs4 soup = BeautifulSoup

Table Web Scraping Issues with Python

阅读更多关于 Table Web Scraping Issues with Python

问题 I am having issues scraping data from this website: https://fantasy.premierleague.com/player-list I am interested in getting access to the player's names and points from the different tables. I'm relatively new to python and completely new to web scraping. Here is what I have so far: from urllib.request import urlopen from bs4 import BeautifulSoup url = 'https://fantasy.premierleague.com/player-list' html = urlopen(url) soup = BeautifulSoup(html, "lxml") rows = soup.find_all('tr') print(rows)

How can I host a backend service powered by web scraping using selenium web driver?

阅读更多关于 How can I host a backend service powered by web scraping using selenium web driver?

问题 So I am developing a project to scrape a website and deliver data to users, however I am using selenium/selenium web driver with python/flask. I was originally going to use beautifulsoup, but the website I am scraping requires some interactions on the page. I have everything working with the scraper, I am just trying to figure out a way to make this work universally if I wanted to host this service on a website using a service such as heroku. Currently Selenium is opening a chrome browser and

How can I host a backend service powered by web scraping using selenium web driver?

阅读更多关于 How can I host a backend service powered by web scraping using selenium web driver?

VBA: Extract HTML from new page (same url)

阅读更多关于 VBA: Extract HTML from new page (same url)

问题 I need to feed inputs to this web page http://kepler.sos.ca.gov/ and then collect information once I click the submit button. The first part (inputs+click submit) runs smoothly : Dim ie As InternetExplorer 'to refer to the HTML document returned Dim html As HTMLDocument 'open Internet Explorer in memory, and go to website Set ie = New InternetExplorer ie.Visible = True ie.navigate "http://kepler.sos.ca.gov/" 'Wait until IE is done loading page Do While ie.readyState <> READYSTATE_COMPLETE

How to Specify different Process settings for two different spiders in CrawlerProcess Scrapy?

阅读更多关于 How to Specify different Process settings for two different spiders in CrawlerProcess Scrapy?

问题 I have two spiders which I want to execute in parallel . I used the CrawlerProcess instance and its crawl method to acheieve this. However, I want to specify different output file , ie FEED_URI for each spider in the same process . I tried to loop the spiders and run them as shown below. Though two different output files are generated, the process terminates as soon as the second spider completes execution . If the first spider completes crawling before the second one, I get the desired

AttributeError: 'ResultSet' object has no attribute 'find_all' Beautifulsoup

阅读更多关于 AttributeError: 'ResultSet' object has no attribute 'find_all' Beautifulsoup

问题 I dont understand why do i get this error: I have a fairly simple function: def scrape_a(url): r = requests.get(url) soup = BeautifulSoup(r.content) news = soup.find_all("div", attrs={"class": "news"}) for links in news: link = news.find_all("href") return link Here is th estructure of webpage I am trying to scrape: <div class="news"> <a href="www.link.com"> <h2 class="heading"> heading </h2> <div class="teaserImg"> <img alt="" border="0" height="124" src="/image"> </div> <p> text </p> </a> <