beautifulsoup

How to select some urls with BeautifulSoup?

扶醉桌前 提交于 2020-01-01 18:48:25
问题 I want to scrape the following information except the last row and "class="Region" row: ... <td>7</td> <td bgcolor="" align="left" style=" width:496px"><a class="xnternal" href="http://www.whitecase.com">White and Case</a></td> <td bgcolor="" align="left">New York</td> <td bgcolor="" align="left" class="Region">N/A</td> <td bgcolor="" align="left">1,863</td> <td bgcolor="" align="left">565</td> <td bgcolor="" align="left">1,133</td> <td bgcolor="" align="left">$160,000</td> <td bgcolor=""

Web scraping results in 403 Forbidden Error

半城伤御伤魂 提交于 2020-01-01 15:36:32
问题 I'm trying to web scrape the earnings for each company off SeekingAlpha using BeautifulSoup. However, it seems like the site is detecting that a web scraper is being used? I get a "HTTP Error 403: Forbidden" The page I'm attempting to scrape is: https://seekingalpha.com/symbol/AMAT/earnings Does anyone know what can be done to bypass this? 回答1: I was able to access the site contents by using a proxy, found from here: https://free-proxy-list.net/ Then, creating a playload using the requests

Python Web Scraping (Beautiful Soup, Selenium and PhantomJS): Only scraping part of full page

左心房为你撑大大i 提交于 2020-01-01 11:51:26
问题 Hello I am having trouble trying to scrape data from a website for modeling purposes (fantsylabs dotcom). I'm just a hack so forgive my ignorance on comp sci lingo. What Im trying to accomplish is... Use selenium to login to the website and navigate to the page with data. ## Initialize and load the web page url = "website url" driver = webdriver.Firefox() driver.get(url) time.sleep(3) ## Fill out forms and login to site username = driver.find_element_by_name('input') password = driver.find

Scrape Multiple URLs using Beautiful Soup

那年仲夏 提交于 2019-12-31 17:59:31
问题 I'm trying to extract specific classes from multiple URLs. The tags and classes stay the same but I need my python program to scrape all as I just input my link. Here's a sample of my work: from bs4 import BeautifulSoup import requests import pprint import re import pyperclip url = input('insert URL here: ') #scrape elements response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") #print titles only h1 = soup.find("h1", class_= "class-headline") print(h1.get_text())

How to get HTML from a beautiful soup object

∥☆過路亽.° 提交于 2019-12-31 08:26:46
问题 I have the following bs4 object listing: >>> listing <div class="listingHeader"> <h2> .... >>> type(listing) <class 'bs4.element.Tag'> I want to extract the raw html as a string. I've tried: >>> a = listing.contents >>> type(a) <type 'list'> So this does not work. How can I do this? 回答1: Just get the string representation: html_content = str(listing) This is a non-prettified version. If you want a prettified one, use prettify() method: html_content = listing.prettify() 来源: https:/

Signing to Google using requests and going to youtube

吃可爱长大的小学妹 提交于 2019-12-31 06:02:07
问题 I'm trying to login to Gmail by just using requests and then proceed to watch youtube, do some searches etc. I don't want to use selenium or any other alternative to selenium as I find it bulky and inconvenient I was researching how to do this and I came across some answers here and based my code of that. However these solutions are from a couple years back and I don't know if it still applies now and if it will work to the purpose I want it to. class SessionGoogle: def __init__(self, url

Web crawler to extract from list elements

我们两清 提交于 2019-12-31 05:38:12
问题 I am trying to extract from <li> tags the dates and store them in an Excel file. <li>January 13, 1991: At least 40 people <a href ="......."> </a> </li> Code: import urllib2 import os from datetime import datetime import re os.environ["LANG"]="en_US.UTF-8" from bs4 import BeautifulSoup page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes") soup = BeautifulSoup(page1) li =soup.find_all("li") count = 0 while count < len(li): soup = BeautifulSoup(li[count]) date_string,

logging in to website using requests

青春壹個敷衍的年華 提交于 2019-12-31 04:06:11
问题 I've tried two completely different methods. But still I can't get the data that is only present after loggin in. I've tried doing one using requests but the xpath returns a null import requests from lxml import html USERNAME = "xxx" PASSWORD = "xxx" LOGIN_URL = "http://www.reginaandrew.com/customer/account/loginPost/referer/aHR0cDovL3d3dy5yZWdpbmFhbmRyZXcuY29tLz9fX19TSUQ9VQ,,/" URL = "http://www.reginaandrew.com/gold-leaf-glass-top-table" def main(): FormKeyTxt = "" session_requests =

BeautifulSoup MemoryError When Opening Several Files in Directory

时光毁灭记忆、已成空白 提交于 2019-12-31 01:51:17
问题 Context: Every week, I receive a list of lab results in the form of an html file. Each week, there are about 3,000 results with each set of results having between two and four tables associated with them. For each result/trial, I only care about some standard information that is stored in one of these tables. That table can be uniquely identified because the first cell, first column always has the text "Lab Results". Problem: The following code works great when I do each file at a time. That

How to return html of a page using robobrowser

陌路散爱 提交于 2019-12-30 18:50:39
问题 I'm experimenting with http://robobrowser.readthedocs.org/en/latest/readme.html, a new python library based on the beautiful soup library. I'm trying to test it out by opening an html page and returning it within a django app, but I can't figure out to do this most simple task. My django app contains : def index(request): p=str(request.POST.get('p', False)) # p='https://www.yahoo.com/' browser = RoboBrowser(history=True) postedmessage = browser.open(p) return HttpResponse(postedmessage) How