beautifulsoup

Is there a way i can grab a column of a table that span over several pages using python?

梦想的初衷 提交于 2019-12-24 18:13:14
问题 I am trying to get the tickers of ETFs from a table that spans over 46 pages: http://etfdb.com/type/region/north-america/us/#etfs&sort_name=assets_under_management&sort_order=desc&page=1 My code is import bs4 as bs import pickle import requests def save_ETF_tickers(): resp = requests.get('http://etfdb.com/type/region/north-america/us/#etfs&sort_name=assets_under_management&sort_order=desc&page=1') soup = bs.BeautifulSoup(resp.text, "lxml") table = soup.find('table',{'class': 'table mm-mobile

Loading more content in a webpage and issues writing to a file

左心房为你撑大大i 提交于 2019-12-24 18:12:37
问题 I am working on a web scraping project which involves scraping URLs from a website based on a search term, storing them in a CSV file(under a single column) and finally scraping the information from these links and storing them in a text file. I am currently stuck with 2 issues. Only the first few links are scraped. I'm unable to extract links from other pages(Website contains load more button). I don't know how to use the XHR object in the code. The second half of the code reads only the

Pin down exact content location in html for web scraping urllib2 Beautiful Soup

梦想与她 提交于 2019-12-24 17:25:50
问题 I'm new to web scraping, have little exposure to html file systems and wanted to know if there is a better more efficient way to search for a required content on the html version of a web page. Currently, I want to scrape reviews for a product here: http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem For this, I have the following code: url = http://www.walmart.com/ip/29701960? wmlspartner=wlpa

Kaggle word2vec competition, part 2

試著忘記壹切 提交于 2019-12-24 17:15:48
问题 my code is FROM: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors, i read the data successful, here is used to BeautifulSoup and nltk to clean the text, remove non-letters but numbers. def review_to_wordlist( review, remove_stopwords=False ): # Function to convert a document to a sequence of words, # optionally removing stop words. Returns a list of words. # # 1. Remove HTML review_text = BeautifulSoup(review).get_text() # # 2. Remove non-letters review_text = re.sub

Fix encoding error with loop in BeautifulSoup4?

孤者浪人 提交于 2019-12-24 16:52:33
问题 This is a follow up to Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4? and Using Python to Scrape Nested Divs and Spans in Twitter?. I'm not using the Twitter API because it doesn't look at the tweets by hashtag this far back. EDIT: The error described here only occurs in Windows 7. The code runs as intended on Linux, as reported by bernie, see comment below, and I am able to run it without encoding errors on OSX 10.10.2. The encoding error occurs when

How to scrape latitude longitude in beautiful soup

走远了吗. 提交于 2019-12-24 16:48:02
问题 I am fairly new to BeautifulSoup4 and am having trouble extracting latitude and longitude values out of an html response from the below code. url = 'http://cinematreasures.org/theaters/united-states?page=1' r = requests.get(url) soup = BeautifulSoup(r.content) links = soup.findAll("tr") print links This code prints out this response multiple times. <tr class="even location theater" data="{id: 0, point: {lng: -94.1751038, lat: 36.0848965} Full tr response <tr>\n <th id="theater_name"><a href="

BeautifulSoup Error in file saving .txt

不问归期 提交于 2019-12-24 16:42:59
问题 from bs4 import BeautifulSoup import requests import os url = "http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html" r = requests.get(url) soup = BeautifulSoup(r.content.decode('utf-8', 'ignore')) data = soup.find_all("article", {"class": "article"}) with open("data1.txt", "wb") as file: content=‘utf-8’ for item in data: content+='''{}\n{}\n\n{}\n{}'''.format( item.contents[0].find_all("time", {"datetime": "2016-03-16T09:50:30+0100"})[0].text, item

compute the average height and the average width of div tag

最后都变了- 提交于 2019-12-24 16:37:01
问题 I have need to get the average div height and width of an html doc. I have try this solution but it doesn't work: import numpy as np average_width = np.mean([div.attrs['width'] for div in my_doc.get_div() if 'width' in div.attrs]) average_height = np.mean([div.attrs['height'] for div in my_doc.get_div() if 'height' in div.attrs]) print average_height,average_width the get_div method return the list of all div retrieved by the find_all method of beautifulSoup here is an example : print my_doc

how do you extract data from json using beautifulsoup in django

一笑奈何 提交于 2019-12-24 16:28:44
问题 Good day. I'm facing an issue while trying to extract values from json. First of all my beautifulsoup works very fine in the shell, but not in django. and also what I'm trying to achieve is extracting data from the received json, but with no success. Here's the class in my view doing it: class FetchWeather(generic.TemplateView): template_name = 'forecastApp/pages/weather.html' def get_context_data(self, **kwargs): context = super().get_context_data(**kwargs) url = 'http://weather.news24.com

How to make Selenium and beautifulsoup work faster?

一曲冷凌霜 提交于 2019-12-24 15:49:53
问题 My goal is to scrape as many profile links as possible on Khan Academy. And then scrape some specific data on each of these profiles. My problem here is simple: this script is taking way to much time and I can't be for sure that it is working the right way. Here is my script: from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common