beautifulsoup

Why does this code generate multiple files? I want 1 file with all entries in it

北慕城南 提交于 2020-02-06 17:43:11
问题 Im trying to work with both beautifulsoup and xpath and was trying to using the following code, but now im getting 1 file per URL instead of before where i was getting 1 file for all the URLS I just moved over the reading from CSV to get the list of urls and also just added the parsing of the url and response.. but when i run this now i get alot of individual files and in some cases 1 file may actually contain 2 scraped pages data.. so do i need to move my file saving out (indent) import

get div attribute val and div text body

允我心安 提交于 2020-02-05 08:39:15
问题 Here is small code to get div attr value. All div name are same with same attr name. redditFile = urllib2.urlopen("http://www.bing.com/videos?q=owl") redditHtml = redditFile.read() redditFile.close() soup = BeautifulSoup(redditHtml) productDivs = soup.findAll('div', attrs={'class' : 'dg_u'}) for div in productDivs: print div.find('div', {"class":"vthumb"})['smturl'] #print div.find("div", {"class":"tl text-body"}) This print none rather then div text first print gives some urls(some times 4,6

assign lines from txt file to html files regex bs4

試著忘記壹切 提交于 2020-02-05 05:19:25
问题 i need to assign a line from a text file containing 7 lines every line with a url, these urls should be replacing href url in 7 html files the issue is that i had the same value or the same url in all html files on all seven html files here is my code: regex = re.compile("https\:\/\/(.*)\n") with open("urls.txt") as f: for line in f: result = regex.search(line) for filename in glob.glob('/htmlz/*.html'): with open(filename, "r") as html_file: soup = BeautifulSoup(html_file,'html.parser') for

Beautiful Soup replaces < with <

你。 提交于 2020-02-04 05:06:07
问题 I've found the text I want to replace, but when I print soup the format gets changed. <div id="content">stuff here</div> becomes <div id="content">stuff here</div> . How can i preserve the data? I have tried print(soup.encode(formatter="none")) , but that produces the same incorrect format. from bs4 import BeautifulSoup with open(index_file) as fp: soup = BeautifulSoup(fp,"html.parser") found = soup.find("div", {"id": "content"}) found.replace_with(data) When I print found , I get the correct

BeautifulSoup HTTPResponse has no attribute encode

守給你的承諾、 提交于 2020-02-03 11:00:27
问题 I'm trying to get beautifulsoup working with a URL, like the following: from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://proxies.org") soup = BeautifulSoup(html.encode("utf-8"), "html.parser") print(soup.find_all('a')) However, I am getting a error: File "c:\Python3\ProxyList.py", line 3, in <module> html = urlopen("http://proxies.org").encode("utf-8") AttributeError: 'HTTPResponse' object has no attribute 'encode' Any idea why? Could it be to do with

Elegant way to try/except a series of BeautifulSoup commands?

Deadly 提交于 2020-02-03 05:10:33
问题 I'm parsing webpages on a site displaying item data. These items have about 20 fields which may or may not occur -- say: price, quantity, last purchased, high, low, etc. I'm currently using a series of commands; about 20 lines of soup.find('div',{'class':SOME_FIELD_OF_INTEREST}) to look for each separate field of interest. (Some are in div , span , dd , and so on, so it's difficult to just do a soup.find_all('div') command.) My question: Is there an elegant way to try and except everything

Elegant way to try/except a series of BeautifulSoup commands?

不想你离开。 提交于 2020-02-03 05:09:10
问题 I'm parsing webpages on a site displaying item data. These items have about 20 fields which may or may not occur -- say: price, quantity, last purchased, high, low, etc. I'm currently using a series of commands; about 20 lines of soup.find('div',{'class':SOME_FIELD_OF_INTEREST}) to look for each separate field of interest. (Some are in div , span , dd , and so on, so it's difficult to just do a soup.find_all('div') command.) My question: Is there an elegant way to try and except everything

Where does data not in a website's source code come from and how do I get it using BeautifulSoup? [duplicate]

久未见 提交于 2020-02-03 02:03:14
问题 This question already has answers here : Web-scraping JavaScript page with Python (13 answers) Beautiful Soup Can't Find Tags (2 answers) Closed last month . I am trying to pull data from a local government's website using BeautifulSoup with Python, but the source code that it pulls down lacks the info I want. I know how to use BeautifulSoup and I can pull any part of the source code I want down and use it in python, but the data I want is not there. What happens is the page has all of the

How can I get the first string from a div that has a div embedded beautifulsoup4

巧了我就是萌 提交于 2020-02-02 13:02:31
问题 I'm trying to extract prices from a website. The code I've written can do that, but when the website has a price that also shows the old price, it returns "none" instead of a string of the price. This is an example of the code without the old price (which my code returns as a string) <div class="xl-price rangePrice"> 535.000 € </div> This is an example of the code WITH the old price (which my code returns as "none") < div class ="xl-price rangePrice" > 487.000 € < span class ="old-price" >

Extract JSON from HTML Script tag with BeautifulSoup in Python

最后都变了- 提交于 2020-02-02 07:04:04
问题 I have the following HTML, and what should I do to extract the JSON from the variable: window.__INITIAL_STATE__ <!DOCTYPE doctype html> <html lang="en"> <script> window.sessConf = "-2912474957111138742"; /* <sl:translate_json> */ window.__INITIAL_STATE__ = { /* Target JSON here with 12 million characters */}; /* </sl:translate_json> */ </script> </html> 回答1: You can use the following Python code to extract the JavaScript code. soup = BeautifulSoup(html) s=soup.find('script') js = 'window = {}