beautifulsoup | 易学教程

Why does this code generate multiple files? I want 1 file with all entries in it

阅读更多关于 Why does this code generate multiple files? I want 1 file with all entries in it

问题 Im trying to work with both beautifulsoup and xpath and was trying to using the following code, but now im getting 1 file per URL instead of before where i was getting 1 file for all the URLS I just moved over the reading from CSV to get the list of urls and also just added the parsing of the url and response.. but when i run this now i get alot of individual files and in some cases 1 file may actually contain 2 scraped pages data.. so do i need to move my file saving out (indent) import

get div attribute val and div text body

阅读更多关于 get div attribute val and div text body

问题 Here is small code to get div attr value. All div name are same with same attr name. redditFile = urllib2.urlopen("http://www.bing.com/videos?q=owl") redditHtml = redditFile.read() redditFile.close() soup = BeautifulSoup(redditHtml) productDivs = soup.findAll('div', attrs={'class' : 'dg_u'}) for div in productDivs: print div.find('div', {"class":"vthumb"})['smturl'] #print div.find("div", {"class":"tl text-body"}) This print none rather then div text first print gives some urls(some times 4,6

assign lines from txt file to html files regex bs4

阅读更多关于 assign lines from txt file to html files regex bs4

问题 i need to assign a line from a text file containing 7 lines every line with a url, these urls should be replacing href url in 7 html files the issue is that i had the same value or the same url in all html files on all seven html files here is my code: regex = re.compile("https\:\/\/(.*)\n") with open("urls.txt") as f: for line in f: result = regex.search(line) for filename in glob.glob('/htmlz/*.html'): with open(filename, "r") as html_file: soup = BeautifulSoup(html_file,'html.parser') for

Beautiful Soup replaces < with <

阅读更多关于 Beautiful Soup replaces < with

问题 I've found the text I want to replace, but when I print soup the format gets changed. <div id="content">stuff here</div> becomes <div id="content">stuff here</div> . How can i preserve the data? I have tried print(soup.encode(formatter="none")) , but that produces the same incorrect format. from bs4 import BeautifulSoup with open(index_file) as fp: soup = BeautifulSoup(fp,"html.parser") found = soup.find("div", {"id": "content"}) found.replace_with(data) When I print found , I get the correct

BeautifulSoup HTTPResponse has no attribute encode

阅读更多关于 BeautifulSoup HTTPResponse has no attribute encode

问题 I'm trying to get beautifulsoup working with a URL, like the following: from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://proxies.org") soup = BeautifulSoup(html.encode("utf-8"), "html.parser") print(soup.find_all('a')) However, I am getting a error: File "c:\Python3\ProxyList.py", line 3, in <module> html = urlopen("http://proxies.org").encode("utf-8") AttributeError: 'HTTPResponse' object has no attribute 'encode' Any idea why? Could it be to do with

Elegant way to try/except a series of BeautifulSoup commands?

阅读更多关于 Elegant way to try/except a series of BeautifulSoup commands?

问题 I'm parsing webpages on a site displaying item data. These items have about 20 fields which may or may not occur -- say: price, quantity, last purchased, high, low, etc. I'm currently using a series of commands; about 20 lines of soup.find('div',{'class':SOME_FIELD_OF_INTEREST}) to look for each separate field of interest. (Some are in div , span , dd , and so on, so it's difficult to just do a soup.find_all('div') command.) My question: Is there an elegant way to try and except everything

Elegant way to try/except a series of BeautifulSoup commands?

阅读更多关于 Elegant way to try/except a series of BeautifulSoup commands?

Where does data not in a website's source code come from and how do I get it using BeautifulSoup? [duplicate]

阅读更多关于 Where does data not in a website's source code come from and how do I get it using BeautifulSoup? [duplicate]

问题 This question already has answers here : Web-scraping JavaScript page with Python (13 answers) Beautiful Soup Can't Find Tags (2 answers) Closed last month . I am trying to pull data from a local government's website using BeautifulSoup with Python, but the source code that it pulls down lacks the info I want. I know how to use BeautifulSoup and I can pull any part of the source code I want down and use it in python, but the data I want is not there. What happens is the page has all of the

How can I get the first string from a div that has a div embedded beautifulsoup4

阅读更多关于 How can I get the first string from a div that has a div embedded beautifulsoup4

问题 I'm trying to extract prices from a website. The code I've written can do that, but when the website has a price that also shows the old price, it returns "none" instead of a string of the price. This is an example of the code without the old price (which my code returns as a string) <div class="xl-price rangePrice"> 535.000 € </div> This is an example of the code WITH the old price (which my code returns as "none") < div class ="xl-price rangePrice" > 487.000 € < span class ="old-price" >

Extract JSON from HTML Script tag with BeautifulSoup in Python

阅读更多关于 Extract JSON from HTML Script tag with BeautifulSoup in Python

问题 I have the following HTML, and what should I do to extract the JSON from the variable: window.__INITIAL_STATE__ <!DOCTYPE doctype html> <html lang="en"> <script> window.sessConf = "-2912474957111138742"; /* <sl:translate_json> */ window.__INITIAL_STATE__ = { /* Target JSON here with 12 million characters */}; /* </sl:translate_json> */ </script> </html> 回答1: You can use the following Python code to extract the JavaScript code. soup = BeautifulSoup(html) s=soup.find('script') js = 'window = {}