beautifulsoup

How to Get Script Tag Variables From a Website using Python

谁说胖子不能爱 提交于 2021-02-08 10:03:55
问题 I am trying to pull a variable called meta in a script tag using Python. I have used selenium to do this before, but selenium is too slow for what I am trying to accomplish. Is there any other way of doing this. I have tried using BeautifulSoup, but I'm stuck... code is below Here is the script tag I'm trying to get the meta variable from: <script>window.ShopifyAnalytics = window.ShopifyAnalytics || {}; window.ShopifyAnalytics.meta = window.ShopifyAnalytics.meta || {}; window.ShopifyAnalytics

Beautiful Soup conversion of Unicode characters to HTML entities

╄→гoц情女王★ 提交于 2021-02-08 09:16:00
问题 This error occurs after loading the document into beautifulsoup The document contains entities like &ldquo; which gets converted to ΓÇ£ I want to output the html entities &ldquo; 回答1: use this refernce link from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) print(soup.prettify(formatter="html")) 来源: https://stackoverflow.com/questions/23191624/beautiful-soup-conversion-of-unicode-characters-to-html-entities

Beautiful Soup conversion of Unicode characters to HTML entities

ぐ巨炮叔叔 提交于 2021-02-08 09:15:28
问题 This error occurs after loading the document into beautifulsoup The document contains entities like &ldquo; which gets converted to ΓÇ£ I want to output the html entities &ldquo; 回答1: use this refernce link from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) print(soup.prettify(formatter="html")) 来源: https://stackoverflow.com/questions/23191624/beautiful-soup-conversion-of-unicode-characters-to-html-entities

Scraping a website with python 3 that requires login

我是研究僧i 提交于 2021-02-08 09:13:38
问题 Just a question regarding some scraping authentication. Using BeautifulSoup : #importing the requests lib import requests from bs4 import BeautifulSoup #specifying the page page = requests.get("http://localhost:8080/login?from=%2F") #parsing through the api soup = BeautifulSoup(page.content, 'html.parser') print(soup.prettify()) From here the output, I think would be important: <table> <tr> <td> User: </td> <td> <input autocapitalize="off" autocorrect="off" id="j_username" name="j_username"

Beautiful Soup: get picture size from html

丶灬走出姿态 提交于 2021-02-08 08:14:04
问题 I want to extract the pictures' widths and heights using Bueatiful Soup. All pictures have the same code format: <img src="http://somelink.com/somepic.jpg" width="200" height="100"> I can extract the links easily with for pic in soup.find_all('img'): print (pic['src']) But for pic in soup.find_all('img'): print (pic['width']) is not working for extracting sizes. What am I missing? EDIT: One of the pictures in the page does not have the width and height in the html code. Did not notice this at

Beautiful Soup: get picture size from html

心不动则不痛 提交于 2021-02-08 08:12:25
问题 I want to extract the pictures' widths and heights using Bueatiful Soup. All pictures have the same code format: <img src="http://somelink.com/somepic.jpg" width="200" height="100"> I can extract the links easily with for pic in soup.find_all('img'): print (pic['src']) But for pic in soup.find_all('img'): print (pic['width']) is not working for extracting sizes. What am I missing? EDIT: One of the pictures in the page does not have the width and height in the html code. Did not notice this at

How to get full web address with BeautifulSoup

时光怂恿深爱的人放手 提交于 2021-02-08 07:05:13
问题 I cannot find how to get the full address of a web site: I get for example "/wiki/Main_Page" instead of "https://en.wikipedia.org/wiki/Main_Page". I cannot simply add url to the link as it would give :"https://en.wikipedia.org/wiki/WKIK/wiki/Main_Page" which is incorrect. My goal is to make it work for any website so I am looking for a general solution. Here is the code : from bs4 import BeautifulSoup import requests url ="https://en.wikipedia.org/wiki/WKIK" r = requests.get(url) data = r

How to get full web address with BeautifulSoup

删除回忆录丶 提交于 2021-02-08 07:03:20
问题 I cannot find how to get the full address of a web site: I get for example "/wiki/Main_Page" instead of "https://en.wikipedia.org/wiki/Main_Page". I cannot simply add url to the link as it would give :"https://en.wikipedia.org/wiki/WKIK/wiki/Main_Page" which is incorrect. My goal is to make it work for any website so I am looking for a general solution. Here is the code : from bs4 import BeautifulSoup import requests url ="https://en.wikipedia.org/wiki/WKIK" r = requests.get(url) data = r

How to strip line breaks from BeautifulSoup get text method

浪子不回头ぞ 提交于 2021-02-08 06:19:15
问题 I have a following output after scraping a web page text Out[50]: ['\nAbsolute FreeBSD, 2nd Edition\n', '\nAbsolute OpenBSD, 2nd Edition\n', '\nAndroid Security Internals\n', '\nApple Confidential 2.0\n', '\nArduino Playground\n', '\nArduino Project Handbook\n', '\nArduino Workshop\n', '\nArt of Assembly Language, 2nd Edition\n', '\nArt of Debugging\n', '\nArt of Interactive Design\n',] I need to strip \n from above list while iterating over it. Following is my code text = [] for name in web

Why is BeautifulSoup's findAll returning an empty list when I search by class?

元气小坏坏 提交于 2021-02-08 03:16:13
问题 I am trying to web-scrape using an h2 tag, but BeautifulSoup returns an empty list. <h2 class="iCIMS_InfoMsg iCIMS_InfoField_Job"> html=urlopen("https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job") bs0bj=BeautifulSoup(html,"lxml") nameList=bs0bj.findAll("h2",{"class":"iCIMS_InfoMsg iCIMS_InfoField_Job"}) print(nameList) 回答1: The content is inside an iframe and updated via js (so not present in initial request). You can use the same link the page is using