bs4

BeautifulSoup.find_all() method not working with namespaced tags

我是研究僧i 提交于 2021-02-18 10:59:12
问题 I have encountered a very strange behaviour while working with BeautifulSoup today. Let's have a look at a very simple html snippet: <html><body><ix:nonfraction>lele</ix:nonfraction></body></html> I am trying to get the content of the <ix:nonfraction> tag with BeautifulSoup. Everything works fine when using the find method: from bs4 import BeautifulSoup html = "<html><body><ix:nonfraction>lele</ix:nonfraction></body></html>" soup = BeautifulSoup(html, 'lxml') # The parser used here does not

Get web page content (Not from source code) [duplicate]

痞子三分冷 提交于 2021-02-08 03:55:22
问题 This question already has answers here : Web-scraping JavaScript page with Python (15 answers) Closed 4 years ago . I want to get the rainfall data of each day from here. When I am in inspect mode , I can see the data. However, when I view the source code, I cannot find it. I am using urllib2 and BeautifulSoup from bs4 Here is my code: import urllib2 from bs4 import BeautifulSoup link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1" r = urllib2.urlopen(link) soup = BeautifulSoup(r)

Get web page content (Not from source code) [duplicate]

别来无恙 提交于 2021-02-08 03:54:21
问题 This question already has answers here : Web-scraping JavaScript page with Python (15 answers) Closed 4 years ago . I want to get the rainfall data of each day from here. When I am in inspect mode , I can see the data. However, when I view the source code, I cannot find it. I am using urllib2 and BeautifulSoup from bs4 Here is my code: import urllib2 from bs4 import BeautifulSoup link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1" r = urllib2.urlopen(link) soup = BeautifulSoup(r)

Beautiful Soup 4 .string() 'NoneType' object is not callable

丶灬走出姿态 提交于 2021-01-28 19:24:28
问题 from bs4 import BeautifulSoup import sys soup = BeautifulSoup(open(sys.argv[2]), 'html.parser') print(soup.prettify) if sys.argv[1] == "h": h2s = soup.find_all("h2") for h in h2s: print(h.string()) The first print statement (added as a test) works - so I know BS4 is working and everything. The second print statement throws: File "sp2gd.py", line 40, in <module> print(h.string()) TypeError: 'NoneType' object is not callable 回答1: BeautifulSoup's .string is a property, not a callable method, and

Maintaining the indentation of an XML file when parsed with Beautifulsoup

こ雲淡風輕ζ 提交于 2021-01-28 03:32:30
问题 I am using BS4 to parse an XML file and trying to write it back to a new XML file. Input file: <tag1> <tag2 attr1="a1"> example text </tag2> <tag3> <tag4 attr2="a2"> example text </tag4> <tag5> <tag6 attr3="a3"> example text </tag6> </tag5> </tag3> </tag1> Script: soup = BeautifulSoup(open("input.xml"), "xml") f = open("output.xml", "w") f.write(soup.encode(formatter='minimal')) f.close() Output: <tag1> <tag2 attr1="a1"> example text </tag2> <tag3> <tag4 attr2="a2"> example text </tag4> <tag5

Python Beautiful Soup - Getting input value

百般思念 提交于 2020-02-08 05:12:47
问题 My plan is to be able to grab the _AntiCsrfToken by using Bs4. I have this HTML where my HTML comes from and what I have written in the code is token = soup.find('input', {'name':'_AntiCsrfToken'})['value']) print(token) but it gives me a error saying Traceback (most recent call last): File "C:\Users\HelloWorld.py", line 67, in <module> print(soup.find('input', {'name':'_AntiCsrfToken'})['value']) File "C:\Python\lib\site-packages\bs4\element.py", line 1292, in find l = self.find_all(name,

Python - Extracting data between specific comment nodes with BeautifulSoup 4

旧城冷巷雨未停 提交于 2020-01-07 03:36:25
问题 Looking to pick out specific data from a website such as prices, company info etc. Luckily, the website designer has put lots of tags such as <!-- Begin Services Table --> ' desired data <!-- End Services Table --> What kind of code would I need in order for BS4 to return the strings between the given tags? import requests from bs4 import BeautifulSoup url = "http://www.100ll.com/searchresults.phpclear_previous=true&searchfor="+'KPLN'+"&submit.x=0&submit.y=0" response = requests.get(url) soup

Cannot figure out what's wrong with beautifulsoup4 in my python 3 script

风格不统一 提交于 2019-12-24 06:47:50
问题 Traceback (most recent call last): File "urlgrabber.py", line 1, in <module> from bs4 import BeautifulSoup File "/Users/asdf/Desktop/Scraper/bs4/__init__.py", line 29, in <module> from .builder import builder_registry File "/Users/asdf/Desktop/Scraper/bs4/builder/__init__.py", line 4, in <module> from bs4.element import ( File "/Users/asdf/Desktop/Scraper/bs4/element.py", line 5, in <module> from bs4.dammit import EntitySubstitution File "/Users/asdf/Desktop/Scraper/bs4/dammit.py", line 13,

Using findAll in BS4 to create list

天大地大妈咪最大 提交于 2019-12-24 05:48:32
问题 I'll start by saying I'm sort of new with Python. I've been working on a Slack bot recently and here's where I'm at so far. source = requests.get(url).content soup = BeautifulSoup(source, 'html.parser') price = soup.findAll("a", {"class":"pricing"})["quantity"] Here is the HTML code I am trying to scrape. <a class="pricing" saleprice="240.00" quantity="1" added="2017-01-01"> S </a> <a class="pricing" saleprice="21.00" quantity="5" added="2017-03-14"> M </a> <a class="pricing" saleprice="139

Bs4 select_one vs find

偶尔善良 提交于 2019-12-23 10:15:33
问题 I was wondering what is the difference between performing bs.find('div') and bs.select_one('div') . Same goes for find_all and select . Is there any difference performance wise, or if any is better to use over the other in specific cases. 回答1: select() and select_one() give you a different way navigating through an HTML tree using the CSS selectors which has rich and convenient syntax. Though, the CSS selector syntax support in BeautifulSoup is limited but covers most common cases.