beautifulsoup

Unable to scrape the text from a certain LI element

喜欢而已 提交于 2020-01-17 04:01:32
问题 I am scraping this URL. I have to scrape the main content of the page like Room Features and Internet Access Here is my code: for h3s in Column: # Suppose this is div.RightColumn for index,test in enumerate(h3s.select("h3")): print("Feature title: "+str(test.text)) for v in h3s.select("ul")[index]: print(v.string.strip()) This code scrapes all the <li> 's but when it comes to scrape Internet Access I get AttributeError: 'NoneType' object has no attribute 'strip' Because <li> s data under the

How to convert a BeautifulSoup tag to JSON?

廉价感情. 提交于 2020-01-16 19:35:10
问题 I have a type element, bs4.element.Tag , product of a web scraping, I usually do: json.loads (soup.find ('script', type = 'application / ld + json'). Text) , but on this page it only appears in: <script> </script> so I had to do: scripts = soup.find_all ('script') until I get to the one that interests me: script = scripts [18] . The variable in question is script . My problem is that I want to access its attributes, for example script ['goodsInfo'] , obviously being an element type bs4

scraping data from wikipedia table

我怕爱的太早我们不能终老 提交于 2020-01-16 16:32:49
问题 I'm just trying to scrape data from a wikipedia table into a panda dataframe. I need to reproduce the three columns: "Postcode, Borough, Neighbourhood". import requests website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text from bs4 import BeautifulSoup soup = BeautifulSoup(website_url,'xml') print(soup.prettify()) My_table = soup.find('table',{'class':'wikitable sortable'}) My_table links = My_table.findAll('a') links Neighbourhood = [] for link in

scraping data from wikipedia table

為{幸葍}努か 提交于 2020-01-16 16:30:09
问题 I'm just trying to scrape data from a wikipedia table into a panda dataframe. I need to reproduce the three columns: "Postcode, Borough, Neighbourhood". import requests website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text from bs4 import BeautifulSoup soup = BeautifulSoup(website_url,'xml') print(soup.prettify()) My_table = soup.find('table',{'class':'wikitable sortable'}) My_table links = My_table.findAll('a') links Neighbourhood = [] for link in

Web scrapping with Python

好久不见. 提交于 2020-01-16 12:02:31
问题 How to parse table from https://ege.hse.ru/rating/2019/81031971/all/?rlist=&ptype=0&vuz-abiturients-budget-order=ge&vuz-abiturients-budget-val=10 with BeautifulSoup and make pandas DataFrame? My code: import requests from bs4 import BeautifulSoup url = 'https://ege.hse.ru/rating/2019/81031971/all/?rlist=&ptype=0&vuz-abiturients-budget-order=ge&vuz-abiturients-budget-val=10' page = requests.get(url) soup = BeautifulSoup(page.content,"html.parser") table = soup.find_all("table") for each_table

Get Info From Script Tag (WebScrap) [duplicate]

烂漫一生 提交于 2020-01-16 08:56:09
问题 This question already has answers here : How to extract a JSON object that was defined in a HTML page javascript block using Python? (3 answers) Closed 6 months ago . #Python Code from bs4 import BeautifulSoup import urllib3 url ='https://www. SomeData .com' req = urllib3.PoolManager() res = req.request('GET', url) soup = BeautifulSoup(res.data, 'html.parser') res = soup.find_all('script') print(res) Then I Got something like this: Results below: [ <script> AAA.trackData.taxonomy = { a:"a", b

Python Memory Issue with BeautifulSoup

你离开我真会死。 提交于 2020-01-16 06:06:18
问题 I've resolved this issue, but I'm wondering why it was caused in the first place. I used BeautifulSoup to identify this span from a webpage: span = <span id="ctl00_ContentPlaceHolder1_RestInfoReskin_lblRestName">Ally's Sizzlers</span> I then assign this variable: restaurant.name = span.contents However on each loop this takes up a full 1 MB, and there's about 20,000 loops. Through trial and error I came upon this solution: restaurant.name = str(span.contents) Can you tell me why the former

Python Memory Issue with BeautifulSoup

馋奶兔 提交于 2020-01-16 06:05:29
问题 I've resolved this issue, but I'm wondering why it was caused in the first place. I used BeautifulSoup to identify this span from a webpage: span = <span id="ctl00_ContentPlaceHolder1_RestInfoReskin_lblRestName">Ally's Sizzlers</span> I then assign this variable: restaurant.name = span.contents However on each loop this takes up a full 1 MB, and there's about 20,000 loops. Through trial and error I came upon this solution: restaurant.name = str(span.contents) Can you tell me why the former

Reading <content:encoded> tags using BeautifulSoup 4

北战南征 提交于 2020-01-16 03:27:11
问题 I'm using BeautifulSoup 4 (bs4) to read an XML RSS feed, and have come across the following entry. I'm trying to read the content enclosed in the <content:encoded><![CDATA[...]]</content> tag: <item> <title>Foobartitle</title> <link>http://www.acme.com/blah/blah.html</link> <category><![CDATA[mycategory]]></category> <description><![CDATA[The quick brown fox jumps over the lazy dog]]></description> <content:encoded> <![CDATA[<p><img class="feature" src="http://www.acme.com/images/image.jpg"

bs4 again from website and save to text file

久未见 提交于 2020-01-15 15:38:46
问题 I am learning on how to extract data from websites now and have managed to get alot of information. However for my next website I am failing for some unknown reason as nothing is saved to the text files nor do I get any output in print. Here is my piece of code: import json import urllib.request from bs4 import BeautifulSoup import requests url = 'https://www.jaffari.org/' request = urllib.request.Request(url,headers={'User-Agent': 'Mozilla/5.0'}) response = urllib.request.urlopen(request)