beautifulsoup

find specific text in beautifulsoup

半城伤御伤魂 提交于 2020-01-25 17:08:25
问题 I have a specific piece of text i'm trying to get using BeautifulSoup and Python, however I am not sure how to get it using sou.find(). I am trying to obtain "#1 in Beauty" only from the following. <ul> <li>...<li> <li>...<li> <li id="salesRank"> <b>Amazon Best Sellers Rank:</b> "#1 in Beauty (" <a href="http://www.amazon.com/gp/bestsellers/beauty/ref=pd_dp_ts_k_1"> See top 100</a> ") Can anyone help me with this? 回答1: You need to use the find_all method of soup . Try below import urllib,

find specific text in beautifulsoup

。_饼干妹妹 提交于 2020-01-25 17:08:12
问题 I have a specific piece of text i'm trying to get using BeautifulSoup and Python, however I am not sure how to get it using sou.find(). I am trying to obtain "#1 in Beauty" only from the following. <ul> <li>...<li> <li>...<li> <li id="salesRank"> <b>Amazon Best Sellers Rank:</b> "#1 in Beauty (" <a href="http://www.amazon.com/gp/bestsellers/beauty/ref=pd_dp_ts_k_1"> See top 100</a> ") Can anyone help me with this? 回答1: You need to use the find_all method of soup . Try below import urllib,

Webscrape Multiple Pages with python - output issue

孤者浪人 提交于 2020-01-25 09:03:20
问题 Happy new year python community, I am trying to extract a table from website using Python Beautifulsoup4 I am struggling to see the results in my output files. The code run smoothly but nothing is written the file. My code below from bs4 import BeautifulSoup as bsoup import requests as rq import re base_url = 'http://www.creationdentreprise.sn/rechercher-une-societe?field_rc_societe_value=&field_ninea_societe_value=&denomination=&field_localite_nid=All&field_siege_societe_value=&field_forme

Webscrape Multiple Pages with python - output issue

喜夏-厌秋 提交于 2020-01-25 09:03:08
问题 Happy new year python community, I am trying to extract a table from website using Python Beautifulsoup4 I am struggling to see the results in my output files. The code run smoothly but nothing is written the file. My code below from bs4 import BeautifulSoup as bsoup import requests as rq import re base_url = 'http://www.creationdentreprise.sn/rechercher-une-societe?field_rc_societe_value=&field_ninea_societe_value=&denomination=&field_localite_nid=All&field_siege_societe_value=&field_forme

BeautifulSoup not fetching the Data

久未见 提交于 2020-01-25 07:57:06
问题 I am trying to fetch the data from the website. But not getting any of the information for fields like name, Nature of business, Telephone, Email, etc. in the variable soup. What should I add to the below code to have this data? import requests import pandas as pd from bs4 import BeautifulSoup page = "http://www.pmas.sg/page/members-directory" pages = requests.get(page) soup = BeautifulSoup(pages.content, 'html.parser') print(soup) The output I am getting using the above code is:- <!DOCTYPE

Webscraping with Python, I can't see the actual names of classes when I say inspect page

无人久伴 提交于 2020-01-25 06:41:09
问题 Ok so I am just learning python and I want to use web scraping. I was watching this tutorial and there the tutor has a totally different "inspect" page(or whatever it is called) than mine. So what he sees is class = "ProfileHeaderCard", and what I see is class = "css-1dbjc4n r-1iusvr4 r-16y2uox r-5f2r5o r-m611by". THE IMPORTANT PART is that BeautifulSoup library does not work when I use my version of the class name but it works when I use his version. When I say print(soup.find('div', {"class

How to crawl for specific links inside a website?

我的梦境 提交于 2020-01-25 04:14:08
问题 I have sucessfully crawled the Headline and the Links . I would like to replace the Summary tab with The Main Article from the link (Since the Title and Summary are same anyways. ) link = "https://www.vanglaini.org" + article.a['href'] (eg. https://www.vanglaini.org/tualchhung/103834) Please help me modify my code. Below is my code. import pandas as pd import requests from bs4 import BeautifulSoup source = requests.get('https://www.vanglaini.org/').text soup = BeautifulSoup(source, 'lxml')

using beautifulsoup 4 for xml causes strange behaviour (memory issues?)

断了今生、忘了曾经 提交于 2020-01-25 03:37:30
问题 I'm getting strange behaviour with this >>> from bs4 import BeautifulSoup >>> smallfile = 'small.xml' #approx 600bytes >>> largerfile = 'larger.xml' #approx 2300 bytes >>> len(BeautifulSoup(open(smallfile, 'r'), ['lxml', 'xml'])) 1 >>> len(BeautifulSoup(open(largerfile, 'r'), ['lxml', 'xml'])) 0 Contents of small.xml: <?xml version="1.0" encoding="us-ascii"?> <Catalog> <CMoverMissile id="HunterSeekerMissile"> <MotionPhases index="1"> <Driver value="Guidance"/> <Acceleration value="3200"/>

How can i crawl web data that not in tags

Deadly 提交于 2020-01-24 21:30:06
问题 <div id="main-content" class="content"> <div class="metaline"> <span class="article-meta author">jorden</span> </div> " 1.name:jorden> 2.age:28 -- " <span class="D2"> from 111.111.111.111 </span> </div> I only need 1.name:jorden 2.age:28 xxx.select('#main-content') this will return all things, but i only need part of them. Because they are not in any tags, i don't know how to do. 回答1: You want to find the tag before the text in question (in your case, <div class="metaline"> ) and then look at

How to extract the strong elements which are in div tag

梦想的初衷 提交于 2020-01-24 11:34:02
问题 I am new to web scraping. I am using Python to scrape the data. Can someone help me in how to extract data from: <div class="dept"><strong>LENGTH:</strong> 15 credits</div> My output should be LENGTH: 15 credits Here is my code: from urllib.request import urlopen from bs4 import BeautifulSoup length=bsObj.findAll("strong") for leng in length: print(leng.text,leng.next_sibling) Output: DELIVERY: Campus LENGTH: 2 years OFFERED BY: Olin Business School but I would like to have only LENGTH.