beautifulsoup | 易学教程

Scrape a series of tables with BeautifulSoup

阅读更多关于 Scrape a series of tables with BeautifulSoup

问题 I am trying to learn about web scraping and python (and programming for that matter) and have found the BeautifulSoup library which seems to offer a lot of possibilities. I am trying to find out how to best pull the pertinent information from this page: http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113 I can go into more detail on this, but basically the company name, the description about it, contact details, the various company details / statistics e.t.c. At this stage looking at how

Beautiful Soup find elements having hidden style

阅读更多关于 Beautiful Soup find elements having hidden style

问题 My simple need. How do I find elements that are not visible on the webpage currently? I am guessing style="visibility:hidden" or style="display:none" are simple ways to hide an element, but BeautifulSoup doesn't know if its hidden or not. For example, HTML is: Textbox_Invisible1: <input id="tbi1" type="text" style="visibility:hidden"> Textbox_Invisible2: <input id="tbi2" type="text" class="hidden_elements"> Textbox1: <input id="tb1" type="text"> So my first concern is that BeautifulSoup

Find specific link text with bs4

阅读更多关于 Find specific link text with bs4

问题 I am trying to scrape a website and find all the headings of a feed. I am having trouble just getting the text of the a tag that I need. Here is an example of the html. <td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a> <td class="m" id="b2"><a href="/zXHNvp" id="c2"

Parsing unclosed `<br>` tags with BeautifulSoup

阅读更多关于 Parsing unclosed `` tags with BeautifulSoup

问题 BeautifulSoup has logic for closing consecutive <br> tags that doesn't do quite what I want it to do. For example, >>> from bs4 import BeautifulSoup >>> bs = BeautifulSoup('one<br>two<br>three<br>four') The HTML would render as one two three four I'd like to parse it into a list of strings, ['one','two','three','four'] . BeautifulSoup's tag-closing logic means that I get nested tags when I ask for all the <br> elements. >>> bs('br') [<br>two<br>three<br>four</br></br></br>, <br>three<br>four<

Convert HTML to plain text and maintain structure/formatting, with ruby

阅读更多关于 Convert HTML to plain text and maintain structure/formatting, with ruby

问题 I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc. The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images). I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with

How to find all text inside <p> elements in an HTML page using BeautifulSoup

阅读更多关于 How to find all text inside elements in an HTML page using BeautifulSoup

问题 I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python. For example, <p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p> should return: Many hundreds of cultivars exist. P.S. Some files contain Unicode characters (Hindi) which need to be extracted. Any ideas how to do that? 回答1: Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the

How to find all text inside <p> elements in an HTML page using BeautifulSoup

阅读更多关于 How to find all text inside elements in an HTML page using BeautifulSoup

Selecting nested element with beautiful soup

阅读更多关于 Selecting nested element with beautiful soup

问题 I have the following html: <div class="leftColumn"> <div> <div class="static"> text1 <br> text2 <br> (222) 123 - 4567 <br> <div class="summary"> How can I select just the text lines using beautiful soup. I've tried a variety of things like: soup.select('.leftColumn div').text but so far no dice 回答1: BeautifouSoup select retrives a list. You must specify the index. soup.select('.leftColumn div')[0].text.split() 回答2: Mauro's answer is probably more what you wanted, but this is another way to do

How can I format every other line to be merged with the line before it? (In Python)

阅读更多关于 How can I format every other line to be merged with the line before it? (In Python)

问题 I have been working with beautiful soup to extract data from website APIs for use in a fan site I am building. I have extracted the data into text files however I am having trouble formatting it. Charles Dance Lord Tywin Lannister (S 02+) Natalie Dormer Queen Margaery Tyrell (S 02+) Harry Lloyd Viserys Targaryen (S 01) Mark Addy King Robert Baratheon (S 01) Alfie Allen Theon Greyjoy Sean Bean Lord Eddard Stark (S 01) I have several text files like this for shows. I would like to have both the

BeautifulSoup - How to find a specific class name alone

阅读更多关于 BeautifulSoup - How to find a specific class name alone

问题 How to find the li tags with a specific class name but not others? For example: ... <li> no wanted </li> <li class="a"> not his one </li> <li class="a z"> neither this one </li> <li class="b z"> neither this one </li> <li class="c z"> neither this one </li> ... <li class="z"> I WANT THIS ONLY ONE</li> ... the code: bs4.find_all ('li', class_='z') returns several entries where there is a "z" and another class name. How to find the entry with the class name "z" , alone ? 回答1: You can use CSS