beautifulsoup

Scrape a series of tables with BeautifulSoup

我的梦境 提交于 2020-01-02 07:03:06
问题 I am trying to learn about web scraping and python (and programming for that matter) and have found the BeautifulSoup library which seems to offer a lot of possibilities. I am trying to find out how to best pull the pertinent information from this page: http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113 I can go into more detail on this, but basically the company name, the description about it, contact details, the various company details / statistics e.t.c. At this stage looking at how

Beautiful Soup find elements having hidden style

ε祈祈猫儿з 提交于 2020-01-02 06:58:32
问题 My simple need. How do I find elements that are not visible on the webpage currently? I am guessing style="visibility:hidden" or style="display:none" are simple ways to hide an element, but BeautifulSoup doesn't know if its hidden or not. For example, HTML is: Textbox_Invisible1: <input id="tbi1" type="text" style="visibility:hidden"> Textbox_Invisible2: <input id="tbi2" type="text" class="hidden_elements"> Textbox1: <input id="tb1" type="text"> So my first concern is that BeautifulSoup

Find specific link text with bs4

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-02 05:23:46
问题 I am trying to scrape a website and find all the headings of a feed. I am having trouble just getting the text of the a tag that I need. Here is an example of the html. <td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a> <td class="m" id="b2"><a href="/zXHNvp" id="c2"

Parsing unclosed `<br>` tags with BeautifulSoup

元气小坏坏 提交于 2020-01-02 05:20:14
问题 BeautifulSoup has logic for closing consecutive <br> tags that doesn't do quite what I want it to do. For example, >>> from bs4 import BeautifulSoup >>> bs = BeautifulSoup('one<br>two<br>three<br>four') The HTML would render as one two three four I'd like to parse it into a list of strings, ['one','two','three','four'] . BeautifulSoup's tag-closing logic means that I get nested tags when I ask for all the <br> elements. >>> bs('br') [<br>two<br>three<br>four</br></br></br>, <br>three<br>four<

Convert HTML to plain text and maintain structure/formatting, with ruby

放肆的年华 提交于 2020-01-02 04:36:05
问题 I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc. The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images). I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with

How to find all text inside <p> elements in an HTML page using BeautifulSoup

杀马特。学长 韩版系。学妹 提交于 2020-01-01 19:38:34
问题 I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python. For example, <p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p> should return: Many hundreds of cultivars exist. P.S. Some files contain Unicode characters (Hindi) which need to be extracted. Any ideas how to do that? 回答1: Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the

How to find all text inside <p> elements in an HTML page using BeautifulSoup

感情迁移 提交于 2020-01-01 19:38:06
问题 I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python. For example, <p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p> should return: Many hundreds of cultivars exist. P.S. Some files contain Unicode characters (Hindi) which need to be extracted. Any ideas how to do that? 回答1: Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the

Selecting nested element with beautiful soup

こ雲淡風輕ζ 提交于 2020-01-01 19:29:11
问题 I have the following html: <div class="leftColumn"> <div> <div class="static"> text1 <br> text2 <br> (222) 123 - 4567 <br> <div class="summary"> How can I select just the text lines using beautiful soup. I've tried a variety of things like: soup.select('.leftColumn div').text but so far no dice 回答1: BeautifouSoup select retrives a list. You must specify the index. soup.select('.leftColumn div')[0].text.split() 回答2: Mauro's answer is probably more what you wanted, but this is another way to do

How can I format every other line to be merged with the line before it? (In Python)

你说的曾经没有我的故事 提交于 2020-01-01 19:22:13
问题 I have been working with beautiful soup to extract data from website APIs for use in a fan site I am building. I have extracted the data into text files however I am having trouble formatting it. Charles Dance Lord Tywin Lannister (S 02+) Natalie Dormer Queen Margaery Tyrell (S 02+) Harry Lloyd Viserys Targaryen (S 01) Mark Addy King Robert Baratheon (S 01) Alfie Allen Theon Greyjoy Sean Bean Lord Eddard Stark (S 01) I have several text files like this for shows. I would like to have both the

BeautifulSoup - How to find a specific class name alone

荒凉一梦 提交于 2020-01-01 18:56:14
问题 How to find the li tags with a specific class name but not others? For example: ... <li> no wanted </li> <li class="a"> not his one </li> <li class="a z"> neither this one </li> <li class="b z"> neither this one </li> <li class="c z"> neither this one </li> ... <li class="z"> I WANT THIS ONLY ONE</li> ... the code: bs4.find_all ('li', class_='z') returns several entries where there is a "z" and another class name. How to find the entry with the class name "z" , alone ? 回答1: You can use CSS