html-parsing

Get the html under a tag using htmlparser python

六月ゝ 毕业季﹏ 提交于 2019-12-19 09:08:16
问题 I want to get whole html under a tag and using HTMLParser. I am able to currently get the data between the tags and following is my code class LinksParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.recording = 0 self.data = '' def handle_starttag(self, tag, attributes): if tag != 'span': return if self.recording: self.recording += 1 return for name, value in attributes: if name == 'itemprop' and value == 'description': break else: return self.recording = 1 def handle

Python: Suppressing errors from going to commandline?

别来无恙 提交于 2019-12-19 07:32:13
问题 When I try to execute a python program from command line, it gives the following error. These errors do not cause any problem to my ouput. I dont want it to be displayed in the commandline Traceback (most recent call last): File "test.py", line 88, in <module> p.feed(ht) File "/usr/lib/python2.5/HTMLParser.py", line 108, in feed self.goahead(0) File "/usr/lib/python2.5/HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.5/HTMLParser.py", line 226, in parse

Correct HTML mark-up syntax? (to remove whitespace between inline-block elements) [duplicate]

我的未来我决定 提交于 2019-12-19 06:33:47
问题 This question already has answers here : How do I remove the space between inline-block elements? (39 answers) Closed 10 months ago . When html code is not 'beautified', it looks like <div><img src="img1.jpg"/><img src="img2.jpg"/></div> And then these pictures rendered as |=||=| //no gap between But after beautifier http://ctrlq.org/beautifier/ <div> <img src="img1.jpg"/> <img src="img2.jpg"/> </div> They are rendered like this |=| |=| // gap (space) between So, same code rendered

Correct HTML mark-up syntax? (to remove whitespace between inline-block elements) [duplicate]

巧了我就是萌 提交于 2019-12-19 06:33:46
问题 This question already has answers here : How do I remove the space between inline-block elements? (39 answers) Closed 10 months ago . When html code is not 'beautified', it looks like <div><img src="img1.jpg"/><img src="img2.jpg"/></div> And then these pictures rendered as |=||=| //no gap between But after beautifier http://ctrlq.org/beautifier/ <div> <img src="img1.jpg"/> <img src="img2.jpg"/> </div> They are rendered like this |=| |=| // gap (space) between So, same code rendered

Correct HTML mark-up syntax? (to remove whitespace between inline-block elements) [duplicate]

那年仲夏 提交于 2019-12-19 06:33:05
问题 This question already has answers here : How do I remove the space between inline-block elements? (39 answers) Closed 10 months ago . When html code is not 'beautified', it looks like <div><img src="img1.jpg"/><img src="img2.jpg"/></div> And then these pictures rendered as |=||=| //no gap between But after beautifier http://ctrlq.org/beautifier/ <div> <img src="img1.jpg"/> <img src="img2.jpg"/> </div> They are rendered like this |=| |=| // gap (space) between So, same code rendered

PhotoSwipe: edit parseThumbnailElements function to parse additional markup element

被刻印的时光 ゝ 提交于 2019-12-19 05:16:32
问题 Using PhotoSwipe the thumbnail gallery markup looks like this: <div class="wrap clearfix"> <div class="my-gallery" itemscope itemtype="http://schema.org/ImageGallery"> <ul class="gallery-grid"> <li> <figure itemprop="associatedMedia" itemscope itemtype="http://schema.org/ImageObject"> <a href="img/dektop/1.jpg" itemprop="contentUrl" data-size="1200x1200"> <img src="img/thumb/1.jpg" itemprop="thumbnail" alt="Image description" /> </a> <figcaption itemprop="caption description">Image caption 1<

Use BeautifulSoup to get a value after a specific tag

谁说我不能喝 提交于 2019-12-19 03:13:12
问题 I'm having a very hard time getting BeautifulSoup to scrape some data for me. What's the best way to access the date (the actual numbers, 2008) from this code sample? It's my first time using Beautifulsoup, I've figured out how to scrape urls off of the page, but I can't quite narrow it down to only select the word Date, and then to only return whatever numeric date follows (in the dd brackets). Is what I'm asking even possible? <div class='dl_item_container clearfix detail_date'> <dt>Date<

Remove attributes using HtmlAgilityPack

血红的双手。 提交于 2019-12-18 18:48:28
问题 I'm trying to create a code snippet to remove all style attributes regardless of tag using HtmlAgilityPack. Here's my code: var elements = htmlDoc.DocumentNode.SelectNodes("//*"); if (elements!=null) { foreach (var element in elements) { element.Attributes.Remove("style"); } } However, I'm not getting it to stick? If I look at the element object immediately after Remove("style") . I can see that the style attribute has been removed , but it still appears in the DocumentNode object. :/ I'm

Parsing nested HTML list with BeautifulSoup

心不动则不痛 提交于 2019-12-18 13:36:30
问题 I need to parse a nested HTML list and convert it to a parent-child dict. Given this list: <ul> <li>Operating System <ul> <li>Linux <ul> <li>Debian</li> <li>Fedora</li> <li>Ubuntu</li> </ul> </li> <li>Windows</li> <li>OS X</li> </ul> </li> <li>Programming Languages <ul> <li>Python</li> <li>C#</li> <li>Ruby</li> </ul> </li> </ul> I want to convert it to a dict like this: { 'Operating System': { 'Linux': { 'Debian': None, 'Fedora': None, 'Ubuntu': None, }, 'Windows': None, 'OS X': None, },

BeautifulSoup - easy way to to obtain HTML-free contents

帅比萌擦擦* 提交于 2019-12-18 13:23:12
问题 I'm using this code to find all interesting links in a page: soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+')) And it does its job pretty well. Unfortunately inside that a tag there are a lot of nested tags, like font , b and different things... I'd like to get just the text content, without any other html tag. Example of link: <A HREF="notizia.php?idn=1134" OnMouseOver="verde();" OnMouseOut="blu();"><FONT CLASS="v12"><B>03-11-2009:  <font color=green>CCS Ingegneria Elettronica