html-parsing | 易学教程

Get the html under a tag using htmlparser python

阅读更多关于 Get the html under a tag using htmlparser python

问题 I want to get whole html under a tag and using HTMLParser. I am able to currently get the data between the tags and following is my code class LinksParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.recording = 0 self.data = '' def handle_starttag(self, tag, attributes): if tag != 'span': return if self.recording: self.recording += 1 return for name, value in attributes: if name == 'itemprop' and value == 'description': break else: return self.recording = 1 def handle

Python: Suppressing errors from going to commandline?

阅读更多关于 Python: Suppressing errors from going to commandline?

问题 When I try to execute a python program from command line, it gives the following error. These errors do not cause any problem to my ouput. I dont want it to be displayed in the commandline Traceback (most recent call last): File "test.py", line 88, in <module> p.feed(ht) File "/usr/lib/python2.5/HTMLParser.py", line 108, in feed self.goahead(0) File "/usr/lib/python2.5/HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.5/HTMLParser.py", line 226, in parse

Correct HTML mark-up syntax? (to remove whitespace between inline-block elements) [duplicate]

阅读更多关于 Correct HTML mark-up syntax? (to remove whitespace between inline-block elements) [duplicate]

问题 This question already has answers here : How do I remove the space between inline-block elements? (39 answers) Closed 10 months ago . When html code is not 'beautified', it looks like <div><img src="img1.jpg"/><img src="img2.jpg"/></div> And then these pictures rendered as |=||=| //no gap between But after beautifier http://ctrlq.org/beautifier/ <div> <img src="img1.jpg"/> <img src="img2.jpg"/> </div> They are rendered like this |=| |=| // gap (space) between So, same code rendered

Correct HTML mark-up syntax? (to remove whitespace between inline-block elements) [duplicate]

阅读更多关于 Correct HTML mark-up syntax? (to remove whitespace between inline-block elements) [duplicate]

Correct HTML mark-up syntax? (to remove whitespace between inline-block elements) [duplicate]

阅读更多关于 Correct HTML mark-up syntax? (to remove whitespace between inline-block elements) [duplicate]

PhotoSwipe: edit parseThumbnailElements function to parse additional markup element

阅读更多关于 PhotoSwipe: edit parseThumbnailElements function to parse additional markup element

问题 Using PhotoSwipe the thumbnail gallery markup looks like this: <div class="wrap clearfix"> <div class="my-gallery" itemscope itemtype="http://schema.org/ImageGallery"> <ul class="gallery-grid"> <li> <figure itemprop="associatedMedia" itemscope itemtype="http://schema.org/ImageObject"> <a href="img/dektop/1.jpg" itemprop="contentUrl" data-size="1200x1200"> <img src="img/thumb/1.jpg" itemprop="thumbnail" alt="Image description" /> </a> <figcaption itemprop="caption description">Image caption 1<

Use BeautifulSoup to get a value after a specific tag

阅读更多关于 Use BeautifulSoup to get a value after a specific tag

问题 I'm having a very hard time getting BeautifulSoup to scrape some data for me. What's the best way to access the date (the actual numbers, 2008) from this code sample? It's my first time using Beautifulsoup, I've figured out how to scrape urls off of the page, but I can't quite narrow it down to only select the word Date, and then to only return whatever numeric date follows (in the dd brackets). Is what I'm asking even possible? <div class='dl_item_container clearfix detail_date'> <dt>Date<

Remove attributes using HtmlAgilityPack

阅读更多关于 Remove attributes using HtmlAgilityPack

问题 I'm trying to create a code snippet to remove all style attributes regardless of tag using HtmlAgilityPack. Here's my code: var elements = htmlDoc.DocumentNode.SelectNodes("//*"); if (elements!=null) { foreach (var element in elements) { element.Attributes.Remove("style"); } } However, I'm not getting it to stick? If I look at the element object immediately after Remove("style") . I can see that the style attribute has been removed , but it still appears in the DocumentNode object. :/ I'm

Parsing nested HTML list with BeautifulSoup

阅读更多关于 Parsing nested HTML list with BeautifulSoup

问题 I need to parse a nested HTML list and convert it to a parent-child dict. Given this list: <ul> <li>Operating System <ul> <li>Linux <ul> <li>Debian</li> <li>Fedora</li> <li>Ubuntu</li> </ul> </li> <li>Windows</li> <li>OS X</li> </ul> </li> <li>Programming Languages <ul> <li>Python</li> <li>C#</li> <li>Ruby</li> </ul> </li> </ul> I want to convert it to a dict like this: { 'Operating System': { 'Linux': { 'Debian': None, 'Fedora': None, 'Ubuntu': None, }, 'Windows': None, 'OS X': None, },

BeautifulSoup - easy way to to obtain HTML-free contents

阅读更多关于 BeautifulSoup - easy way to to obtain HTML-free contents

问题 I'm using this code to find all interesting links in a page: soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+')) And it does its job pretty well. Unfortunately inside that a tag there are a lot of nested tags, like font , b and different things... I'd like to get just the text content, without any other html tag. Example of link: <A HREF="notizia.php?idn=1134" OnMouseOver="verde();" OnMouseOut="blu();"><FONT CLASS="v12"><B>03-11-2009: <font color=green>CCS Ingegneria Elettronica