html-parsing

Add parent tags with beautiful soup

半城伤御伤魂 提交于 2020-01-11 10:38:36
问题 I have many pages of HTML with various sections containing these code snippets: <div class="footnote" id="footnote-1"> <h3>Reference:</h3> <table cellpadding="0" cellspacing="0" class="floater" style="margin-bottom:0;" width="100%"> <tr> <td valign="top" width="20px"> <a href="javascript:void(0);" onclick='javascript:toggleFootnote("footnote-1");' title="click to hide this reference">1.</a> </td> <td> <p> blah </p> </td> </tr> </table> </div> I can parse the HTML successfully and extract

How can I extract structured text from an HTML list in PHP?

此生再无相见时 提交于 2020-01-11 07:49:05
问题 I have this string: <ul> <li id="1">Page 1</li> <li id="2">Page 2 <ul> <li id="3">Sub Page A</li> <li id="4">Sub Page B</li> <li id="5">Sub Page C <ul> <li id="6">Sub Sub Page I</li> </ul> </li> </ul> </li> <li id="7">Page 3 <ul> <li id="8">Sub Page D</li> </ul> </li> <li id="9">Page 4</li> </ul> and I want to explode every information with PHP and make it like: ---------------------------------- | ID | ORDER | PARENT | CHILDREN | ---------------------------------- | 1 | 1 | 0 | 0 | | 2 | 2 |

parsing/extracting a HTML Table, Website in Java

自闭症网瘾萝莉.ら 提交于 2020-01-10 19:56:46
问题 I want to parse the contents of this HTML table : Here is the full website with source code: http://www.kantschule-falkensee.de/uploads/dmiadgspahw/klassen/A_Klasse_11.htm I want to parse the data for each cell, all 5 cells under "Montag"(Monday) as an example. I tried several ways of parsing this Website using JSOUP but i havent got any succes with it. My main Goal is to show the contents in an listview in an Android app. For now i tried to print the contents in a java console. Both

Simple libxml2 HTML parsing example, using Objective-c, Xcode, and HTMLparser.h

不羁的心 提交于 2020-01-09 03:19:45
问题 Please can somebody show me a simple example of parsing some HTML using libxml. #import <libxml2/libxml/HTMLparser.h> NSString *html = @"<ul>" "<li><input type=\"image\" name=\"input1\" value=\"string1value\" /></li>" "<li><input type=\"image\" name=\"input2\" value=\"string2value\" /></li>" "</ul>" "<span class=\"spantext\"><b>Hello World 1</b></span>" "<span class=\"spantext\"><b>Hello World 2</b></span>"; 1) Say I want to parse the value of the input whose name = input2. Should output

Split string into smaller part with constrain [PHP RegEx HTML]

我的未来我决定 提交于 2020-01-06 15:07:25
问题 I need to split long string into a array with following constrains: The input will be HTML string , may be full page or partial. Each part (new strings) will have a limited number of character (e.g. not more than 8000 character) Each part can contain multiple sentences (delimited by . [full stop]) but never a partial sentences . Except if the last part of the string (as last part may not have any full stop. The string contain HTML tags. But the tag can not be divided as ( <a href='test.html'>

How can I parse only part of an HTML file and ignore the rest?

两盒软妹~` 提交于 2020-01-06 08:38:11
问题 In each of 5,000 HTML files I have to get only one line of text, which is line 999. How can I tell the HTML::Parser that I only have to get line 999? </p><h1>dataset 1:</h1>  <table border="0" bgcolor="#EFEFEF" leftmargin="15" topmargin="5"><tr> <td><strong>name:</strong> </td> <td width=500> myname one </td></tr><tr> <td><strong>type:</strong> </td> <td width=500> type_one (04313488) </td></tr><tr> <td><strong>aresss:</strong> </td><td>Friedrichstr. 70, 73430 Madrid</td></tr><tr> <td><strong

JSOUP parsing HTML get class inside class

亡梦爱人 提交于 2020-01-06 08:12:43
问题 i am developing android application using JSOUP for parsing HTML. i have HTML syntax <div class='wrapper'> <div style='margin:7px;'> <div class='box' style='height:595px'> <div class='boxtitlebox'> <div class='boxtitle'><h4>13 RECENT CHORDS</h4></div><div class='clear'></div> </div> <div class='listitem'><a href='http://www.chordfrenzy.com/chord/9742/ungu-apa-sih-maumu-kord-lirik-lagu'> <div class='subtitle'>Chord Ungu</div> <div class='title'>Apa Sih Maumu</div> </a></div> <div class=

JSOUP parsing HTML get class inside class

这一生的挚爱 提交于 2020-01-06 08:12:03
问题 i am developing android application using JSOUP for parsing HTML. i have HTML syntax <div class='wrapper'> <div style='margin:7px;'> <div class='box' style='height:595px'> <div class='boxtitlebox'> <div class='boxtitle'><h4>13 RECENT CHORDS</h4></div><div class='clear'></div> </div> <div class='listitem'><a href='http://www.chordfrenzy.com/chord/9742/ungu-apa-sih-maumu-kord-lirik-lagu'> <div class='subtitle'>Chord Ungu</div> <div class='title'>Apa Sih Maumu</div> </a></div> <div class=

File format for storing html parser rules

时光毁灭记忆、已成空白 提交于 2020-01-06 04:22:04
问题 I'm using Jsoup to parse a page which structure changes over time. For now the parsing config is written in Java so I have to issue a new build each time the rules get modified. Is there some sort of json- or xml-based markup language I could use to store the parsing config in an external file? 回答1: You can try to use Clojure , Clojure can represent your config data and call Jsoup for parse 回答2: Options include XPath and CSS selectors syntax. The latter is supported by Jsoup. 来源: https:/

Why can't parse all div elements in the target.html with lxml.html?

不羁的心 提交于 2020-01-06 02:46:13
问题 Please download the file in dropbox and save it as /tmp/target.html . target.html Open it in firefox with firebug to watch the html struture. It is clear that there are at least 10 div in target.html . Now to parse all div elements in the target.html with lxml.html. python3 >>> import lxml.html >>> doc=lxml.html.parse("/tmp/target.html") >>> divs=doc.xpath("//div") >>> len(divs) 4 Get the result 4 ,why so many divs can't be parsed with above code? At lease 10 divs in the target.html . Same