html-parsing | 易学教程

what method in jsoup can return the modified html?

阅读更多关于 what method in jsoup can return the modified html?

问题 When I parse the html file(stored in native) with jsoup . I have modified some elements in the html file, so I want to save the modified html, and replace the old one? Do any body know which method in jsoup can do the job? Thank you so much!!! 回答1: You could write the contents of either document.toString() or document.outerHtml() to file, where document is got from Document document = Jsoup.connect("http://...").get(); // any document modifications... like so: BufferedWriter htmlWriter = new

Auto-indent HTML Code and Display It

阅读更多关于 Auto-indent HTML Code and Display It

问题 When displaying HTML code in PHP, if I want to indent tags what would be the best way? For example, <?php $html = '<div><p><span>some text</span></p></div>'; echo htmlspecialchars($html); ?> will give <div><p><span>some text</span</p></div> . Instead, I'd like to show something like, <div> <p> <span>some text</span> </p> </div> 回答1: You can use htmLawed http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/ this would be your code: <?php require("htmLawed/htmLawed.php"); echo "

remove certain attributes from HTML tags

阅读更多关于 remove certain attributes from HTML tags

问题 How can I remove certain attributes such as id, style, class, etc. from HTML code? I thought I could use the lxml.html.clean module, but as it turned out I can only remove style attributes with Clean(style=True).clean_html(code) . I'd prefer not to use regular expressions for this task (attributes could change). What I would like to have: from lxml.html.clean import Cleaner code = '<tr id="ctl00_Content_AdManagementPreview_DetailView_divNova" class="Extended" style="display: none;">' cleaner

Why does Array.to_s return brackets?

阅读更多关于 Why does Array.to_s return brackets?

问题 For an array, when I type: puts array[0] ==> text Yet when I type puts array[0].to_s ==> ["text"] Why the brackets and quotes? What am I missing? ADDENDUM: my code looks like this page = open(url) {|f| f.read } page_array = page.scan(/regex/) #pulls partial urls into an array partial_url = page_array[0].to_s full_url = base_url + partial_url #adds each partial url to a consistent base_url puts full_url what I'm getting looks like: http://www.stackoverflow/["questions"] 回答1: This print the

Using DOMDocument, is it possible to get all elements that exists within a certain DOM?

阅读更多关于 Using DOMDocument, is it possible to get all elements that exists within a certain DOM?

问题 Let's say I have an HTML file with a lot of different elements, each having different attributes. Let's say I do not know beforehand how this HTML will look like. Using PHP's DOMDocument, how can I iterate over ALL elements and modify them? All I see is getElementByTagName and getElementById etc. I want to iterate through all elements. For instance. Let's say the HTML looks like this (just an example, in reality I do not know the structure): $html = '<div class="potato"><span></span></div>';

How to get data from HTML using regex

阅读更多关于 How to get data from HTML using regex

问题 I have following HTML <table class="profile-stats"> <tr> <td class="stat"> <div class="statnum">8</div> <div class="statlabel"> Tweets </div> </td> <td class="stat"> <a href="/THEDJMHA/following"> <div class="statnum">13</div> <div class="statlabel"> Following </div> </a> </td> <td class="stat stat-last"> <a href="/THEDJMHA/followers"> <div class="statnum">22</div> <div class="statlabel"> Followers </div> </a> </td> </tr> </table> I want to get value from <td class="stat stat-last"> => <div

How to strip entire HTML, CSS and JS code or tags from HTML page in python [duplicate]

阅读更多关于 How to strip entire HTML, CSS and JS code or tags from HTML page in python [duplicate]

问题 This question already has answers here : Closed 6 years ago . Possible Duplicate: BeautifulSoup Grab Visible Webpage Text Web scraping with Python Say I am a very complex HTML page consisting usual HTML tags, CSS & JS in the middle. We might see all worst cases. All I want is strip all the above tags/ code and return "text". In simple terms: <html><body>Text</body></html> This might contain JS, CSS etc. etc.. I am trying to use BeautifulSoup but its not removing JS from the code.. Now ,I am

Can I Convert this utf8 Character?

阅读更多关于 Can I Convert this utf8 Character?

问题 I am using dom htm document function to scrape html and store it into MySQl. but I have notieced that for foriegn languages like chinese or japanese etc. some wierd charactors are stored in MySQL and I dont think any one can read this.., é—¨æˆ·,æ–°é—»,ãƒ¼ã‚¿ãƒ«,æ¤œç´¢ so my question is can I convert this back into original form by using any code?? if not I want to eliminate this from my table beacuse there is no use of it.how can I eliminate only these charactors from table?? 回答1: This should

Regular expression with multiple results

阅读更多关于 Regular expression with multiple results

问题 What's wrong with my regex ? "/Blabla$2$ :.*<tr><td class=\"generic\">(.*)<\/td>.+<\/tr>/Uis" .... <tr> <td class="aaa">Blabla(1) :</td> <td> <table class="bbb"><tbody> <tr class="ccc"><th>title1</th><th>title2</th><th>title3</th></tr> <tr><td class="generic">word1</td><td class="generic">word2 </td><td class="generic">word3</td></tr> <tr><td class="generic">word4</td><td class="generic">word5 </td><td class="generic">word6</td></tr> </tbody></table> </td> </tr> <tr> <td class="aaa">Blabla

ASP.NET - Parse / Query HTML Before Transmission and Insert CSS Class References

阅读更多关于 ASP.NET - Parse / Query HTML Before Transmission and Insert CSS Class References

问题 As a web developer I feel too much of my time is spent on CSS. I am trying to come up with a solution where I can write re-usable CSS i.e. classes and reference these classes in the HTML without additional code in ASPX or ASCX files etc. or code-behind files. I want an intermediary which links up HTML elements with CSS classes. What I want to achieve: Modify HTML immediately before transmission Select elements in the HTML Based on rules defined elsewhere (e.g. in a text file relating to the