html-parsing

what method in jsoup can return the modified html?

随声附和 提交于 2019-12-12 09:49:46
问题 When I parse the html file(stored in native) with jsoup . I have modified some elements in the html file, so I want to save the modified html, and replace the old one? Do any body know which method in jsoup can do the job? Thank you so much!!! 回答1: You could write the contents of either document.toString() or document.outerHtml() to file, where document is got from Document document = Jsoup.connect("http://...").get(); // any document modifications... like so: BufferedWriter htmlWriter = new

Auto-indent HTML Code and Display It

喜你入骨 提交于 2019-12-12 09:10:32
问题 When displaying HTML code in PHP, if I want to indent tags what would be the best way? For example, <?php $html = '<div><p><span>some text</span></p></div>'; echo htmlspecialchars($html); ?> will give <div><p><span>some text</span</p></div> . Instead, I'd like to show something like, <div> <p> <span>some text</span> </p> </div> 回答1: You can use htmLawed http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/ this would be your code: <?php require("htmLawed/htmLawed.php"); echo "

remove certain attributes from HTML tags

天涯浪子 提交于 2019-12-12 08:08:06
问题 How can I remove certain attributes such as id, style, class, etc. from HTML code? I thought I could use the lxml.html.clean module, but as it turned out I can only remove style attributes with Clean(style=True).clean_html(code) . I'd prefer not to use regular expressions for this task (attributes could change). What I would like to have: from lxml.html.clean import Cleaner code = '<tr id="ctl00_Content_AdManagementPreview_DetailView_divNova" class="Extended" style="display: none;">' cleaner

Why does Array.to_s return brackets?

可紊 提交于 2019-12-12 07:47:29
问题 For an array, when I type: puts array[0] ==> text Yet when I type puts array[0].to_s ==> ["text"] Why the brackets and quotes? What am I missing? ADDENDUM: my code looks like this page = open(url) {|f| f.read } page_array = page.scan(/regex/) #pulls partial urls into an array partial_url = page_array[0].to_s full_url = base_url + partial_url #adds each partial url to a consistent base_url puts full_url what I'm getting looks like: http://www.stackoverflow/["questions"] 回答1: This print the

Using DOMDocument, is it possible to get all elements that exists within a certain DOM?

孤者浪人 提交于 2019-12-12 07:43:39
问题 Let's say I have an HTML file with a lot of different elements, each having different attributes. Let's say I do not know beforehand how this HTML will look like. Using PHP's DOMDocument, how can I iterate over ALL elements and modify them? All I see is getElementByTagName and getElementById etc. I want to iterate through all elements. For instance. Let's say the HTML looks like this (just an example, in reality I do not know the structure): $html = '<div class="potato"><span></span></div>';

How to get data from HTML using regex

一曲冷凌霜 提交于 2019-12-12 06:39:50
问题 I have following HTML <table class="profile-stats"> <tr> <td class="stat"> <div class="statnum">8</div> <div class="statlabel"> Tweets </div> </td> <td class="stat"> <a href="/THEDJMHA/following"> <div class="statnum">13</div> <div class="statlabel"> Following </div> </a> </td> <td class="stat stat-last"> <a href="/THEDJMHA/followers"> <div class="statnum">22</div> <div class="statlabel"> Followers </div> </a> </td> </tr> </table> I want to get value from <td class="stat stat-last"> => <div

How to strip entire HTML, CSS and JS code or tags from HTML page in python [duplicate]

本小妞迷上赌 提交于 2019-12-12 06:24:13
问题 This question already has answers here : Closed 6 years ago . Possible Duplicate: BeautifulSoup Grab Visible Webpage Text Web scraping with Python Say I am a very complex HTML page consisting usual HTML tags, CSS & JS in the middle. We might see all worst cases. All I want is strip all the above tags/ code and return "text". In simple terms: <html><body>Text</body></html> This might contain JS, CSS etc. etc.. I am trying to use BeautifulSoup but its not removing JS from the code.. Now ,I am

Can I Convert this utf8 Character?

拜拜、爱过 提交于 2019-12-12 05:39:33
问题 I am using dom htm document function to scrape html and store it into MySQl. but I have notieced that for foriegn languages like chinese or japanese etc. some wierd charactors are stored in MySQL and I dont think any one can read this.., 门户,æ–°é—»,ータル,検索 so my question is can I convert this back into original form by using any code?? if not I want to eliminate this from my table beacuse there is no use of it.how can I eliminate only these charactors from table?? 回答1: This should

Regular expression with multiple results

僤鯓⒐⒋嵵緔 提交于 2019-12-12 04:47:56
问题 What's wrong with my regex ? "/Blabla\(2\) :.*<tr><td class=\"generic\">(.*)<\/td>.+<\/tr>/Uis" .... <tr> <td class="aaa">Blabla(1) :</td> <td> <table class="bbb"><tbody> <tr class="ccc"><th>title1</th><th>title2</th><th>title3</th></tr> <tr><td class="generic">word1</td><td class="generic">word2 </td><td class="generic">word3</td></tr> <tr><td class="generic">word4</td><td class="generic">word5 </td><td class="generic">word6</td></tr> </tbody></table> </td> </tr> <tr> <td class="aaa">Blabla

ASP.NET - Parse / Query HTML Before Transmission and Insert CSS Class References

走远了吗. 提交于 2019-12-12 04:11:09
问题 As a web developer I feel too much of my time is spent on CSS. I am trying to come up with a solution where I can write re-usable CSS i.e. classes and reference these classes in the HTML without additional code in ASPX or ASCX files etc. or code-behind files. I want an intermediary which links up HTML elements with CSS classes. What I want to achieve: Modify HTML immediately before transmission Select elements in the HTML Based on rules defined elsewhere (e.g. in a text file relating to the