html-parsing

C# library to clean up html [closed]

风格不统一 提交于 2019-12-21 20:11:41
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . I was wondering if there is a library in .Net to clean up and remove unclosed tags in an html document? 回答1: HtmlTidy! See the url below for more details: http://www.devx.com/dotnet/Article/20505/0/page/2 The source of the download/project is: http://tidy.sourceforge.net/ I gave the other link because it

C# library to clean up html [closed]

南楼画角 提交于 2019-12-21 20:11:13
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . I was wondering if there is a library in .Net to clean up and remove unclosed tags in an html document? 回答1: HtmlTidy! See the url below for more details: http://www.devx.com/dotnet/Article/20505/0/page/2 The source of the download/project is: http://tidy.sourceforge.net/ I gave the other link because it

get contents of <a> tags using python

橙三吉。 提交于 2019-12-21 19:57:52
问题 Assuming I have html read into my program like this: <p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T & P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p> <p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p> <p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p> <p><a href="http://vancouver

How to search in a HTML file for some tags?

夙愿已清 提交于 2019-12-21 19:57:18
问题 I'm having a little problem in Java. How to do this: I want to search in a HTML file for the tags href and src, and then I want to get the URL associated with that tags. What is the best way to do it? Thanks for the help. Best regards. 回答1: This is the code I used to accomplish exactly what you'd like to do, but first let me give you a few tips. If you're in a Java Swing environment, make sure to use the methods in the javax.swing.text.html and javax.swing.text.html.parser packages.

JSOUP not downloading complete html if the webpage is big in size. Any alternatives to this or any workarounds?

拈花ヽ惹草 提交于 2019-12-21 19:27:36
问题 I was trying to get the HTML page and parse information. I just found out that some of the pages were not completely downloaded using Jsoup . I checked with curl command on command line then the complete page got downloaded. Initially I thought that it was site specific, but then I just tried to parse any big webpage randomly using Jsoup and found that it didn't download the complete webpage. I tried specifying user agent and time out properties still it failed to download. Here is the code I

simple html dom: how get a tag without certain attribute

余生长醉 提交于 2019-12-21 16:53:58
问题 I want to get the tags with "class" attribute equal to "someclass" but only those tags that hasn't defined the attribute "id". I tried the following (based on this answer) but didn't work: $html->find('.someclass[id!=*]'); Note: I'm using Simple HTML DOM class and in the basic documentation that they give, I didn't find what I need. 回答1: Simple HTML DOM class does not support CSS3 pseudo classes which is required for negative attribute matching. It is simple to work around the limitation

Advantages of XSLT or Linq to XML

江枫思渺然 提交于 2019-12-21 08:55:51
问题 What advantages are there for using either XSLT or Linq to XML for HTML parsing in C#? This is under the assumption that the html has been cleaned so it is valid xhtml. These values will eventually go into a c# object to be validated and processed. Please let me know if these are valid and if there are other things to consider. XSLT Advantages: Easy to change quickly and deploy Fairly well known XSLT Disadvantages: Not compiled, so is slower to process String manipulation can be cumbersome

Advantages of XSLT or Linq to XML

余生长醉 提交于 2019-12-21 08:55:36
问题 What advantages are there for using either XSLT or Linq to XML for HTML parsing in C#? This is under the assumption that the html has been cleaned so it is valid xhtml. These values will eventually go into a c# object to be validated and processed. Please let me know if these are valid and if there are other things to consider. XSLT Advantages: Easy to change quickly and deploy Fairly well known XSLT Disadvantages: Not compiled, so is slower to process String manipulation can be cumbersome

HTML to RTF string using Python

穿精又带淫゛_ 提交于 2019-12-21 07:21:53
问题 I am looking for a way to convert HTML text to RTF string. Is there any libraries that does this job. I get html content dynamically in my project and need it to be rendered in RTF format. I am using HTML parser to convert HTML text to normal string and then have trying to use PyRTF for conversion to RTF format. Is there any better way that this can be done.Thanks in advance. 回答1: RTF seems a dicey format to convert from/to. I've tried cutting and pasting among applications on Mac OS X, for

HTML to RTF string using Python

谁说我不能喝 提交于 2019-12-21 07:21:43
问题 I am looking for a way to convert HTML text to RTF string. Is there any libraries that does this job. I get html content dynamically in my project and need it to be rendered in RTF format. I am using HTML parser to convert HTML text to normal string and then have trying to use PyRTF for conversion to RTF format. Is there any better way that this can be done.Thanks in advance. 回答1: RTF seems a dicey format to convert from/to. I've tried cutting and pasting among applications on Mac OS X, for