Converting HTML entities to Unicode Characters in C#

前端 未结 6 2084
心在旅途
心在旅途 2020-12-15 15:23

I found similar questions and answers for Python and Javascript, but not for C# or any other WinRT compatible language.

The reason I think I need it, is because I\'

相关标签:
6条回答
  • 2020-12-15 15:56

    Different coding/encoding of HTML entities and HTML numbers in Metro App and WP8 App.

    With Windows Runtime Metro App

    {
        string inStr = "ó";
        string auxStr = System.Net.WebUtility.HtmlEncode(inStr);
        // auxStr == ó
        string outStr = System.Net.WebUtility.HtmlDecode(auxStr);
        // outStr == ó
        string outStr2 = System.Net.WebUtility.HtmlDecode("ó");
        // outStr2 == ó
    }
    

    With Windows Phone 8.0

    {
        string inStr = "ó";
        string auxStr = System.Net.WebUtility.HtmlEncode(inStr);
        // auxStr == ó
        string outStr = System.Net.WebUtility.HtmlDecode(auxStr);
        // outStr == ó
        string outStr2 = System.Net.WebUtility.HtmlDecode("ó");
        // outStr2 == ó
    }
    

    To solve this, in WP8, I have implemented the table in HTML ISO-8859-1 Reference before calling System.Net.WebUtility.HtmlDecode().

    0 讨论(0)
  • 2020-12-15 16:08

    This worked for me, replaces both common and unicode entities.

    private static readonly Regex HtmlEntityRegex = new Regex("&(#)?([a-zA-Z0-9]*);");
    
    public static string HtmlDecode(this string html)
    {
        if (html.IsNullOrEmpty()) return html;
        return HtmlEntityRegex.Replace(html, x => x.Groups[1].Value == "#"
            ? ((char)int.Parse(x.Groups[2].Value)).ToString()
            : HttpUtility.HtmlDecode(x.Groups[0].Value));
    }
    
    [Test]
    [TestCase(null, null)]
    [TestCase("", "")]
    [TestCase("'fark'", "'fark'")]
    [TestCase(""fark"", "\"fark\"")]
    public void should_remove_html_entities(string html, string expected)
    {
        html.HtmlDecode().ShouldEqual(expected);
    }
    
    0 讨论(0)
  • 2020-12-15 16:11

    This might be useful, replaces all (for as far as my requirements go) entities with their unicode equivalent.

        public string EntityToUnicode(string html) {
            var replacements = new Dictionary<string, string>();
            var regex = new Regex("(&[a-z]{2,5};)");
            foreach (Match match in regex.Matches(html)) {
                if (!replacements.ContainsKey(match.Value)) { 
                    var unicode = HttpUtility.HtmlDecode(match.Value);
                    if (unicode.Length == 1) {
                        replacements.Add(match.Value, string.Concat("&#", Convert.ToInt32(unicode[0]), ";"));
                    }
                }
            }
            foreach (var replacement in replacements) {
                html = html.Replace(replacement.Key, replacement.Value);
            }
            return html;
        }
    
    0 讨论(0)
  • 2020-12-15 16:14

    Improved Zumey method (I can`t comment there). Max char size is in the entity: &exclamation; (11). Upper case in the entities are also possible, ex. À (Source from wiki)

    public string EntityToUnicode(string html) {
            var replacements = new Dictionary<string, string>();
            var regex = new Regex("(&[a-zA-Z]{2,11};)");
            foreach (Match match in regex.Matches(html)) {
                if (!replacements.ContainsKey(match.Value)) { 
                    var unicode = HttpUtility.HtmlDecode(match.Value);
                    if (unicode.Length == 1) {
                        replacements.Add(match.Value, string.Concat("&#", Convert.ToInt32(unicode[0]), ";"));
                    }
                }
            }
            foreach (var replacement in replacements) {
                html = html.Replace(replacement.Key, replacement.Value);
            }
            return html;
        }
    
    0 讨论(0)
  • 2020-12-15 16:16

    Use HttpUtility.HtmlDecode() .Read on msdn here

    decodedString = HttpUtility.HtmlDecode(myEncodedString)
    
    0 讨论(0)
  • 2020-12-15 16:19

    I recommend using System.Net.WebUtility.HtmlDecode and NOT HttpUtility.HtmlDecode.

    This is due to the fact that the System.Web reference does not exist in Winforms/WPF/Console applications and you can get the exact same result using this class (which is already added as a reference in all those projects).

    Usage:

    string s =  System.Net.WebUtility.HtmlDecode("&eacute;"); // Returns é
    
    0 讨论(0)
提交回复
热议问题