Regex HTML Extraction C#

﹥>﹥吖頭↗ 提交于 2019-12-11 14:34:31

问题


I have searched and searched about Regex but I can't seem to find something that will allow me to do this.

I need to get the 12.32, 2,300, 4.644 M and 12,444.12 from the following strings in C#:

<td class="c-ob-j1a" property="c-value">12.32</td>
<td class="c-ob-j1a" property="c-value">2,300</td>
<td class="c-ob-j1a" property="c-value">4.644 M</td>
<td class="c-ob-j1a" property="c-value">12,444.12 M</td>

I got up to this:

MatchCollection valueCollection = Regex.Matches(html, @"<td class=""c-ob-j1a"" property=""c-value"">(?<Value>P{</td>})</td>");

Thanks!


回答1:


You should not use regexp to parse HTML. See this post on howto parse html What is the best way to parse html in C#? or you could use HtmlAgilityPack http://www.codeplex.com/htmlagilitypack

but if you really want to use regex this should work.

<td[^>](.+?)<\/td>



回答2:


"value">(.*?)<\/td>

should do it for you. The value you require would be held in the capturing group denoted by the parentheses




回答3:


Something like this should work:

/<td[.]*?>(.+)<\/td>/

Regarding your code sample, this would probably be more maintainable:

MatchCollection valueCollection = Regex.Matches(html, @"<td[^>]*?>(?<Value>.*?)</td>")

If your html consists of other td's which you don't want to extract data from, your original regex should be fine.




回答4:


I'd probably start with a very strict match to avoid accidentally capturing other parts of the document:

    static void Main(string[] args)
    {
        string html = @"<td class=""c-ob-j1a"" property=""c-value"">12.32</td>
<td class=""c-ob-j1a"" property=""c-value"">2,300</td>
<td class=""c-ob-j1a"" property=""c-value"">4.644 M</td>
<td class=""c-ob-j1a"" property=""c-value"">12,444.12 M</td>";

        var matches = Regex.Matches(html, @"<td class=""c-ob-j1a"" property=""c-value"">([^<]*)</td>");
        foreach (Match match in matches)
            Console.WriteLine(match.Groups[1].Value);
    }

(And I would also like to take this opportunity to recommend the Html Agility Pack if you haven't tried it yet.)




回答5:


If all you need is to parse the td tag in the formats you presented you might get away with a regex.

In general parsing html with regex is not working. You can find many questions here on SO explaining why



来源:https://stackoverflow.com/questions/1894995/regex-html-extraction-c-sharp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!