Java Regex or XML parser?

会有一股神秘感。 提交于 2019-12-04 04:24:24

问题


I want to remove any tags such as

<p>hello <namespace:tag : a>hello</namespace:tag></p>

to become

 <p> hello hello </p>

What is the best way to do this if it is regex for some reason this is now working can anyone help?

(<|</)[:]{1,2}[^</>]>

edit: added


回答1:


Definitely use an XML parser. Regex should not be used to parse *ML




回答2:


You should not use regex for these purposes use a parser like lxml or BeautifulSoup

>>> import lxml.html as lxht
>>> myString = '<p>hello <namespace:tag : a>hello</namespace:tag></p>'
>>> lxht.fromstring(myString).text_content()
'hello hello'

Here is a reason why you should not parse html/xml with regex.




回答3:


If you're just trying to pull the plain text out of some simple XML, the best (fastest, smallest memory footprint) would be to just run a for loop over the data:

PSEUDOCODE BELOW

bool inMarkup = false;
string text = "";
for each character in data // (dunno what you're reading from)
{
    char c = current;
    if( c == '<' ) inMarkup = true;
    else if( c == '>') inMarkup = false;
    else if( !inMarkup ) text += c;
}

Note: This will break if you encounter things like CDATA, JavaScript, or CSS in your parsing.

So, to sum up... if it's simple, do something like above and not a regular expression. If it isn't that simple, listen to the other guys an use an advanced parser.




回答4:


This is a solution I personally used for a likewise problem in java. The library used for this is Jsoup : http://jsoup.org/.

In my particular case I had to unwrap tags that had an attribute with a particular value in them. You see that reflected in this code, it's not the exact solution to this problem but could put you on your way.

  public static String unWrapTag(String html, String tagName, String attribute, String matchRegEx) {
    Validate.notNull(html, "html must be non null");
    Validate.isTrue(StringUtils.isNotBlank(tagName), "tagName must be non blank");
    if (StringUtils.isNotBlank(attribute)) {
      Validate.notNull(matchRegEx, "matchRegEx must be non null when an attribute is provided");
    }    
    Document doc = Jsoup.parse(html);
    OutputSettings outputSettings = doc.outputSettings();
    outputSettings.prettyPrint(false);
    Elements elements = doc.getElementsByTag(tagName);
    for (Element element : elements) {
      if(StringUtils.isBlank(attribute)){
        element.unwrap();
      }else{
        String attr = element.attr(attribute);
        if(!StringUtils.isBlank(attr)){
          String newData = attr.replaceAll(matchRegEx, "");
          if(StringUtils.isBlank(newData)){
            element.unwrap();
          }
        }        
      }
    }
    return doc.html();
  }


来源:https://stackoverflow.com/questions/9120964/java-regex-or-xml-parser

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!