Regex: remove line breaks from parts of string (PHP)

雨燕双飞 提交于 2019-12-23 16:18:29

问题


I want to remove all the line breaks and carriage returns from an XML file so all tags fit on one line each.

XML Source example:

<resources>
  <resource>
    <id>001</id>
    <name>Resource name 1</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor. Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.</desc>
  </resource>
  <resource>
    <id>002</id>
    <name>Resource name 2</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor. Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.
</desc>
  </resource>
  <resource>
    <id>003</id>
    <name>Resource name 3</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor.
Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.
</desc>
  </resource>
</resources>

My take at it:

$pattern = "#(\t\t<[^>]*>[^<>]*)[\r\n]+([^<>]*</.*>)#";
$replacement = "$1$2";
$data = preg_replace($pattern, $replacement, $data);

This pattern corrects the 2nd resource and puts it back on its line. However, it doesn't correct the 2 line breaks from the 3rd resource, it only corrects one. The result is this:

<resources>
  <resource>
    <id>001</id>
    <name>Resource name 1</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor. Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.</desc>
  </resource>
  <resource>
    <id>002</id>
    <name>Resource name 2</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor. Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.</desc>
  </resource>
  <resource>
    <id>003</id>
    <name>Resource name 3</name>
    <desc>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas nibh magna, fermentum et pretium vel, malesuada sit amet dolor.
Morbi dictum, nunc sed interdum facilisis, ligula enim pharetra tortor, at egestas urna massa non nulla.</desc>
  </resource>
</resources>

What's wrong with my pattern?


回答1:


The first [^<>]* in your regex initially gobbles up all of the remaining text, and then has to backtrack a ways so the rest of the regex can match. It only backtracks as far as it has to, i.e., to the last line break in the text. The rest of the regex is able to match what's left, so that's that.

But your regex would only match one line break in any case, because it consumes the whole text. It should consume only the part you want to remove. Check this out:

preg_replace('#[\r\n]+(?=[^<>]*</desc>)#', ' ', $data);

After the line break is found, the lookahead confirms that it was found inside a <desc> element. But the lookahead doesn't consume anything, so the next line break (if there is one) is still there to be matched on the next pass.

You can't have the lookahead match just any end tag (</\w+>) because that would let it match line breaks between elements as well as inside them. You can, however, enumerate the elements you want to work on:

</(?:desc|name|id)>



回答2:


Unless there's a lot more to what you're trying to do than you describe, I think you're making it way too complicated. You don't need nearly as complex a regex as you have. Try just using /\r?\n. This worked for me with your data:

$data = preg_replace("/\r?\n/", "", $data);



回答3:


What's wrong with my pattern?

It's a pattern, not an XML parser.

Try using the DOM, or one of the many, many real XML parsers available to PHP. It should be a simple matter of going through all of the text nodes and trimming them.



来源:https://stackoverflow.com/questions/3340494/regex-remove-line-breaks-from-parts-of-string-php

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!