Why does LF and CRLF behave differently with /^\s*$/gm regex?

廉价感情. 提交于 2021-01-24 09:45:08

问题


I've been seeing this issue on Windows. When I try to clear any whitespace on each line on Unix:

const input =
`===

HELLO

WOLRD

===`
console.log(input.replace(/^\s+$/gm, ''))

This produces what I expect:

===

HELLO

WOLRD

===

i.e. if there were spaces on blank lines, they'd get removed. On the other hand, on Windows, the regex clears the WHOLE string. To illustrate:

const input =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, '\r\n')
console.log(input.replace(/^\s+$/gm, ''))

(template literals will always print only \n in JS, so I had to replace with \r\n to emulate Windows (? after \r just to be sure for those who don't believe). The result:

===
HELLO
WOLRD
===

The whole line is gone! But my regex has ^ and $ with the m flag set, so it's kind of /^-to-$/m. What's the difference between \r and \r\n then that makes it produce different results?

when I do some logging

console.log(input.replace(/^\s*$/gm, (m) => {
  console.log('matched')
  return ''
}))

With \r\n I'm seeing

matched
matched
matched
matched
matched
matched
===
HELLO
WOLRD
===

and with \n only

matched
matched
matched
===

HELLO

WOLRD

===

回答1:


TL;DR a pattern including whitespace and line breaks will also match characters part of a \r\n sequence, if you let it.

First of all, let's actually examine what characters are there and aren't there when you do a replacement. Starting with a string that only uses line feeds:

const inputLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\n");

console.log('------------ INPUT ')
console.log(inputLF);
console.log('------------')

debugPrint(inputLF, 2);
debugPrint(inputLF, 3);
debugPrint(inputLF, 4);
debugPrint(inputLF, 5);

const replaceLF = inputLF.replace(/^\s+$/gm, '');

console.log('------------ REPLACEMENT')
console.log(replaceLF);
console.log('------------')

debugPrint(replaceLF, 2);
debugPrint(replaceLF, 3);
debugPrint(replaceLF, 4);
debugPrint(replaceLF, 5);

console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`);
console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`);
console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`);
console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`);

console.log('------------')
console.log('inputLF === replaceLF :', inputLF === replaceLF)

function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

Each line ends with char code 10 which is the Line Feed (LF) character that is represented in a string literal with \n. Before and after the replacement, the two strings are the same - not only look the same but actually equal each other, so the replacement did nothing.

Now let's examine the other case:

const inputCRLF =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, "\r\n")
console.log('------------ INPUT ')
console.log(inputCRLF);
console.log('------------')

debugPrint(inputCRLF, 2);
debugPrint(inputCRLF, 3);
debugPrint(inputCRLF, 4);
debugPrint(inputCRLF, 5);
debugPrint(inputCRLF, 6);
debugPrint(inputCRLF, 7);

const replaceCRLF = inputCRLF.replace(/^\s+$/gm, '');;

console.log('------------ REPLACEMENT')
console.log(replaceCRLF);
console.log('------------')

debugPrint(replaceCRLF, 2);
debugPrint(replaceCRLF, 3);
debugPrint(replaceCRLF, 4);
debugPrint(replaceCRLF, 5);

function debugPrint(str, charIndex) {
  console.log(`index: ${charIndex}
   charcode: ${str.charCodeAt(charIndex)}
   character: ${str.charAt(charIndex)}`
 );
}

This time each line ends with char code 13 which is the Carriage Return (CR) character that is represented in a string literal with \r and then the LF follows. After the replacement, instead of having a sequence of =\r\n\r\nH instead it's not just =\r\nH. Let's look at why.

Here is what MDN says about the meta character ^:

Matches the beginning of input. If the multiline flag is set to true, also matches immediately after a line break character.

And here is what MDN says about the meta character $

Matches the end of input. If the multiline flag is set to true, also matches immediately before a line break character.

So they match after and before a line break character. In that, MDN means the LF or the CR. This can be seen if we test a string that contains different line breaks:

const stringLF = "hello\nworld";
const stringCRLF = "hello\r\nworld";

const regexStart = /^\s/m;
const regexEnd = /\s$/m;

console.log(regexStart.exec(stringLF));
console.log(regexStart.exec(stringCRLF));

console.log(regexEnd.exec(stringLF));
console.log(regexEnd.exec(stringCRLF));

If we try to match whitespace near a line break, this doesn't match anything if there is an LF but it does match the CR with CRLF. So, in that case $ would match here:

"hello\r\nworld"
        ^^ what `^\s` matches

"hello\r\nworld"
      ^^ what `\s$` matches

So both ^ and $ recognise either of the CRLF sequence as end of line. This will make a difference when you do a search and replace. Since your regex specifies ^\s+$ that means that when you have a line that is entirely \r\n then it matches. But for a reason that is not obvious:

const re = /^\s+$/m;

const sringLF = "hello\n\nworld";
const stringCRLF = "hello\r\n\r\nworld";


console.log(re.exec(sringLF));
console.log(re.exec(stringCRLF));

So, the regex doesn't match an\r\n but rather \n\r (two whitespace characters) between two other line breakcharacters. That's because + is eager and will consume as much of the character sequence as it can get away with. Here is what the regex engine will try. Somewhat simplified for brevity:

input = "hello\r\n\r\nworld
regex = /^\s+$/

Step 1
hello[\r]\n\r\nworld
    matches `^`, symbol satisfied -> continue with next symbol in regex

Step 2
hello[\r\n]\r\nworld
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 3
hello[\r\n\r]\nworld
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 4
hello[\r\n\r\n]world
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 5
hello[\r\n\r\nw]orld
    does not match `\s` -> backtrack

Step 6
hello[\r\n\r\n]world
    matches `^\s+`, quantifier satisfied -> continue to next symbol in regex

Step 7
hello[\r\n\r\nw]orld
    does not match `$` in `^\s+$` -> backtrack

Step 8
hello[\r\n\r\n]world
    matches `^\s+$`, last symbol satisfied -> finish

Lastly, there is something slightly hidden here - it matters that you're matching whitespace. This is because it will behave differently to most other symbols in that it explicitly matches a line break character, whereas . will not:

Matches any single character except line terminators

So, if you specify \s$ this will match the CR in \r\n because the regex engine is forced to look for a match for both \s and $, therefore it finds the \r before the \n. However, this will not happen for many other patterns, since $ will usually be satisfied when it's before CR (or at the end of the string).

Same with ^\s it will explicitly look for a whitespace character after a linebreak which is satisfied by the LF in CRLF, however if you're not seeking that, then it will happily match after the LF:

const stringLF = "hello\nworld";
const stringCRLF = "hello\r\nworld";

const regexStartAll = /^./mg;
const regexEndAll = /.$/gm;

console.log(stringLF.match(regexStartAll));
console.log(stringCRLF.match(regexStartAll));

console.log(stringLF.match(regexEndAll));
console.log(stringCRLF.match(regexEndAll));

So, all of this means that ^\s+$ has some unintuitive behaviour yet perfectly coherent once you understand that the regex engine matches exactly what you tell it to.



来源:https://stackoverflow.com/questions/60729065/why-does-lf-and-crlf-behave-differently-with-s-gm-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!