How to parse a regular expression?

早过忘川 提交于 2021-02-20 05:29:26

问题


Disclaimer before this is auto-closed. This is NOT the same as this:

How do you access the matched groups in a JavaScript regular expression?

Let's say I have this regular expression:

const regex = /(\w+) count: (\d+)/

Is there a way I can extract the capture groups so that I have:

[ '\w+', '\d+' ]`

回答1:


As others pointed out you'd need a real parser, such as Lex & Yacc. You can however use regex and some recursion magic to parse nested structures. See details at https://twiki.org/cgi-bin/view/Blog/BlogEntry201109x3

Here is a JavaScript version that can parse nested groups properly. The default test is (\w+) count: (\d+), number: (-?\d+(\/\d+)?), e.g. three groups at level 0, and one group nested at level 1 in the third group:

    // configuration:
    const ctrlChar = '~'; // use non-printable, such as '\x01'
    const cleanRegex = new RegExp(ctrlChar + '\\d+' + ctrlChar, 'g');

    function parseRegex(str) {

        function _levelRegx(level) {
            return new RegExp('(' + ctrlChar + level + ctrlChar + ')\\((.*?)(' + ctrlChar + level + ctrlChar + ')\\)', 'g');
        }

        function _extractGroup(m, p1, p2, p3) {
            //console.log('m: ' + m + ', p1: ' + p1 + ', p2: ' + p2 + ', p3: ' + p3);
            groups.push(p2.replace(cleanRegex, ''));
            let nextLevel = parseInt(p1.replace(/\D/g, ''), 10) + 1;
            p2 = p2.replace(_levelRegx(nextLevel), _extractGroup);
            return '(' + p2 + ')';
        }

        // annotate parenthesis with proper nesting level:
        let level = 0;
        str = str.replace(/(?<!\\)[\(\)]/g, function(m) {
            if(m === '(') {
                return ctrlChar + (level++) + ctrlChar + m;
            } else {
                return ctrlChar + (--level) + ctrlChar + m;
            }
        });
        console.log('nesting: ' + str);

        // recursively extract groups:
        let groups = [];
        level = 0;
        str = str.replace(_levelRegx(level), _extractGroup);
        console.log('result: ' + str);
        console.log('groups: [ \'' + groups.join('\', \'') + '\' ]');
        $('#regexGroups').text(JSON.stringify(groups, null, ' '));
    }

    $('document').ready(function() {
        let str = $('#regexInput').val();
        parseRegex(str);

        $('#regexInput').on('input', function() {
            let str = $(this).val();
            parseRegex(str);
        });
    });
div, input {
  font-family: monospace;
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.0/jquery.min.js"></script>
<div>
<p>Regex: <input id="regexInput" value="(\w+) count: (\d+), number: (-?\d+(\/\d+)?)" size="60" />
<p>Groups: <span id="regexGroups"></span></p>
<p>.<br />.<br />.</p>
</div>

You can try it out with various nested patterns.

Explanation:

  • step 1: annotate opening and closing parenthesis with proper nesting level:
    • the annotation is done with control character ~
    • in real live use a non-printable char to avoid collision
    • the result for (\w+) is ~0~(\w+~0~)
    • the result of the default input is ~0~(\w+~0~) count: ~0~(\d+~0~), number: ~0~(-?\d+~1~(\/\d+~1~)?~0~)
  • step 2: recursively extract groups:
    • we start with level 0, and extract all groups at that level
    • for each matched group we recursively extract all groups at that next level


来源:https://stackoverflow.com/questions/66127686/how-to-parse-a-regular-expression

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!