How to parse heterogenous markup with PHP?

倾然丶 夕夏残阳落幕 提交于 2019-12-24 00:07:23

问题


I have a string with custom markup for saving songs with chords, tabulatures, notes etc. It contains

things in various brackets: \[.+?\], \[[.+?\]], \(.+?\)
arrows: <-{3,}>, \-{3,}>, <\-{3,}
and so on...

Sample text might be

Text Text [something]
--->
Text (something 021213)

Now I wish to parse the markup into array of tokens, objects of corresponding classes, which would look like (matched parts in brackets)

ParsedBlock_Text ("Text Text ")
ParsedBlock_Chord ("something")
ParsedBlock_Text (" ")
ParsedBlock_NewColumn
ParsedBlock_Text (" text ")
ParsedBlock_ChordDiagram ("something 021213")

I know how to match them, but either I must match each different pattern, and save offsets to properly sort the array, or I match them at once and I don't know which one has been matched.

Thanks, MK


回答1:


Assuming you do not try to nest these structures, this will tokenize your text:

function ParseText($text) {
    $re = '/\[\[(?P<DoubleBracket>.*?)]]|\[(?P<Bracket>.*?)]|\((?P<Paren>.*?)\)|(?<Arrow><---+>?|---+>)/s';
    $keys = array('DoubleBracket', 'Bracket', 'Paren', 'Arrow');
    $result = array();
    $lastStart = 0;
    if (preg_match_all($re, $text, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE)) {
        foreach ($matches as $match) {
            $start = $match[0][1];
            $prefix = substr($text, $lastStart, $start - $lastStart);
            $lastStart = $start + strlen($match[0][0]);
            if ($prefix != '' && !ctype_space($prefix)) {
                $result []= array('Text', trim($prefix));
            }
            foreach ($keys as $key) {
                if (isset($match[$key]) && $match[$key][1] >= 0) {
                    $result []=  array($key, $match[$key][0]);
                    break;
                }
            }
        }
    }
    $prefix = substr($text, $lastStart);
    if ($prefix != '' && !ctype_space($prefix)) {
        $result []= array('Text', trim($prefix));
    }
    return $result;
}

Example:

$mytext = <<<'EOT'
Text Text [something]
--->
Text (something 021213)
More Text
EOT;

$parsed = ParseText($mytext);
foreach ($parsed as $item) {
    print_r($item);
}

Output:

Array
(
    [0] => Text
    [1] => Text Text
)
Array
(
    [0] => Bracket
    [1] => something
)
Array
(
    [0] => Arrow
    [1] => --->
)
Array
(
    [0] => Text
    [1] => Text
)
Array
(
    [0] => Paren
    [1] => something 021213
)
Array
(
    [0] => Text
    [1] => More Text
)

http://ideone.com/kJQrBw

If you want to add more patterns to the regex, make sure you put longer patterns at the start, so they are not mistakenly matched as the wrong type.



来源:https://stackoverflow.com/questions/16358582/how-to-parse-heterogenous-markup-with-php

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!