Notepad++ and regex with removal of unmatching sections

做~自己de王妃 提交于 2020-01-22 03:05:24

问题


I'm using this regex string I've crafted...

['"]{1}\w+@\w+\.{1}\w\w\w?['"]

To hunt for email addresses contained in quotes in an old badly formatted legacy file.

Example:

 ADF325@#%jkdaf-@#%j-afd(#$w52'leroyjenkins@myguild.edu'@#%kladfjkla-235dsaf-'thisemail@example.com'2l35jk2dz-dl1jkozf-afajelj'gooselick@somebodyspastries.co'l2#%Jk23l5jlafafljewo8972509357
j2k3l5jadfjeljwfoobar'foobar@barfoo.foo'jk23j-zv8902354jlfa
('352lj53k2ljkumquat'fakeemail@realemail.wtf')lajflsdf
etc.

The regex is working beautifully for me... except for one thing. I want to replace everything that -doesn't- match with whitespace so I can work on formatting this to migrate this to the proper DB. How can I delete everything that doesn't match (and preferably, throw a newline between each match)?


回答1:


Use

['"](\w+@\w+\.\w{2,3})['"]|(?s).

and replace with (?{1}\1\n).

A bit faster equivalent (demo):

['"](\w+@\w+\.\w{2,3})['"]|[^'"]*(?:['"](?!\w+@\w+\.\w{2,3}['"])[^'"]*)*

Details

  • ['"] - a quote
  • (\w+@\w+\.\w{2,3}) - Group 1: 1+ word chars, @, 1+ word chars, . and then 2 or 3 word chars
  • ['"] - a quote
  • | - or
  • (?s). - any single char.

If Group 1 matches ((?{1}) the match is replaced with the Group 1 contents (\1\n). If the (?s). matches, the match gets removed.




回答2:


When you have to deal with large files, the way to process them consists to not load them entirely. Instead you have to read them as a stream. You can't do that using npp, but it's possible using a script language like php. If you want to make changes or to extract something in particular when you load a file as a stream, you can write a user-defined stream filter:

class EmailFilter extends php_user_filter
{
    public function filter($in, $out, &$consumed, $closing)
    {
        while ( $bucket = stream_bucket_make_writeable($in) ) {
            if ( preg_match_all('~\'\K\w+@\w+\.\w{2,3}(?=\')|"\K\w+@\w+\.\w{2,3}(?=")~S', $bucket->data, $matches) ) {
                 $bucket->data = implode("\n", $matches[0]);    
            }
            $consumed += $bucket->datalen;
            stream_bucket_append($out, $bucket);
        }
        return PSFS_PASS_ON;
    }
}

stream_filter_register('email_filter', 'EmailFilter');
$handle = fopen('php://filter/read=email_filter/resource=yourfile.txt', 'rb');

while (feof($handle) !== true) {
    echo fgets($handle); 
}

fclose($handle);

When you adopt this kind of approach, nothing forbids to insert mails in your table five by five, ten by ten, twenty by twenty or the number you want. The goal is to not load all the file in memory.

(more examples in O'Reilly Modern PHP)



来源:https://stackoverflow.com/questions/47722896/notepad-and-regex-with-removal-of-unmatching-sections

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!