问题
I'm using this regex string I've crafted...
['"]{1}\w+@\w+\.{1}\w\w\w?['"]
To hunt for email addresses contained in quotes in an old badly formatted legacy file.
Example:
ADF325@#%jkdaf-@#%j-afd(#$w52'leroyjenkins@myguild.edu'@#%kladfjkla-235dsaf-'thisemail@example.com'2l35jk2dz-dl1jkozf-afajelj'gooselick@somebodyspastries.co'l2#%Jk23l5jlafafljewo8972509357
j2k3l5jadfjeljwfoobar'foobar@barfoo.foo'jk23j-zv8902354jlfa
('352lj53k2ljkumquat'fakeemail@realemail.wtf')lajflsdf
etc.
The regex is working beautifully for me... except for one thing. I want to replace everything that -doesn't- match with whitespace so I can work on formatting this to migrate this to the proper DB. How can I delete everything that doesn't match (and preferably, throw a newline between each match)?
回答1:
Use
['"](\w+@\w+\.\w{2,3})['"]|(?s).
and replace with (?{1}\1\n)
.
A bit faster equivalent (demo):
['"](\w+@\w+\.\w{2,3})['"]|[^'"]*(?:['"](?!\w+@\w+\.\w{2,3}['"])[^'"]*)*
Details
['"]
- a quote(\w+@\w+\.\w{2,3})
- Group 1: 1+ word chars,@
, 1+ word chars,.
and then 2 or 3 word chars['"]
- a quote|
- or(?s).
- any single char.
If Group 1 matches ((?{1}
) the match is replaced with the Group 1 contents (\1\n
). If the (?s).
matches, the match gets removed.
回答2:
When you have to deal with large files, the way to process them consists to not load them entirely. Instead you have to read them as a stream. You can't do that using npp, but it's possible using a script language like php. If you want to make changes or to extract something in particular when you load a file as a stream, you can write a user-defined stream filter:
class EmailFilter extends php_user_filter
{
public function filter($in, $out, &$consumed, $closing)
{
while ( $bucket = stream_bucket_make_writeable($in) ) {
if ( preg_match_all('~\'\K\w+@\w+\.\w{2,3}(?=\')|"\K\w+@\w+\.\w{2,3}(?=")~S', $bucket->data, $matches) ) {
$bucket->data = implode("\n", $matches[0]);
}
$consumed += $bucket->datalen;
stream_bucket_append($out, $bucket);
}
return PSFS_PASS_ON;
}
}
stream_filter_register('email_filter', 'EmailFilter');
$handle = fopen('php://filter/read=email_filter/resource=yourfile.txt', 'rb');
while (feof($handle) !== true) {
echo fgets($handle);
}
fclose($handle);
When you adopt this kind of approach, nothing forbids to insert mails in your table five by five, ten by ten, twenty by twenty or the number you want. The goal is to not load all the file in memory.
(more examples in O'Reilly Modern PHP)
来源:https://stackoverflow.com/questions/47722896/notepad-and-regex-with-removal-of-unmatching-sections