问题
I can't seem to find decent documentation on haskell's POSIX implementation.
Specifically the module Text.Regex.Posix
.
Can anyone point me in the right direction of using multiline matching on a string?
A snippet for the curious:
> extractToken body = body =~ "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>" :: String
I'm trying to extract the source of wikipedia pages, however this method clearly falls over when more than one line is involved.
回答1:
You may need to import Text.Regex.Base.RegexLike
for access to makeRegexOpts
and friends.
extractToken body = match regex body where
regex = makeRegexOpts (defaultCompOpt - compNewline) defaultExecOpt
"<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"
Well, since Text.Regex.Posix
's defaultCompOpt = compExtended + compNewline
, that works out equivalently as
extractToken body = match regex body where
regex = makeRegexOpts compExtended defaultExecOpt
"<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"
To pull out just the first group, use one of the other instances of RegexLike. One possibility is
extractToken body = head groups where
(preMatch, inMatch, postMatch, groups) =
match regex body :: (String, String, String, [String])
regex = makeRegexOpts compExtended defaultExecOpt
"<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"
回答2:
You may need to use the PCRE backend instead if you want to do anything more flexible, or with better performance, than Posix regexes.
pcre-light and regex-pcre are both fine.
回答3:
I solved in this case by matching
((.*)|\n*)*
Although this may not always work depending on your expression. The above solution is probably the best way to go if you're able to.
来源:https://stackoverflow.com/questions/1028764/multiline-matching-in-haskell-posix