问题
I need to parse a single select tag in a poorly formed HTML document (so XML-based parsers don't work).
I think I know how to use parsec to parse the select tag once I get there, but how do I skip all the stuff before and after that tag?
Example:
<html>
random content with lots of tags...
<select id=something title="whatever"><option value=1 selected>1. First<option value=2>2. Second</select>
more random content...
</html>
That's actually what the HTML looks like in the select tag. How would I do this with Parsec, or would you recommend I use a different library?
回答1:
Here's how I'd do it:
solution = (do {
; string "<tag-name"
; x <- ⟦insertOptionsParserHere⟧
; char '>'
; return x
}) <|> (anyChar >> solution)
This will recursively consume characters until it meets a starting <html>
tag, upon which it uses your parser, and leaves the recursion on consuming a final tag.
It is wise to note that there may be trailing whitespace before & after To fix that, we could do this, providing your parser consumes the tags:
solution = ⟦insertHtmlParserHere⟧ <|> (anyChar >> solution)
To be clear that would mean that ⟦insertHtmlParserHere⟧
would have this kind of structure:
⟦insertHtmlParserHere⟧ = do
string "<tag-name"
⋯
char '>'
As a side-note, if you want to capture every tag available, you can quite happily use many
:
everyTag = many solution
回答2:
You can try to use regex and capture the select tag:
import Text.ParserCombinators.Parsec
import Text.Regex.Posix
getOptionTags content = content =~ "(<select.*</select>)"::[[String]]
main :: IO ()
main = do
s <- readFile "in"
putStrLn . show . head . head $ getOptionTags s
回答3:
You can use Replace.Megaparsec.findAll to find the substrings in a document which match a parser.
import Replace.Megaparsec
import Text.Megaparsec
let parseSelect :: Parsec Void String String
parseSelect = do
chunk "<select"
manyTill anySingle $ chunk "</select>"
let input = "<html>\n random content with lots of tags...\n <select id=something title=\"whatever\"><option value=1 selected>1. First<option value=2>2. Second</select>\n more random content...\n</html>"
>>> parseTest (findAll parseSelect) input
[Left "<html>\n random content with lots of tags...\n "
,Right "<select id=something title=\"whatever\"><option value=1 selected>1. First<option value=2>2. Second</select>"
,Left "\n more random content...\n</html>"
]
来源:https://stackoverflow.com/questions/29546940/parsec-ignore-everything-except-one-fragment