Parsec ignore everything except one fragment

问题

I need to parse a single select tag in a poorly formed HTML document (so XML-based parsers don't work).

I think I know how to use parsec to parse the select tag once I get there, but how do I skip all the stuff before and after that tag?

Example:

<html>
   random content with lots of tags...
   <select id=something title="whatever"><option value=1 selected>1. First<option value=2>2. Second</select>
   more random content...
</html>

That's actually what the HTML looks like in the select tag. How would I do this with Parsec, or would you recommend I use a different library?

回答1:

Here's how I'd do it:

solution = (do {
  ; string "<tag-name"
  ; x <- ⟦insertOptionsParserHere⟧
  ; char '>'
  ; return x
  }) <|> (anyChar >> solution)

This will recursively consume characters until it meets a starting <html> tag, upon which it uses your parser, and leaves the recursion on consuming a final tag.

It is wise to note that there may be trailing whitespace before & after To fix that, we could do this, providing your parser consumes the tags:

solution = ⟦insertHtmlParserHere⟧ <|> (anyChar >> solution)

To be clear that would mean that ⟦insertHtmlParserHere⟧ would have this kind of structure:

⟦insertHtmlParserHere⟧ = do
   string "<tag-name"
   ⋯
   char '>'

As a side-note, if you want to capture every tag available, you can quite happily use many:

everyTag = many solution

回答2:

You can try to use regex and capture the select tag:

import Text.ParserCombinators.Parsec
import Text.Regex.Posix


getOptionTags content = content =~ "(<select.*</select>)"::[[String]]

main :: IO ()
main = do
    s <- readFile "in"
    putStrLn . show . head . head $ getOptionTags s

回答3:

You can use Replace.Megaparsec.findAll to find the substrings in a document which match a parser.

import Replace.Megaparsec
import Text.Megaparsec

let parseSelect :: Parsec Void String String
    parseSelect = do
        chunk "<select"
        manyTill anySingle $ chunk "</select>"

let input = "<html>\n   random content with lots of tags...\n   <select id=something title=\"whatever\"><option value=1 selected>1. First<option value=2>2. Second</select>\n   more random content...\n</html>"

>>> parseTest (findAll parseSelect) input
[Left "<html>\n   random content with lots of tags...\n   "
,Right "<select id=something title=\"whatever\"><option value=1 selected>1. First<option value=2>2. Second</select>"
,Left "\n   more random content...\n</html>"
]

来源：https://stackoverflow.com/questions/29546940/parsec-ignore-everything-except-one-fragment

标签

html

haskell

parsec