F# How to tokenise user input: separating numbers, units, words?

扶醉桌前 提交于 2019-12-21 20:26:40

问题


I am fairly new to F#, but have spent the last few weeks reading reference materials. I wish to process a user-supplied input string, identifying and separating the constituent elements. For example, for this input:

XYZ Hotel: 6 nights at 220EUR / night plus 17.5% tax

the output should resemble something like a list of tuples:

[ ("XYZ", Word); ("Hotel:", Word);
("6", Number); ("nights", Word);
("at", Operator); ("220", Number);
("EUR", CurrencyCode); ("/", Operator); ("night", Word);
("plus", Operator); ("17.5", Number); ("%", PerCent); ("tax", Word) ]

Since I'm dealing with user input, it could be anything. Thus, expecting users to comply with a grammar is out of the question. I want to identify the numbers (could be integers, floats, negative...), the units of measure (optional, but could include SI or Imperial physical units, currency codes, counts such as "night/s" in my example), mathematical operators (as math symbols or as words including "at" "per", "of", "discount", etc), and all other words.

I have the impression that I should use active pattern matching -- is that correct? -- but I'm not exactly sure how to start. Any pointers to appropriate reference material or similar examples would be great.


回答1:


I put together an example using the FParsec library. The example is not robust at all but it gives a pretty good picture of how to use FParsec.

type Element =
| Word of string
| Number of string
| Operator of string
| CurrencyCode of string
| PerCent  of string    

let parsePerCent state =
    (parse {
        let! r = pstring "%"
        return PerCent r
    }) state

let currencyCodes = [|
    pstring "EUR"
|]

let parseCurrencyCode state =
    (parse {
        let! r = choice currencyCodes
        return CurrencyCode r
    }) state

let operators = [|
    pstring "at"
    pstring "/"
|]

let parseOperator state =
    (parse {
        let! r = choice operators
        return Operator r
    }) state

let parseNumber state =
    (parse {
        let! e1 = many1Chars digit
        let! r = opt (pchar '.')
        let! e2 = manyChars digit
        return Number (e1 + (if r.IsSome then "." else "") + e2)
    }) state

let parseWord state =
    (parse {
        let! r = many1Chars (letter <|> pchar ':')
        return Word r
    }) state

let elements = [| 
    parseOperator
    parseCurrencyCode
    parseWord
    parseNumber 
    parsePerCent
|]

let parseElement state =
    (parse {
        do! spaces
        let! r = choice elements
        do! spaces
        return r
    }) state

let parseElements state =
    manyTill parseElement eof state

let parse (input:string) =
    let result = run parseElements input 
    match result with
    | Success (v, _, _) -> v
    | Failure (m, _, _) -> failwith m



回答2:


It sounds like what you really want is just a lexer. A good alternative to FSParsec would be FSLex. (Good intro tutorial, albiet somewhat dated, can be found on my old blog here.) Using FSLex you can take your input text:

XYZ Hotel: 6 nights at 220EUR / night plus 17.5% tax

And get it properly tokenized into something like:

 [ Word("XYZ"); Hotel; Int(6); Word("nights"); Word("at"); Int(220); EUR; ... ]

The next step, once you have an List of tokens, is to do some form of pattern matching / analysis to extract semantic information (which I assume is what you are really after). With the normalized token stream, it should be as simple as:

let rec processTokenList tokens = 
    match tokens with
    | Float(x) :: Keyword("EUR") :: rest  -> // Dollar amount x
    | Word(x) :: Keyword("Hotel") :: rest -> // Hotel x
    | hd :: rest -> // Couldn't find anything interesting...
                    processTokenList rest

That should at least get you started. But note that as your input gets more 'formal', so will the usefulness of your lexing. (And if you only accept a very specific input, then you can use a proper parser and be done with it!)



来源:https://stackoverflow.com/questions/4653820/f-how-to-tokenise-user-input-separating-numbers-units-words

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!