Grammatical inference of regular expressions for given finite list of representative strings?

前端 未结 2 1438
一生所求
一生所求 2020-12-03 01:30

I\'m working on analyzing a large public dataset with lots of verbose human-readable strings that were clearly generated by some regular (in the formal language theory sense

2条回答
  •  萌比男神i
    2020-12-03 01:55

    The only thing I can suggest is to play around with Nltk (Natural Language Toolkit for Python) a bit and see if it can at least recognize recurring patterns.

    Another thing you may look into is MALLET (Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction etc.)

    Perl has something called LinkParser but it seems to require you to provide a representation of the actual grammar (on the other hand, it comes with a large set of different models so maybe it could be shoehorned to help you sorting your samples).

    Gate may allow you to create examples from a subset of records in your corpus and possibly reverse engineer the grammar from those.

    Finally, have a look at the CRAN repository for text-specific packages.

提交回复
热议问题