lexical-analysis | 易学教程

What is a regular expression for control characters?

阅读更多关于 What is a regular expression for control characters?

I'm trying to match a control character in the form \^c where c is any valid character for control characters. I have this regular expression, but it's not currently working: \\[^][@-z] I think the problem lies with the fact that the caret character (^) is part of the regular expressions parsing engine. Match an ASCII text string of the form ^X using the pattern \^. , nothing more. Match an ASCII text string of the form \^X with the pattern \\\^. . You may wish to constrain that dot to [?@_\[\]^\\] , so \\\^[A-Z?@_\[\]^\\] . It’s easier to read as [?\x40-\x5F] for the bracketed character class

How to make a flex (lexical scanner) to read UTF-8 characters input?

阅读更多关于 How to make a flex (lexical scanner) to read UTF-8 characters input?

问题 It seems that flex doesn't support UTF-8 input. Whenever the scanner encounter a non-ASCII char, it stops scanning as if it was an EOF. Is there a way to force flex to eat my UTF-8 chars? I don't want it to actually match UTF-8 chars, just eat them when using the '.' pattern. Any suggestion? EDIT The most simple solution would be: ANY [\x00-\xff] and use 'ANY' instead of '.' in my rules. 回答1: I have been looking into this myself and reading the Flex mailing list to see if anyone thought about

What is a regular expression for control characters?

阅读更多关于 What is a regular expression for control characters?

问题 I'm trying to match a control character in the form \^c where c is any valid character for control characters. I have this regular expression, but it's not currently working: \\[^][@-z] I think the problem lies with the fact that the caret character (^) is part of the regular expressions parsing engine. 回答1: Match an ASCII text string of the form ^X using the pattern \^. , nothing more. Match an ASCII text string of the form \^X with the pattern \\\^. . You may wish to constrain that dot to [

What can create a lexical error in C?

阅读更多关于 What can create a lexical error in C?

Besides not closing a comment /*... , what constitutes a lexical error in C? Here are some: "abc<EOF> where EOF is the end of the file. In fact, EOF in the middle of many lexemes should produce errors: 0x<EOF> I assume that using bad escapes in strings is illegal: "ab\qcd" Probably trouble with floating point exponents 1e+% Arguably, you shouldn't have stuff at the end of a preprocessor directive: #if x % Basically anything that is not conforming to ISO C 9899/1999, Annex A.1 "Lexical Grammar" is a lexical fault if the compiler does its lexical analysis according to this grammar. Here are some

How would you go about implementing off-side rule?

阅读更多关于 How would you go about implementing off-side rule?

问题 I've already written a generator that does the trick, but I'd like to know the best possible way to implement the off-side rule. Shortly: Off-side rule means in this context that indentation is getting recognized as a syntactic element. Here is the offside rule in pseudocode for making tokenizers that capture indentation in usable form, I don't want to limit answers by language: token NEWLINE matches r"\n\ *" increase line count pick up and store the indentation level remember to also record

Parsing Meaning from Text

阅读更多关于 Parsing Meaning from Text

问题 I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like: "Manny Ramirez makes his return for the Dodgers today against the Houston Astros", what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes

C#/.NET Lexer Generators

阅读更多关于 C#/.NET Lexer Generators

I'm looking for a decent lexical scanner generator for C#/.NET -- something that supports Unicode character categories, and generates somewhat readable & efficient code. Anyone know of one? EDIT: I need support for Unicode categories , not just Unicode characters. There are currently 1421 characters in just the Lu (Letter, Uppercase) category alone, and I need to match many different categories very specifically, and would rather not hand-write the character sets necessary for it. Also, actual code is a must -- this rules out things that generate a binary file that is then used with a driver

How to implement Lexical Analysis in Javascript

阅读更多关于 How to implement Lexical Analysis in Javascript

Hey folks, thanks for reading I am currently attempting to do a Google-style calculator. You input a string, it determines if it can be calculated and returns the result. I began slowly with the basics : + - / * and parenthesis handling. I am willing to improve the calculator over time, and having learned a bit about lexical analysis a while ago, I built a list of tokens and associated regular expression patterns. This kind of work is easily applicable with languages such as Lex and Yacc, except I am developping a Javascript-only application. I tried to transcript the idea into Javascript but

How can I find only 'interesting' words from a corpus?

阅读更多关于 How can I find only 'interesting' words from a corpus?

I am parsing sentences. I want to know the relevant content of each sentence, defined loosely as "semi-unique words" in relation to the rest of the corpus. Something similar to Amazon's "statistically improbable phrases", which seem to (often) convey the character of a book through oddball strings of words. My first pass was to start making a common words list. This knocks out the easy ones like a , the , from , etc. Obviously, it turns out that this list gets quite long. One idea is to generate this list: Make a histogram of the corpus' word frequencies, and lop off the top 10% or something

Parsing Meaning from Text

阅读更多关于 Parsing Meaning from Text

I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like: "Manny Ramirez makes his return for the Dodgers today against the Houston Astros", what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes anything Title Capped is a proper noun). To make this question even worse, what are the things I'm not