What characters are allowed in Perl identifiers?

99封情书 提交于 2019-12-03 17:15:02

问题


I'm working on regular expressions homework where one question is:

Using language reference manuals online determine the regular expressions for integer numeric constants and identifiers for Java, Python, Perl, and C.

I don't need help on the regular expression, I just have no idea what identifiers look like in Perl. I found pages describing valid identifiers for C, Python and Java, but I can't find anything about Perl.

EDIT: To clarify, finding the documentation was meant to be easy (like doing a Google search for python identifiers). I'm not taking a class in "doing Google searches".


回答1:


Perl Integer Constants

Integer constants in Perl can be

  • in base 16 if they start with ^0x
  • in base 2 if they start with ^0b
  • in base 8 if they start with 0
  • otherwise they are in base 10.

Following that leader is any number of valid digits in that base and also optional underscores.

Note that digit does not mean \p{POSIX_Digit}; it means \p{Decimal_Number}, which is really quite different, you know.

Please note that any leading minus sign is not part of the integer constant, which is easily proven by:

$ perl -MO=Concise,-exec -le '$x = -3**$y'
1  <0> enter 
2  <;> nextstate(main 1 -e:1) v:{
3  <$> const(IV 3) s
4  <$> gvsv(*y) s
5  <2> pow[t1] sK/2
6  <1> negate[t2] sK/1
7  <$> gvsv(*x) s
8  <2> sassign vKS/2
9  <@> leave[1 ref] vKP/REFC
-e syntax OK

See the 3 const, and much later on the negate op-code? That tells you a bunch, including a curiosity of precedence.

Perl Identifiers

Identifiers specified via symbolic dereferencing have absolutely no restriction whatsoever on their names.

  • For example, 100->(200) calls the function named 100 with the arugments (100, 200).
  • For another, ${"What’s up, doc?"} refers to the scalar package variable by that name in the current package.
  • On the other hand, ${"What's up, doc?"} refers to the scalar package variable whose name is ${"s up, doc?"} and which is not in the current package, but rather in the What package. Well, unless the current package is the What package, of course. Similary $Who's is the $s variable in the Who package.

One can also have identifiers of the form ${^identifier}; these are not considered symbolic dereferences into the symbol table.

Identifiers with a single character alone can be a punctuation character, include $$ or %!.

Identifers can also be of the form $^C, which is either a control character or a circumflex folllowed by a non-control character.

If none of those things is true, a (non–fully qualified) identifier follows the Unicode rules related to characters with the properties ID_Start followed by those with the property ID_Continue. However, it overrules this in allowing all-digit identifiers and identifiers that start with (and perhaps have nothing else beyond) an underscore. You can generally pretend (but it’s really only pretending) that that is like saying \w+, where \w is as described in Annex C of UTS#18. That is, anything that has any of these:

  • the Alphabetic property — which includes far more than just Letters; it also contains various combining characters and the Letter_Number code points, plus the circled letters
  • the Decimal_Number property, which is rather more than merely [0-9]
  • Any and all characters with the Mark property, not just those marks that are deemed Other_Alphabetic
  • Any characters with the Connector_Puncutation property, of which underscore is just one such.

So either ^\d+$ or else

^[\p{Alphabetic}\p{Decimal_Number}\p{Mark}\p{Connector_Punctuation}]+$

ought to do it for the really simple ones if you don’t care to explore the intricacies of the Unicode ID_Start and ID_Continue properties. That’s how it’s really done, but I bet your instructor doesn’t know that. Perhaps one shan’t tell him, eh?

But you should cover the nonsimple ones I describe earlier.

And we haven’t talked about packages yet.

Perl Packages in Identifiers

Beyond those simple rules, you must also consider that identifiers may be qualified with a package name, and package names themselves follow the rules of identifiers.

The package separator is either :: or ' at your whim.

You do not have to specify a package if it is the first component in a fully qualified identifier, in which case it means the package main. That means things like $::foo and $'foo are equivalent to $main::foo, and isn't_it() is equivalent to isn::t_it(). (Typo removed)

Finally, as a special case, a trailing double-colon (but not a single-quote) at the end of a hash is permitted, and this then refers to the symbol table of that name.

Thus %main:: is the main symbol table, and because you can omit main, so too is %::.

Meanwhile %foo:: is the foo symbol table, as is %main::foo:: and also %::foo:: just for perversity’s sake.

Summary

It’s nice to see instructors giving people non-trivial assignments. The question is whether the instructor realized it was non-trivial. Probably not.

And it’s hardly just Perl, either. Regarding the Java identifiers, did you figure out yet that the textbooks lie? Here’s the demo:

$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: ^\033"; System.out.println(var_\033); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[

Yes, it’s true. It is also true for many other code points, especially if you use -encoding UTF-8 on the compile line. Your job is to find the pattern that describes these startlingly unforbidden Java identifiers. Hint: make sure to include code point U+0000.

There, aren’t you glad you asked? Hope this helps. Or something. ☺




回答2:


The homework requests that you use the reference manuals, so I'll answer in those terms.

The Perl documentation is available at http://perldoc.perl.org/. The section that deals on variables is perldata. That will easily give you a usable answer.

In reality, I doubt that the complete answer is available in the documentation. There are special variables (see perlvar), and "use utf8;" can greatly affect the definition of "letter" and "number".

$ perl -E'use utf8; $é=123; say $é'
123

[ I only covered the identifier part. I just noticed the question is larger than that ]




回答3:


The perlvar page of the Perl documentation has a section at the end roughly outlining the allowable syntax. In summary:

  1. Any combination of letters, digits, underscores, and the special sequence :: (or '), provided it starts with a letter or underscore.
  2. A sequence of digits.
  3. A single punctuation character.
  4. A single control character, which can also be written as caret-{letter}, e.g. ^W.
  5. An alphanumeric string starting with a control character.

Note that most of the identifiers other than the ones in set 1 are either given a special meaning by Perl, or are reserved and may gain a special meaning in later versions. But if you're just trying to work out what is a valid identifier, then that doesn't really matter in your case.




回答4:


Having no official specification (Perl is whatever the perl interpreter can parse) these can be a little tricky to discern.

This page has examples of all the integer constant formats. The format of identifiers will need to be inferred from various pages in perldoc.



来源:https://stackoverflow.com/questions/4800275/what-characters-are-allowed-in-perl-identifiers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!