Valid identifiers in FORTRAN 66, C, Java , C++ current and future
What’s in a name ? Identifiers are used in modern programming languages to refer to types, classes, variables and object instances. While the first programming languages were resource-constrained and ASCII-centered, modern languages are more flexible with regards to the possible forms identifiers can take.
This post is a comparison on the lexical conventions for identifiers (length and character sets) in FORTRAN 66, C, Java, current and future C++.
The original FORTRAN 66 identifiers were defined based on digits and letters as follows:
A symbolic name consists of from one to six alphanumeric characters, the first of which must be alphabetic.
A digit is one of the ten characters: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
A letter is one of the twenty-six characters; A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z.
So we only have the 26 ASCII letters to choose from (i.e. case insensitive) to build our 6-character identifiers. No underscores, no $ signs.
ANSI C (or ISO C or C90) as defined by ISO/IEC 9899:1990 says:
An identifier is a sequence of nondigit characters (including the underscore _ and the lower-case and upper-case letters) and digits.
The first character shall be a nondigit character.
C is limited to ASCII letters, but it is case sensitive. Underscore OK, $ not OK.
ISO C lifted the length limitations set 15 years before in the C Reference Manual that came with 6th Edition Unix, where “no more than the first eight characters are significant, and only the first seven for external identifiers“. The practical length of identifiers in ISO C is constrained by the requirements on the compiler implementation translation limits: 31 significant characters for an internal identifier.
C++ current standard (2003)
The current C++ standard as implemented in currently available compilers has the same character set limitations as C:
nondigitidentifier nondigitidentifier digit
nondigit: one of _ a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L MN O P Q R S T U V W X Y Z
digit: one of 0 1 2 3 4 5 6 7 8 9
The limit for the maximum number of characters in an internal identiﬁer, macro name or in an external identiﬁer is increased to a grandiose 1024.
In the Java Language Specification, Third Edition an identifier is defined as an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.
The “Java digits” are just 0-9.
A “Java letter” is defined with reference to the 30 Unicode General Categories which also match the “Java Constant Field” values, according to this table:
|Abbr||Long||Description||Java Constant Field Value|
|Cc||Control||a C0 or C1 control code||CONTROL|
|Cf||Format||a format control character||FORMAT|
|Cn||Unassigned||a reserved unassigned code point or a noncharacter||UNASSIGNED|
|Co||Private_Use||a private-use character||PRIVATE_USE|
|Cs||Surrogate||a surrogate code point||SURROGATE|
|Ll||Lowercase_Letter||a lowercase letter||LOWERCASE_LETTER|
|Lm||Modifier_Letter||a modifier letter||MODIFIER_LETTER|
|Lo||Other_Letter||other letters, including syllables and ideographs||OTHER_LETTER|
|Lt||Titlecase_Letter||a digraphic character, with first part uppercase||TITLECASE_LETTER|
|Lu||Uppercase_Letter||an uppercase letter||UPPERCASE_LETTER|
|Mc||Spacing_Mark||a spacing combining mark (positive advance width)||COMBINING_SPACING_MARK|
|Me||Enclosing_Mark||an enclosing combining mark||ENCLOSING_MARK|
|Mn||Nonspacing_Mark||a nonspacing combining mark (zero advance width)||NON_SPACING_MARK|
|Nd||Decimal_Number||a decimal digit||DECIMAL_DIGIT_NUMBER|
|Nl||Letter_Number||a letterlike numeric character||LETTER_NUMBER|
|No||Other_Number||a numeric character of other type||OTHER_NUMBER|
|Pc||Connector_Punctuation||a connecting punctuation mark, like a tie||CONNECTOR_PUNCTUATION|
|Pd||Dash_Punctuation||a dash or hyphen punctuation mark||DASH_PUNCTUATION|
|Pe||Close_Punctuation||a closing punctuation mark (of a pair)||END_PUNCTUATION|
|Pf||Final_Punctuation||a final quotation mark||FINAL_QUOTE_PUNCTUATION|
|Pi||Initial_Punctuation||an initial quotation mark||INITIAL_QUOTE_PUNCTUATION|
|Po||Other_Punctuation||a punctuation mark of other type||OTHER_PUNCTUATION|
|Ps||Open_Punctuation||an opening punctuation mark (of a pair)||START_PUNCTUATION|
|Sc||Currency_Symbol||a currency sign||CURRENCY_SYMBOL|
|Sk||Modifier_Symbol||a non-letterlike modifier symbol||MODIFIER_SYMBOL|
|Sm||Math_Symbol||a symbol of primarily mathematical use||MATH_SYMBOL|
|So||Other_Symbol||a symbol of other type||OTHER_SYMBOL|
|Zl||Line_Separator||U+2028 LINE SEPARATOR only||LINE_SEPARATOR|
|Zp||Paragraph_Separator||U+2029 PARAGRAPH SEPARATOR only||PARAGRAPH_SEPARATOR|
|Zs||Space_Separator||a space character (of various non-zero widths)||SPACE_SEPARATOR|
With the help of this table, we understand that a “Java Letter” can be a currency symbol (such as “$”), a connecting punctuation character (such as “_”), or belong to one of the Unicode General Categores Lu, Ll, Lt, Lm or Lo.
It is not clear to me whether by saying currency symbol and connecting punctuation character the entire CURRENCY_SYMBOL (Sc), CONNECTOR_PUNCTUATION (Pc), DASH_PUNCTUATION (Pd), END_PUNCTUATION (Pe), FINAL_QUOTE_PUNCTUATION (Pf), INITIAL_QUOTE_PUNCTUATION (Pi), OTHER_PUNCTUATION (Po) and START_PUNCTUATION (Ps) Unicode General Categores are included, maybe somebody with Java skills can fill this void.
The Java programming language allows programmers to name identifiers with great liberty, including most Unicode code points (basically in their native languages), with underscore and dollar sign ($) both OK. . An undesirable side effect is that two identifiers differ if they differ in their Unicode code point, even if the glyphs (what you see on the screen) are the same. For example A and Α are different identifiers in Java because they are respectively LATIN CAPITAL LETTER A and GREEK CAPITAL LETTER ALPHA, and a is different from а because they are respectively LATIN SMALL LETTER A and CYRILLIC SMALL LETTER A.
C++ upcoming standard (C++0x)
C++0x, the planned new standard for the C++ programming language due to come out in 2011 or 2012 is more elastic than current C++ in its definition of an identifier:
An identiﬁer is an arbitrarily long sequence of letters and digits, starting with a letter.
Upper-and lower-case letters are diﬀerent. All characters are signiﬁcant.
A “letter” is the usual a-z, A-Z and _ or a “universal-character-name” or “other implementation-deﬁned characters”.
A “universal-character-name” is defined with reference to Annex A (Recommended extended repertoire for user-defined identifiers) of TR 10176:2003, TECHNICAL REPORT ISO/IEC TR 10176, Fourth edition (2003): Guidelines for the preparation of programming language standards.
A “universal-character-name” according to TR 10176, Annex A can be any character which “collectively can be used to generate word-like identifiers for most natural languages of the world“, including “letters (combining or not), syllables, and ideographs together with the modifier letters and marks conventionally used as parts of words“. The acceptable Unicode code points are:
Latin: 0041-005A, 0061-007A, 00AA, 00BA, 00C0-00D6, 00D8-00F6, 00F8-01F5, 01FA-0217, 0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F
Greek: 0386, 0388-038A, 038C, 038E-03A1, 03A3-03CE, 03D0-03D6, 03DA, 03DC, 03DE, 03E0, 03E2-03F3, 1F00-1F15, 1F18-1F1D, 1F20-1F45, 1F48-1F4D, 1F50-1F57, 1F59, 1F5B, 1F5D, 1F5F-1F7D, 1F80-1FB4, 1FB6-1FBC, 1FC2-1FC4, 1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 1FE0-1FEC, 1FF2-1FF4, 1FF6-1FFC
Cyrillic: 0401-040C, 040E-044F, 0451-045C, 045E-0481, 0490-04C4, 04C7-04C8, 04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9
Armenian: 0531-0556, 0561-0587
Hebrew: 05D0-05EA, 05F0-05F2
Hebrew (C): 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2
Arabic: 0621-063A, 0640-064A, 0671-06B7, 06BA-06BE, 06C0-06CE, 06D0-06D3, 06D5, 06E5-06E6
Arabic (C): 064B-0652, 0670, 06D6-06DC, 06E7-06E8, 06EA-06ED
Devanagari: 0905-0939, 0950, 0958-0961
Devanagari (C): 0901-0903, 093E-094D, 0951-0952, 0962-0963
Bengali: 0985-098C, 098F-0990, 0993-09A8, 09AA-09B0, 09B2, 09B6-09B9, 09DC-09DD, 09DF-09E1, 09F0-09F1
Bengali (C): 0981-0983, 09BE-09C4, 09C7-09C8, 09CB-09CD, 09E2-09E3
Gurmukhi: 0A05-0A0A, 0A0F-0A10, 0A13-0A28, 0A2A-0A30, 0A32-0A33, 0A35-0A36, 0A38-0A39, 0A59-0A5C, 0A5E, 0A74
Gurmukhi (C): 0A02, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D
Gujarati: 0A85-0A8B, 0A8D, 0A8F-0A91, 0A93-0AA8, 0AAA-0AB0, 0AB2-0AB3, 0AB5-0AB9, 0ABD, 0AD0, 0AE0
Gujarati (C): 0A81-0A83, 0ABE-0AC5, 0AC7-0AC9, 0ACB-0ACD
Oriya: 0B05-0B0C, 0B0F-0B10, 0B13-0B28, 0B2A-0B30, 0B32-0B33, 0B36-0B39, 0B5C-0B5D, 0B5F-0B61
Oriya (C): 0B01-0B03, 0B3E-0B43, 0B47-0B48, 0B4B-0B4D
Tamil: 0B85-0B8A, 0B8E-0B90, 0B92-0B95, 0B99-0B9A, 0B9C, 0B9E-0B9F, 0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9
Tamil (C): 0B82-0B83, 0BBE-0BC2, 0BC6-0BC8, 0BCA-0BCD
Telugu: 0C05-0C0C, 0C0E-0C10, 0C12-0C28, 0C2A-0C33, 0C35-0C39, 0C60-0C61
Telugu (C): 0C01-0C03, 0C3E-0C44, 0C46-0C48, 0C4A-0C4D
Kannada: 0C85-0C8C, 0C8E-0C90, 0C92-0CA8, 0CAA-0CB3, 0CB5-0CB9, 0CDE, 0CE0-0CE1
Kannada (C): 0C82-0C83, 0CBE-0CC4, 0CC6-0CC8, 0CCA-0CCD
Malayalam: 0D05-0D0C, 0D0E-0D10, 0D12-0D28, 0D2A-0D39, 0D60-0D61
Malayalam (C): 0D02-0D03, 0D3E-0D43, 0D46-0D48, 0D4A-0D4D
Thai: 0E01-0E30, 0E32-0E33, 0E40-0E46, 0E50-0E59
Thai (C): 0E31, 0E34-0E3A, 0E47-0E4E
Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A, 0E8D, 0E94-0E97, 0E99-0E9F, 0EA1-0EA3, 0EA5, 0EA7, 0EAA-0EAB, 0EAD-0EAE, 0EB0, 0EB2-0EB3, 0EBD, 0EC0-0EC4, 0EC6, 0EDC-0EDD
Lao (C): 0EB1, 0EB4-0EB9, 0EBB-0EBC, 0EC8-0ECD
Tibetan: 0F00, 0F40-0F47, 0F49-0F69, 0F88-0F8B
Tibetan (C): 0F18-0F19, 0F35, 0F37, 0F39, 0F71-0F84, 0F86-0F87, 0F90-0F95, 0F97, 0F99-0FAD, 0FB1-0FB7, 0FB9
Georgian: 10A0-10C5, 10D0-10F6
Katakana: 30A1-30F6, 30FB-30FC
CJK Unified Ideographs: 4E00-9FA5
Digits: 0030-0039, 0660-0669, 06F0-06F9, 0966-096F, 09E6-09EF, 0A66-0A6F, 0AE6-0AEF, 0B66-0B6F, 0BE7-0BEF, 0C66-0C6F, 0CE6-0CEF, 0D66-0D6F, 0E50-0E59, 0ED0-0ED9, 0F20-0F29
Special characters: 00B5, 02B0-02B8, 02BB, 02BD-02C1, 02D0-02D1, 02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE, 203F-2040, 2102, 2107, 210A-2113, 2115, 2118-211D, 2124, 2126, 2128, 212A-2131, 2133-2138, 2160-2182, 3005-3007, 3021-3029
The upcoming version of the C++ will be subject to the same confusing same-glyph, different Unicode code-point syndrome as Java A != Α and a != а.
The good news is that since the “good” code points are listed, it is easier for implementations to check if a character is acceptable or not, whereas for Java it is required to have access to the Unicode tables to know if a character belongs to a certain General Category.