In this chapter: Null and Undefined Boolean and Numbers String and Character File Names and URLs Regular Expression » The Java/Judo Regular Expressions » Use Regular Expressions in Judo Date and Time Secret

Book: The Judo Language 0.9

Chapter 5. Basic Data Types and Expressions

By James Jianbo Huang

non-printer version

In Judo, all values are objects. Every object has a type, which equates to a set of properties and methods.

For convenience, Judo data types are categorized into primitive and non-primitive types. Primitive types are traditionally simple values; in Judo, primitive types include boolean, numbers, string, date, time and a special type, Secret. Other than these, all types are non-primtive, which includes Judo built-in data structures and objects, Java objects and anything that are mapped into Java objects, such as XML DOM objects and Windows ActiveX controls. In this section, we discuss primitive types and their usages.

Null and Undefined_{to be done}

null

undefined

eof

nl

Boolean and Numbers_{to be finished}

Boolean type has only two values, the literals true and false. Numerically, true is 1 and false is 0.

Numbers can be integers or floating point numbers. Judo supports all the integer and floating point number formats that Java supports.

String and Character_{to be finished}

Judo strings are sequences of unicode characters. There is no specific character type in Judo; characters are simply strings containing one character.

Judo strings are very rich in functionality. Strings are also used to represent file and directory names and URLs.

The simplest form of string literals is a piece of text quoted by either double quotes or single quotes; if double quotes are used, single quotes are legitimate characters, and vice versa. If both double quotes and single quotes appear in the text, one of them has to be escaped. The escape sequences are the same as in Java. Judo supports unicode escape sequence as well. The following are some examples.

x = 'Hello, World!';
x = "It's so nice.";
x = '"Fine," he said.';
x = '"Yes," he said, "it\'s good!"';
x = '\u65E5\u672C\u8A9E';
x = 'a\tb\tc';

Multi-line text literals
Judo supports two forms of multi-line text literals, [[* *]] and [[[* *]]. Both formats quote a chunk of text, which may include new-lines. The [[* *]] is used more often; it allows the chunk of text to be indented, resulting in code that is nicely aligned. In other words, the leading number of whitespace characters at the beginning of each line are stripped. For this reason, it is better not to use tabs to avoid potential confusions. The [[* *]] also trims the leading and trailing whitespace characters. The [[[* *]] format simply quotes the chunk of text as-is. Let's see an example and explain.

Listing 5.1 multiline.judo
a = [[* aaaaaaaaaaaaaaa aaaaaaaaaaaaaaa aaaaaaaaaaaaaaa *]]; println '----', a, '----'; b = [[[* bbbbbbbbbbbbbbb bbbbbbbbbbbbbbb bbbbbbbbbbbbbbb *]]; println '====', b, '====';

The println command prints out all the textual values plus a newline. The result for this program is:

----aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaa----
====
   bbbbbbbbbbbbbbb
   bbbbbbbbbbbbbbb
   bbbbbbbbbbbbbbb
====

Embedded expressions
In mutli-line text literals, expressions can be embedded with the (* *) syntax. The embedded expressions will be evaluated to string values and concatenated to the rest of the text. Strictly speaking, the text is not a literal any more, but rather, a template. The following is an example that sends out emails to a mailing list which is stored in a database table (we will explain the usages later in this book; for now, just focus on how the text template is used; auxiliary parts such as connecting and disconnecting from servers are also omitted.)

executeQury qry:
  SELECT last_name, salute, email FROM customers
;
while qry.next() {
  sendMail
     from: 'support@judoscript.com'
       to: qry.email
  subject: 'Daily digest for ' + Date().fmtDate('yyyy-MM-dd')
     body: [[*
               Dear (* qry.salute *) (* qry.last_name *),

               This is today's daily digest.
               Please don't reply to this mail.

               Thanks,
               -Judo support
           *]]
  ;
}

In the body clause of the sendMail statement, we used a text template and generated the message body for each customer, where the values are from the database query object.

Embedded variables and environment variables
The syntax for embedded expressions, (* *), applies only to the multi-line text literals. However, variables, including environment variables, can be embedded all forms of string literals via the ${} syntax, which is familiar to Unix shell programmers. The rule is that, if the named variale exists within the current Judo program, its value is used; otherwise, the name-sake environment variable is retrieved and used. What's more, ${} can be used independently, which is a shortcut for the system function, getenv(), that explicitly accesses environment variables. As usual, let's see an example.

Listing 5.2 envvar.judo
println 'Case I. \${CLASSPATH} --> ', ${CLASSPATH}; println "Case Ia. '\${CLASSPATH}' --> ${CLASSPATH}"; println "Case Ib. CLASSPATH -->", CLASSPATH; // set it and see that it is: println '... Set in-program variable CLASSPATH to ', CLASSPATH = 'hahaha'; println 'Case II. \${CLASSPATH} --> ', ${CLASSPATH}; println "Case IIa. getenv('CLASSPATH') --> ", getenv('CLASSPATH'); println "Case IIb. '\${CLASSPATH}' --> ${CLASSPATH}"; println 'Case III. CLASSPATH --> ', CLASSPATH;

This program essentially consists of five test cases. Case I explicitly accesses the environment variable CLASSPATH. Case Ia yields the same result, only because there is no name-sake variable. Prior to Case II, we set a in-program variable with the same name, CLASSPATH; Case II proves that the ${CLASSPATH} ignores the in-program variable and still returns the environment variable. Case IIa shows how to use getenv('CLASSPATH') to accomplish the same. Case IIb is in contrast to Case Ia; this time, the varaible CLASSPATH has been defined, and the in-program variable value is displayed. Lastly, the reference to CLASSPATH is always referencing the in-program variable. The following is the result:

Case I.   ${CLASSPATH}        --> c:\jlib\judo.jar;c:\jlib\classes12.zip
Case Ia.  '${CLASSPATH}'      --> c:\jlib\judo.jar;c:\jlib\classes12.zip
Case Ib.  CLASSPATH           -->
... Set in-program variable CLASSPATH to hahaha
Case II.  ${CLASSPATH}        --> c:\jlib\judo.jar;c:\jlib\classes12.zip
Case IIa. getenv('CLASSPATH') --> c:\jlib\judo.jar;c:\jlib\classes12.zip
Case IIb. '${CLASSPATH}'      --> hahaha
Case III. CLASSPATH           --> hahaha

This environment variable access operator is familiar to Unix shell programmers, and will be discussed further in chapter . .

Within a string, both ${} and (* *) syntax can embed references to variables; (* *) can enclose any expressions; ${} potentially reference environment variables if the name-sake variable does not exist. To access global variables, you can do like this: ${::xyz}. If the global variable does not exist, Judo still tries to find the name-sake environment variable.

File Names and URLs_{to be done}

Regular Expression

Regular expression (short as regex) is a familiar topic to many scripting language programmers. As a mini language describing various text patterns, regex renders tremendous power to text processing. People have been making great efforts to provide this power to Java, and finally, JDK1.4 embraced it as a part of Java standard edition. Judo regex support is based on that of Java. Since this is available only in JDK1.4 and later, any regex uses with JDK1.3 will cause runtime errors.

If you are a Java programmer, you are probably aware of the JDK1.4 regex API. If you are not a Java programmer, you don't have to be concerned with that API; all you have to know is the regex constructs. Judo does not reinvent the regex construts but simply uses Java's, so it is good to know how Java does it and what Java supports.

The java.util.regex package in JDK1.4 onwards supports Java regex. The key in this API is class Pattern. A regex must be "compiled" into a Pattern instance, and then used to deal with string instances. What you can apply a compiled regex pattern to strings to do these:

Check if the regex pattern matches the whole or the leading part of the string.
Replace the first or all the occurrances of the regex matches.
Split the string into an array of strings separated by the regex pattern.
Match the regex pattern against a string and return a detailed information regarding the matched segments in the string.

The match operation, in Java, returns a Matcher object, which has facilities (methods) to go through various pieces of the matches. Each matched piece is called a group, which has a start and an end indices in the original string. You can reset and match again. This object is treated as an intrinsic object in Judo and will be discussed in detail later.

The Java/Judo Regular Expressions

Regular expressions in Judo are the same as in Java; from this point on, we will just call them regular expressions, or simply regex's. In this section, we will introduce the details of the regex's, which is, indeed, the specification defined by the java.util.regex.Pattern class in JDK1.4. For general knowledge about regex, please refer to relevant literatures such as any Perl books. Here, we assume that you are assumed to be familiar with some forms of regex's and just discuss the details of regex syntax.

The following table shows the regex constructs:

**Table 5.1 Regex Constructs**
Construct	Matches
Characters
`x`	The character `x`
`\\`	The backslash character
`\0n`	The character with octal value `0n` (0 <= n <= 7)
`\0nn`	The character with octal value `0nn` (0 <= n <= 7)
`\0mnn`	The character with octal value `0mnn` (0 <= m <= 3, 0 <= n <= 7)
`\xhh`	The character with hexadecimal value `0xhh`
`\uhhhh`	The character with hexadecimal value `0xhhhh`
`\t`	The tab character (`'\u0009'`)
`\n`	The newline (line feed) character (`'\u000A'`)
`\r`	The carriage-return character (`'\u000D'`)
`\f`	The form-feed character (`'\u000C'`)
`\a`	The alert (bell) character (`'\u0007'`)
`\e`	The escape character (`'\u001B'`)
`\cx`	The control character corresponding to `x`
Character classes
`[abc]`	`a`, `b`, or `c` (simple class)
`[^abc]`	Any character except `a`, `b`, or `c` (negation)
`[a-zA-Z]`	`a` through `z` or `A` through `Z`, inclusive (range)
`[a-d[m-p]]`	`a` through `d`, or `m` through `p`: `[a-dm-p]` (union)
`[a-z&&[def]]`	`d`, `e`, or `f` (intersection)
`[a-z&&[^bc]]`	`a` through `z`, except for `b` and `c`: `[ad-z]` (subtraction)
`[a-z&&[^m-p]]`	`a` through `z`, and not `m` through `p`: `[a-lq-z]` (subtraction)
Predefined character classes
`.`	Any character (may or may not match line terminators)
`\d`	A digit: `[0-9]`
`\D`	A non-digit: `[^0-9]`
`\s`	A whitespace character: `[ \t\n\x0B\f\r]`
`\S`	A non-whitespace character: `[^\s]`
`\w`	A word character: `[a-zA-Z_0-9]`
`\W`	A non-word character: `[^\w]`
POSIX character classes (US-ASCII only)
`\p{Lower}`	A lower-case alphabetic character: `[a-z]`
`\p{Upper}`	An upper-case alphabetic character: `[A-Z]`
`\p{ASCII}`	All ASCII: `[\x00-\x7F]`
`\p{Alpha}`	An alphabetic character: `[\p{Lower}\p{Upper}]`
`\p{Digit}`	A decimal digit: `[0-9]`
`\p{Alnum}`	An alphanumeric character: `[\p{Alpha}\p{Digit}]`
`\p{Punct}`	Punctuation: One of !"#$%&'()*,-./:;<=>?@[\]^_`{\|}~
`\p{Graph}`	A visible character: `[\p{Alnum}\p{Punct}]`
`\p{Print}`	A printable character: `[\p{Graph}]`
`\p{Blank}`	A space or a tab: `[ \t]`
`\p{Cntrl}`	A control character: `[\x00-\x1F\x7F]`
`\p{XDigit}`	A hexadecimal digit: `[0-9a-fA-F]`
`\p{Space}`	A whitespace character: `[ \t\n\x0B\f\r]`
Classes for Unicode blocks and categories
`\p{InGreek}`	A character in the Greek block (simple block)
`\p{Lu}`	An uppercase letter (simple category)
`\p{Sc}`	A currency symbol
`\P{InGreek}`	Any character except one in the Greek block (negation)
`[\p{L}-[\p{Lu}]]`	Any letter except an uppercase letter (subtraction)
Boundary matchers
`^`	The beginning of a line
`$`	The end of a line
`\b`	A word boundary
`\B`	A non-word boundary
`\A`	The beginning of the input
`\G`	The end of the previous match
`\Z`	The end of the input but for the final terminator, if any
`\z`	The end of the input
Greedy quantifiers
`X?`	`X`, once or not at all
`X*`	`X`, zero or more times
`X+`	`X`, one or more times
`X{n}`	`X`, exactly `n` times
`X(n,}`	`X`, at least `n` times
`X{n,m}`	`X`, at least `n` but not more than m times
Reluctant quantifiers
`X??`	`X`, once or not at all
`X*?`	`X`, zero or more times
`X?`	`X`, one or more times
`X{n}?`	`X`, exactly `n` times
`X(n,}?`	`X`, at least `n` times
`X{n,m}?`	`X`, at least `n` but not more than m times
Possessive quantifiers
`X?`	`X`, once or not at all
`X*`	`X`, zero or more times
`X`	`X`, one or more times
`X{n}`	`X`, exactly `n` times
`X(n,}`	`X`, at least `n` times
`X{n,m}`	`X`, at least `n` but not more than m times
Logical operators
`XY`	`X` followed by `Y`
`X\|Y`	Either `X` or `Y`
`(X)`	`X`, as a capturing group
Back references
`\n`	Whatever the `n`^th capturing group matched
Quotation
`\`	Nothing, but quotes the following character
`\Q`	Nothing, but quotes all characters until `\E`
`\E`	Nothing, but ends quoting started by `\Q`
Special constructs (non-capturing)
`(?:X)`	`X`, as a non-capturing group
`(?idmsux-idmsux)`	Nothing, but turns match flags on - off
`(?idmsux-idmsux:X)`	`X`, as a capturing group with the given flags on - off
`(?=X)`	`X`, via zero-width positive lookahead
`(?!X)`	`X`, via zero-width negative lookahead
`(?<=X)`	`X`, via zero-width positive lookbehind
`(?<!X)`	`X`, via zero-width negative lookbehind
`(?>X)`	`X`, as an independent, non-capturing group

Backslashes, escapes, and quoting
The backslash character (\) serves to introduce escaped constructs, as defined in the table above, as well as to quote characters that otherwise would be interpreted as unescaped constructs. Thus the expression \\ matches a single backslash and \{ matches a left brace.

It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.

Line terminators
A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

A newline (line feed) character (\n),
A carriage-return character followed immediately by a newline character (\r\n),
A standalone carriage-return character (\r),
A next-line character (\u0085),
A line-separator character (\u2028), or
A paragraph-separator character (\u2029).

If UNIX_LINES mode is activated, then the only line terminators recognized are newline characters.

The regular expression . matches any character except a line terminator unless the DOTALL flag is specified.

Groups and capturing
Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:

((A)(B(C)))
(A)
(B(C))
(C)

Group zero always stands for the entire expression.

Capturing groups are so named because, during a match, each subsequence of the input sequence that matches such a group is saved. The captured subsequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete.

The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string aba against the expression (a(b)?), for example, leaves group two set to b. All captured input is discarded at the beginning of each match.

Groups beginning with (? are pure groups that do not capture text and do not count towards the group total.

Regex modes
Regex patterns can be run in different modes. The following table lists all the modes, along with the mode symbols used in Judo regex.

**Table 5.2 Regex Modes**
Mode	Symbol	Meaning
CANON_EQ	c	Enable canonical equivalence, so that two characters will be considered to match if, and only if, their full canonical decompositions match. The expression `a\u030A`, for example, will match the string `å` when this flag is specified. By default, matching does not take canonical equivalence into account.
CASE_INSENTITIVE	i	Enables case-insensitive matching. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag. Case-insensitive matching can also be enabled via the embedded flag expression `(?i)`.
COMMENTS	x	Permits whitespace and comments in pattern, that whitespace is ignored, and embedded comments starting with `#` are ignored until the end of a line. Unix lines mode can also be enabled via the embedded flag expression `(?x)`.
DOTALL	s	Enables dotall mode, where the expression `.` matches any character, including a line terminator. By default this expression does not match line terminators. Dotall mode can also be enabled via the embedded flag expression `(?s)`.
MULTILINE	m	Enables multiline mode, where the expressions `^` and `$` match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence. Multiline mode can also be enabled via the embedded flag expression `(?m)`.
UNICODE_CASE	u	Enables Unicode-aware case folding. When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case folding can also be enabled via the embedded flag expression `(?u)`.
UNIX_LINES	l	Enables Unix lines mode, that only the '\n' line terminator is recognized in the behavior of `.`, `^`, and `$`. Unix lines mode can also be enabled via the embedded flag expression `(?d)`.

Use Regular Expressions in Judo

Regex support in Judo is very simple; there is no extra operators or special syntax. The string data type has these regex methods: matches(), matchesStart(), replaceAll(), replaceFirst(), split() and match(). All these methods take a pattern as their first parameter. The pattern can be a single string, or an array of two strings: the first one is the pattern and the second is the modes.

Regex's are compiled by its engine before they can be used. This process can be expensive if repeated many times, so Judo caches all the compiled ones. Regex's in different modes are different ones and are cached separately. Let us see some examples.

Listing 5.3 regex1.judo
input = 'aAabFOOAABFooABfOOb'; println input.replaceAll(['a*b','i'], '-'); // result: -FOO-Foo-fOO- input = 'zzdogzzdigzz'; println input.replaceFirst('d.g','cat'); // result: zzcatzzdigzz input = 'boo:and:foo'; println input.split(':',2); // result: [boo,and:foo] println input.split(':',5); // result: [boo,and,foo] println input.split(':',-1); // result: [boo,and,foo] println input.split('o',5); // result: [b,,:and:f,,] println input.split('o',-1); // result: [b,,:and:f,,] println input.split('o',0); // result: [b,,:and:f] println input.split('o'); // result: [b,,:and:f]

TODO: To be expanded with more examples, including various modes and case studies such as log analyzers and such.

Date and Time_{to be finished}

Both date and time literals are specified by the same Date keyword. All parts of date/time can be specified in this sequence:

Date(year, month, day, hour, minute, second, milli-second)

where month is 1 through 12, day is the day of the month. The rest are obvious. The time components, e.g., hour, minute, etc., can be omitted; the missing components are 0's. If no parameters are supplied, Date() itself represents the current time.

Secret

Judo is a great tool to create network client programs. Security is one of the major concerns in any distributed environment. Password is the most commonly used mechanism, but leaving plain text passwords in scripts or configuration files is always a huge hole in security. Judo address this issue by introducing a special data type, Secret. Secret values are created with this constructor:

Secret( encrypted_password [ , decryptor ] )

The decryptor is any object that implements the method decrypt(), which takes a string and returns another. It does not matter whether it is implemented in Judo or Java, though most likely it is in Java. The encrypted value must be a text string. How to obtain it is up to your crypto package that your decryptor is part of. If no decryptor is specified, or the decryptor is not found (i.e., evaluated to be null,) by default the password is returned as-is. But would this Secret value really protect the password? Judo is open source; what if some attacker plant a sniffer in the code that gets the returned password from the decryptor?

The idea for this Secret mechanism is to use different decryptor objects in different environments. Take a look at this example:

Suppose you have run some utility and encrypted your password to "XI,8aM4/", and the decryptor is a Java class.

decryptor = null;
{
  decryptor = new java::com.xxx.util.MyCrypto;
catch: ; // ignore any exceptions.
}

// Use a Secret value as password to connect to a database:
connect to dbUrl, 'dbuser', Secret('abcdef', decryptor);

......

This script is run in a test environment and in the production environment. Both environments have their own database schemas, user names and passwords. On the test environment, we do not have the Java class com.xxx.util.MyCrypto in the classpath, so the decryptor ends up being null, which is passed to the Secret constructor; therefore, in the test environment, the password for the connect command is actually abcdef, which is ok. In the production, the Java class com.xxx.util.MyCrypto is deployed in the classpath (that runs Judo), so decryptor will hold an instance of that Java class; the class's decrypt() method will be called and turn abcdef into THIS IS SOMETHING YOU'D NEVER EVER HAVE GUESSED, which is the password for the production database. Because of the only the production deployer has the decryptor Java class, the security is not compromized in the script, which is checked in to the Configuration Management system that every developer has access to. The same script can be easily run as-is in various environments, including production.

Chapter 5. Basic Data Types and Expressions

Null and Undefined to be done

Boolean and Numbers to be finished

String and Character to be finished

File Names and URLs to be done