TITLE

Synopsis 5: Regexes and Rules

AUTHORS

Damian Conway <damian@conway.org> and Allison Randal <al@shadowed.net>

VERSION
   Maintainer: Patrick Michaud <pmichaud@pobox.com> and
               Larry Wall <larry@wall.org>
   Date: 24 Jun 2002
   Last Modified: 3 June 2006
   Number: 5
   Version: 25

This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex because they haven't been regular expressions for a long time. When referring to their use in a grammar, the term rule is preferred.

New match state and capture variables

The underlying match state object is now available as the $/ variable, which is implicitly lexically scoped. All access to the current (or most recent) match is through this variable, even when it doesn't look like it. The individual capture variables (such as $0, $1, etc.) are just elements of $/.

By the way, unlike in Perl 5, the numbered capture variables now start at $0 instead of $1. See below.

Unchanged syntactic features

The following regex features use the same syntax as in Perl 5:

Modifiers
Changed metacharacters
New metacharacters
Bracket rationalization
Variable (non-)interpolation
Extensible metasyntax (<...>)
Backslash reform
Regexes really are regexes now
Backtracking control
Named Regexes
Nothing is illegal
Return values from matches
Match objects
Subpattern captures
Accessing captured subpatterns
Nested subpattern captures
Quantified subpattern captures
Indirectly quantified subpattern captures
Subpattern numbering
Subrule captures
Accessing captured subrules
Repeated captures of the same subrule
Aliasing

Aliases can be named or numbered. They can be scalar-, array-, or hash-like. And they can be applied to either capturing or non-capturing constructs. The following sections highlight special features of the semantics of some of those combinations.

Named scalar aliasing to subpatterns
  • If a named scalar alias is applied to a set of capturing parens:
            #          ______/capturing parens\_____
            #         |                             |
            #         |                             |
         ms/ $<key>:=( (<[A..E]>) (\d**{3..6}) (X?) ) /;

    then the outer capturing parens no longer capture into the array of $/ (like unaliased parens would). Instead the aliased parens capture into the hash of $/; specifically into the hash element whose key is the alias name.

  • So, in the above example, a successful match sets $<key> (i.e. $/<key>), but not $0 (i.e. not $/[0]).
  • More specifically:
    • $/<key> will contain the Match object that would previously have been placed in $/[0].
    • $/<key>[0] will contain the A-E letter,
    • $/<key>[1] will contain the digits,
    • $/<key>[2] will contain the optional X.
  • Another way to think about this behavior is that aliased parens create a kind of lexically scoped named subrule; that the contents of the brackets are treated as if they were part of a separate subrule whose name is the alias.
Named scalar aliases applied to non-capturing brackets
  • If an named scalar alias is applied to a set of non-capturing brackets:
            #          ___/non-capturing brackets\__
            #         |                             |
            #         |                             |
         ms/ $<key>:=[ (<[A..E]>) (\d**{3..6}) (X?) ] /;

    then the corresponding $/<key> object contains only the string matched by the non-capturing brackets.

  • In particular, the array of the $/<key> entry is empty. That's because square brackets do not create a nested lexical scope, so the subpatterns are unnested and hence correspond to $0, $1, and $2, and not to $/<key>[0], $/<key>[1], and $/<key>[2].
  • In other words:
    • $/<key> will contain the complete substring matched by the square brackets (in a Match object, as described above),
    • $0 will contain the A-E letter,
    • $1 will contain the digits,
    • $2 will contain the optional X.
Named scalar aliasing to subrules
  • If a subrule is aliased, it assigns its Match object to the hash entry whose key is the name of the alias. And it no longer assigns anything to the hash entry whose key is the subrule name. That is:
         if m:/ ID\: $<id>:=<ident> / {
             say "Identified as $/<id>";    # $/<ident> is undefined
         }
  • Hence aliasing a subrule changes the destination of the subrule's Match object. This is particularly useful for differentiating two or more calls to the same subrule in the same scope. For example:
         if ms/ mv <file>+ $<dir>:=<file> / {
             @from = @{$<file>};
             $to   = $<dir>;
         }
Numbered scalar aliasing
  • If a numbered alias is used instead of a named alias:
         m/ $1:=(<-[:]>*) \:  $0:=<ident> /

    the behavior is exactly the same as for a named alias (i.e the various cases described above), except that the resulting Match object is assigned to the corresponding element of the appropriate array rather than to an element of the hash.

  • If any numbered alias is used, the numbering of subsequent unaliased subpatterns in the same scope automatically increments from that alias number (much like enum values increment from the last explicit value). That is:
          #  ---$1---    -$2-    ---$6---    -$7-
          # |        |  |    |  |        |  |    |
         m/ $1:=(food)  (bard)  $6:=(bazd)  (quxd) /;
  • This follow-on behavior is particularly useful for reinstituting Perl5 semantics for consecutive subpattern numbering in alternations:
         $tune_up = rx/ (don't) (ray) (me) (for) (solar tea), (d'oh!)
                      | $6:=(every) (green) (BEM) (devours) (faces)
                      #             $7      $8    $9        $10
                      /;
  • It also provides an easy way in Perl 6 to reinstitute the unnested numbering semantics of nested Perl 5 subpatterns:
          # Perl 5...
          #               $1
          #  _____________/\______________
          # |    $2          $3       $4  |
          # |  __/\___   ____/\____   /\  |
          # | |       | |          | |  | |
         m/ ( (<[A..E]>) (\d**{3..6}) (X?) ) /;
    
    
          # Perl 6...
          #               $0
          #  _____________/\______________
          # |  $0[0]       $0[1]    $0[2] |
          # |  __/\___   ____/\____   /\  |
          # | |       | |          | |  | |
         m/ ( (<[A..E]>) (\d**{3..6}) (X?) ) /;
    
    
          # Perl 6 simulating Perl 5...
          #                 $1
          #  _______________/\________________
          # |        $2          $3       $4  |
          # |      __/\___   ____/\____   /\  |
          # |     |       | |          | |  | |
         m/ $1:=[ (<[A..E]>) (\d**{3..6}) (X?) ] /;

    The non-capturing brackets don't introduce a scope, so the subpatterns within them are at regex scope, and hence numbered at the top level. Aliasing the square brackets to $1 means that the next subpattern at the same level (i.e. the (<[A..E]>)) is numbered sequentially (i.e. $2), etc.

Scalar aliases applied to quantified constructs
  • All of the above semantics apply equally to aliases which are bound to quantified structures.
  • The only difference is that, if the aliased construct is a subrule or subpattern, that quantified subrule or subpattern will have returned a list of Match objects (as described in "Quantified subpattern captures" and "Repeated captures of the same subrule"). So the corresponding array element or hash entry for the alias will contain an array, instead of a single Match object.
  • In other words, aliasing and quantification are completely orthogonal. For example:
         if m/ mv $0:=<file>+ / {
             # <file>+ returns a list of Match objects,
             # so $0 contains an array of Match objects,
             # one for each successful call to <file>
    
             # $/<file> does not exist (it's pre-empted by the alias)
         }
    
    
         if m/ mv $<from>:=(\S+ \s+)* / {
             # Quantified subpattern returns a list of Match objects,
             # so $/<from> contains an array of Match
             # objects, one for each successful match of the subpattern
    
             # $0 does not exist (it's pre-empted by the alias)
         }
  • Note, however, that a set of quantified non-capturing brackets always returns a single Match object which contains only the complete substring that was matched by the full set of repetitions of the brackets (as described in "Named scalar aliases applied to non-capturing brackets"). For example:
         "coffee fifo fumble" ~~ m/ $<effs>:=[f <-[f]>**{1..2} \s*]+ /;
    
         say $<effs>;    # prints "fee fifo fum"
Array aliasing
  • An alias can also be specified using an array as the alias instead of scalar. For example:
         m/ mv @<from>:=[(\S+) \s+]* <dir> /;
  • Using the @<alias>:= notation instead of a $<alias>:= mandates that the corresponding hash entry or array element always receives an array of Match objects, even if the construct being aliased would normally return a single Match object. This is useful for creating consistent capture semantics across structurally different alternations (by enforcing array captures in all branches):
         ms/ Mr?s? @<names>:=<ident> W\. @<names>:=<ident>
            | Mr?s? @<names>:=<ident>
            /;
    
         # Aliasing to @<names> means $/<names> is always
         # an Array object, so...
    
         say @{$/<names>};
  • For convenience and consistency, @<key> can also be used outside a regex, as a shorthand for @{ $/<key> }. That is:
         ms/ Mr?s? @<names>:=<ident> W\. @<names>:=<ident>
            | Mr?s? @<names>:=<ident>
            /;
    
         say @<names>;
  • If an array alias is applied to a quantified pair of non-capturing brackets, it captures the substrings matched by each repetition of the brackets into separate elements of the corresponding array. That is:
         m/ mv $<files>:=[ f.. \s* ]* /; # $/<files> assigned a single
                                         # Match object containing the
                                         # complete substring matched by
                                         # the full set of repetitions
                                         # of the non-capturing brackets
    
         m/ mv @<files>:=[ f.. \s* ]* /; # $/<files> assigned an array,
                                         # each element of which is a
                                         # C<Match> object containing
                                         # the substring matched by Nth
                                         # repetition of the non-
                                         # capturing bracket match
  • If an array alias is applied to a quantified pair of capturing parens (i.e. to a subpattern), then the corresponding hash or array element is assigned a list constructed by concatenating the array values of each Match object returned by one repetition of the subpattern. That is, an array alias on a subpattern flattens and collects all nested subpattern captures within the aliased subpattern. For example:
         if ms/ $<pairs>:=( (\w+) \: (\N+) )+ / {
             # Scalar alias, so $/<pairs> is assigned an array
             # of Match objects, each of which has its own array
             # of two subcaptures...
    
             for @{$<pairs>} -> $pair {
                 say "Key: $pair[0]";
                 say "Val: $pair[1]";
             }
         }
    
    
         if ms/ @<pairs>:=( (\w+) \: (\N+) )+ / {
             # Array alias, so $/<pairs> is assigned an array
             # of Match objects, each of which is flattened out of
             # the two subcaptures within the subpattern
    
             for @{$<pairs>} -> $key, $val {
                 say "Key: $key";
                 say "Val: $val";
             }
         }
  • Likewise, if an array alias is applied to a quantified subrule, then the hash or array element corresponding to the alias is assigned a list containing the array values of each Match object returned by each repetition of the subrule, all flattened into a single array:
         rule pair { (\w+) \: (\N+) \n }
    
         if ms/ $<pairs>:=<pair>+ / {
             # Scalar alias, so $/<pairs> contains an array of
             # Match objects, each of which is the result of the
             # <pair> subrule call...
    
             for @{$<pairs>} -> $pair {
                 say "Key: $pair[0]";
                 say "Val: $pair[1]";
             }
         }
    
    
         if ms/ mv @<pairs>:=<pair>+ / {
             # Array alias, so $/<pairs> contains an array of
             # Match objects, all flattened down from the
             # nested arrays inside the Match objects returned
             # by each match of the <pair> subrule...
    
             for @{$<pairs>} -> $key, $val {
                 say "Key: $key";
                 say "Val: $val";
             }
         }
  • In other words, an array alias is useful to flatten into a single array any nested captures that might occur within a quantified subpattern or subrule. Whereas a scalar alias is useful to preserve within a top-level array the internal structure of each repetition.
  • It is also possible to use a numbered variable as an array alias. The semantics are exactly as described above, with the sole difference being that the resulting array of Match objects is assigned into the appropriate element of the regex's match array rather than to a key of its match hash. For example:
         if m/ mv  \s+  @0:=((\w+) \s+)+  $1:=((\W+) (\s*))* / {
             #          |                 |
             #          |                 |
             #          |                  \_ Scalar alias, so $1 gets an
             #          |                     array, with each element
             #          |                     a Match object containing
             #          |                     the two nested captures
             #          |
             #           \___ Array alias, so $0 gets a flattened array of
             #                just the (\w+) captures from each repetition
    
             @from     = @{$0};      # Flattened list
    
             $to_str   = $1[0][0];   # Nested elems of
             $to_gap   = $1[0][1];   #    unflattened list
         }
  • Note again that, outside a regex, @0 is simply a shorthand for @{$0}, so the first assignment above could also have been written:
         @from = @0;
Hash aliasing
  • An alias can also be specified using a hash as the alias variable, instead of a scalar or an array. For example:
         m/ mv %<location>:=( (<ident>) \: (\N+) )+ /;
  • A hash alias causes the corresponding hash or array element in the current scope's Match object to be assigned a (nested) Hash object (rather than an Array object or a single Match object).
  • If a hash alias is applied to a subrule or subpattern then the first nested numeric capture becomes the key of each hash entry and any remaining numeric captures become the values (in an array if there is more than one),
  • As with array aliases it is also possible to use a numbered variable as a hash alias. Once again, the only difference is where the resulting Match object is stored:
         rule one_to_many {  (\w+) \: (\S+) (\S+) (\S+) }
    
         if ms/ %0:=<one_to_many>+ / {
             # $/[0] contains a hash, in which each key is provided by
             # the first subcapture within C<one_to_many>, and each
             # value is an  array containing the
             # subrule's second, third, and fourth, etc. subcaptures...
    
             for %{$/[0]} -> $pair {
                 say "One:  $pair.key";
                 say "Many: { @{$pair.value} }";
             }
         }
  • Outside the regex, %0 is a shortcut for %{$0}:
             for %0 -> $pair {
                 say "One:  $pair.key";
                 say "Many: { @{$pair.value} }";
             }
External aliasing
  • Instead of using internal aliases like:
         m/ mv  @<files>:=<ident>+  $<dir>:=<ident> /

    the name of an ordinary variable can be used as an external alias, like so:

         m/ mv  @files:=<ident>+  $dir:=<ident> /
  • In this case, the behavior of each alias is exactly as described in the previous sections, except that the resulting capture(s) are bound directly (but still hypothetically) to the variables of the specified name that exist in the scope in which the regex is declared.
Capturing from repeated matches
  • When an entire regex is successfully matched with repetitions (specified via the :x or :g flag) or overlaps (specified via the :ov or :ex flag), it will usually produce a series of distinct matches.
  • A successful match under any of these flags still returns a single Match object in $/. However, this object may represent a partial evaluation of the regex. Moreover, the values of this match object are slightly different from those provided by a non-repeated match:
    • The boolean value of $/ after such matches is true or false, depending on whether the pattern matched.
    • The string value is the substring from the start of the first match to the end of the last match (including any intervening parts of the string that the regex skipped over in order to find later matches).
    • Subcaptures are returned as a multidimensional list, which the user can choose to process in either of two ways. If you refer to @(), the multidimensionality is ignored and all the matches are returned flattened (but still lazily). If you refer to @@(), you can get each individual sublist as a Capture object. (That is, there is a @@() coercion operator that happens, like @(), to default to $/.) As with any multidimensional list, each sublist can be lazy separately.

    For example:

         if $text ~~ ms:g/ (\S+:) <rocks> / {
             say 'Full match context is: [$/]';
         }

    But the list of individual match objects corresponding to each separate match is also available:

         if $text ~~ ms:g/ (\S+:) <rocks> / {
             say "Matched { +@@() } times";    # Note: forced eager here
    
             for @@() -> $m {
                 say "Match between $m.from() and $m.to()";
                 say 'Right on, dude!' if $m[0] eq 'Perl';
                 say "Rocks like $m<rocks>";
             }
         }
:keepall
  • All regexes remember everything if :keepall is in effect anywhere in the outer dynamic scope. In this case everything inside the angles is used as part of the key. Suppose the earlier example parsed whitespace:
         / <key> <?ws> <'=>'> <?ws> <value> { %hash{$<key>} = $<value> } /

    The two instances of <?ws> above would store an array of two values accessible as @<?ws>. It would also store the literal match into $<'=\>'>. Just to make sure nothing is forgotten, under :keepall any text or whitespace not otherwise remembered is attached as an extra property on the subsequent node. (The name of that property is "pretext".)

  • Your private ident rule shouldn't clobber someone else's ident rule. So some mechanism is needed to confine rules to a namespace.
  • If subs are the model for rules, then modules/classes are the obvious model for aggregating them. Such collections of rules are generally known as grammars.
  • Just as a class can collect named actions together:
         class Identity {
             method name { "Name = $.name" }
             method age  { "Age  = $.age"  }
             method addr { "Addr = $.addr" }
    
             method desc {
                 print &.name(), "\n",
                       &.age(),  "\n",
                       &.addr(), "\n";
             }
    
             # etc.
         }

    so too a grammar can collect a set of named rules together:

         grammar Identity {
             rule name { Name = (\N+) }
             rule age  { Age  = (\d+) }
             rule addr { Addr = (\N+) }
             rule desc {
                 <name> \n
                 <age>  \n
                 <addr> \n
             }
    
             # etc.
         }
  • Like classes, grammars can inherit:
         grammar Letter {
             rule text     { <greet> <body> <close> }
    
             rule greet { [Hi|Hey|Yo] $<to>:=(\S+?) , $$}
    
             rule body     { <line>+? }   # note: backtracks forwards via +?
    
             rule close { Later dude, $<from>:=(.+) }
    
             # etc.
         }
    
         grammar FormalLetter is Letter {
    
             rule greet { Dear $<to>:=(\S+?) , $$}
    
             rule close { Yours sincerely, $<from>:=(.+) }
    
         }
  • Just like the methods of a class, the rule definitions of a grammar are inherited (and polymorphic!). So there's no need to respecify body, line, etc.
  • Perl 6 will come with at least one grammar predefined:
         grammar Perl {    # Perl's own grammar
    
             rule prog { <statement>* }
    
             rule statement {
                       | <decl>
                       | <loop>
                       | <label> [<cond>|<sideff>|;]
             }
    
             rule decl { <sub> | <class> | <use> }
    
             # etc. etc. etc.
         }
  • Hence:
         given $source_code {
             $parsetree = m:keepall/<Perl.prog>/;
         }

For writing your own backslash and assertion subrules or macros, you may use the following syntactic categories:

     token rule_backslash:<w> { ... }     # define your own \w and \W
     token rule_assertion:<*> { ... }     # define your own <*stuff>
     macro rule_metachar:<,> { ... }     # define a new metacharacter
     macro rule_mod_internal:<x> { ... } # define your own /:x() stuff/
     macro rule_mod_external:<x> { ... } # define your own m:x()/stuff/

As with any such syntactic shenanigans, the declaration must be visible in the lexical scope to have any effect. It's possible the internal/external distinction is just a trait, and that some of those things are subs or methods rather than subrules or macros. (The numeric regex modifiers are recognized by fallback macros defined with an empty operator name.)

Various pragmas may be used to control various aspects of regex compilation and usage not otherwise provided for. These are tied to the particular declarator in question:

    use s :foo;         # control s defaults
    use m :foo;         # control m defaults
    use rx :foo;        # control rx defaults
    use regex :foo;     # control regex defaults
    use token :foo;     # control token defaults
    use rule :foo;      # control rule defaults

(It is a general policy in Perl 6 that any pragma designed to influence the surface behavior of a keyword is identical to the keyword itself, unless there is good reason to do otherwise. On the other hand, pragmas designed to influence deep semantics should not be named identically, though of course some similarity is good.)

  • The tr/// quote-like operator now also has a method form called trans(). Its argument is a list of pairs. You can use anything that produces a pair list:
         $str.trans( %mapping.pairs.sort );

    Use the .= form to do a translation in place:

         $str.=trans( %mapping.pairs.sort );
  • The two sides of the any pair can be strings interpreted as tr/// would:
         $str.=trans( 'A..C' => 'a..c', 'XYZ' => 'xyz' );

    As a degenerate case, each side can be individual characters:

         $str.=trans( 'A'=>'a', 'B'=>'b', 'C'=>'c' );
  • The two sides of each pair may also be Array objects:
         $str.=trans( ['A'..'C'] => ['a'..'c'], <X Y Z> => <x y z> );
  • The array version can map one-or-more characters to one-or-more characters:
         $str.=trans( [' ',      '<',    '>',    '&'    ] =>
                      ['&nbsp;', '&lt;', '&gt;', '&amp;' ]);

    In the case that more than one sequence of input characters matches, the longest one wins. In the case of two identical sequences the first in order wins.

    There are also method forms of m// and s///:

         $str.match(//);
         $str.subst(//, "replacement")
         $str.subst(//, {"replacement"})
         $str.=subst(//, "replacement")
         $str.=subst(//, {"replacement"})
  • To anchor to a particular position in the general case you can use the <at($pos)> assertion to say that the current position is the same as the position object you supply. You may set the current match position via the :c and :p modifiers.

    However, please remember that in Perl 6 string positions are generally not integers, but objects that point to a particular place in the string regardless of whether you count by bytes or codepoints or graphemes. If used with an integer, the at assertion will assume you mean the current lexically scoped Unicode level, on the assumption that this integer was somehow generated in this same lexical scope. If this is outside the current string's allowed abstraction levels, an exception is thrown. See S02 for more discussion of string positions.

  • Buf types are based on fixed-width cells and can therefore handle integer positions just fine, and treat them as array indices. In particular, buf8 AKA buf is just an old-school byte string. Matches against Buf types are restricted to ASCII semantics in the absence of an explicit modifier asking for the array's values to be treated as some particular encoding such as UTF-32. (This is also true for those compact arrays that are considered isomorphic to Buf types.) Positions within Buf types are always integers, counting one per unit cell of the underlying array. Be aware that "from" and "to" positions are reported as being between elements. If matching against a compact array @foo, a final position of 42 indicates that @foo[42] was the first element not included.
  • Anything that can be tied to a string can be matched against a regex. This feature is particularly useful with input streams:
         my $stream is from($fh);       # tie scalar to filehandle
    
         # and later...
    
         $stream ~~ m/pattern/;         # match from stream
  • Any non-compact array of mixed strings or objects can be matched against a regex:
        @array ~~ / foo <,> bar <elem>* /;

    The special <,> subrule matches the boundary between elements. The <elem> assertion matches any individual array element. It is the equivalent of "dot" for the whole element.

    If the array elements are strings, they are concatenated virtually into a single logical string. If the array elements are tokens or other such objects, the objects must provide appropriate methods for the kinds of subrules to match against. It is an assertion error to match a string-matching assertion against an object that doesn't provide a string view. However, pure object lists can be parsed as long as the match (including any subrules) restricts itself to assertions like:

         <.isa(Dog)>
         <.does(Bark)>
         <.can('scratch')>

    It is permissible to mix objects and strings in an array as long as they're in different elements. You may not embed objects in strings, however. Any object may, of course, pretend to be a string element if it likes.

    Please be aware that the warnings on .from and .to returning opaque objects goes double for matching against an array, where a particular position reflects both a position within the array and (potentially) a positional within a string of that array. Do not expect to do math with such values. Nor should you expect to be able to extract a substr that crosses element boundaries.

  • To match against each element of an array, use a hyper operator:
         @array».match($regex)