Synopsis 5: Regexes and Rules
Damian Conway <damian@conway.org> and Allison Randal <al@shadowed.net>
Maintainer: Patrick Michaud <pmichaud@pobox.com> and
Larry Wall <larry@wall.org>
Date: 24 Jun 2002
Last Modified: 3 June 2006
Number: 5
Version: 25This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex because they haven't been regular expressions for a long time. When referring to their use in a grammar, the term rule is preferred.
The underlying match state object is now available as the $/ variable, which is implicitly lexically scoped. All access to the current (or most recent) match is through this variable, even when it doesn't look like it. The individual capture variables (such as $0, $1, etc.) are just elements of $/.
By the way, unlike in Perl 5, the numbered capture variables now start at $0 instead of $1. See below.
The following regex features use the same syntax as in Perl 5:
/x) is no longer required...it's the default. (In fact, it's pretty much mandatory--the only way to get back to the old syntax is with the :Perl5/:P5 modifier.)/s or /m modifiers (changes to the meta-characters replace them - see below)./e evaluation modifier on substitutions; instead use: s/pattern/{ doit() }/Instead of /ee say:
s/pattern/{ eval doit() }/m:g:i/\s* (\w*) \s* ,?/;
Every modifier must start with its own colon. The delimiter must be separated from the final modifier by whitespace if it would be taken as an argument to the preceding modifier (which is true for any bracketing character).
:i :ignorecase
:g :global:c (or :continue) modifier causes the pattern to continue scanning from the string's current .pos: m:c/ pattern / # start at end of
# previous match on $_Note that this does not automatically anchor the pattern to the starting location. (Use :p for that.) The pattern you supply to split has an implicit :c modifier.
The :continue modifier takes an optional argument of type StrPos which specifies the point at which to start scanning for a match. This should not be used unless you know what you're doing, or just happen to like hard-to-debug infinite loops.
:p (or :pos) modifier causes the pattern to try to match only at the string's current .pos: m:p/ pattern / # match at end of
# previous match on $_Since this is implicitly anchored to the position, it's suitable for building parsers and lexers. The pattern you supply to a Perl macro's is parsed trait has an implicit :p modifier.
Note that
m:c/pattern/
is roughly equivalent to
m:p/.*? <( pattern )> /
Also note that any regex called as a subrule is implicitly anchored to the current position anyway.
The :pos modifier takes an optional argument of type StrPos which specifies the point at which to attempt a match. This should not be used lightly. Put it in the category of a "goto".
:s (:sigspace) modifier causes whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, <?ws>. That is, m:s/ next cmd = <condition>/
is the same as:
m/ <?ws> next <?ws> cmd <?ws> = <?ws> <condition>/
which is effectively the same as:
m/ \s* next \s+ cmd \s* = \s* <condition>/
But in the case of
m:s {(a|\*) (b|\+)}or equivalently,
m { (a|\*) <?ws> (b|\+) }<?ws> can't decide what to do until it sees the data. It still does the right thing. If not, define your own <?ws> and :sigspace will use that.
In general you don't need to use :sigspace within grammars because the parser rules automatically handle whitespace policy for you. In this context, whitespace often includes comments, depending on how the grammar chooses to define its whitespace rule. Although the default <?ws> subrule recognizes no comment construct, any grammar is free to override the rule. The <?ws> rule is not intended to mean the same thing everywhere.
It's also possible to pass an argument to :sigspace specifying a completely different subrule to apply. This can be any rule, it doesn't have to match whitespace. When discussing this modifier, it is important to distinguish the significant whitespace in the pattern from the "whitespace" being matched, so we'll call the pattern's whitespace sigspace, and generally reserve whitespace to indicate whatever <?ws> matches in the current grammar. The correspondence between sigspace and whitespace is primarily metaphorical, which is why the correspondence is both useful and (potentially) confusing.
The :s modifier is considered sufficiently important that match variants are defined for them:
ms/match some words/ # same as m:sigspace
ss/match some words/replace those words/ # same ss s:sigspaceConjecture: This might become sufficiently idiomatic that ms// would be better as a "stuttered" mm// instead, much as qq// became idiomatic. It would also match ss/// that way.
m:bytes / .**{2} / # match two bytes
m:codes / .**{2} / # match two codepoints
m:graphs/ .**{2} / # match two graphemes
m:langs / .**{2} / # match two language dependent charsThere are corresponding pragmas to default to these levels.
:Perl5 modifier allows Perl 5 regex syntax to be used instead: m:Perl5/(?mi)^[a-z]{1,2}(?=\s)/(It does not go so far as to allow you to put your modifiers at the end.)
x, it means repetition. Use :x(4) for the general form. So s:4x [ (<?ident>) = (\N+) $$] [$0 => $1];
is the same as:
s:x(4) [ (<?ident>) = (\N+) $$] [$0 => $1];
which is almost the same as:
$_.pos = 0;
s:c [ (<?ident>) = (\N+) $$] [$0 => $1] for 1..4;except that the string is unchanged unless all four matches are found. However, ranges are allowed, so you can say :x(1..4) to change anywhere from one to four matches.
st, nd, rd, or th, it means find the Nth occurrence. Use :nth(3) for the general form. So s:3rd/(\d+)/@data[$0]/;
is the same as
s:nth(3)/(\d+)/@data[$0]/;
which is the same as:
m/(\d+)/ && m:c/(\d+)/ && s:c/(\d+)/@data[$0]/;
Lists and junctions are allowed: :nth(1|2|3|5|8|13|21|34|55|89).
So are closures: :nth{.is_fibonacci}
:ov (:overlap) modifier, the current regex will match at all possible character positions (including overlapping) and return all matches in a list context, or a disjunction of matches in a scalar context. The first match at any position is returned. $str = "abracadabra";
if $str ~~ m:overlap/ a (.*) a / {
@substrings = @@(); # bracadabr cadabr dabr br
}:ex (:exhaustive) modifier, the current regex will match every possible way (including overlapping) and return all matches in a list context, or a disjunction of matches in a scalar context. $str = "abracadabra";
if $str ~~ m:exhaustive/ a (.*) a / {
say "@()"; # br brac bracad bracadabr c cad cadabr d dabr br
}Note that the ~~ above can return as soon as the first match is found, and the rest of the matches may be performed lazily by @().
[Conjecture: the :exhaustive modifier should have an optional argument specifying how many seconds to run before giving up, since it's trivially easy to ask for the heat death of the universe to happen first.]
:rw modifier causes this regex to claim the current string for modification rather than assuming copy-on-write semantics. All the bindings in $/ become lvalues into the string, such that if you modify, say, $1, the original string is modified in that location, and the positions of all the other fields modified accordingly (whatever that means). In the absence of this modifier (especially if it isn't implemented yet, or is never implemented), all pieces of $/ are considered copy-on-write, if not read-only.:keepall modifier causes this regex and all invoked subrules to remember everything, even if the rules themselves don't ask for their subrules to be remembered. This is for forcing a grammar that throws away whitespace and comments to keep them instead.:ratchet modifier causes this regex to not backtrack by default. (Generally you do not use this modifier directly, since it's implied by token and rule declarations.) The effect of this modifier is to imply a : after every construct that could backtrack, including bare *, +, and ? quantifiers, as well as alternations.:panic modifier causes this regex and all invoked subrules to try to backtrack on any rules that would otherwise default to not backtracking because they have :ratchet set. Never panic unless you're desperate and want the pattern matcher to do a lot of unnecessary work. If you have an error in your grammar, it's almost certainly a bad idea to fix it by backtracking.:i, :s, :Perl5, and Unicode-level modifiers can be placed inside the regex (and are lexically scoped): m/:s alignment = [:i left|right|cent[er|re]] /
m:fuzzy/pattern/;
m:fuzzy('bare')/pattern/;m:fuzzy (pattern);
or you'll end up with:
m:fuzzy(fuzzyargs); pattern ;
. now matches any character including newline. (The /s modifier is gone.)^ and $ now always match the start/end of a string, like the old \A and \z. (The /m modifier is gone.)$ no longer matches an optional preceding \n so it's necessary to say \n?$ if that's what you mean.\n now matches a logical (platform independent) newline not just \x0a.\A, \Z, and \z metacharacters are gone./x is default: # now always introduces a comment. If followed by an opening bracket character (and if not in the first column), it introduces an embedded comment that terminates with the closing bracket. Otherwise the comment terminates at the newline.:sigspace modifier described above).^^ and $$ match line beginnings and endings. (The /m modifier is gone.) They are both zero-width assertions. $$ matches before any \n (logical newline), and also at the end of the string if the final character was not a \n. ^^ always matches the beginning of the string and after any \n that is not the final character in the string.. matches an anything, while \N matches an anything except newline. (The /s modifier is gone.) In particular, \N matches neither carriage return nor line feed.& metacharacter separates conjunctive terms. The patterns on either side must match with the same beginning and end point. The operator is list associative like |, has higher precedence than |, and backtracking makes the right argument vary faster than the left.(...) still delimits a capturing group. However the ordering of these groups is hierarchical rather than linear. See "Nested subpattern captures".[...] is no longer a character class. It now delimits a non-capturing group.{...} is no longer a repetition quantifier. It now delimits an embedded closure. / (\S+) { print "string not blank\n"; $text = $0; }
\s+ { print "but does contain whitespace\n" }
/fail: / (\d+) { $0 < 256 or fail } /Closures are guaranteed to be called at the canonical time even if the optimizer could prove that something after them can't match. (Anything before is fair game, however.)
**{...} for maximal matching, with a corresponding **{...}? for minimal matching. Space is allowed on either side of the asterisks. The curlies are taken to be a closure returning an Int or a Range object. / value was (\d ** {1..6}?) with ([\w]**{$m..$n}) /It is illegal to return a list, so this easy mistake fails:
/ [foo]**{1,3} /(At least, it fails in the absence of use rx :listquantifier, which is likely to be unimplemented in Perl 6.0.0 anyway).
The optimizer will likely optimize away things like **{1..*} so that the closure is never actually run in that case. But it's a closure that must be run in the general case, so you can use it to generate a range on the fly based on the earlier matching. (Of course, bear in mind the closure must be run before attempting to match whatever it quantifies.)
<...> are now extensible metasyntax delimiters or assertions (i.e. they replace Perl 5's crufty (?...) syntax).<'...'> literal (i.e. it does not treat the interpolated string as a subpattern). In other words, a Perl 6: / $var /
is like a Perl 5:
/ \Q$var\E /
However, if $var contains a Regex object, instead of attempting to convert it to a string, it is called as a subrule, as if you said <$var>. (See assertions below.) This form does not capture, and it fails if $var is tainted.
/ @cmds /
is matched as if it were an alternation of its elements:
/ [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /
As with a scalar variable, each element is matched as a literal unless it happens to be a Regex object, in which case it is matched as a subrule. As with scalar subrules, a tainted subrule always fails. All values pay attention to the current :ignorecase setting.
"" key will match anywhere, provided no longer key matches.) "", nothing special happens except that the key match succeeds.Regex object, it is executed as a subrule, with an initial position after the matched key. (This is further described below under the <%hash> notation.) As with scalar subrules, a tainted subrule always fails, and no capture is attempted.All hash keys, and values that are strings, pay attention to the :ignorecase setting. (Subrules maintain their own case settings.)
< determines the behavior of the assertion./ <sign>? <mantissa> <exponent>? /
/ <before pattern> / # was /(?=pattern)/
/ <after pattern> / # was /(?<pattern)/
/ <sp> / # match the SPACE character (U+0020)
/ <ws> / # match "whitespace":
# \s+ if it's between two \w characters,
# \s* otherwise
/ <at($pos)> / # match only at a particular StrPos
# short for <?{ .pos == $pos }>The after assertion implements lookbehind by reversing the syntax tree and looking for things in the opposite order going to the left. It is illegal to do lookbehind on a pattern that cannot be reversed.
Note: the effect of a forward-scanning lookbehind at the top level can be achieved with:
/ .*? prestuff <( mainpat )> /
? causes the assertion not to capture what it matches (see "Subrule captures". For example: / <ident> <ws> / # $/<ident> and $/<ws> both captured
/ <?ident> <ws> / # only $/<ws> captured
/ <?ident> <?ws> / # nothing capturedThe non-capturing behavior may be overridden with a :keepall.
$ indicates an indirect subrule. The variable must contain either a Regex object, or a string to be compiled as the regex. The string is never matched literally. By default <$foo> is captured into $<foo>, but you can use the <?$foo> form to suppress capture, and you can always say $<$foo> := <$foo> if you prefer to include the sigil in the key.
:: indicates a symbolic indirect subrule: / <::($somename)> /
The variable must contain the name of a subrule. By the rules of single method dispatch this is first searched for in the current grammar and its ancestors. If this search fails an attempt is made to dispatch via MMD, in which case it can find subrules defined as multis rather than methods. This form is not captured by default.
@ matches like a bare array except that each element is treated as a subrule (string or Regex object) rather than as a literal. That is, a string is forced to be compiled as a subrule instead of being matched literally. (There is no difference for a Regex object.) By default <@foo> is captured into $<foo>, but you can use the <?@foo> form to suppress capture, and you can always say $<@foo> := <@foo> if you prefer to include the sigil in the key.
% matches like a bare hash except that a string value is always treated as a subrule, even if it is a string that must be compiled to a regex at match time. (Numeric values may still indicate "false match". and a closure may do whatever it likes.) By default <%foo> is captured into $<foo>, but you can use the <?%foo> form to suppress capture, and you can always say $<%foo> := <%foo> if you prefer to include the sigil in the key.
With both bare hash and hash in angles, the key is counted as "matched" immediately; that is, the current match position is set to after the key token before calling any subrule in the value. That subrule may, however, magically access the key anyway as if the subrule had started before the key and matched with <KEY> assertion. That is, $<KEY> will contain the keyword or token that this subrule was looked up under, and that value will be returned by the current match object even if you do nothing special with it within the match. (This also works for the name of a macro as seen from an is parsed regex, since internally that turns into a hash lookup.)
As with bare hash, the longest key matches according to the venerable longest token rule, but in addition, you may combine multiple hashes under the same longest-token consideration like this:
<%statement|%prefix|%term>
This means that, despite being in a later hash, %term<food> will be selected in preference to %prefix<foo> because it's the longer token. However, if there is a tie, the earlier hash wins, so %statement<if> hides any %prefix<if> or %term<if>.
In contrast, if you say
[ <%prefix> | <%term> ]
a %prefix<foo> would be selected in preference to a %term<food>. (Which is not what you usually want if your language is to do longest-token consistently.)
{ indicates code that produces a regex to be interpolated into the pattern at that point as a subrule: / (<?ident>) <{ %cache{$0} //= get_body($0) }> /The closure is guaranteed to be run at the canonical time.
As with an ordinary embedded closure, an explicit return from a regex closure binds the result object for this match, ignores the rest of the current regex, and reports success:
/ (\d) <{ return $0.sqrt }> NotReached /;This has the effect of capturing the square root of the numified string, instead of the string. The NotReached part is not reached.
These closures are invoked as anonymous methods on the Match object. See "Match objects" below for more about result objects.
& interpolates the return value of a subroutine call as a regex. Hence <&foo()>
is short for
<{ foo() }>Regex object, it is not recompiled. If it is a string, the compiled form is cached with the string so that it is not recompiled next time you use it unless the string changes. (Any external lexical variable names must be rebound each time though.) Subrules may not be interpolated with unbalanced bracketing. An interpolated subrule keeps its own inner $/, so its parentheses never count toward the outer regexes groupings. (In other words, parenthesis numbering is always lexically scoped.)?{ or !{indicates a code assertion: / (\d**{1..3}) <?{ $0 < 256 }> /
/ (\d**{1..3}) <!{ $0 < 256 }> /Similar to:
/ (\d**{1..3}) { $0 < 256 or fail } /
/ (\d**{1..3}) { $0 < 256 and fail } /Unlike closures, code assertions are not guaranteed to be run at the canonical time if the optimizer can prove something later can't match. So you can sneak in a call to a non-canonical closure that way:
/^foo .* <?{ do { say "Got here!" } or 1 }> .* bar$/The do block is unlikely to run unless the string ends with "bar".
( indicates the start of a result capture: / foo <( \d+ )> bar /
is equivalent to:
/ <after foo> \d+ <before bar> /
except that the scan for "foo" can be done in the forward direction, while a lookbehind assertion would presumably scan for \d+ and then match "foo" backwards. The use of <(...)> affects only the meaning of the result object and the positions of the beginning and ending of the match. That is, after the match above, $() contains only the digits matched, and .pos is pointing to after the digits. Other captures (named or numbered) are unaffected and may be accessed through $/.
It is a syntax error to use an unbalanced <( or )>.
[ or + indicates an enumerated character class. Ranges in enumerated character classes are indicated with ... / <[a..z_]>* /
/ <+[a..z_]>* /- indicates a complemented character class: / <-[a..z_]> <-alpha> /
/ <[a..z]-[aeiou]+xdigit> / # consonant or hex digit
If such a combination starts with a named character class, a leading + is required:
/ <+alpha-[Jj]> / # J-less alpha
' indicates a literal match (including whitespace): / <'match this exactly (whitespace matters)'> /
" indicates a literal match after interpolation: / <"match $THIS exactly (whitespace still matters)"> /
<.> matches any logical grapheme (including a Unicode combining character sequences): / seekto = <.> / # Maybe a combined char
Same as:
/ seekto = [:graphs .] /
! indicates a negated meaning (always a zero-width assertion): / <!before _ > / # We aren't before an _
Note that <!alpha> is different from <-alpha> because the latter matches /./ when it is not an alpha.
<<<Ccode: a >> 1>>>
\p and \P properties become intrinsic grammar rules (<prop ...> and <!prop ...>).\L...\E, \U...\E, and \Q...\E sequences are gone. In the rare cases that need them you can use <{ lc $regex }> etc.\G sequence is gone. Use :p instead. (Note, however, that it makes no sense to use :p within a pattern, since every internal pattern is implicitly anchored to the current position.) See the at assertion below.\1, \2, etc.) are gone; $0, $1, etc. can be used instead, because variables are no longer interpolated.\h and \v, match horizontal and vertical whitespace respectively, including Unicode.\s now matches any Unicode whitespace character.\N matches anything except a logical newline; it is the negation of \n.\H matches anything but horizontal whitespace.\V matches anything but vertical whitespace.\T matches anything but a tab.\R matches anything but a return.\F matches anything but a formfeed.\E matches anything but an escape.\X... matches anything but the specified character (specified in hexadecimal).qr/pattern/ regex constructor is gone. regex { pattern } # always takes {...} as delimiters
rx / pattern / # can take (almost any) chars as delimitersYou may not use whitespace or alphanumerics for delimiters. Space is optional unless needed to distinguish from modifier arguments or function parens. So you may use parens as your rx delimiters, but only if you interpose whitespace:
rx ( pattern ) # okay
rx( 1,2,3 ) # tries to call rx function(This is true of all quotelike constructs in Perl 6.)
$regex = regex :g:s:i { my name is (.*) };
$regex = rx:g:s:i / my name is (.*) /; # same thingSpace is necessary after the final modifier if you use any bracketing character for the delimiter. (Otherwise it would be taken as an argument to the modifier.)
$regex = rx :g :s :i / my name is (.*) /;
qr because it's no longer an interpolating quote-like operator. rx is short for regex, (not to be confused with regular expressions).sub {...} constructor. In fact, that analogy will run very deep in Perl 6.{...} is now always a closure (which may still execute immediately in certain contexts and be passed as an object in others), so too a raw /.../ is now always a Regex object (which may still match immediately in certain contexts and be passed as an object in others)./.../ matches immediately in a value context (void, Boolean, string, or numeric), or when it is an explicit argument of a ~~. Otherwise it's a Regex constructor identical to the explicit regex form. So this: $var = /pattern/;
no longer does the match and sets $var to the result. Instead it assigns a Regex object to $var.
m{...} or rx{...}: $var = m{pattern}; # Match regex immediately, assign result
$var = rx{pattern}; # Assign regex expression itself@list = split /pattern/, $str;
are now just consequences of the normal semantics.
grep: sub my_grep($selector, *@list) {
given $selector {
when Regex { ... }
when Code { ... }
when Hash { ... }
# etc.
}
}Using {...} or /.../ in the scalar context of the first argument causes it to produce a Code or Regex object, which the switch statement then selects upon.
rx has variants, so does the regex declarator. In particular, there are two special variants for use in grammars: token and rule. A token declaration:
token ident { [ <alpha> | _ ] \w+ }never backtracks by default. That is, it likes to commit to whatever it has scanned so far. The above is equivalent to
regex ident { [ <alpha>: | _: ]: \w+: }but rather easier to read. The bare *, + and ? quantifiers never backtrack in a token unless some outer regex has specified a :panic option that applies. If you want to prevent even that, use *:, +: or ?: to prevent any backtracking into the quantifier. If you want to explicitly backtrack, append either a ? or a + to the quantifier. The ? forces minimal matching as usual, while the + forces greedy matching. The token declarator is really just short for
regex :ratchet { ... }The other is the rule declarator, for declaring non-terminal productions in a grammar. Like a token, it also does not backtrack by default. In addition, a rule regex also assumes :sigspace. A rule is really short for:
regex :ratchet :sigspace { ... }?...? syntax (succeed once) was rarely used and can be now emulated more cleanly with a state variable: $result = do { state $x ||= m/ pattern /; } # only matches first timeTo reset the pattern, simply say $x = 0. Though if you want $x visible you'd have to avoid using a block:
$result = state $x ||= m/ pattern /;
...
$x = 0;rx, m, s, and the like. It's also greedy in ordinary regex declarations. In rule and token declarations, backtracking must be explicit.:? or ? to the atom. If the preceding token is a quantifier, the : may be omitted, so *? works just as in Perl 5.:+ or + to the atom. If the preceding token is a quantifier, the : may be omitted. (Perl 5 has no corresponding construct because backtracking always defaults to greedy in Perl 5.): without a subsequent ? or +. Backtracking over a single colon causes the regex engine not to retry the preceding atom: ms/ \( <expr> [ , <expr> ]*: \) /
(i.e. there's no point trying fewer <expr> matches, if there's no closing parenthesis on the horizon)
To force all the atoms in an expression not to backtrack by default, use :ratchet or rule or token.
ms/ [ if :: <expr> <block>
| for :: <list> <block>
| loop :: <loop_controls>? <block>
]
/(i.e. there's no point trying to match a different keyword if one was already found but failed). Note that you can still back into such an alternation, so you may also need to put : after it if you also want to disable that. If an explicit or implicit :ratchet has disabled backtracking by supplying an implicit :, you need to put an explicit :+ after the alternation to enable backing into another alternative if the first pick fails.
regex ident {
( [<alpha>|_] \w* ) ::: { fail if %reserved{$0} }
| " [<alpha>|_] \w* "
}
ms/ get <ident>? /(i.e. using an unquoted reserved word as an identifier is not permitted)
<commit> assertion causes the entire match to fail outright, no matter how many subrules down it happens: regex subname {
([<alpha>|_] \w*) <commit> { fail if %reserved{$0} }
}
ms/ sub <subname>? <block> /(i.e. using a reserved word as a subroutine name is instantly fatal to the surrounding match as well)
<cut> assertion always matches successfully, and has the side effect of deleting the parts of the string already matched.<cut> causes the complete match to fail (like backtracking past a <commit>. This is because there's now no preceding text to backtrack into.sub and regex extends much further. token ident { [<alpha>|_] \w* }
# and later...
@ids = grep /<ident>/, @strings; regex serial_number { <[A..Z]> \d**{8} }
token type { alpha | beta | production | deprecated | legacy }in other regexes as named assertions:
rule identification { [soft|hard]ware <type> <serial_number> }/<prior>/
/<null>/
For example:
split /<?null>/, $string
splits between characters.
/a|b|c|<?null>/
This makes it easier to catch errors like this:
/a|b|c|/
As a special case, however, the first null alternative in a match like
ms/ [
| if :: <expr> <block>
| for :: <list> <block>
| loop :: <loop_controls>? <block>
]
/is simply ignored. Only the first alternative is special that way. If you write:
ms/ [
if :: <expr> <block> |
for :: <list> <block> |
loop :: <loop_controls>? <block> |
]
/it's still an error.
$something = "";
/a|b|c|$something/;$/, which is an environmental lexical declared in the outer subroutine that is calling the regex. (A closure lexically embedded in a regex does not redeclare $/, so $/ always refers to the current match, not any prior submatch done within the closure). if /pattern/ {...}
# or:
/pattern/; if $/ {...}With :global or :overlap or :exhaustive the boolean is allowed to return true on the first match. The Match object can produce the rest of the results lazily if evaluated in list context.
print %hash{ "{$text ~~ /<?ident>/}" };
# or equivalently:
$text ~~ /<?ident>/ && print %hash{~$/};But generally you should say ~$/ if you mean ~$/.
$sum += /\d+/;
# or equivalently:
/\d+/; $sum = $sum + $/;Match object evaluates to its underlying result object. Usually this is just the entire match string, but you can override that by calling return inside a regex: my $moose = $(m:{
<antler> <body>
{ return Moose.new( body => $<body>().attach($<antler>) ) }
# match succeeds -- ignore the rest of the regex
});$() is a shorthand for $($/). The result object may be of any type, not just a string.
You may also capture a subset of the match as the result object using the <(...)> construct:
"foo123bar" ~~ / foo <( \d+ )> bar /
say $(); # says 123In this case the result object is always a string when doing string matching, and a list of one or more elements when doing array matching.
Additionally, the Match object delegates its coerce calls (such as +$match and ~$match) to its underlying result object. The only exception is that Match handles boolean coercion itself, which returns whether the match had succeeded at least once.
This means that these two work the same:
/ <moose> { return $$<moose> as Moose } /
/ <moose> { return $<moose> as Moose } /Match object pretends to be an array of all its positional captures. Hence ($key, $val) = ms/ (\S+) => (\S+)/;
can also be written:
$result = ms/ (\S+) => (\S+)/;
($key, $val) = @$result;To get a single capture into a string, use a subscript:
$mystring = "{ ms/ (\S+) => (\S+)/[0] }";To get all the captures into a string, use a zen slice:
$mystring = "{ ms/ (\S+) => (\S+)/[] }";Or cast it into an array:
$mystring = "@( ms/ (\S+) => (\S+)/ )";
Note that, as a scalar variable, $/ doesn't automatically flatten in list context. Use @() as a shorthand for @($/) to flatten the positional captures under list context. Note that a Match object is allowed to evaluate its match lazily in list context. Use **@() to force an eager match.
Match object pretends to be a hash of all its named captures. The keys do not include any sigils, so if you capture to variable @<foo> its real name is $/{'foo'} or $/<foo>. However, you may still refer to it as @<foo> anywhere $/ is visible. (But it is erroneous to use the same name for two different capture datatypes.) Note that, as a scalar variable, $/ doesn't automatically flatten in list context. Use %() as a shorthand for %($/) to flatten as a hash, or bind it to a variable of the appropriate type. As with @(), it's possible for %() to produce its pairs lazily in list context.
$<0 1 2> is equivalent to $/[0,1,2]. This allows you to write slices of intermixed named and numbered captures.$0, $1, etc. are just aliases into $/[0], $/[1], etc. Hence they will all be undefined if the last match failed (unless they were explicitly bound in a closure without using the let keyword).Match objects have methods that provide additional information about the match. For example: if m/ def <ident> <codeblock> / {
say "Found sub def from index $/.from() to index $/.to()";
}Warning: these methods usually return values of type StrPos, which you should not treat as integers. The interpolation of these values in the example above is slightly naughty, and likely to print out the positions not as numbers but as "Graphs(42)" or some such.
Match. That is: $match_obj = $str ~~ /pattern/;
say "Matched" if $match_obj;$/ variable, unless the match statement is inside another regex. That is: $str ~~ /pattern/;
say "Matched" if $/;$/ variable holds the current regex's incomplete Match object (which can be modified via the internal $/. For example: $str ~~ / foo # Match 'foo'
{ $/ = 'bar' } # But pretend we matched 'bar'
/;
say $/; # says 'bar'This is slightly dangerous, insofar as you might return something that does not behave like a Match object to some context that requires one. Fortunately, you normally just want to return a result object instead:
$str ~~ / foo # Match 'foo'
{ return 'bar' } # But pretend we matched 'bar'
/;
say $(); # says 'bar' # subpattern
# _________________/\____________________
# | |
# | subpattern subpattern |
# | __/\__ __/\__ |
# | | | | | |
ms/ (I am the (walrus), ( khoo )**{2} kachoo) /;Match object if it is successfully matched.Match object is pushed onto the array inside the outer Match object belonging to the surrounding scope (known as its parent Match object). The surrounding scope may be either the innermost surrounding subpattern (if the subpattern is nested) or else the entire regex itself. # subpat-A
# _________________/\____________________
# | |
# | subpat-B subpat-C |
# | __/\__ __/\__ |
# | | | | | |
ms/ (I am the (walrus), ( khoo )**{2} kachoo) /;then the Match objects representing the matches made by subpat-B and subpat-C would be successively pushed onto the array inside subpat- A's Match object. Then subpat-A's Match object would itself be pushed onto the array inside the Match object for the entire regex (i.e. onto $/'s array).
Match object are referred to using either the standard array access notation (e.g. $/[0], $/[1], $/[2], etc.) or else via the corresponding lexically scoped numeric aliases (i.e. $0, $1, $2, etc.) So: say "$/[1] was found between $/[0] and $/[2]";
is the same as:
say "$1 was found between $0 and $2";
$/.Match object (i.e. $/) store individual Match objects representing the substrings that where matched and captured by the first, second, third, etc. outermost (i.e. unnested) subpatterns. So these elements can be treated like fully fledged match results. For example: if m/ (\d\d\d\d)-(\d\d)-(\d\d) (BCE?|AD|CE)?/ {
($yr, $mon, $day) = $/[0..2]
$era = "$3" if $3; # stringify/boolify
@datepos = ( $0.from() .. $2.to() ); # Call Match methods
}Match surrounding subpattern, not to the array of $/. # Perl 5...
#
# $1--------------------- $4--------- $5------------------
# | $2--------------- | | | | $6---- $7------ |
# | | $3-- | | | | | | | | | |
# | | | | | | | | | | | | | |
m/ ( A (guy|gal|g(\S+) ) ) (sees|calls) ( (the|a) (gal|guy) ) /x; # Perl 6...
#
# $0--------------------- $1--------- $2------------------
# | $0[0]------------ | | | | $2[0]- $2[1]--- |
# | | $0[0][0] | | | | | | | | | |
# | | | | | | | | | | | | | |
m/ ( A (guy|gal|g(\S+) ) ) (sees|calls) ( (the|a) (gal|guy) ) /;Match object. Instead, it produces a list of Match objects corresponding to the sequence of individual matches made by the repeated subpattern.Match objects, the corresponding array element for the quantified capture will store a (nested) array rather than a single Match object. For example: if m/ (\w+) \: (\w+ \s+)* / {
say "Key: $0"; # Unquantified --> single Match
say "Values: { @{$1} }"; # Quantified --> array of Match
} # non-capturing quantifier
# __________/\____________ __/\__
# | || |
# | $0 $1 || |
# | _^_ ___^___ || |
# | | | | | || |
m/ [ (\w+) \: (\w+ \h*)* \n ]**{2..*} /Non-capturing brackets don't create a separate nested lexical scope, so the two subpatterns inside them are actually still in the regex's top-level scope. Hence their top-level designations: $0 and $1.
$0 and $1 will each contain an array. The elements of that array will be the submatches returned by the corresponding subpattern on each iteration of the non-capturing parentheses. For example: my $text = "foo:food fool\nbar:bard barb";
# $0-- $1------
# | | | |
$text ~~ m/ [ (\w+) \: (\w+ \h*)* \n ]**{2..*} /;
# Because they're in a quantified non-capturing block...
# $0 contains the equivalent of:
#
# [ Match.new(str=>'foo'), Match.new(str=>'bar') ]
#
# and $1 contains the equivalent of:
#
# [ Match.new(str=>'food '),
# Match.new(str=>'fool' ),
# Match.new(str=>'bard '),
# Match.new(str=>'barb' ),
# ]Match objects representing the captures of the inner parens for every iteration (as described above). That is: my $text = "foo:food fool\nbar:bard barb";
# $0-----------------------
# | |
# | $0[0] $0[1]--- |
# | | | | | |
$text ~~ m/ ( (\w+) \: (\w+ \h*)* \n )**{2..*} /;
# Because it's in a quantified capturing block,
# $0 contains the equivalent of:
#
# [ Match.new( str=>"foo:food fool\n",
# arr=>[ Match.new(str=>'foo'),
# [
# Match.new(str=>'food '),
# Match.new(str=>'fool'),
# ]
# ],
# ),
# Match.new( str=>'bar:bard barb',
# arr=>[ Match.new(str=>'bar'),
# [
# Match.new(str=>'bard '),
# Match.new(str=>'barb'),
# ]
# ],
# ),
# ]
#
# and there is no $1|. Hence: # $0 $1 $2 $3 $4 $5
$tune_up = rx/ (don't) (ray) (me) (for) (solar tea), (d'oh!)
# $0 $1 $2 $3 $4
| (every) (green) (BEM) (devours) (faces)
/;This means that if the second alternation matches, the @$/ array will contain ('every', 'green', 'BEM', 'devours', 'faces') rather than (undef, undef, undef, undef, undef, undef, 'every', 'green', 'BEM', 'devours', 'faces') (as the same regex would in Perl 5).
<regex> within a pattern is known as a subrule, whether that regex is actually defined as a regex or token or rule or even an ordinary method or multi. # subrule subrule subrule
# __^__ _______^______ __^__
# | | | | | |
m/ <ident> $<spaces>:=(\s*) <digit>+ /Match object. But, unlike subpatterns, that Match object is not assigned to the array inside its parent Match object. Instead, it is assigned to an entry of the hash inside its parent Match object. For example: # .... $/ .....................................
# : :
# : .... $/[0] .................. :
# : : : :
# : $/<ident> : $/[0]<ident> : :
# : __^__ : __^__ : :
# : | | : | | : :
ms/ <ident> \: ( known as <ident> previously ) /Match object can be referred to using any of the standard hash access notations ($/{'foo'}, $/<bar>, $/«baz», etc.), or else via corresponding lexically scoped aliases ($<foo>, $«bar», $<baz>, etc.) So the previous example also implies: # $<ident> $0<ident>
# __^__ __^__
# | | | |
ms/ <ident> \: ( known as <ident> previously ) /<ident>) or aliased ($<ident> := (<alpha>\w*). The name's the thing.Match objects rather than a single Match object.Match objects to this array. For example: if ms/ mv <file> <file> / {
$from = $<file>[0];
$to = $<file>[1];
}Likewise, with a quantified subrule:
if ms/ mv <file>**{2} / {
$from = $<file>[0];
$to = $<file>[1];
}Likewise, with a mixture of both:
if ms/ mv <file>+ <file> / {
$to = pop @{$<file>};
@from = @{$<file>};
} if ms/ mv <file> $<dir>:=<file> / {
$from = $<file>; # Only one subrule named <file>, so scalar
$to = $<dir>; # The Capture Formerly Known As <file>
}Likewise, neither of the following constructions causes <file> to produce an array of Match objects, since none of them has two or more <file> subrules in the same lexical scope:
if ms/ (keep) <file> | (toss) <file> / {
# Each <file> is in a separate alternation, therefore <file>
# is not repeated in any one scope, hence $<file> is
# not an Array object...
$action = $0;
$target = $<file>;
}
if ms/ <file> \: (<file>|none) / {
# Second <file> nested in subpattern which confers a
# different scope...
$actual = $/<file>;
$virtual = $/[0]<file> if $/[0]<file>;
}Match object). So: if ms/ <file> \: [<file>|none] / { # Two <file>s in same scope
$actual = $/<file>[0];
$virtual = $/<file>[1] if $/<file>[1];
}Aliases can be named or numbered. They can be scalar-, array-, or hash-like. And they can be applied to either capturing or non-capturing constructs. The following sections highlight special features of the semantics of some of those combinations.
# ______/capturing parens\_____
# | |
# | |
ms/ $<key>:=( (<[A..E]>) (\d**{3..6}) (X?) ) /;then the outer capturing parens no longer capture into the array of $/ (like unaliased parens would). Instead the aliased parens capture into the hash of $/; specifically into the hash element whose key is the alias name.
$<key> (i.e. $/<key>), but not $0 (i.e. not $/[0]).$/<key> will contain the Match object that would previously have been placed in $/[0].$/<key>[0] will contain the A-E letter,$/<key>[1] will contain the digits,$/<key>[2] will contain the optional X. # ___/non-capturing brackets\__
# | |
# | |
ms/ $<key>:=[ (<[A..E]>) (\d**{3..6}) (X?) ] /;then the corresponding $/<key> object contains only the string matched by the non-capturing brackets.
$/<key> entry is empty. That's because square brackets do not create a nested lexical scope, so the subpatterns are unnested and hence correspond to $0, $1, and $2, and not to $/<key>[0], $/<key>[1], and $/<key>[2].$/<key> will contain the complete substring matched by the square brackets (in a Match object, as described above),$0 will contain the A-E letter,$1 will contain the digits,$2 will contain the optional X.Match object to the hash entry whose key is the name of the alias. And it no longer assigns anything to the hash entry whose key is the subrule name. That is: if m:/ ID\: $<id>:=<ident> / {
say "Identified as $/<id>"; # $/<ident> is undefined
}Match object. This is particularly useful for differentiating two or more calls to the same subrule in the same scope. For example: if ms/ mv <file>+ $<dir>:=<file> / {
@from = @{$<file>};
$to = $<dir>;
}m/ $1:=(<-[:]>*) \: $0:=<ident> /
the behavior is exactly the same as for a named alias (i.e the various cases described above), except that the resulting Match object is assigned to the corresponding element of the appropriate array rather than to an element of the hash.
# ---$1--- -$2- ---$6--- -$7-
# | | | | | | | |
m/ $1:=(food) (bard) $6:=(bazd) (quxd) /; $tune_up = rx/ (don't) (ray) (me) (for) (solar tea), (d'oh!)
| $6:=(every) (green) (BEM) (devours) (faces)
# $7 $8 $9 $10
/; # Perl 5...
# $1
# _____________/\______________
# | $2 $3 $4 |
# | __/\___ ____/\____ /\ |
# | | | | | | | |
m/ ( (<[A..E]>) (\d**{3..6}) (X?) ) /;
# Perl 6...
# $0
# _____________/\______________
# | $0[0] $0[1] $0[2] |
# | __/\___ ____/\____ /\ |
# | | | | | | | |
m/ ( (<[A..E]>) (\d**{3..6}) (X?) ) /;
# Perl 6 simulating Perl 5...
# $1
# _______________/\________________
# | $2 $3 $4 |
# | __/\___ ____/\____ /\ |
# | | | | | | | |
m/ $1:=[ (<[A..E]>) (\d**{3..6}) (X?) ] /;The non-capturing brackets don't introduce a scope, so the subpatterns within them are at regex scope, and hence numbered at the top level. Aliasing the square brackets to $1 means that the next subpattern at the same level (i.e. the (<[A..E]>)) is numbered sequentially (i.e. $2), etc.
Match objects (as described in "Quantified subpattern captures" and "Repeated captures of the same subrule"). So the corresponding array element or hash entry for the alias will contain an array, instead of a single Match object. if m/ mv $0:=<file>+ / {
# <file>+ returns a list of Match objects,
# so $0 contains an array of Match objects,
# one for each successful call to <file>
# $/<file> does not exist (it's pre-empted by the alias)
}
if m/ mv $<from>:=(\S+ \s+)* / {
# Quantified subpattern returns a list of Match objects,
# so $/<from> contains an array of Match
# objects, one for each successful match of the subpattern
# $0 does not exist (it's pre-empted by the alias)
}Match object which contains only the complete substring that was matched by the full set of repetitions of the brackets (as described in "Named scalar aliases applied to non-capturing brackets"). For example: "coffee fifo fumble" ~~ m/ $<effs>:=[f <-[f]>**{1..2} \s*]+ /;
say $<effs>; # prints "fee fifo fum"m/ mv @<from>:=[(\S+) \s+]* <dir> /;
@<alias>:= notation instead of a $<alias>:= mandates that the corresponding hash entry or array element always receives an array of Match objects, even if the construct being aliased would normally return a single Match object. This is useful for creating consistent capture semantics across structurally different alternations (by enforcing array captures in all branches): ms/ Mr?s? @<names>:=<ident> W\. @<names>:=<ident>
| Mr?s? @<names>:=<ident>
/;
# Aliasing to @<names> means $/<names> is always
# an Array object, so...
say @{$/<names>};@<key> can also be used outside a regex, as a shorthand for @{ $/<key> }. That is: ms/ Mr?s? @<names>:=<ident> W\. @<names>:=<ident>
| Mr?s? @<names>:=<ident>
/;
say @<names>; m/ mv $<files>:=[ f.. \s* ]* /; # $/<files> assigned a single
# Match object containing the
# complete substring matched by
# the full set of repetitions
# of the non-capturing brackets
m/ mv @<files>:=[ f.. \s* ]* /; # $/<files> assigned an array,
# each element of which is a
# C<Match> object containing
# the substring matched by Nth
# repetition of the non-
# capturing bracket matchMatch object returned by one repetition of the subpattern. That is, an array alias on a subpattern flattens and collects all nested subpattern captures within the aliased subpattern. For example: if ms/ $<pairs>:=( (\w+) \: (\N+) )+ / {
# Scalar alias, so $/<pairs> is assigned an array
# of Match objects, each of which has its own array
# of two subcaptures...
for @{$<pairs>} -> $pair {
say "Key: $pair[0]";
say "Val: $pair[1]";
}
}
if ms/ @<pairs>:=( (\w+) \: (\N+) )+ / {
# Array alias, so $/<pairs> is assigned an array
# of Match objects, each of which is flattened out of
# the two subcaptures within the subpattern
for @{$<pairs>} -> $key, $val {
say "Key: $key";
say "Val: $val";
}
}Match object returned by each repetition of the subrule, all flattened into a single array: rule pair { (\w+) \: (\N+) \n }
if ms/ $<pairs>:=<pair>+ / {
# Scalar alias, so $/<pairs> contains an array of
# Match objects, each of which is the result of the
# <pair> subrule call...
for @{$<pairs>} -> $pair {
say "Key: $pair[0]";
say "Val: $pair[1]";
}
}
if ms/ mv @<pairs>:=<pair>+ / {
# Array alias, so $/<pairs> contains an array of
# Match objects, all flattened down from the
# nested arrays inside the Match objects returned
# by each match of the <pair> subrule...
for @{$<pairs>} -> $key, $val {
say "Key: $key";
say "Val: $val";
}
}Match objects is assigned into the appropriate element of the regex's match array rather than to a key of its match hash. For example: if m/ mv \s+ @0:=((\w+) \s+)+ $1:=((\W+) (\s*))* / {
# | |
# | |
# | \_ Scalar alias, so $1 gets an
# | array, with each element
# | a Match object containing
# | the two nested captures
# |
# \___ Array alias, so $0 gets a flattened array of
# just the (\w+) captures from each repetition
@from = @{$0}; # Flattened list
$to_str = $1[0][0]; # Nested elems of
$to_gap = $1[0][1]; # unflattened list
}@0 is simply a shorthand for @{$0}, so the first assignment above could also have been written: @from = @0;
m/ mv %<location>:=( (<ident>) \: (\N+) )+ /;
Match object to be assigned a (nested) Hash object (rather than an Array object or a single Match object).Match object is stored: rule one_to_many { (\w+) \: (\S+) (\S+) (\S+) }
if ms/ %0:=<one_to_many>+ / {
# $/[0] contains a hash, in which each key is provided by
# the first subcapture within C<one_to_many>, and each
# value is an array containing the
# subrule's second, third, and fourth, etc. subcaptures...
for %{$/[0]} -> $pair {
say "One: $pair.key";
say "Many: { @{$pair.value} }";
}
}%0 is a shortcut for %{$0}: for %0 -> $pair {
say "One: $pair.key";
say "Many: { @{$pair.value} }";
}m/ mv @<files>:=<ident>+ $<dir>:=<ident> /
the name of an ordinary variable can be used as an external alias, like so:
m/ mv @files:=<ident>+ $dir:=<ident> /
:x or :g flag) or overlaps (specified via the :ov or :ex flag), it will usually produce a series of distinct matches.Match object in $/. However, this object may represent a partial evaluation of the regex. Moreover, the values of this match object are slightly different from those provided by a non-repeated match: $/ after such matches is true or false, depending on whether the pattern matched.@(), the multidimensionality is ignored and all the matches are returned flattened (but still lazily). If you refer to @@(), you can get each individual sublist as a Capture object. (That is, there is a @@() coercion operator that happens, like @(), to default to $/.) As with any multidimensional list, each sublist can be lazy separately.For example:
if $text ~~ ms:g/ (\S+:) <rocks> / {
say 'Full match context is: [$/]';
}But the list of individual match objects corresponding to each separate match is also available:
if $text ~~ ms:g/ (\S+:) <rocks> / {
say "Matched { +@@() } times"; # Note: forced eager here
for @@() -> $m {
say "Match between $m.from() and $m.to()";
say 'Right on, dude!' if $m[0] eq 'Perl';
say "Rocks like $m<rocks>";
}
}:keepall is in effect anywhere in the outer dynamic scope. In this case everything inside the angles is used as part of the key. Suppose the earlier example parsed whitespace: / <key> <?ws> <'=>'> <?ws> <value> { %hash{$<key>} = $<value> } /The two instances of <?ws> above would store an array of two values accessible as @<?ws>. It would also store the literal match into $<'=\>'>. Just to make sure nothing is forgotten, under :keepall any text or whitespace not otherwise remembered is attached as an extra property on the subsequent node. (The name of that property is "pretext".)
ident rule shouldn't clobber someone else's ident rule. So some mechanism is needed to confine rules to a namespace. class Identity {
method name { "Name = $.name" }
method age { "Age = $.age" }
method addr { "Addr = $.addr" }
method desc {
print &.name(), "\n",
&.age(), "\n",
&.addr(), "\n";
}
# etc.
}so too a grammar can collect a set of named rules together:
grammar Identity {
rule name { Name = (\N+) }
rule age { Age = (\d+) }
rule addr { Addr = (\N+) }
rule desc {
<name> \n
<age> \n
<addr> \n
}
# etc.
} grammar Letter {
rule text { <greet> <body> <close> }
rule greet { [Hi|Hey|Yo] $<to>:=(\S+?) , $$}
rule body { <line>+? } # note: backtracks forwards via +?
rule close { Later dude, $<from>:=(.+) }
# etc.
}
grammar FormalLetter is Letter {
rule greet { Dear $<to>:=(\S+?) , $$}
rule close { Yours sincerely, $<from>:=(.+) }
}body, line, etc. grammar Perl { # Perl's own grammar
rule prog { <statement>* }
rule statement {
| <decl>
| <loop>
| <label> [<cond>|<sideff>|;]
}
rule decl { <sub> | <class> | <use> }
# etc. etc. etc.
} given $source_code {
$parsetree = m:keepall/<Perl.prog>/;
}For writing your own backslash and assertion subrules or macros, you may use the following syntactic categories:
token rule_backslash:<w> { ... } # define your own \w and \W
token rule_assertion:<*> { ... } # define your own <*stuff>
macro rule_metachar:<,> { ... } # define a new metacharacter
macro rule_mod_internal:<x> { ... } # define your own /:x() stuff/
macro rule_mod_external:<x> { ... } # define your own m:x()/stuff/As with any such syntactic shenanigans, the declaration must be visible in the lexical scope to have any effect. It's possible the internal/external distinction is just a trait, and that some of those things are subs or methods rather than subrules or macros. (The numeric regex modifiers are recognized by fallback macros defined with an empty operator name.)
Various pragmas may be used to control various aspects of regex compilation and usage not otherwise provided for. These are tied to the particular declarator in question:
use s :foo; # control s defaults
use m :foo; # control m defaults
use rx :foo; # control rx defaults
use regex :foo; # control regex defaults
use token :foo; # control token defaults
use rule :foo; # control rule defaults(It is a general policy in Perl 6 that any pragma designed to influence the surface behavior of a keyword is identical to the keyword itself, unless there is good reason to do otherwise. On the other hand, pragmas designed to influence deep semantics should not be named identically, though of course some similarity is good.)
tr/// quote-like operator now also has a method form called trans(). Its argument is a list of pairs. You can use anything that produces a pair list: $str.trans( %mapping.pairs.sort );
Use the .= form to do a translation in place:
$str.=trans( %mapping.pairs.sort );
tr/// would: $str.=trans( 'A..C' => 'a..c', 'XYZ' => 'xyz' );
As a degenerate case, each side can be individual characters:
$str.=trans( 'A'=>'a', 'B'=>'b', 'C'=>'c' );
$str.=trans( ['A'..'C'] => ['a'..'c'], <X Y Z> => <x y z> );
$str.=trans( [' ', '<', '>', '&' ] =>
[' ', '<', '>', '&' ]);In the case that more than one sequence of input characters matches, the longest one wins. In the case of two identical sequences the first in order wins.
There are also method forms of m// and s///:
$str.match(//);
$str.subst(//, "replacement")
$str.subst(//, {"replacement"})
$str.=subst(//, "replacement")
$str.=subst(//, {"replacement"})<at($pos)> assertion to say that the current position is the same as the position object you supply. You may set the current match position via the :c and :p modifiers. However, please remember that in Perl 6 string positions are generally not integers, but objects that point to a particular place in the string regardless of whether you count by bytes or codepoints or graphemes. If used with an integer, the at assertion will assume you mean the current lexically scoped Unicode level, on the assumption that this integer was somehow generated in this same lexical scope. If this is outside the current string's allowed abstraction levels, an exception is thrown. See S02 for more discussion of string positions.
Buf types are based on fixed-width cells and can therefore handle integer positions just fine, and treat them as array indices. In particular, buf8 AKA buf is just an old-school byte string. Matches against Buf types are restricted to ASCII semantics in the absence of an explicit modifier asking for the array's values to be treated as some particular encoding such as UTF-32. (This is also true for those compact arrays that are considered isomorphic to Buf types.) Positions within Buf types are always integers, counting one per unit cell of the underlying array. Be aware that "from" and "to" positions are reported as being between elements. If matching against a compact array @foo, a final position of 42 indicates that @foo[42] was the first element not included. my $stream is from($fh); # tie scalar to filehandle
# and later...
$stream ~~ m/pattern/; # match from stream@array ~~ / foo <,> bar <elem>* /;
The special <,> subrule matches the boundary between elements. The <elem> assertion matches any individual array element. It is the equivalent of "dot" for the whole element.
If the array elements are strings, they are concatenated virtually into a single logical string. If the array elements are tokens or other such objects, the objects must provide appropriate methods for the kinds of subrules to match against. It is an assertion error to match a string-matching assertion against an object that doesn't provide a string view. However, pure object lists can be parsed as long as the match (including any subrules) restricts itself to assertions like:
<.isa(Dog)>
<.does(Bark)>
<.can('scratch')>It is permissible to mix objects and strings in an array as long as they're in different elements. You may not embed objects in strings, however. Any object may, of course, pretend to be a string element if it likes.
Please be aware that the warnings on .from and .to returning opaque objects goes double for matching against an array, where a particular position reflects both a position within the array and (potentially) a positional within a string of that array. Do not expect to do math with such values. Nor should you expect to be able to extract a substr that crosses element boundaries.
@array».match($regex)