![]() |
Home | Libraries | People | FAQ | More |
One of the key benefits of representing regexes as C++ expressions is the ability to easily refer to other C++ code and data from within the regex. This enables programming idioms that are not possible with other regular expression libraries. Of particular note is the ability for one regex to refer to another regex, allowing you to build grammars out of regular expressions. This section describes how to embed one regex in another by value and by reference, how regex objects behave when they refer to other regexes, and how to access the tree of results after a successful parse.
The
object has value semantics. When a regex object appears on the right-hand
side in the definition of another regex, it is as if the regex were embedded
by value; that is, a copy of the nested regex is stored by the enclosing
regex. The inner regex is invoked by the outer regex during pattern matching.
The inner regex participates fully in the match, back-tracking as needed
to make the match succeed.
basic_regex<>
Consider a text editor that has a regex-find feature with a whole-word option. You can implement this with xpressive as follows:
find_dialog dlg; if( dialog_ok == dlg.do_modal() ) { std::string pattern = dlg.get_text(); // the pattern the user entered bool whole_word = dlg.whole_word.is_checked(); // did the user select the whole-word option? sregex re = sregex::compile( pattern ); // try to compile the pattern if( whole_word ) { // wrap the regex in begin-word / end-word assertions re = bow >> re >> eow; } // ... use re ... }
Look closely at this line:
// wrap the regex in begin-word / end-word assertions re = bow >> re >> eow;
This line creates a new regex that embeds the old regex by value. Then, the new regex is assigned back to the original regex. Since a copy of the old regex was made on the right-hand side, this works as you might expect: the new regex has the behavior of the old regex wrapped in begin- and end-word assertions.
![]() |
Note |
---|---|
Note that |
If you want to be able to build recursive regular expressions and context-free grammars, embedding a regex by value is not enough. You need to be able to make your regular expressions self-referential. Most regular expression engines don't give you that power, but xpressive does.
![]() |
Tip |
---|---|
The theoretical computer scientists out there will correctly point out that a self-referential regular expression is not "regular", so in the strict sense, xpressive isn't really a regular expression engine at all. But as Larry Wall once said, "the term [regular expression] has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here." |
Consider the following code, which uses the by_ref()
helper to define a recursive regular expression
that matches balanced, nested parentheses:
sregex parentheses; parentheses // A balanced set of parentheses ... = '(' // is an opening parenthesis ... >> // followed by ... *( // zero or more ... keep( +~(set='(',')') ) // of a bunch of things that are not parentheses ... | // or ... by_ref(parentheses) // a balanced set of parentheses ) // (ooh, recursion!) ... >> // followed by ... ')' // a closing parenthesis ;
Matching balanced, nested tags is an important text processing task, and
it is one that "classic" regular expressions cannot do. The by_ref()
helper makes it possible. It allows one regex object to be embedded in another
by reference. Since the right-hand side holds parentheses
by reference, assigning the
right-hand side back to parentheses
creates a cycle, which will execute recursively.
Once we allow self-reference in our regular expressions, the genie is out of the bottle and all manner of fun things are possible. In particular, we can now build grammars out of regular expressions. Let's have a look at the text-book grammar example: the humble calculator.
sregex group, factor, term, expression; group = '(' >> by_ref(expression) >> ')'; factor = +_d | group; term = factor >> *(('*' >> factor) | ('/' >> factor)); expression = term >> *(('+' >> term) | ('-' >> term));
The regex expression
defined
above does something rather remarkable for a regular expression: it matches
mathematical expressions. For example, if the input string were "foo 9*(10+3) bar"
, this pattern
would match "9*(10+3)"
.
It only matches well-formed mathematical expressions, where the parentheses
are balanced and the infix operators have two arguments each. Don't try this
with just any regular expression engine!
Let's take a closer look at this regular expression grammar. Notice that
it is cyclic: expression
is implemented in terms of term
,
which is implemented in terms of factor
,
which is implemented in terms of group
,
which is implemented in terms of expression
,
closing the loop. In general, the way to define a cyclic grammar is to forward-declare
the regex objects and embed by reference those regular expressions that have
not yet been initialized. In the above grammar, there is only one place where
we need to reference a regex object that has not yet been initialized: the
definition of group
. In that
place, we use by_ref()
to embed expression
by reference.
In all other places, it is sufficient to embed the other regex objects by
value, since they have already been initialized and their values will not
change.
![]() |
Tip |
---|---|
Embed by value if possible |
Using
,
you can also build grammars out of dynamic regular expressions. You do that
by creating named regexes, and referring to other regexes by name. Each
regex_compiler<>
instance keeps a mapping from names to regexes that have been created with
it.
regex_compiler<>
You can create a named dynamic regex by prefacing your regex with "(?$name=)"
, where name
is the name of the regex. You can refer to a named regex from another regex
with "(?$name)"
. The
named regex does not need to exist yet at the time it is referenced in another
regex, but it must exist by the time you use the regex.
Below is a code fragment that uses dynamic regex grammars to implement the calculator example from above.
using namespace boost::xpressive; using namespace regex_constants; sregex expr; { sregex_compiler compiler; syntax_option_type x = ignore_white_space; compiler.compile("(? $group = ) \\( (? $expr ) \\) ", x); compiler.compile("(? $factor = ) \\d+ | (? $group ) ", x); compiler.compile("(? $term = ) (? $factor )" " ( \\* (? $factor ) | / (? $factor ) )* ", x); expr = compiler.compile("(? $expr = ) (? $term )" " ( \\+ (? $term ) | - (? $term ) )* ", x); } std::string str("foo 9*(10+3) bar"); smatch what; if(regex_search(str, what, expr)) { // This prints "9*(10+3)": std::cout << what[0] << std::endl; }
As with static regex grammars, nested regex invocations create nested match results (see Nested Results below). The result is a complete parse tree for string that matched. Unlike static regexes, dynamic regexes are always embedded by reference, not by value.
The calculator examples above raises a number of very complicated memory-management issues. Each of the four regex objects refer to each other, some directly and some indirectly, some by value and some by reference. What if we were to return one of them from a function and let the others go out of scope? What becomes of the references? The answer is that the regex objects are internally reference counted, such that they keep their referenced regex objects alive as long as they need them. So passing a regex object by value is never a problem, even if it refers to other regex objects that have gone out of scope.
Those of you who have dealt with reference counting are probably familiar
with its Achilles Heel: cyclic references. If regex objects are reference
counted, what happens to cycles like the one created in the calculator examples?
Are they leaked? The answer is no, they are not leaked. The
object has some tricky reference tracking code that ensures that even cyclic
regex grammars are cleaned up when the last external reference goes away.
So don't worry about it. Create cyclic grammars, pass your regex objects
around and copy them all you want. It is fast and efficient and guaranteed
not to leak or result in dangling references.
basic_regex<>
Nested regular expressions raise the issue of sub-match scoping. If both the inner and outer regex write to and read from the same sub-match vector, chaos would ensue. The inner regex would stomp on the sub-matches written by the outer regex. For example, what does this do?
sregex inner = sregex::compile( "(.)\\1" ); sregex outer = (s1= _) >> inner >> s1;
The author probably didn't intend for the inner regex to overwrite the sub-match written by the outer regex. The problem is particularly acute when the inner regex is accepted from the user as input. The author has no way of knowing whether the inner regex will stomp the sub-match vector or not. This is clearly not acceptable.
Instead, what actually happens is that each invocation of a nested regex
gets its own scope. Sub-matches belong to that scope. That is, each nested
regex invocation gets its own copy of the sub-match vector to play with,
so there is no way for an inner regex to stomp on the sub-matches of an outer
regex. So, for example, the regex outer
defined above would match "ABBA"
,
as it should.
If nested regexes have their own sub-matches, there should be a way to access
them after a successful match. In fact, there is. After a
or regex_match()
,
the regex_search()
struct behaves like the head of a tree of nested results. The match_results<>
class provides a match_results<>
nested_results()
member function that returns an ordered
sequence of
structures, representing the results of the nested regexes. The order of
the nested results is the same as the order in which the nested regex objects
matched.
match_results<>
Take as an example the regex for balanced, nested parentheses we saw earlier:
sregex parentheses; parentheses = '(' >> *( keep( +~(set='(',')') ) | by_ref(parentheses) ) >> ')'; smatch what; std::string str( "blah blah( a(b)c (c(e)f (g)h )i (j)6 )blah" ); if( regex_search( str, what, parentheses ) ) { // display the whole match std::cout << what[0] << '\n'; // display the nested results std::for_each( what.nested_results().begin(), what.nested_results().end(), output_nested_results() ); }
This program displays the following:
( a(b)c (c(e)f (g)h )i (j)6 ) (b) (c(e)f (g)h ) (e) (g) (j)
Here you can see how the results are nested and that they are stored in the order in which they are found.
![]() |
Tip |
---|---|
See the definition of output_nested_results in the Examples section. |
Sometimes a regex will have several nested regex objects, and you want to
know which result corresponds to which regex object. That's where basic_regex<>::regex_id()
and match_results<>::regex_id()
come in handy. When iterating over the nested results, you can compare the
regex id from the results to the id of the regex object you're interested
in.
To make this a bit easier, xpressive provides a predicate to make it simple
to iterate over just the results that correspond to a certain nested regex.
It is called regex_id_filter_predicate
,
and it is intended to be used with Boost.Iterator.
You can use it as follows:
sregex name = +alpha; sregex integer = +_d; sregex re = *( *_s >> ( name | integer ) ); smatch what; std::string str( "marsha 123 jan 456 cindy 789" ); if( regex_match( str, what, re ) ) { smatch::nested_results_type::const_iterator begin = what.nested_results().begin(); smatch::nested_results_type::const_iterator end = what.nested_results().end(); // declare filter predicates to select just the names or the integers sregex_id_filter_predicate name_id( name.regex_id() ); sregex_id_filter_predicate integer_id( integer.regex_id() ); // iterate over only the results from the name regex std::for_each( boost::make_filter_iterator( name_id, begin, end ), boost::make_filter_iterator( name_id, end, end ), output_result ); std::cout << '\n'; // iterate over only the results from the integer regex std::for_each( boost::make_filter_iterator( integer_id, begin, end ), boost::make_filter_iterator( integer_id, end, end ), output_result ); }
where output_results
is a
simple function that takes a smatch
and displays the full match. Notice how we use the regex_id_filter_predicate
together with basic_regex<>::regex_id()
and boost::make_filter_iterator()
from the Boost.Iterator
to select only those results corresponding to a particular nested regex.
This program displays the following:
marsha jan cindy 123 456 789