Found at:
http://publish.ez.no/article/articleprint/11/
|
Regular Expressions explained
|
This article will give you an introduction to the world of regular
expressions. I'll start off with explaining what regular expressions are
and introduce it's syntax, then some examples with varying complexity and
last a list of tools which use regular expressions.
[Note: Puppy has a regular expression
evaluation and learning tool in the Utilities menu]
Concept
A
regular expression is a text pattern consisting of a combination of
alphanumeric characters and special characters known as metacharacters. A
close relative is in fact the
wildcard expression which are often used
in file management. The pattern is used to match against text strings. The
result of a match is either successful or not, however when a match is
successful not all of the pattern must match, this is explained later in the
article.
You'll find that
regular expressions are used in three different ways:
Regular text match, search and replace and splitting. The latter is basicly
the same as the reverse match ie. everything the
regular expression
did not match.
Regular expressions are often simply called regexps or RE, but for
consistency I'll be referring to it with it's full name.
Due to the versatility of the
regular expression it is widely used in
text processing and parsing. UNIX users are probably familiar with them
trough the use of the programs,
grep,
sed,
awk and
ed. Text editors such as
(X)Emacs and
vi also use them
heavily. Probably the most known use of
regular expressions are in
the programming language Perl, you'll find that Perl sports the most advanced
regular expression implementation to this day.
Usage
Now you're probably wondering why you should bother to learn
regular
expressions. Well if you're a normal computer user your benefits from
using them are somewhat small, however if you're either a developer or a
system administrator you'll find that knowing
regular expressions will
make your (professional)life so much better.
Developers can use them to parse text files, fix up code and other wonders.
System administrators can use them to search trough logs, automate boring
tasks and sniff the network traffic for unauthorized activity.
Actually I would go so far as to say it's a crime for a System Administrator
not to have
any knowledge of
regular expressions.
Quantifiers
Before I start explaining the syntax you might want to jump to the last page
to learn which programs you can use to test out the examples in this
article.
The contents of an expression is, as explained earlier, a combination of
alphanumeric characters and metacharacters. An alphanumeric character is
either a letter from the alphabet
or a number
Actually in the world of regular expressions any character which is not a
metacharacter will match itself(often called literal characters), however a
lot of the time you're mostly concerned with the alphanumeric characters. A
very special character is the backslash
\, this turns any
metacharacters into literal characters, and alphanumeric characters into a
sort of metacharacter or sequence. The metacharacters are:
\ | ( ) [ { ^ $ * + ? . < >
|
With that said normal characters don't sound too interesting so lets jump to
the our very first meta characters.
The punctuation mark, or dot,
. needs explaining first since it often
leads to confusion. This character will not, as many might think, match the
punctuation in a line, it is instead a special meta character which matches
any character. Using this were you wanted to find the end of the line or the
decimal in a floating number will lead to strange results. As explained
above, you need to backslashify it to get the literal meaning. For instance
take this expression
will match the number 1.23 in a text as you might have guessed, but it will
also match these next lines
to make the expression
only match the floating number we change it
to
Remember this, it's very important. Now with that said we can get the show
going.
Two heavily recurring metacharacters are
They are called quantifiers and tells the engine to look for several
occurrences of a characters, the quantifier always precedes the character at
hand. The
* character matches zero or more occurrences of the
character in a row, the
+ characters is similar but matches one or
more.
So what if you decided to find words which had the character
c in it
you might be tempted to write:
What might come as a surprise to you is that you will find an enormous amount
of matches, even words with no c in it will match. How so you ask, well the
answer is simple. Recall that the
* character matches
zero or
more characters, well thats exactly what you did, zero characters.
You see in
regular expressions you have the possibility to match what
is called
the empty string, which is simply a string with zero size.
This empty string can actually be found in all texts, for instance the
word:
contains three empty strings. They are contained at the position right before
the
g, in between the
g and the
o and after the
o. And an empty string contains exactly
one empty string. At
first this might seem like a really silly thing to do but you'll learn later
on how this is used in more complex expressions.
So with this knowledge we might want to change our expression to:
and voila we get only words with c in them.
The next metacharacter you'll learn is:
This simply tells the engine to either match the character or not (zero or
one). For instance the expression:
will match any of these lines:
These three metacharacters are simply a specialized scenario for the more
generalized quantifier
the
n and
m are respectively the minimum and maximum size for
the quantifier. For instance
means match one or up to five characters. You can also skip m to allow for
infinite match:
which matches one or more characters. This is exactly what the
+
characters does. So now you see the connection,
* is equal to
{0,},
+ is equal to
{1,} and
? is equal to
{0,1}.
The last thing you can do with the quantifier is to also skip the comma,
which means to match 5 characters, no more no less.
Assertions
The next type of metacharacters are assertions, these will match if a given
assertion is true. The first pair of assertions are
which match the beginning of the line and the end of the line. Note that some
regular expression implementations allows you to change their behavior
so that they will instead match the beginning of the text and the end of the
text. These assertions always match a zero length string, or in other words
they match a position. For instance if you wrote this expression:
it would match any line which began with the word
The.
The next assertion characters match at the beginning and end of a word, they
are:
they come in handy when you want to match a word precisely, for instance:
would match any of the following words
cow
coward
cowage
cowboy
cowl
|
a small change to the expression:
and you'll only match the word
cow in the text.
One last thing to be said is that all literal characters are in fact
assertions themselves, the difference between them and the ones above is that
literal ones has a size. So for cleanliness sake we only use the word
assertions for those that are zero-width.
Groups and alternation
One thing you might have noticed when we explained quantifiers is that they
only worked on the character to the left, since this pretty much limits our
expressions I'll explain other uses for quantifiers. Quantifiers can also be
used on metacharacters, using them on assertions is silly since they are
zero-width and matching one, two, three or more of them doesn't do any good.
However the grouping and sequence metacharacters are perfect for being
quantified. Let's first start with grouping.
You can form groups, or subexpressions as they are frequently called, by
using the begin and end parenthesis characters:
The
( starts the subexpression and the
) ends it. It is also
possible to have one or more subexpressions inside a subexpressions. The
subexpression will match if the contents match. So mixing this with
quantifiers and assertions you can do:
which matches all of the following lines
Another use for the subexpressions are to extract a portion of the match if
it matches, this is often used in conjunction with sequences which is
discussed later.
You can also use the result of a subexpression for what is called a back
reference. A back reference is given by using a backslashified digit, only a
single non-zero digit, this leaves you with nine back references.
The back reference matches whatever the corresponding subexpression actually
matched (except that {article_contents_1} matches a null character). To find
the number of the subexpression count the left parentheses from the left.
The use for back references are somewhat limited, especially since you only
have nine of them, but on some rare occasion you might need it. Note some
regular expression implementations can use multi-digit numbers as long
as they don't start with a 0.
Next is alternations which allows you to match on of many words, the
alternation character is
a sample usage is:
would match either Bill, Linus, Steve or Larry, and mixing this with
subexpressions and quantifiers we can do:
which matches any of the following words but none other
cow
coward
cowage
cowboy
cowl
|
I mentioned earlier in the article that not all of the expression must match
for the match to be successful, this can happen when you're using
subexpressions together with alternations. For instance
((Donald|Dolly) Duck)|(Scrooge McDuck)
|
As you see only the left or right top subexpression will match, not both,
this is sometimes handy when you want to run a complex pattern in one
subexpression and if it fails try another one.
Sequences
Last we have sequences which defines sequences of characters which can match,
sometimes you don't want match a word directly but rather something that
resembles one. The sequence characters are
any characters put inside the sequence brackets are treated as a literal
character, even metacharacters. The only special characters are the
-
which denotes character ranges and the
^ which is used to negate a
sequence. The sequence is somewhat similar with alternation, the similarity
is that only one of the items listed will match. For instance
will match any small characters which are in the English alphabet (a to z).
Another common sequence is
which matches any small or capital characters in the English alphabet as well
as numbers. Sequences are also mixed with quantifiers and assertions to
produce more elaborate searches. For instance
matches all whole words. This will match
cow
Linus
regular
expression
|
but will not match
Now what if you wanted to find anything but words, the expression
would find any sequences of characters which does not contain the English
alphabet or any numbers.
Some implementations of
regular expressions allows you to use
shorthand versions for commonly used sequences, they are:
\d, a digit [0-9]
\D, a non-digit [^0-9]
\w, a word (alphanumeric) [a-zA-Z0-9]
\W, a non-word [^a-zA-Z0-9]
\s, a whitespace [ \t\n\r\f]
\S, a non-whitespace [^ \t\n\r\f]
|
Wildcards
For people who has some knowledge with wildcards I'll give a brief
explanation on how to convert them to
regular expressions. After
reading this article you probably have seen the similarities with wildcards.
For instance
matches any text which end with .jpg. You can also specify brackets with
characters, for instance
matches any text which ends in .cpp or .hpp. Altogether very similar to
regular expressions.
Converting the * operator
The * means match zero or more of anything in wildcards, as we learned we do
this is regular expression with the punctuation mark and the * quantifier.
This gives
Also remember to convert any punctuation marks from wildcards to be
backslashified.
Converting the ? operator
The ? means match any character but do match
something, this is
exactly what the punctuation mark does.
Converting the [] operator
The square bracket can be used untouched since they have the same meaning
going from wildcards to regular expressions.
These leaves us with:
Replace any * characters with .*
Replace any ? characters with .
Leave square brackets as they are.
Replace any characters which are metacharacters with a backslashified
variant.
Examples
would be converted to
would be convert to
or alternatively
Examples
To really get to know
regular expressions I've left some commonly used
expressions on this page. Study them, experiment and try to understand
exactly what they are doing.
Email validity, will only match email addresses which are valid, for instance
user@host.com
[a-z0-9_-]+(\.[a-z0-9_-]+)*@[a-z0-9_-]+(\.[a-z0-9_-]+)+
|
Email validity #2, matches email addresses with a name in front, for instance
"Joe Doe
"
("?[a-zA-Z]+"?[ \t]*)+\<[a-z0-9_-]+(\.[a-z0-9_-]+)*@[a-z0-9_-]+(\.[a-z0-9_-]+)+\>
|
Protocol validity, matches web based protocols such as htpp://, ftp:// or
https://
C/C++ includes, matches valid include statements in C/C++ files.
^#include[ \t]+[<"][^>"]+[">]
|
C++ end of line comments
C/C++ span line comments, it has one flaw, can you spot it?
Floating point numbers, matches simple floating point numbers of the kind 1.2
and 0.5
Hexadecimal numbers, matches C/C++ style hex numbers, 0xcafebabe
Utilities
There exists several utilities which uses regular expression. I'll leave a
list of them with a short description:
grep
Grep searches named input files for lines containing a match to the given
pattern. It can also be used to find files which contains a specific pattern,
for instance:
grep -E "cow|vache" * >/dev/null && echo "Found a cow"
|
This is utility is rather common on Linux distributions, but if you don't
have it you can grab a version on the GNU page
A small tip is to enable extended regular expressions with the options -E, if
not a lot of the metacharacters explained in this article won't work.
sed
Sed is a stream editor. A stream editor is used to perform basic text
transformations on an input stream.
This is utility is rather common on Linux distributions, but if you don't
have it you can grab a version on the GNU page
gawk
Gawk is the GNU Project's implementation of the AWK programming language.
It conforms to the definition of the language in the POSIX 1003.2 Command
Language And Utilities Standard.
This is utility is rather common on Linux distributions, but if you don't
have it you can grab a version on the GNU
page
[document edited
here]
Regular expression related links:
Regular Expressions
and NP-Completeness
Equivalence
of Regular Expressions and Finite Automata
Perl Regular
Expression Tutorial