Regular expressions are used when you want to search for specify lines of text containing a particular pattern. Most of the \*U utilities operate on ASCII files a line at a time. Regular expressions search for patterns on a single line, and not for patterns that start on one line and end on another.
It is simple to search for a specific word or string of characters. Almost every editor on every computer system can do this. Regular expressions are more powerful and flexible. You can search for words of a certain size. You can search for a word with four or more vowels that end with an "s." Numbers, punctuation characters, you name it, a regular expression can find it. What happens once the program you are using find it is another matter. Some just search for the pattern. Others print out the line containing the pattern. Editors can replace the string with a new pattern. It all depends on the utility.
Regular expressions confuse people because they look a lot like the file matching patterns the shell uses. They even act the same way--almost. The square brackers are similar, and the asterisk acts similar to, but not identical to the asterisk in a regular expression. In particular, the Bourne shell, C shell, find, and cpio use file name matching patterns and not regular expressions.
Remember that shell meta-characters are expanded before the shell passes the arguments to the program. To prevent this expansion, the special characters in a regular expression must be quoted when passed as an option from the shell. You already know how to do this because I covered this topic in last month's tutorial.
There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular expression. A few utilities like awk and egrep use the extended expression. Most use the "regular" regular expression. From now on, if I talk about a "regular expression," it describes a feature in both types.
Here is a table of the Solaris (around 1991) commands that allow you to specify regular expressions:
Utility | Regular Expression Type |
vi | Basic |
sed | Basic |
grep | Basic |
csplit | Basic |
dbx | Basic |
dbxtool | Basic |
more | Basic |
ed | Basic |
expr | Basic |
lex | Basic |
pg | Basic |
nl | Basic |
rdist | Basic |
awk | Extended |
nawk | Extended |
egrep | Extended |
EMACS | EMACS Regular Expressions |
PERL | PERL Regular Expressions |
Pattern | Matches |
\&^A | "A" at the beginning of a line |
A$ | "A" at the end of a line |
A^ | "A^" anywhere on a line |
$A | "$A" anywhere on a line |
^\^ | "^" at the beginning of a line |
\$$ | "$" at the end of a line |
The use of "^" and "$" as indicators of the beginning or end of a line is a convention other utilities use. The vi editor uses these two characters as commands to go to the beginning or end of a line. The C shell uses "!^" to specify the first argument of the previous line, and "!$" is the last argument on the previous line.
It is one of those choices that other utilities go along with to maintain consistancy. For instance, "$" can refer to the last line of a file when using ed and sed. Cat -e marks end of lines with a "$." You might see it in other programs as well.
the regular expression would be "^T[a-z][aeiou] ."
Like the anchors in places that can't be considered an anchor, the characters "]" and "-" do not have a special meaning if they directly follow "[." Here are some examples:
Regular Expression | Matches |
[] | The characters "[]" |
[0] | The character "0" |
[0-9] | Any number |
[^0-9] | Any character other than a number |
[-0-9] | Any number or a "-" |
[0-9-] | Any number or a "-" |
[^-0-9] | Any character except a number or a "-" |
[]0-9] | Any number or a "]" |
[0-9]] | Any number followed by a "]" |
[0-9-z] | Any number, |
or any character between "9" and "z". | |
[0-9\-a\]] | Any number, or |
a "-", a "z", or a "]" |
This explains why the pattern "^#*" is useless, as it matches any number of "#'s" at the beginning of the line, including zero. Therefore this will match every line, because every line starts with zero or more "#'s."
At first glance, it might seem that starting the count at zero is stupid. Not so. Looking for an unknown number of characters is very important. Suppose you wanted to look for a number at the beginning of a line, and there may or may not be spaces before the number. Just use "^ *" to match zero or more spaces at the beginning of the line. If you need to match one or more, just repeat the character set. That is, "[0-9]*" matches zero or more numbers, and "[0-9][0-9]*" matches one or more numbers.
If a backslash is placed before a "<," ">," "{," "}," "(," ")," or before a digit, the back slash turns on a special meaning. This was done because these special functions were added late in the life of regular expressions. Changing the meaning of "{" would have broken old expressions. This is a horrible crime punishable by a year of hard labor writing COBOL programs. Instead, adding a back slash added functionality without breaking old programs. Rather than complain about the unsymmetry, view it as evolution.
Having convinced you that "\{" isn't a plot to confuse you, an example is in order. The regular expression to match 4, 5, 6, 7 or 8 lower case letters is
You must remember that modifiers like "*" and "\{1,5\}" only act as modifiers if they follow a character set. If they were at the beginning of a pattern, they would not be a modifier. Here is a list of examples, and the exceptions:
Regular Expression | Matches |
_ | |
* | Any line with an asterisk |
\* | Any line with an asterisk |
\\ | Any line with a back slash |
^* | Any line starting with an asterisk |
^A* | Any line |
^A\* | Any line starting with an "A*" |
^AA* | Any line if it starts with one "A" |
^AA*B | Any line with one or more "A"'s followed by a "B" |
^A\{4,8\}B | Any line starting with 4, 5, 6, 7 or 8 "A"'s |
followed by a "B" | |
^A\{4,\}B | Any line starting with 4 or more "A"'s |
followed by a "B" | |
^A\{4\}B | Any line starting with "AAAAB" |
\{4,8\} | Any line with "{4,8}" |
A{4,8} | Any line with "A{4,8}" |
There is an easy solution. The characters "\<" and "\>" are similar to the "^" and "$" anchors, as they don't occupy a position of a character. They do "anchor" the expression between to only match if it is on a word boundary. The pattern to search for the word "the" would be "\<[tT]he\>." The character before the "t" must be either a new line character, or anything except a letter, number, or underscore. The character after the "e" must also be a character other than a number, letter, or underscore or it could be the end of line character.
The "\<" and "\>" characters were introduced in the vi editor. The other programs didn't have this ability at that time. Also the "\{min,max\}" modifier is new and earlier utilities didn't have this ability. This made it difficult for the novice user of regular expressions, because it seemed each utility has a different convention. Sun has retrofited the newest regular expression library to all of their programs, so they all have the same ability. If you try to use these newer features on other vendor's machines, you might find they don't work the same way.
The other potential point of confusion is the extent of the pattern matches. Regular expressions match the longest possible pattern. That is, the regular expression
The character "?" matches 0 or 1 instances of the character set before, and the character "+" matches one or more copies of the character set. You can't use the \{ and \} in the extended regular expressions, but if you could, you might consider the "?" to be the same as "\{0,1\}" and the "+" to be the same as "\{1,\}."
By now, you are wondering why the extended regular expressions is even worth using. Except for two abbreviations, there are no advantages, and a lot of disadvantages. Therefore, examples would be useful.
The three important characters in the expanded regular expressions are "(," "|," and ")." Together, they let you match a choice of patterns. As an example, you can egrep to print all From: and Subject: lines from your incoming mail:
I promised to explain why the back slash characters don't work in extended regular expressions. Well, perhaps the "\{...\}" and "\<...\>" could be added to the extended expressions. These are the newest addition to the regular expression family. They could be added, but this might confuse people if those characters are added and the "\(...\)" are not. And there is no way to add that functionality to the extended expressions without changing the current usage. Do you see why? It's quite simple. If "(" has a special meaning, then "\(" must be the ordinary character. This is the opposite of the Basic regular expressions, where "(" is ordinary, and "\(" is special. The usage of the parentheses is incompatable, and any change could break old programs.
If the extended expression used "( ..|...)" as regular characters, and "\(...\|...\)" for specifying alternate patterns, then it is possible to have one set of regular expressions that has full functionality. This is exactly what GNU emacs does, by the way.
The rest of this is random notes.
Regular Expression | Class | Type | Meaning |
_ | |||
. | all | Character Set | A single character (except newline) |
^ | all | Anchor | Beginning of line |
$ | all | Anchor | End of line |
[...] | all | Character Set | Range of characters |
* | all | Modifier | zero or more duplicates |
\< | Basic | Anchor | Beginning of word |
\> | Basic | Anchor | End of word |
\(..\) | Basic | Backreference | Remembers pattern |
\1..\9 | Basic | Reference | Recalls pattern |
_+ | Extended | Modifier | One or more duplicates |
? | Extended | Modifier | Zero or one duplicate |
\{M,N\} | Extended | Modifier | M to N Duplicates |
(...|...) | Extended | Anchor | Shows alteration |
_ | |||
\(...\|...\) | EMACS | Anchor | Shows alteration |
\w | EMACS | Character set | Matches a letter in a word |
\W | EMACS | Character set | Opposite of \w |
Character Group | Meaning |
[:alnum:] | Alphanumeric |
[:cntrl:] | Control Character |
[:lower:] | Lower case character |
[:space:] | Whitespace |
[:alpha:] | Alphanumeric |
[:digit:] | Digit |
[:print:] | Printable character |
[:upper:] | Upper Case Character |
[:blank:] | whitespace, tabe, etc. |
[:graph:] | ? |
[:punct:] | Puctuation |
[:xdigit:] | Extended Digit |
Regular Expression | ||
Class | Type | Meaning |
\t | Character Set | tab |
\n | Character Set | newline |
\r | Character Set | return |
\f | Character Set | form |
\a | Character Set | alarm |
\e | Character Set | escape |
\033 | Character Set | octal |
\x1B | Character Set | hex |
\c[ | Character Set | control |
\l | Character Set | lowercase |
\u | Character Set | uppercase |
\L | Character Set | lowercase |
\U | Character Set | uppercase |
\E | Character Set | end |
\Q | Character Set | quote |
\w | Character Set | Match a "word" character |
\W | Character Set | Match a non-word character |
\s | Character Set | Match a whitespace character |
\S | Character Set | Match a non-whitespace character |
\d | Character Set | Match a digit character |
\D | Character Set | Match a non-digit character |
\b | Anchor | Match a word boundary |
\B | Anchor | Match a non-(word boundary) |
\A | Anchor | Match only at beginning of string |
\Z | Anchor | Match only at EOS, or before newline |
\z | Anchor | Match only at end of string |
\G | Anchor | Match only where previous m//g left off |
Example of PERL Extended, multi-line regular expression
m{ \( ( # Start group [^()]+ # anything but '(' or ')' | # or \( [^()]* \) )+ # end group \) }x