CHAPTER 7 Text Processing

7.1 Regular Expression Syntax

Some text processing programs, such as grep, egrep, sed, awk and vi, let you search on patterns instead of fixed strings. These text patterns are known as regular expressions. You form a regular expression by combining normal characters and special characters, also known as meta-characters, with the rules below. With these regular expressions you can do pattern matching on text data. Regular expressions come in three different forms:

Anchors which tie the pattern to a location on the line
Character sets which match a character at a single position
Modifiers which specify how many times to repeat the previous expression

Regular expression syntax is as follows. Some programs will accept all of these, others may only accept some.

. match any single character except <newline>

* match zero or more instances of the single character (or meta-character) immediately preceding it

[abc] match any of the characters enclosed

[a-d] match any character in the enclosed range

[^exp] match any character not in the following expression

^abc the regular expression must start at the beginning of the line (Anchor)

abc$ the regular expression must end at the end of the line (Anchor)

\ treat the next character literally. This is normally used to escape the meaning of special characters such as "." and "*".

\{n,m\} match the regular expression preceding this a minimum number of n times and a maximum of m times (0 through 255 are allowed for n and m). The \{ and \} sets should be thought of as single operators. In this case the \ preceding the bracket does not escape its special meaning, but rather turns on a new one.

\<abc\> will match the enclosed regular expression as long as it is a separate word. Word boundaries are defined as beginning with a <newline> or anything except a letter, digit or underscore (_) or ending with the same or a end-of-line character. Again the \< and \> sets should be thought of as single operators.

$abc$ saves the enclosed pattern in a buffer. Up to nine patterns can be saved for each line. You can reference these latter with the \n character set. Again the $ and $ sets should be thought of as single operators.

\n where n is between 1 and 9. This matches the nth expression previously saved for this line. Expressions are numbered starting from the left. The \n should be thought of as a single operator.

& print the previous search pattern (used in the replacement string)

There are a few meta-characters used only by awk and egrep. These are:

+ match one or more of the preceding expression

? match zero or more of the preceding expression

| separator. Match either the preceding or following expression.

( ) group the regular expressions within and apply the match to the set.

Some examples of the more commonly used regular expressions are:

regular

expression matches

cat the string cat

.at any occurrence of a letter, followed by at, such as cat, rat, mat, bat, fat, hat

xy*z any occurrence of an x, followed by zero or more y's, followed by a z.

^cat cat at the beginning of the line

cat$ cat at the end of the line

\* any occurrence of an asterisk

[cC]at cat or Cat

[^a-zA-Z] any occurrence of a non-alphabetic character

[0-9]$ any line ending with a number

[A-Z][A-Z]* one or more upper case letters

[A-Z]* zero or more upper case letters (In other words, anything.)

Introduction to Unix - 14 AUG 1996

[Next] [Previous] [Up] [Top] [Contents]