Copyright 2001 Bruce Barnett and General Electric Company
All rights reserved
You are allowed to print copies of this tutorial for your personal use, and link to this page, but you are not allowed to make electronic copies, or redistribute this tutorial in any form without permission.
Awk is an extremely versatile programming language for working on files. We'll teach you just enough to understand the examples in this book, plus a smidgen.
In the past I have covered grep and sed. This section discusses AWK, another cornerstone of UNIX shell programming. There are three variations of AWK:
Why is AWK so important? It is an excellent filter and report writer. Many UNIX utilities generates rows and columns of information. AWK is an excellent tool for processing these rows and columns, and is easier to use AWK than most conventional programming languages. It can be considered to be a pseudo-C interpretor, as it understands the same arithmatic operators as C. AWK also has string manipulation functions, so it can search for particular strings and modify the output. AWK also has associative arrays, which are incredible useful, and is a feature most computing languages lack. Associative arrays can make a complex problem a trivial exercise.
I won't exhaustively cover AWK. That is, I will cover the essential parts, and avoid the many variants of AWK. It might be too confusing to discuss three different versions of AWK. I won't cover the GNU version of AWK called "gawk." Similarly, I will not discuss the new AT&T AWK called "nawk." The new AWK comes on the Sun system, and you may find it superior to the old AWK in many ways. In particular, it has better diagnostics, and won't print out the infamous "bailing out near line ..." message the original AWK is prone to do. Instead, "nawk" prints out the line it didn't understand, and highlights the bad parts with arrows. If you find yourself needing a feature that is very difficult or impossible to do in AWK, I suggest you either use NAWK, or convert your AWK script into PERL using the "a2p" conversion program which comes with PERL. PERL is a marvelous language, and I use it all the time, but I do not plan to cover PERL in these tutorials. Having made my intention clear, I can continue with a clear conscience.
Many UNIX utilities have strange names. AWK is one of those utilities. It is not an abbreviation for awkward. In fact, it is an elegant and simple language. The work "AWK" is derived from the initials of the language's three developers: A. Aho, B. W. Kernighan and P. Weinberger.
The essential organization of an AWK program follows the form:
adds one line before and one line after the input file. This isn't very useful, but with a simple change, we can make this into a typical AWK program:
The characters "t" Indicates a tab character so the output lines up on even boundries. The "$8" and "$3" have a meaning similar to a shell script. Instead of the eighth and third argument, they mean the eighth and third field of the input line. You can think of a field as a column, and the action you specify operates on each line or row read in.
There are two differences between AWK and a shell processing the characters within double quotes. AWK understands special characters follow the "" character. The UNIX shells do not. Also, unlike the shell (and PERL) AWK does not evaluate variables within strings. The second line could not be written like this:
This might generate the following if there were only two files in the current directory:
a.file barnett
another.file barnett
- DONE -
The script itself can be written in many ways.
The C shell version would look like this:
#!/bin/csh -f
awk '
BEGIN { print "FiletOwner" }
{ print $8, "t", $3}
END { print " - DONE -" }
'
Click here to get file: awk_example1.csh
As you can see in the above script, each line of the AWK script must
have a backslash if it is not the last line of the script.
This is necessary as the C shell doesn't, by default, allow strings
to span more than one line.
The Bourne shell allows quoted strings
to span several lines:
#!/bin/sh
awk '
BEGIN { print "FiletOwner" }
{ print $8, "t", $3}
END { print " - DONE -" }
'
Click here to get file: awk_example1.sh
The third form is to store the commands in a file, and execute
Change the permission with the chmod command, and the script becomes a new command. Notice the "-f" option of AWK above, which is also used in the third format. It specifies the file containing the instructions. As you can see, AWK considers lines that start with a "#" to be a comment. Well, anything from the "#" to the end of the line is a comment. However, I always comment my AWK scripts with the "#" at the start of the line, for reasons I'll discuss later.
Which format should you use? I prefer the last format when possible. It's shorter and simpler. It's also easier to debug problems. If you need to use a shell, and want to avoid using too many files, you can combine them as we did in the first and second example.
The format of AWK is not free-form.
You cannot put new line breaks just anywhere.
They must go in particular locations.
To be precise, in the original AWK
you can insert a new line character
after the curly braces, and at the end of a command,
but not elsewhere.
If you wanted to break a long line into two lines at any other place,
you must use
a backslash:
#!/bin/awk -f
BEGIN { print "FiletOwner" }
{ print $8, "t",
$3}
END { print " - DONE -" }
Click here to get file: awk_example2.awk
The Bourne shell version would be
#!/bin/sh
awk '
BEGIN { print "FiletOwner" }
{ print $8, "t",
$3}
END { print "done"}
'
Click here to get file: awk_example2.sh
while the C shell would be
#!/bin/csh -f
awk '
BEGIN { print "FiletOwner" }
{ print $8, "t",
$3}
END { print "done"}
'
Click here to get file: awk_example2.csh
As you can see, this demonstrates how awkward the C shell is when enclosing an AWK script. Not only are back slashes needed for every line, some lines need two. Many people will warn you about the C shell. Some of the problems are subtle, and you may never see them. Try to include an AWK or sed script within a C shell script, and the back slashes will drive you crazy. This is what convinced me to learn the Bourne shell years ago, when I was starting out. I strongly recommend you use the Bourne shell for any AWK or sed script. If you don't use the Bourne shell, then you should learn it. As a minimum, learn how to set variables, which by some strange coincidence is the subject of the next section.
A suggested use is:
This would print the third column from the ls command, which would be the owner of the file. You can change this into a utility that counts how many files are owned by each user by adding
Only one problem: the script doesn't work.
The value of the
"column" variable is not seen by AWK. Change
"awk" to
"echo" to check. You need to turn off the quoting
when the variable is seen. This can be done by
ending the quoting, and restarting it after the variable:
#!/bin/sh
column=$1
awk '{print $'$column'}'
Click here to get file: Column2.sh
This is a very important concept, and throws experienced programmers a curve ball. In many computer languages, a string has a start quote, and end quote, and the contents in between. If you want to include a special character inside the quote, you must prevent the character from having the typical meaning. In C this is down by putting a backslash before the character. In other languages, there is a special combination of characters to to this. In the C and Bourne shell, the quote is just a switch. It turns the interpretation mode on or off. There is really no such concept as "start of string" and "end of string." The quotes toggle a switch inside the interpretor. The quote character is not passed on to the application. This is why there are two pairs of quotes above. Notice there are two dollar signs. The first one is quoted, and is seen by AWK. The second one is not quoted, so the shell evaluates the variable, and replaces "$column" by the value. If you don't understand, either change "awk" to "echo," or change the first line to read "#!/bin/sh -x."
Some improvements are needed, however. The Bourne shell has a mechanism to provide a value for a variable if the value isn't set, or is set and the value is an empty string. This is done by using the format:
We can save a line by combining these two steps:
#!/bin/sh
awk '{print $'${1:-1}'}'
Click here to get file: Column4.sh
It is hard to read, but it is compact. There is one other method that can be used. If you execute an AWK command and include on the command line
this variable will be set when the AWK script starts.
An example of this use would be:
#!/bin/sh
awk '{print $c}' c=${1:-1}
Click here to get file: Column5.sh
This last variation does not have the problems with quoting the previous example had. You should master the earlier example, however, because you can use it with any script or command. The second method is special to AWK.
+--------------------------------------------+ | AWK Table 1 | | Binary Operators | |Operator Type Meaning | +--------------------------------------------+ |amp;+ Arithmetic Addition | |amp;- Arithmetic Subtraction | |amp;* Arithmetic Multiplication | |amp;/ Arithmetic Division | |amp;% Arithmetic Modulo | |<space> String Concatenation | +--------------------------------------------+Using variables with the value of "7" and "3," AWK returns the following results for each operator when using the print command:
+---------------------+ |Expression Result | +---------------------+ |7+3 10 | |7-3 4 | |7*3 21 | |7/3 2.33333 | |7%3 1 | |7 3 73 | +---------------------+
There are a few points to make. The modulus operator finds the remainder after an integer divide. The print command output a floating point number on the divide, but an integer for the rest. The string concatenate operator is confusing, since it isn't even visible. Place a space between two variables and the strings are concatenated together. This also shows that numbers are converted automatically into strings when needed. Unlike C, AWK doesn't have "types" of variables. There is one type only, and it can be a string or number. The conversion rules are simple. A number can easily be converted into a string. When a string is converted into a number, AWK will do so. The string "123" will be converted into the number 123. However, the string "123X" will be converted into the number 0. (NAWK will behave differently, and converts the string into integer 123, which is found in the beginning of the string).
AWK also supports the "++" and "--" operators of C. Both increment or decrement the variables by one. The operator can only be used with a single variable, and can be before or after the variable. The prefix form modifies the value, and then uses the result, while the postfix form gets the results of the variable, and afterwards modifies the variable. As an example, if X has the value of 3, then the AWK statement
would print the numbers 3 and 5. These operators are also assignment operators, and can be used by themselves on a line:
Certain operators have precedence over others; parenthesis can be used to control grouping. The statement
is the same as
Both print out "74."
Notice spaces can be added for readability. AWK, like C, has special assignment operators, which combine a calculation with an assignment. Instead of saying
you can more concisely say:
The complete list follows:
+-----------------------------------------+ | AWK Table 2 | | Assignment Operators | |Operator Meaning | |+= Add result to variable | |-= Subtract result from variable | |*= Multiply variable by result | |/= Divide variable by result | |%= Apply modulo to variable | +-----------------------------------------+
Arithmetic values can also be converted into boolean conditions by using relational operators:
+---------------------------------------+ | AWK Table 3 | | Relational Operators | |Operator Meaning | +---------------------------------------+ |== Is equal | |!= Is not equal to | |> Is greater than | |>= Is greater than or equal to | |< Is less than | |<= Is less than or equal to | +---------------------------------------+
These operators are the same as the C operators. They can be used to compare numbers or strings. With respect to strings, lower case letters are greater than upper case letters.
+-----------------------------+ | AWK Table 4 | |Regular Expression Operators | |Operator Meaning | +-----------------------------+ |~ Matches | |!~ Doesn't match | +-----------------------------+
The order in this case is particular. The regular expression must be enclosed by slashes, and comes after the operator. AWK supports extended regular expressions, so the following are examples of valid tests:
At this point, you can use AWK as a language for simple calculations;
If you wanted to calculate something, and not read any lines for
input,
you could use the
BEGIN keyword discussed earlier, combined with a
exit command:
#!/bin/awk -f
BEGIN {
# Print the squares from 1 to 10 the first way
i=1;
while (i <= 10) {
printf "The square of ", i, " is ", i*i;
i = i+1;
}
# do it again, using more concise code
for (i=1; i <= 10; i++) {
printf "The square of ", i, " is ", i*i;
}
# now end
exit;
}
Click here to get file: awk_print_squares.awk
The following asks for a number, and then squares it:
#!/bin/awk -f
BEGIN {
print "type a number";
}
{
print "The square of ", $i, " is ", $1*$1;
print "type another number";
}
END {
print "Done"
}
Click here to get file: awk_ask_for_square.awk
The above isn't a good filter, because it asks for input each time. If you pipe the output of another program into it, you would generate a lot of meaningless prompts.
Here is a filter that you should find useful. It counts lines, totals
up the numbers in the first column, and calculates the average.
Pipe
"wc -c *" into it, and it will count files, and tell you the average number of
words per file, as well as the total words and the number of files.
#!/bin/awk -f
BEGIN {
# How many lines
lines=0;
total=0;
}
{
# this code is executed once for each line
# increase the number of files
lines++;
# increase the total size, which is field #1
total+=$1;
}
END {
# end, now output the total
print lines " lines read";
print "total is ", total;
if (lines > 0 ) {
print "average is ", total/lines;
} else {
print "average is 0";
}
}
Click here to get file: average.awk
You can pipe the output of "ls -s" into this filter to count the number of files, the total size, and the average size. There is a slight problem with this script, as it includes the output of "ls" that reports the total. This causes the number of files to be off by one. Changing
will fix this problem. Note the code which prevents a divide by zero. This is common in well-written scripts. I also initialize the variables to zero. This is not necessary, but it is a good habit.
I have mentioned two kinds of variables: positional and user defined. A user defined variable is one you create. A positional variable is not a special variable, but a function triggered by the dollar sign. Therefore
and
do the same thing: print the first field on the line. There are two more points about positional variables that are very useful. The variable "$0" refers to the entire line that AWK reads in. That is, if you had eight fields in a line,
is similar to
This will change the spacing between the fields; otherwise, they behave the same. You can modify positional variables. The following commands
deletes the second field. If you had four fields, and wanted to print out the second and fourth field, there are two ways. This is the first:
and the second
#!/bin/awk -f
{
print $2, $4;
}
These perform similarly, but not identically. The number of spaces between the values vary. There are two reasons for this. The actual number of fields does not change. Setting a positional variable to an empty string does not delete the variable. It's still there, but the contents has been deleted. The other reason is the way AWK outputs the entire line. There is a field separator that specifies what character to put between the fields on output. The first example outputs four fields, while the second outputs two. In-between each field is a space. This is easier to explain if the characters between fields could be modified to be made more visible. Well, it can. AWK provides special variables for just that purpose.
There is a way to do this without the command line option.
The variable
"FS" can be set like any variable, and has the same function
as the
"-F" command line option. The following is a
script that has the same function as the one above.
#!/bin/awk -f
BEGIN {
FS=":";
}
{
if ( $2 == "" ) {
print $1 ": no password!";
}
}
Click here to get file: awk_nopasswd.awk
The second form can be used to create a UNIX utility, which I will name "chkpasswd," and executed like this:
The command "chkpasswd -F:" cannot be used, because AWK will never see this argument. All interpreter scripts accept one and only one argument, which is immediately after the "#!/bin/awk" string. In this case, the single argument is "-f." Another difference between the command line option and the internal variable is the ability to set the input field separator to be more than one character. If you specify
then AWK will split a line into fields wherever it sees those two characters, in that exact order. You cannot do this on the command line.
There is a third advantage the internal variable has over the command line option: you can change the field separator character as many times as you want while reading a file. Well, at most once for each line. You can even change it depending on the line you read. Suppose you had the following file which contains the numbers 1 through 7 in three different formats. Lines 4 through 6 have colon separated fields, while the others separated by spaces.
The AWK program can easily switch between these formats:
#!/bin/awk -f
{
if ($1 == "#START") {
FS=":";
} else if ($1 == "#STOP") {
FS=" ";
} else {
#print the Roman number in column 3
print $3
}
}
Click here to get file: awk_example3.awk
Note the field separator variable retains its value until it is explicitly changed. You don't have to reset it for each line. Sounds simple, right? However, I have a trick question for you. What happens if you change the field separator while reading a line? That is, suppose you had the following line
and you executed the following script:
What would be printed? "Three" or "Two:Three:4?" Well, the script would print out "Two:Three:4" twice. However, if you deleted the first print statement, it would print out "Three" once! I thought this was very strange at first, but after pulling out some hair, kicking the deck, and yelling at muself and everyone who had anything to do with the development of UNIX, it is intuitively obvious. You just have to be thinking like a professional programmer to realize it is intuitive. I shall explain, and prevent you from causing yourself physical harm.
If you change the field separator before you read the line, the change affects what you read. If you change it after you read the line, it will not redefine the variables. You wouldn't want a variable to change on you as a side-effect of another action. A programming language with hidden side effects is broken, and should not be trusted. AWK allows you to redefine the field separator either before or after you read the line, and does the right thing each time. Once you read the variable, the variable will not change unless you change it. Bravo!
To illustrate this further, here is another version of the
previous code that changes the field separator dynamically.
In this case, AWK does it by examining field
"$0," which is the entire line.
When the line contains a colon, the field separator is a colon,
otherwise, it is a space:
#!/bin/awk -f
{
if ( $0 ~ /:/ ) {
FS=":";
} else {
FS=" ";
}
#print the third field, whatever format
print $3
}
Click here to get file: awk_example4.awk
This example eliminates the need to have the special "#START" and "#STOP" lines in the input.
and
The first example prints out one field, and the second prints out two fields. In the first case, the two positional parameters are concatenated together and output without a space. In the second case, AWK prints two fields, and places the output field separator between them. Normally this is a space, but you can change this by modifying the variable "OFS."
If you wanted to copy the password file, but delete the
encrypted password, you could use AWK:
#!/bin/awk -f
BEGIN {
FS=":";
OFS=":";
}
{
$2="";
print
}
Click here to get file: delete_passwd.awk
Give this script the password file, and it will delete the password, but leave everything else the same. You can make the output field separator any number of characters. You are not limited to a single character.
It is useful to know how many fields
are on a line. You may want to have your script
change its operation based on the number of fields.
As an example, the command
"ls -l" may generate eight or nine fields, depending on which version you are
executing. The System V version,
"/usr/bin/ls -l" generates nine fields, which is equivalent to the Berkeley
"/usr/ucb/ls -lg" command. If you wanted to print the owner and filename
then the following AWK script would work with either version of
"ls:"
#!/bin/awk -f
# parse the output of "ls -l"
# print owner and filename
# remember - Berkeley ls -l has 8 fields, System V has 9
{
if (NF == 8) {
print $3, $8;
} else if (NF == 9) {
print $3, $9;
}
}
Click here to get file: owner_group.awk
Don't forget the variable can be prepended with a
"$." This allows you to print the last field of any column
#!/bin/awk -f
{ print $NF; }
Click here to get file: print_last_field.awk
One warning about AWK. There is a limit of 99 fields in a single line. PERL does not have any such limitations.
Another useful variable is
"NR." This tells you the number of records, or the line number.
You can use AWK to only examine certain lines.
This example prints lines after the first 100 lines, and puts a line
number before each line after 100:
#!/bin/awk -f
{ if (NR >= 100) {
print NR, $0;
}
Click here to get file: awk_example5.awk
Normally, AWK reads one line at a time, and breaks up the line into
fields. You can set the
"RS" variable to change AWK's definition of a
"line." If you set it to an empty string, then AWK
will read the entire file into memory.
You can combine this with changing the
"FS" variable. This example treats each line as a field,
and prints out the second and third line:
#!/bin/awk -f
BEGIN {
# change the record separator from newline to nothing
RS=""
# change the field separator from whitespace to newline
FS="n"
}
{
# print the second and third line of the file
print $2, $3;
}
.)l
#!/bin/awk -f
# reports which file is being read
BEGIN {
f="";
}
{ if (f != FILENAME) {
print "reading", FILENAME;
f=FILENAME;
}
print;
}
Click here to get file: awk_example6.awk
The two lines are printed with a space between.
Also this will only work if the input file is less than 100 lines,
therefore this technique is limited. You can use it to
break words up, one word per line, using this:
#!/bin/awk -f
BEGIN {
RS=" ";
}
{
print ;
}
Click here to get file: oneword_per_line.awk
but this only works if all of the words are separated by a space. If there is a tab or punctuation inside, it would not.
The default output record separator is a newline, like the input.
This can be set to be a newline and carriage return, if you need
to generate a text file for a non-UNIX system.
#!/bin/awk -f
# this filter adds a carriage return to all lines
# before the newline character
BEGIN {
ORS="rn"
}
{ print }
Click here to get file: add_cr.awk
The last variable known to regular AWK is
"FILENAME," which tells you the name of the file being read.
#!/bin/awk -f
BEGIN {
# change the record separator from newline to nothing
RS=""
# change the field separator from whitespace to newline
FS="n"
}
{
# print the second and third line of the file
print $2, $3;
}
.)l
#!/bin/awk -f
# reports which file is being read
BEGIN {
f="";
}
{ if (f != FILENAME) {
print "reading", FILENAME;
f=FILENAME;
}
print;
}
Click here to get file: awk_example6.awk
This can be used if several files need to be parsed by AWK. Normally you use standard input to provide AWK with information. You can also specify the filenames on the command line. If the above script was called "testfilter," and if you executed it with
In this case, the second file will be called
"-," which is the conventional name for standard input.
I have used this when I want to put some information
before and after a filter operation. The prefix and postfix files
special data before and after the real data.
By checking the filename, you can parse the information
differently. This is also useful to report syntax errors
in particular files:
#!/bin/awk -f
{
if (NF == 6) {
# do the right thing
} else {
if (FILENAME == "-" ) {
print "SYNTAX ERROR, Wrong number of fields,",
"in STDIN, line #:", NR, "line: ", $0;
} else {
print "SYNTAX ERROR, Wrong number of fields,",
"Filename: ", FILENAME, "line # ", NR,"line: ", $0;
}
}
}
Click here to get file: awk_example7.awk
I have used dozens of different programming languages over the last 20 years, and AWK is the first language I found that has associative arrays. This term may be meaningless to you, but believe me, these arrays are invaluable, and simplify programming enormously. Let me describe a problem, and show you how associative arrays can be used for reduce coding time, giving you more time to explore another stupid problem you don't want to deal with in the first place.
Let's suppose you have a directory overflowing with files, and you want to find out how many files are owned by each user, and perhaps how much disk space each user owns. You really want someone to blame; it's hard to tell who owns what file. A filter that processes the output of ls would work:
But this doesn't tell you how much space each user is using. It also doesn't work for a large directory tree. This requires find and xargs:
The third column of
"ls" is the username. The filter has to count
how many times it sees each user.
The typical program would have an array of usernames
and another array that counts how many times each
username has been seen. The index to both arrays
are the same; you use one array to find the index, and the second
to keep track of the count.
I'll show you one way to do it in AWK--the wrong way:
#!/bin/awk -f
# bad example of AWK programming
# this counts how many files each user owns.
BEGIN {
number_of_users=0;
}
{
# must make sure you only examine lines with 8 or more fields
if (NF>7) {
user=0;
# look for the user in our list of users
for (i=1; i<=number_of_users; i++) {
# is the user known?
if (username[i] == $3) {
# found it - remember where the user is
user=i;
}
}
if (user == 0) {
# found a new user
username[++number_of_users]=$3;
user=number_of_users;
}
# increase number of counts
count[user]++;
}
}
END {
for (i=1; i<=number_of_users; i++) {
print count[i], username[i]
}
}
Click here to get file: awk_example8.awk
I don't want you to
read this script. I told you it's the wrong way
to do it.
If you were a C programmer, and didn't know AWK,
you would probably use a technique like the one above.
Here is the same program, except this example that uses AWK's
associative arrays. The important point is to notice the difference in
size
between these two versions:
#!/bin/awk -f
{
username[$3]++;
}
END {
for (i in username) {
print username[i], i;
}
}
Click here to get file: count_users0.awk
This is shorter, simpler, and much easier to understand--Once you understand exactly what an associative array is. The concept is simple. Instead of using a number to find an entry in an array, use anything you want. An associative array in an array whose index is a string. All arrays in AWK are associative. In this case, the index into the array is the third field of the "ls" command, which is the username. If the user is "bin," the main loop increments the count per user by effectively executing
UNIX guru's may gleefully report that the 8 line AWK script can be replaced by:
This happens to fix the other problem. Apply this technique and you will make your AWK programs more robust and easier for others to use.
This is invalid in AWK. However, the following is perfectly fine:
I often find myself using certain techniques repeatedly in AWK. This example will demonstrate these techniques, and illustrate the power and elegance of AWK. The program is simple and common. The disk is full. Who's gonna be blamed? I just hope you use this power wisely. Remember, you may be the one who filled up the disk.
Having resolved my moral dilemma, by placing the burden squarely on your shoulders, I will describe the program in detail. I will also discuss several tips you will find useful in large AWK programs. First, initialize all arrays used in a for loop. There will be four arrays for this purpose. Initialization is easy:
The third suggestion is to make sure your input is in the correct form. It's generally a good idea to be pessimistic, but I will add a simple but sufficient test in this example.
I placed the test and error clause up front, so the rest of the code won't be cluttered. AWK doesn't have user defined functions. NAWK, GAWK and PERL do.
The next piece of advice for complex AWK scripts is to define a name for each field used. In this case, we want the user, group and size in disk blocks. We could use the file size in bytes, but the block size corresponds to the blocks on the disk, a more accurate measurement of space. Disk blocks can be found by using "ls -s." This adds a column, so the username becomes the fourth column, etc. Therefore the script will contain:
This will allow us to easily adapt to changes in input. We could use "$1" throughout the script, but if we changed the number of fields, which the "-s" option does, we'd have to change each field reference. You don't want to go through an AWK script, and change all the "$1" to "$2," and also change the "$2" to "$3" because those are really the "$1" that you just changed to "$2." Of course this is confusing. That's why it's a good idea to assign names to the fields. I've been there too.
Next the AWK script will count how many times each combination of users and groups occur. That is, I am going to construct a two-part index that contains the username and groupname. This will let me count up the number of times each user/group combination occurs, and how much disk space is used.
Consider this: how would you calculate the total for just a user, or for just a group? You could rewrite the script. Or you could take the user/group totals, and total them with a second script.
You could do it, but it's not the AWK way to do it. If you had to examine a bazillion files, and it takes a long time to run that script, it would be a waste to repeat this task. It's also inefficient to require two scripts when one can do everything. The proper way to solve this problem is to extract as much information as possible in one pass through the files. Therefore this script will find the number and size for each category:
This index will be used for all arrays. There is a space between the two values. This covers the total for the user/group combination. What about the other three arrays? I will use a "*" to indicate the total for all users or groups. Therefore the index for all files would be "* *" while the index for all of the file owned by user daemon would be "daemon *." The heart of the script totals up the number and size of each file, putting the information into the right category. I will use 8 arrays; 4 for file sizes, and 4 for counts:
u_size[user " *"]+=size;
g_size["* " group]+=size;
ug_size[user " " group]+=size;
all_size["* *"]+=size;
This particular universal index will make sorting easier, as you will see. Also important is to sort the information in an order that is useful. You can try to force a particular output order in AWK, but why work at this, when it's a one line command for sort? The difficult part is finding the right way to sort the information. This script will sort information using the size of the category as the first sort field. The largest total will be the one for all files, so this will be one of the first lines output. However, there may be several ties for the largest number, and care must be used. The second field will be the number of files. This will help break a tie. Still, I want the totals and sub-totals to be listed before the individual user/group combinations. The third and fourth fields will be generated by the index of the array. This is the tricky part I warned you about. The script will output one string, but the sort utility will not know this. Instead, it will treat it as two fields. This will unify the results, and information from all 4 arrays will look like one array. The sort of the third and fourth fields will be dictionary order, and not numeric, unlike the first two fields. The "*" was used so these sub-total fields will be listed before the individual user/group combination.
The arrays will be printed using the following format:
I only showed you one array, but all four are printed the same way. That's the essence of the script. The results is sorted, and I converted the space into a tab for cosmetic reasons.
size count user group 3173 81 * * 3173 81 root * 2973 75 * staff 2973 75 root staff 88 3 * daemon 88 3 root daemon 64 2 * kmem 64 2 root kmem 48 1 * tty 48 1 root tty
This says there are 81 files in this directory, which takes up 3173 disk blocks. All of the files are owned by root. 2973 disk blocks belong to group staff. There are 3 files with group daemon, which takes up 88 disk blocks.
As you can see, the first line of information is the total for all users and groups. The second line is the sub-total for the user "root." The third line is the sub-total for the group "staff." Therefore the order of the sort is useful, with the sub-totals before the individual entries. You could write a simple AWK or grep script to obtain information from just one user or one group, and the information will be easy to sort.
There is only one problem. The /usr/ucb directory on my system only uses 1849 blocks; at least that's what du reports. Where's the discrepancy? The script does not understand hard links. This may not be a problem on most disks, because many users do not use hard links. Still, it does generate inaccurate results. In this case, the program vi is also e, ex, edit, view, and 2 other names. The program only exists once, but has 7 names. You can tell because the link count (field 2) reports 7. This causes the file to be counted 7 times, which causes an inaccurate total. The fix is to only count multiple links once. Examining the link count will determine if a file has multiple links. However, how can you prevent counting a link twice? There is an easy solution: all of these files have the same inode number. You can find this number with the -i option to ls. To save memory, we only have to remember the inodes of files that have multiple links. This means we have to add another column to the input, and have to renumber all of the field references. It's a good thing there are only three. Adding a new field will be easy, because I followed my own advice.
The final script should be easy to follow. I have used variations of this hundreds of times and find it demonstrates the power of AWK as well as provide insight to a powerful programming paradigm. AWK solves these types of problems easier than most languages. But you have to use AWK the right way.
This is a fully working version of the program, that accurately counts
disk space, appears below:
#!/bin/sh
find . -type f -print | xargs /usr/bin/ls -islg |
awk '
BEGIN {
# initialize all arrays used in for loop
u_count[""]=0;
g_count[""]=0;
ug_count[""]=0;
all_count[""]=0;
}
{
# validate your input
if (NF != 11) {
# ignore
} else {
# assign field names
inode=$1;
size=$2;
linkcount=$4;
user=$5;
group=$6;
# should I count this file?
doit=0;
if (linkcount == 1) {
# only one copy - count it
doit++;
} else {
# a hard link - only count first one
seen[inode]++;
if (seen[inode] == 1) {
doit++;
}
}
# if doit is true, then count the file
if (doit ) {
# total up counts in one pass
# use description array names
# use array index that unifies the arrays
# first the counts for the number of files
u_count[user " *"]++;
g_count["* " group]++;
ug_count[user " " group]++;
all_count["* *"]++;
# then the total disk space used
u_size[user " *"]+=size;
g_size["* " group]+=size;
ug_size[user " " group]+=size;
all_size["* *"]+=size;
}
}
}
END {
# output in a form that can be sorted
for (i in u_count) {
if (i != "") {
print u_size[i], u_count[i], i;
}
}
for (i in g_count) {
if (i != "") {
print g_size[i], g_count[i], i;
}
}
for (i in ug_count) {
if (i != "") {
print ug_size[i], ug_count[i], i;
}
}
for (i in all_count) {
if (i != "") {
print all_size[i], all_count[i], i;
}
}
} ' |
# numeric sort - biggest numbers first
# sort fields 0 and 1 first (sort starts with 0)
# followed by dictionary sort on fields 2 + 3
sort +0nr -2 +2d |
# add header
(echo "size count user group";cat -) |
# convert space to tab - makes it nice output
# the second set of quotes contains a single tab character
tr ' ' ' '
# done - I hope you like it
Click here to get file: count_users3.awk
Remember when I said I didn't need to use 4 different arrays?
I can use just one. This is more confusing, but more concise
#!/bin/sh
find . -type f -print | xargs /usr/bin/ls -islg |
awk '
BEGIN {
# initialize all arrays used in for loop
count[""]=0;
}
{
# validate your input
if (NF != 11) {
# ignore
} else {
# assign field names
inode=$1;
size=$2;
linkcount=$4;
user=$5;
group=$6;
# should I count this file?
doit=0;
if (linkcount == 1) {
# only one copy - count it
doit++;
} else {
# a hard link - only count first one
seen[inode]++;
if (seen[inode] == 1) {
doit++;
}
}
# if doit is true, then count the file
if (doit ) {
# total up counts in one pass
# use description array names
# use array index that unifies the arrays
# first the counts for the number of files
count[user " *"]++;
count["* " group]++;
count[user " " group]++;
count["* *"]++;
# then the total disk space used
size[user " *"]+=size;
size["* " group]+=size;
size[user " " group]+=size;
size["* *"]+=size;
}
}
}
END {
# output in a form that can be sorted
for (i in count) {
if (i != "") {
print size[i], count[i], i;
}
}
} ' |
# numeric sort - biggest numbers first
# sort fields 0 and 1 first (sort starts with 0)
# followed by dictionary sort on fields 2 + 3
sort +0nr -2 +2d |
# add header
(echo "size count user group";cat -) |
# convert space to tab - makes it nice output
# the second set of quotes contains a single tab character
tr ' ' ' '
# done - I hope you like it
Click here to get file: count_users.awk
So far, I described several simple scripts that provide useful information, in a somewhat ugly output format. Columns might not line up properly, and it is often hard to find patterns or trends without this unity. As you use AWK more, you will be desirous of crisp, clean formatting. To achieve this, you must master the printf function.
Printf has one of these syntactical forms:
The first argument to the printf function is the format. This is a string, or variable whose value is a string. This string, like all strings, can contain special escape sequences to print control characters.
+-------------------------------------------------------+ | AWK Table 5 | | Escape Sequences | |Sequence Description | +-------------------------------------------------------+ |a ASCII bell (NAWK only) | |b Backspace | |f Formfeed | |n Newline | |r Carriage Return | |t Horizontal tab | |v Vertical tab (NAWK only) | |ddd Character (1 to 3 octal digits) (NAWK only) | |xdd Character (hexadecimal) (NAWK only) | | Any character c | +-------------------------------------------------------+It's difficult to explain the differences without being wordy. Hopefully I'll provide enough examples to demonstrate the differences.
With NAWK, you can print three tab characters using these three different representations:
You should notice a difference between the printf function and the print function. Print terminates the line with the OFS character, and divides each field with the ORS separator. Printf does nothing unless you specify the action. Therefore you will frequently end each line with the newline character "n," and you must specify the separating characters explicitly.
+----------------------------------------+ | AWK Table 6 | | Format Specifiers | |Specifier Meaning | +----------------------------------------+ |c ASCII Character | |d Decimal integer | |e Floating Point number | | (engineering format) | |f Floating Point number | | (fixed point format) | |g The shorter of e or f, | | with trailing zeros removed | |o Octal | |s String | |x Hexadecimal | |% Literal % | +----------------------------------------+
Again, I'll cover the differences quickly. Table 3 illustrates the differences. The first line states "printf(%cn",100.0)"" prints a "d."
+--------------------------------+ | AWK Table 7 | | Example of format conversions | |Format Value Results | +--------------------------------+ |%c 100.0 d | |%c "100.0" 1 (NAWK?) | |%c 42 " | |%d 100.0 100 | |%e 100.0 1.000000e+02 | |%f 100.0 100.000000 | |%g 100.0 100 | |%o 100.0 144 | |%s 100.0 100.0 | |%s "13f" 13f | |%d "13f" 0 (AWK) | |%d "13f" 13 (NAWK) | |%x 100.0 64 | +--------------------------------+
This table reveals some differences between AWK and NAWK. When a string with numbers and letters are coverted into an integer, AWK will return a zero, while NAWK will convert as much as possible. The second example, marked with "NAWK?" will return "d" on some earlier versions of NAWK, while later versions will return "1."
Using format specifiers, there is another way to print a double quote with NAWK. This demonstrates Octal, Decimal and Hexadecimal conversion. As you can see, it isn't symmetrical. Decimal conversions are done differently.
I'll discuss each one separately.
If there is a number after the "%," this specifies the minimum number of characters to print. This is the width field. Spaces are added so the number of printed characters equal this number. Note that this is the minimum field size. If the field becomes to large, it will grow, so information will not be lost. Spaces are added to the left.
This format allows you to line up columns perfectly. Consider the following format:
As long as the string is less than 20 characters, the number will start on the 21st column. If the string is too long, then the two fields will run together, making it hard to read. You may want to consider placing a single space between the fields, to make sure you will always have one space between the fields. This is very important if you want to pipe the output to another program.
Adding informational headers makes the output more readable.
Be aware that changing the format of the data may make it difficult to
get
the columns aligned perfectly. Consider the following
script:
#!/usr/bin/awk -f
BEGIN {
printf("String Numbern");
}
{
printf("%10s %6dn", $1, $2);
}
Click here to get file: awk_example9.awk
It would be awkward (forgive the choice of words) to add a new column and retain the same
alignment.
More complicated formats would require a lot of trial and error.
You have to adjust the first
printf to agree with the second
printf statement. I suggest
#!/usr/bin/awk -f
BEGIN {
printf("%10s %6sn", "String", "Number");
}
{
printf("%10s %6dn", $1, $2);
}
Click here to get file: awk_example10.awk
or even better
#!/usr/bin/awk -f
BEGIN {
format1 ="%10s %6sn";
format2 ="%10s %6dn";
printf(format1, "String", "Number");
}
{
printf(format2, $1, $2);
}
Click here to get file: awk_example11.awk
The last example, by using string variables for formatting, allows you to keep all of the formats together.
This will move the printing characters to the left, with spaces added to the right.
The precision field, which is the number between the decimal and the format character, is more complex. Most people use it with the floating point format (%f), but surprisingly, it can be used with any format character. With the octal, decimal or hexadecimal format, it specifies the minimum number of characters. Zeros are added to met this requirement. With the %e and %f formats, it specifies the number of digits after the decimal point. The %e "e+00" is not included in the precision. The %g format combines the characteristics of the %d and %f formats. The precision specifies the number of digits displayed, before and after the decimal point. The precision field has no effect on the %c field. The %s format has an unusual, but useful effect: it specifies the maximum number of significant characters to print.
If the first number after the "%," or after the "%-," is a zero, then the system adds zeros when padding. This includes all format types, including strings and the %c character format. This means "%010d" and "%.10d" both adds leading zeros, giving a minimum of 10 digits. The format "%10.10d" is therefore redundant. Table 8 gives some examples:
+--------------------------------------------+ | AWK Table 8 | | Examples of complex formatting | |Format Variable Results | +--------------------------------------------+ |%c 100 "d" | |%10c 100 " d" | |%010c 100 "000000000d" | +--------------------------------------------+ |%d 10 "10" | |%10d 10 " 10" | |%10.4d 10.123456789 " 0010" | |%10.8d 10.123456789 " 00000010" | |%.8d 10.123456789 "00000010" | |%010d 10.123456789 "0000000010" | +--------------------------------------------+ |%e 987.1234567890 "9.871235e+02" | |%10.4e 987.1234567890 "9.8712e+02" | |%10.8e 987.1234567890 "9.87123457e+02" | +--------------------------------------------+ |%f 987.1234567890 "987.123457" | |%10.4f 987.1234567890 " 987.1235" | |%010.4f 987.1234567890 "00987.1235" | |%10.8f 987.1234567890 "987.12345679" | +--------------------------------------------+ |%g 987.1234567890 "987.123" | |%10g 987.1234567890 " 987.123" | |%10.4g 987.1234567890 " 987.1" | |%010.4g 987.1234567890 "00000987.1" | |%.8g 987.1234567890 "987.12346" | +--------------------------------------------+ |%o 987.1234567890 "1733" | |%10o 987.1234567890 " 1733" | |%010o 987.1234567890 "0000001733" | |%.8o 987.1234567890 "00001733" | +--------------------------------------------+ |%s 987.123 "987.123" | |%10s 987.123 " 987.123" | |%10.4s 987.123 " 987." | |%010.8s 987.123 "000987.123" | +--------------------------------------------+ |%x 987.1234567890 "3db" | |%10x 987.1234567890 " 3db" | |%010x 987.1234567890 "00000003db" | |%.8x 987.1234567890 "000003db" | +--------------------------------------------+
There is one more topic needed to complete this lesson on printf.
You can append to an existing file, by using ">>:"
Consider the shell program:
This will read standard input, and copy the standard input to files "/tmp/a" and "/tmp/b." File "/tmp/a" will grow larger, as information is always appended to the file. File "/tmp/b," however, will only contain one line. This happens because each time the shell see the ">" or ">>" characters, it opens the file for writing, choosing the truncate/create or appending option at that time.
Now consider the equivalent AWK program:
This behaves differently. AWK chooses the create/append option the first time a file is opened for writing. Afterwards, the use of ">" or ">>" is ignored. Unlike the shell, AWK copies all of standard input to file "/tmp/b."
Instead of a string, some versions of AWK allow you to specify an expression:
I hope this gives you the skill to make your AWK output picture perfect.
There are three types of functions: numeric, string and whatever's left. Table9 lists all of the numeric functions:
+----------------------------------+ | AWK Table 9 | +----------------------------------+ | Numeric Functions | |Name Function Variant | +----------------------------------+ |cos cosine AWK | |exp Exponent AWK | |int Integer AWK | |log Logarithm AWK | |sin Sine AWK | |sqrt Square Root AWK | |atan2 Arctangent NAWK | |rand Random NAWK | |srand Seed Random NAWK | +----------------------------------+
Sorry about that. I don't know what came over me. I don't usually resort to puns. I'll write a note to myself, and after I sine the note, I'll have my boss cosine it.
Now stop that! I hate arguing with myself. I always lose. Thinking about math I learned in the year 2 B.C. (Before Computers) seems to cause flashbacks of high school, pimples, and (shudder) times best left forgotten. The stress of remembering those days must have made me forget the standards I normally set for myself. Besides, no-one appreciates obtuse humor anyway, even if I find acute way to say it.
I better change the subject fast. Combining humor and computers is a very serious matter.
Here is a NAWK script
that calculates the trigonometric functions for all degrees between 0 and 360.
It also shows why there is no tangent, secant or cosecant function.
(They aren't necessary). If you read the
script, you will learn of some subtle differences between
AWK and NAWK.
All this in a thin veneer of demonstrating why we learned
trigonometry in the first place. What more can you ask for?
Oh, in case you are wondering, I wrote this in the month of December.
#!/usr/bin/nawk -f
#
# A smattering of trigonometry...
#
# This AWK script plots the values from 0 to 360
# for the basic trigonometry functions
# but first - a review:
#
# (Note to the editor - the following diagram assumes
# a fixed width font, like Courier.
# otherwise, the diagram looks very stupid, instead of slightly stupid)
#
# Assume the following right triangle
#
# Angle Y
#
# |
# |
# |
# a | c
# |
# |
# +------- Angle X
# b
#
# since the triangle is a right angle, then
# X+Y=90
#
# Basic Trigonometric Functions. If you know the length
# of 2 sides, and the angles, you can find the length of the third side.
# Also - if you know the length of the sides, you can calculate
# the angles.
#
# The formulas are
#
# sine(X) = a/c
# cosine(X) = b/c
# tangent(X) = a/b
#
# reciprocal functions
# cotangent(X) = b/a
# secant(X) = c/b
# cosecant(X) = c/a
#
# Example 1)
# if an angle is 30, and the hypotenuse (c) is 10, then
# a = sine(30) * 10 = 5
# b = cosine(30) * 10 = 8.66
#
# The second example will be more realistic:
#
# Suppose you are looking for a Christmas tree, and
# while talking to your family, you smack into a tree
# because your head was turned, and your kids were arguing over who
# was going to put the first ornament on the tree.
#
# As you come to, you realize your feet are touching the trunk of the tree,
# and your eyes are 6 feet from the bottom of your frostbitten toes.
# While counting the stars that spin around your head, you also realize
# the top of the tree is located at a 65 degree angle, relative to your eyes.
# You suddenly realize the tree is 12.84 feet high! After all,
# tangent(65 degrees) * 6 feet = 12.84 feet
# All right, it isn't realistic. Not many people memorize the
# tangent table, or can estimate angles that accurately.
# I was telling the truth about the stars spinning around the head, however.
#
BEGIN {
# assign a value for pi.
PI=3.14159;
# select an "Ed Sullivan" number - really really big
BIG=999999;
# pick two formats
# Keep them close together, so when one column is made larger
# the other column can be adjusted to be the same width
fmt1="%7s %8s %8s %8s %10s %10s %10s %10sn";
# print out the title of each column
fmt2="%7d %8.2f %8.2f %8.2f %10.2f %10.2f %10.2f %10.2fn";
# old AWK wants a backslash at the end of the next line
# to continue the print statement
# new AWK allows you to break the line into two, after a comma
printf(fmt1,"Degrees","Radians","Cosine","Sine",
"Tangent","Cotangent","Secant", "Cosecant");
for (i=0;i<=360;i++) {
# convert degrees to radians
r = i * (PI / 180 );
# in new AWK, the backslashes are optional
# in OLD AWK, they are required
printf(fmt2, i, r,
# cosine of r
cos(r),
# sine of r
sin(r),
#
# I ran into a problem when dividing by zero.
# So I had to test for this case.
#
# old AWK finds the next line too complicated
# I don't mind adding a backslash, but rewriting the
# next three lines seems pointless for a simple lesson.
# This script will only work with new AWK, now - sigh...
# On the plus side,
# I don't need to add those back slashes anymore
#
# tangent of r
(cos(r) == 0) ? BIG : sin(r)/cos(r),
# cotangent of r
(sin(r) == 0) ? BIG : cos(r)/sin(r),
# secant of r
(cos(r) == 0) ? BIG : 1/cos(r),
# cosecant of r
(sin(r) == 0) ? BIG : 1/sin(r));
}
# put an exit here, so that standard input isn't needed.
exit;
}
Click here to get file: trigonometry.awk
NAWK also has the arctangent function. This is useful for some graphics work, as
If you execute this script several times, you will get the exact same results. Experienced programmers know random number generators aren't really random, unless they use special hardware. These numbers are pseudo-random, and calculated using some algorithm. Since the algorithm is fixed, the numbers are repeatable unless the numbers are seeded with a unique value. This is done using the srand function above, which is commented out. Typically the random number generator is not given a special seed until the bugs have been worked out of the program. There's nothing more frustrating than a bug that occurs randomly. The srand function may be given an argument. If not, it uses the current time and day to generate a seed for the random number generator.
Besides numeric functions, there are two other types of function: strings and the whatchamacallits. First, a list of the string functions:
+-------------------------------------------------+ | AWK Table 10 | | String Functions | |Name Variant | +-------------------------------------------------+ |index(string,search) AWK, NAWK, GAWK | |length(string) AWK, NAWK, GAWK | |split(string,array,separator) AWK, NAWK, GAWK | |substr(string,position) AWK, NAWK, GAWK | |substr(string,position,max) AWK, NAWK, GAWK | |sub(regex,replacement) NAWK, GAWK | |sub(regex,replacement,string) NAWK, GAWK | |gsub(regex,replacement) NAWK, GAWK | |gsub(regex,replacement,string) NAWK, GAWK | |match(string,regex) NAWK, GAWK | |tolower(string) GAWK | |toupper(string) GAWK | +-------------------------------------------------+
Most people first use AWK to perform simple calculations. Associative arrays and trigonometric functions are somewhat esoteric features, that new users embrace with the eagerness of a chain smoker in a fireworks factory. I suspect most users add some simple string functions to their repertoire once they want to add a little more sophistication to their AWK scripts. I hope this column gives you enough information to inspire your next effort.
There are four string functions in the original AWK: index(), length(), split(), and substr(). These functions are quite versatile.
If you want to search for a special character, the index() function will search for specific characters inside a string. To find a comma, the code might look like this:
If the substring consists of 2 or more characters, all of these characters must be found, in the same order, for a non-zero return value. Like the length() function, this is useful for checking for proper input conditions.
The substr function can be used in many non-obvious ways.
As an example, it can be used to convert upper case letters to lower case.
#!/usr/bin/awk -f
# convert upper case letters to lower case
BEGIN {
LC="abcdefghijklmnopqrstuvwxyz";
UC="ABCDEFGHIJKLMNOPQRSTUVWXYZ";
}
{
out="";
# look at each character
for(i=1;i<=length($0);i++) {
# get the character to be checked
char=substr($0,i,1);
# is it an upper case letter?
j=index(UC,char);
if (j > 0 ) {
# found it
out = out substr(LC,j,1);
} else {
out = out char;
}
}
printf("%sn", out);
}
Click here to get file: upper_to_lower.awk
GAWK has the
toupper() and
tolower() functions, for convenient conversions of case.
These functions take strings, so you
can reduce the above script to a single line:
#!/usr/local/bin/gawk -f
{
print tolower($0);
}
Click here to get file: upper_to_lower.gawk
The third argument is typically a single character. If a longer string is used, only the first letter is used as a separator.
Sub() performs a string substitution, like sed. To replace "old" with "new" in a string, use
print the following when executed:
As you can see, the pattern can be a regular expression.
Lastly, there are the whatchamacallit functions. I could use the word "miscellaneous," but it's too hard to spell. Darn it, I had to look it up anyway.
+-----------------------------------------------+ | AWK Table 11 | | Miscellaneous Functions | |Name Variant | +-----------------------------------------------+ |getline AWK, NAWK, GAWK | |getline <file NAWK, GAWK | |getline variable NAWK, GAWK | |getline variable <file NAWK, GAWK | |"command" | getline NAWK, GAWK | |"command" | getline variable NAWK, GAWK | |system(command) NAWK, GAWK | |close(command) NAWK, GAWK | |systime() GAWK | |strftime(string) GAWK | |strftime(string, timestamp) GAWK | +-----------------------------------------------+
Instead of reading into the standard variables, you can specify the variable to set:
NAWK's getline can also read from a pipe. If you have a program that generates single line, you can use
or
If you have more than one line, you can loop through the results:
for (i in cmd) {
printf("%s=%sn", i, cmd[i]);
}
Only one pipe can be open at a time. If you want to open another pipe, you must execute
This is necessary even if the end of file is reached.
The function takes one or two arguments. The first argument is a string that specified the format. This string contains regular characters and special characters. Special characters start with a backslash or the percent character. The backslash characters with the backslash prefix are the same I covered earlier. In addition, the strftime() function defines dozens of combinations, all of which start with "%." The following table lists these special sequences:
+---------------------------------------------------------------+ | AWK Table 12 | | GAWK's strftime formats | +---------------------------------------------------------------+ |%a The locale's abbreviated weekday name | |%A The locale's full weekday name | |%b The locale's abbreviated month name | |%B The locale's full month name | |%c The locale's "appropriate" date and time representation | |%d The day of the month as a decimal number (01--31) | |%H The hour (24-hour clock) as a decimal number (00--23) | |%I The hour (12-hour clock) as a decimal number (01--12) | |%j The day of the year as a decimal number (001--366) | |%m The month as a decimal number (01--12) | |%M The minute as a decimal number (00--59) | |%p The locale's equivalent of the AM/PM | |%S The second as a decimal number (00--61). | |%U The week number of the year (Sunday is first day of week) | |%w The weekday as a decimal number (0--6). Sunday is day 0 | |%W The week number of the year (Monday is first day of week) | |%x The locale's "appropriate" date representation | |%X The locale's "appropriate" time representation | |%y The year without century as a decimal number (00--99) | |%Y The year with century as a decimal number | |%Z The time zone name or abbreviation | |%% A literal %. | +---------------------------------------------------------------+
Depending on your operating system, and installation, you may also have the following formats:
+-----------------------------------------------------------------------+ | AWK Table 13 | | Optional GAWK strftime formats | +-----------------------------------------------------------------------+ |%D Equivalent to specifying %m/%d/%y | |%e The day of the month, padded with a blank if it is only one digit | |%h Equivalent to %b, above | |%n A newline character (ASCII LF) | |%r Equivalent to specifying %I:%M:%S %p | |%R Equivalent to specifying %H:%M | |%T Equivalent to specifying %H:%M:%S | |%t A TAB character | |%k The hour as a decimal number (0-23) | |%l The hour (12-hour clock) as a decimal number (1-12) | |%C The century, as a number between 00 and 99 | |%u is replaced by the weekday as a decimal number [Monday == 1] | |%V is replaced by the week number of the year (using ISO 8601) | |%v The date in VMS format (e.g. 20-JUN-1991) | +-----------------------------------------------------------------------+
One useful format is
This constructs a string that contains the year, month, day, hour, minute and second in a format that allows convenient sorting. If you ran this at noon on Christmas, 1994, it would generate the string
Here is the GAWK equivalent of the date command:
#! /usr/local/bin/gawk -f
#
BEGIN {
format = "%a %b %e %H:%M:%S %Z %Y";
print strftime(format);
}
Click here to get file: date.gawk
You will note that there is no exit command in the begin statement. If I was using AWK, an exit statement is necessary. Otherwise, it would never terminate. If there is no action defined for each line read, NAWK and GAWK do not need an exit statement.
If you provide a second argument to the
strftime() function, it uses that argument as the timestamp, instead of the
current system's time.
This is useful for calculating future times.
The following script calculates the time
one week after the current time:
#!/usr/local/bin/gawk -f
BEGIN {
# get current time
ts = systime();
# the time is in seconds, so
one_day = 24 * 60 * 60;
next_week = ts + (7 * one_day);
format = "%a %b %e %H:%M:%S %Z %Y";
print strftime(format, next_week);
exit;
}
Click here to get file: one_week_later.gawk
In my first tutorial on AWK, I described the AWK statement as having the form
I have only used two patterns so far: the special words BEGIN and END. Other patterns are possible, yet I haven't used any. There are several reasons for this. The first is that these patterns aren't necessary. You can duplicate them using an if statement. Therefore this is an "advanced feature." Patterns, or perhaps the better word is conditions, tend to make an AWK program obscure to a beginner. You can think of them as an advanced topic, one that should be attempted after becoming familiar with the basics.
A pattern or condition is simply an abbreviated test. If the condition is true, the action is performed. All relational tests can be used as a pattern. The "tail -10" command, which prints the first 10 lines and stops, can be duplicated with
Besides relational tests, you can also use containment tests, i. e. do strings contain regular expressions? Printing all lines that contain the word "special" can be written as
or more briefly
This type of test is so common, the authors of AWK allow a third, shorter format:
These tests can be combined with the AND (&&) and OR (||) commands, as well as the NOT (!) operator. Parenthesis can also be added if you are in doubt, or to make your intention clear.
The following condition prints the line if it contains the word "whole" or columns 1 and 2 contain "part1" and "part2" respectively.
This can be shortened to
There is one case where adding parenthesis hurts. The condition
works, but
does not. If parenthesis are used, it is necessary to explicitly specify the test:
A murky situation arises when a simple variable is used as a condition. Since the variable NF specifies the number of fields on a line, one might think the statement
I expected NAWK to work, but on some SunOS systems it refused to print any lines at all. On newer Solaris systems it did behave properly. Again, changing it to the longer form worked for all variations. GAWK, like the newer version of NAWK, worked properly. After this experience, I decided to leave other, exotic variations alone. Clearly this is unexplored territory. I could write a script that prints the first 20 lines, except if there were exactly three fields, unless it was line 10, by using
But I won't. Obscurity, like puns, is often unappreciated.
There is one more common and useful pattern I have not yet described. It is the comma separated pattern. A common example has the form:
This form defines, in one line, the condition to turn the action on, and the condition to turn the action off. That is, when a line containing "start" is seen, it is printed. Every line afterwards is also printed, until a line containing "stop" is seen. This one is also printed, but the line after, and all following lines, are not printed. This triggering on and off can be repeated many times. The equivalent code, using the if command, is:
The conditions do not have to be regular expressions. Relational tests can also be used. The following prints all lines between 20 and 40:
You can mix relational and containment tests. The following prints every line until a "stop" is seen:
There is one more area of confusion about patterns: each one is independent of the others. You can have several patterns in a script; none influence the other patterns. If the following script is executed:
and the input file's line 10 contains "xxx," it would be printed 4 times, as each condition is true. You can think of each condition as cumulative. The exception is the special BEGIN and END conditions. In the original AWK, you can only have one of each. In NAWK and GAWK, you can have several BEGIN or END actions.
when I could use
After all, they reason, the semicolon is unnecessary, and comments do not have to start on the first column. This is true. Still I avoid this. Years ago, when I started writing AWK programs, I would find myself confused when the nesting of conditions were too deep. If I moved a complex if statement inside another if statement, my alignment of braces became incorrect. It can be very difficult to repair this condition, especially with large scripts. Nowadays I use emacs to do the formatting for me, but 10 years ago I didn't have this option. My solution was to use the program cb, which is a "C beautifier." By including optional semicolons, and starting comments on the first column of the line, I could send my AWK script through this filter, and properly align all of the code.
+--------------------------------+ | AWK Table 14 | |Variable AWK NAWK GAWK | +--------------------------------+ |FS Yes Yes Yes | |NF Yes Yes Yes | |RS Yes Yes Yes | |NR Yes Yes Yes | |FILENAME Yes Yes Yes | |OFS Yes Yes Yes | |ORS Yes Yes Yes | +--------------------------------+ |ARGC Yes Yes | |ARGV Yes Yes | |ARGIND Yes | |FNR Yes Yes | |OFMT Yes Yes | |RSTART Yes Yes | |RLENGTH Yes Yes | |SUBSEP Yes Yes | |ENVIRON Yes | |IGNORECASE Yes | |CONVFMT Yes | |ERRNO Yes | |FIELDWIDTHS Yes | +--------------------------------+
Since I've already discussed many of these, I'll only cover those that I missed earlier.
GAWK does, but that is because GAWK requires the "-v" option before each assignment:
the first program would print the numbers 1 through 20, while the second would print the numbers 1 through 10 twice, once for each file.
Variables b and c are both strings, but the first one will have the value "12.00" while the second will have the value "12."
I found myself going into more depth than I planned, and I hope you found this useful. I found out a lot myself, especially when I discussed topics not covered in other books. Which reminds me of some closing advice: if you don't understand how something works, experiment and see what you can discover. Good luck, and happy AWKing...
This document was translated by troff2html v0.21 on September 22, 2001.