TweetFollow Us on Twitter

awk for Data Processing - Part 2

Volume Number: 22 (2006)
Issue Number: 04
Column Tag: Programming

Mac In The Shell

awk for Data Processing - Part 2

by Edward Marczak

Revving up the engine.

Last month, I introduced you to awk, the 'pattern processor'. That laid the foundation, and merely scratched the surface of awk's power. This month, we're going to dive back into flow control, as we've seen with bash, sed, math routines, and other cool awk features. Of course, awk only becomes more powerful when combined with shell scripting and sed.

Pattern Matching

Now that, over the last few months, we've covered regexp, sed, shell globbing, and now, awk, here's a word on anytime you want to use a utility that does pattern matching. Sometimes you're not in control of the data you're going to need to sift through. However, there are times where, you are the one generating the data. This could be in the form of a report, or even from another command-line tool. In any case, try to make your life easier: don't spew out unnecessary data! For example: you may want to find out ip addresses assigned to a particular interface, so you decide to use ifconfig, and write a sed script to parse the output. The sed script can pattern match the interface and then loop through the results looking for "inet". Not bad, but you decrease your work if you specify the interface you're looking for to ifconfig. If you're using nireport, make sure you output fields in the order that you want them, rather than use awk to swap them around. If you need a file listing, look at all of the switches that will sort and add symbols to the output that will make matching easier. Always make sure you read man page for any program that you're using: you may find some surprising switches that reduce the work you do further down the chain.

Back to Basics

Part 1 of this article gave us some real awk basics - print, match a pattern, field operators and some built-in variables. The built-in variables that we covered were NF, the number of fields in a record, and FS, the field separator. Of course, there are some more built-ins that we should know about. Let's do that before proceeding.

FS separates fields during the input stage. By default, awk separates output with a space. You can define that to be anything you want, using OFS. The output field separator is generated by the comma in a print statement. So, to rewrite an example from last month, we can make the output look better:

$ ls -l | awk 'BEGIN {OFS="\t\t"} {print $5,$9,$1}'
182468   20050629-local.jpg   -rw-r--r--
51986   iChats   drwxrwx---
68   images   drwxr-xr-x
1345   jamlog.txt   -rw-r--r--
61440   lads.exe   -rw-r--r--
271103   mount-1.260-3.wbm.gz    -rw-r--r--
352457   mr.spx    -rw-r--r--
Figure 1 - Output Field Separator in action

Better, but still a little ragged. Don't worry! We'll fix that in a bit.

When you generate multi-line output, awk separates each record with ORS, the record separator. ORS is a newline by default, so each record starts on a new line. You can change this! Why would you want to? You can even define RS, the input record separator. Sometimes, small examples are worth 1,000 words. If you are processing data that comes in a 'block' - spread out over several lines - setting FS to "\n", the newline character, will allow awk access via the field variables. Set RS to "", and awk will split these correctly when you have multiple input records. Practical example: you suspect a problem with user records, and want to search for particular users that (may) have the same home directory assigned. Here's the script:

01. #!/bin/bash
03. for name in `dscl localhost -list /Search/Users`
04. do
05. dscl localhost -read /Search/Users/${name} | awk 'BEGIN {FS="\n"; RS=""} $0 ~ /\/Users\/marczak/ {print $0}'
06. done
Figure 2 - User search script

First, you can see it's a shell script. We'll use bash to feed awk multi-line records using dscl. Line 3 sets up a loop using all of the usernames that we have access to. Line 5 uses dscl again to get the detail for the username provided and feed that record to awk. Using a BEGIN construct, we first set FS and RS. Then, we look for "/Users/marczak" anywhere in the record using $0. If we match, we print the entire record. This way, we'll print all records that have that path as a home directory. It's a fairly specific example, but actually came in handy once. Plus, it illustrates handling multi-line records!

Finally, in our built-in round up, NR and FNR, keep the current line number available for you. NR is cumulative, and FNR gives you the number of the current record with respect to the input file. Useful if you're processing multiple files.

Low-level Format

awk is a fantastic tool for generating reports, however, reports are only really useful if they look good. The data can be good, but if it's hard to read, the brain just switches off. As you've seen, OFS and print only get you so far. awk supports a formatted print statement, printf, that you may have seen in some other languages, notably C. printf is more flexible than standard print, but requires a little more hand holding. Want an example? Here you go:

awk 'BEGIN {printf ("This is a test.\n")}'

Easy, right? So, what's different? First, you'll notice that you have to supply the newline - just like C! What I left out here, are the optional format specifiers, which again, match their counterpart in C. man printf will get you the list, if you forget them. Let's learn by example. The file listing code can be re-written with printf like this:

ls -l | awk '{printf "%s\t\t%s\t\t%s\n",$5, $9, $1}'

This means, print a string ("%s"), two tabs ("\t"), a string, two tabs and a string, followed up by a newline character. Each format specifier needs a corresponding value after the format string to fill in the place-holder with. We're substituting each %s with a field - $5, $9 and $1, respectively. However, this really is the equivalent of the earlier code - it's still ragged! printf also allows you to supply the width and alignment of the output. So, to clean up our listing, we can use this:

$ ls -l | awk 'NR > 1 {printf "%-20s%-20s%-20s\n",$5, $9, $1}'
306                 dist                   drwxr-xr-x          
42364               httpd.conf             -rw-r--r--          
37417               httpd.conf.bak   -rw-r--r--          
38334               httpd.conf.default   -rw-r--r--          
38334               httpd.conf.dist        -rw-r--r--          
12965               magic                  -rw-r--r--          
12965               magic.default          -rw-r--r--          
15201               mime.types             -rw-r--r--          
15201               mime.types.default     -rw-r--r--          
204                 users                  drwxr-xr-x
Figure 3 - Width specifiers

That's much nicer! Explanation: instead of "%s", we can specify a width using "%20s" - "20" being the width. By default, the output is right justified in the space allotted. I added the hyphen - "%-20s" - to our example, to left justify the text.

Flow Control...Again

Depending on how long you've been reading this column, you've seen this before: we've covered looping and decision-making in bash and in sed. Well, flow is important! So, let's see how awk handles these constructs.

The most basic of tests is an if/then test. The pattern matching we've seen is essentially an if/then test that is applied to all input. However, if you've matched something basic, and then need to make further decisions, you can use if/then. Let's combine this with an example using a loop.

As you may have seen in other languages, awk has a while loop that conditionally executes a block of code. Here's the idea:

while (condition is true) action

Like other languages, you can have a line feed between the condition and action, and if the action is multiple lines, they must be contained in curly-braces. I'm actually going to throw a few new things in, and then explain. Let's re-write our user search script from earlier (Figure 2):

01. #!/bin/bash
03. for name in `dscl localhost -list /Search/Users`
04. do
05. dscl localhost -read /Search/Users/${name} | awk '
06. BEGIN {FS="\n"; RS=""}
07. $0 ~ /\/Users\/marczak/ {
08. i=1
09. while (i<=NF) {
10.         if ($i ~ /Dir/ || $i ~ /ID/ || $i ~ /Shell/) print $i
11.         i++
12. }
13. }
14. '
15. done
Figure 4 - A while loop in action

Once again, this is a shell script that feeds full blocks of data into awk. The dscl statements on lines 3 and 5 are identical to the ones from the first script. Look how we break up the awk script across multiple lines from there. Line 5 ends with a single quote, which allows bash to treat everything up until the next single quote as continuous code. Note the closing single quote on line 14! Again, we're going to look for a match on the entire input ($0) - looking for "/Users/Marczak" again. If and when we find it, that's where our adventure begins.

Line 8 initializes the variable "i" to 1. Not zero. We're going to index through fields, and don't need to check $0 again! Line 9 shows off our while loop. As long as i doesn't exceed the number of fields on the input record, we execute the loop. First time through, i=1, and we can use it to reference the first field of input ($1). Line 10 - an if statement! Lovely! If we find that the field we're currently looking at contains "Dir" or ("||") "ID" or "Shell", we print that field. Then, we increment i on line 11 so we don't loop around forever - and, we reference the next field in the next iteration of the loop.

Really cool stuff here: using the built-in NF variable as a comparison in our while loop, using a variable for the field reference, using classic Unix utilities with OS X specific CLI programs....nice. In addition to a while loop, awk supports the familiar "for" and "do" loop constructs. And as you may have guessed, you may recognize them already.

A "do" loop is a variant of the "while" loop. Its main difference is that the action is always executed at least once. It looks like this:

do {
} while (condition is true)
Need to see it in action?  Here you go:
        numMice = 5
        catTime = 3
        do {
                theMicePlay = numMice * theCatIsAway
                if (theCatIsAway > catTime) theCatIsAway = 0
        } while (theCatIsAway)
        print "The mice played " theMicePlay " days."
Figure 5 - An example do loop

Yes, it's a completely contrived example so I could use "while the cat is away"...I needed to bring a little levity to this column. This example does illustrate a little math, though, which I haven't explicitly covered.

The for loop borrows its syntax from C and should be pretty recognizable:

for (initialize; test conditions; increment) {

Rewriting the previous loop using for would look like this:

01. BEGIN { 
02.         numMice = 5
03.         catTime = 3
04.         for (theCatIsAway = 1; theCatIsAway > 0; theCatIsAway++) {
05.                 theMicePlay = numMice * theCatIsAway
06.                 if (theCatIsAway > catTime) theCatIsAway = -1
07.         }
08.         print "The mice played " theMicePlay " days"
09. }
Figure 6 - an example for loop

Look at that for loop! It's a thing of beauty! No, really...I'm serious! (Outside of the fact that it ruins my play on words). It lets you take care of everything you need for a loop. Note, however, that the increment happens at the bottom of the loop. This is important, and is the reason, we set theCatIsAway to -1 rather than 0 on line 6. Otherwise, our test would never be true, and we'd get caught in an infinite loop.

Once again, like other languages, awk lets us skip an iteration of a loop, or break out altogether. Inside of a loop, the break keyword breaks out of the loop, and ends it:

do {
   if (leaveLoopNow) break
} while (x < currentThreshhold)

In this example, if leaveLoopNow is true, execute the break statement and bail out of the loop - never to execute the remainder of the loop, picking up execution following the loop.

A less drastic version of break, is continue. A short example will make it clear:

do {
   if (notThisTime) continue
} while (x < currentThreshhold)

Here, if notThisTime is true, we just go back to the top of the loop. But the loop will continue, as long as our condition is true.

There are also two flow-altering statements that affect awk's entire flow - next and exit. The simpler of the two is exit. When awk encounters the exit statement, it jumps to the END rule. Of course, you don't even have to have and END rule defined. In that case, the script just terminates. Note that exit can supply a value to use as awk's exit code. Nice way to test success or failure in a shell script. exit without a value defaults to "0". next transfers control back to the top of the script where awk will read the next record of input. This is useful in a few different situations. If you only want to process records in a file that has 5 fields, simply sue this rule:

NF != 5 {next}

That's also useful for error checking, whereby if the target input doesn't 'look' right, you can just crank through the file. Perhaps even keeping count of how many records you skipped for use in an exception report.


Here's something that I can't tell you I've covered before. Certainly not in sed, nor in bash. Of course there are other languages that support arrays, so some of this may look familiar. But, if this column has been your introduction to anything remotely related to programming or scripting, this will be slightly new. An array is simply a variable that lets you hold a series of values. Being a loosely typed language, all arrays in awk are associative arrays - arrays that map keys to values. Associative arrays do not need to use integers as the key, or subscript, nor does every value need to be of equal type and size. If you have a PHP background, you'll understand this innately. Naturally, examples are forthcoming. Like other variables in awk, arrays do not need to be declared, so you can just use them:

array[key] = value

Often, you'll see simple numeric keys (subscripts) - useful when loading data in from a file, and you want to track something from every record, or mark certain records based on a value. Just as often, though, you'll see a key; a string that maps to a value. We can use this feature like this:

BEGIN { color["red"] = "0xF00"
color["green"] = "0x0F0"
color["blue"] = "0x00F"
print color["red"]

Nice, right? Arrays let us keep related values together. You can also use a variable as the key. Here's a totally trivial example that illustrates a few new concepts:

01. #!/bin/bash
02. /System/Library/PrivateFrameworks/Apple80211.framework/Versions/A/Resources/airport -I | awk '
03. BEGIN { FS=":" }
04. {
05. gsub(" ","",$1)
06. recordlist[$1] = $2
07. }
08. END {
09. for (key in recordlist)
10.         print "The " key " is equal to" recordlist[key]
11. }
12. '
Figure 7 - Many new concepts!

Once again, I wrap awk in a bash shell script. First new thing, may be the airport command. With the "-I" switch, it gives you information about your current airport status. Next new thing is on line 5: gsub. Sometimes, exposure to many languages is a bit of a curse. When I see this command, I always think back to BASIC's "gosub" (go to subroutine) command. In awk's case, however, it stands for global substitue. I'm using it here just to clean up the output a bit. It's really powerful, though, and works like this:

gsub(regexp, substitution, string)

Now it's apparent; I'm just removing the spaces from $1: a space (" ") is being replaced with nothing ("") in the string $1. Now, look what's happening on line 6 - the value of $1 is being used as the key in the array "recordlist". It's also being assigned the value of $2, the second field. Then on line 9 in the END pattern, there's a new flow control statement. A variant on a for loop, we have some special syntax that accesses each element of an array in turn. "key" is a made-up variable. Right there, on the spot. It could really be whatever we like, but as with all variables, it should be something somewhat meaningful. This variable will contain the current key name in each iteration of the loop.

While there's a lot more to arrays, I'd be remiss if I didn't mention two functions: split and delete. split splits a string into an array based on a separator. This is just like awk's main loop function that breaks input into fields based on FS. If you have awk reading a CSV file, you could use split thusly:

x = split($0, myFields, ",")

What this does is create an array - myFields - that contains each 'field' of $0, fields being separated with a comma. split returns the number of fields in the string, in our case, putting the result into "x". If the input looked like this:

Mike Jones, 555-1234,
Bill Smith, 555-0984,
Sally Foster, 555-3456,

...then during the first pass, myFields would contain:

myFields[1] = "Mike Jones"
myFields[2] = "555-1234"
myFields[3] = ""

split is also a useful way to load up an array:


That's a lot easier than writing out "month[1] = "...

delete is simple: it lets you remove an element of an array. Simply:

delete myFields[2]

...would get rid of the phone number in our previous example. Of course, you can always just ignore a field, using delete will make sure it's gone if you choose to use

Grab Bag

There are some final things I feel the need to mention about awk, but realized that they're each pretty short and belonged altogether in a 'grab bag' section. Without further ado, here they are.

In addition to the other built-in variables that we've covered, awk presents two built-ins as arrays: ARGV and ENVIRON. ARGV is an array that contains each command line argument, including the script name itself in ARGV[0]. If you ran your script like this:

$ awk -f argvtest.awk var1 15 "Iolo" "Shamino"

...ARGV would contain:

ARGV[0] = awk
ARGV[1] = var1
ARGV[2] = 15
ARGV[3] = "Iolo"
ARGV[4] = "Shamino"

ENVIRON maps current environment variables to their values. For example:


on my machine would yield "darwin8.0".

In addition to split and gsub, awk contains some really useful (and common to other languages) string manipulation functions, such as substr, toupper, length and tolower. Consider that homework.

Finally, awk provides a way to get input outside of the main input loop. getline gets a new line of input, and can be used in two different ways. First, when used by itself, it will get the next line from input that the main loop would have gotten. This is similar to next (covered above), however, getline does not bring flow back to the top of the script. Secondarily, you can pipe input into awk and read it with getline. While more in-depth work still requires a shell script, this is my favorite way to write a quick-and-dirty awk script. One example will get you going:

$ awk 'BEGIN {"top -l 1" | getline; print $2 " processes running"}'
121 processes running

The output of top is piped into awk - directly from inside awk! You can, of course, even use that trick conditionally, and go look something up on the fly if needed.


To show everything that awk is capable of would take a book. I believe I've shown things that are immediately understandable and practical. Between this column and the last, you should have a good foundation to build on. Of course, there's a lot more to explore. I didn't get to multi-dimensional arrays, trig math, user-defined functions, piping to output...and more. If this has whetted your appetite, there are many resources that teach awk in-depth. Just dropping into Google and trying "learning awk" brings back an incredible number of resources.

awk is a fantastic utility that has proved its worth over decades of classic Unix use. For OS X administrators, it dovetails perfectly with the powerful command-line utilities at our disposal.

Recommended reading for the month: Cuckoo's Egg, by Cliff Stoll. Released in 1989, this was one of the first track-a-hacker books I ever read. Of course, I could suggest some technical reference for you to dig into, but this is just good reading. I saw a copy at a bookstore not too long ago, and that made me break out my old version. Still a good read today, if not a great way to compare and contrast the technical environment of the late 1980s to today. Also, a good reminder that social engineering is timeless.

Ed Marczak owns and operates Radiotope, a technology consulting company. More tech tips at the blog:


Community Search:
MacTech Search:

Software Updates via MacUpdate

Planet Diver guide - How to survive long...
Planet Diver is an endless arcade game about diving through planets while dodging lava, killing bats, and collecting Starstuff. Here are some tips to help you go the distance. [Read more] | Read more »
KORG iDS-10 (Music)
KORG iDS-10 1.0.0 Device: iOS iPhone Category: Music Price: $9.99, Version: 1.0.0 (iTunes) Description: ** Debut Discount: 50% OFF! Sale Price US$9.99 (Regular price US$19.99). Other all Korg apps are also 50% OFF until Dec 28! **... | Read more »
World of Tanks Generals guide - Tips and...
World of Tanks Generals is a brand new card game by the developer behind the World of Tanks shooter franchise. It plays like a cross between chess and your typical card game. You have to keep in consideration where you place your tanks on the board... | Read more »
TruckSimulation 16 guide: How to succeed...
Remember those strangely enjoyable truck missions in Grand Theft Auto V whereit was a disturbing amount of fun to deliver cargo? TruckSimulation 16 is reminiscent of that, and has you play the role of a truck driver who has to deliver various... | Read more »
The best GIF making apps
Animated GIFs have exploded in popularity recently which is likely thanks to a combination of Tumblr, our shorter attention spans, and the simple fact they’re a lot of fun. [Read more] | Read more »
The best remote desktop apps for iOS
We've been sifting through the App Store to find the best ways to do computer tasks on a tablet. That gave us a thought - what if we could just do computer tasks from our tablets? Here's a list of the best remote desktop apps to help you use your... | Read more »
Warhammer 40,000: Freeblade guide - How...
Warhammer 40,000: Freebladejust launched in the App Store and it lets you live your childhood dream of blowing up and slashing a bunch of enemies as a massive, hulking Space Marine. It's not easy being a Space Marine though - and particularly if... | Read more »
Gopogo guide - How to bounce like the be...
Nitrome just launched a new game and, as to be expected, it's a lot of addictive fun. It's called Gopogo, and it challenges you to hoparound a bunch of platforms, avoiding enemies and picking up shiny stuff. It's not easy though - just like the... | Read more »
Sago Mini Superhero (Education)
Sago Mini Superhero 1.0 Device: iOS Universal Category: Education Price: $2.99, Version: 1.0 (iTunes) Description: KAPOW! Jack the rabbit bursts into the sky as the Sago Mini Superhero! Fly with Jack as he lifts impossible weights,... | Read more »
Star Wars: Galaxy of Heroes guide - How...
Star Wars: Galaxy of Heroes is all about collecting heroes, powering them up, and using them together to defeat your foes. It's pretty straightforward stuff for the most part, but increasing your characters' stats can be a bit confusing because it... | Read more »

Price Scanner via

World’s First USB-C Adapter For MacBook Suppo...
Innergie, a brand of Delta Electronics, has announced its official release of the world’s first USB-C adapter supporting four DC output voltages, the PowerGear USB-C 45. This true Type C adapter... Read more
13-inch and 11-inch MacBook Airs on sale for...
B&H Photo has 13″ and 11″ MacBook Airs on sale for up to $120 off MSRP as part of their Holiday sale including free shipping plus NY sales tax only: - 11″ 1.6GHz/128GB MacBook Air: $819 $90 off... Read more
13-inch MacBook Pros on sale for up to $150 o...
Take up to $150 off MSRP on the price of a new 13″ MacBook Pro at B&H Photo today as part of their Holiday sale. Shipping is free, and B&H charges NY tax only. These prices are currently the... Read more
13-inch 128GB MacBook Air now on sale for $79...
Best Buy has just lowered their price on the 2015 13″ 1.6GHz/128GB MacBook Air to $799.99 on their online store for Cyber Monday. Choose free shipping or free local store pickup (if available). Sale... Read more
Best Buy lowers 13-inch MacBook Pro prices, n...
Best Buy has lowered prices on select 13″ MacBook Pros this afternoon. Now save up to $200 off MSRP for Cyber Monday on the following models. Choose free shipping or free local store pickup (if... Read more
Cyber Monday: Apple MacBooks on sale for up t...
Apple resellers have MacBook Pros, MacBook Airs, and MacBooks on sale for up to $250 off MSRP for Cyber Monday 2015. The following is a roundup of the lowest prices available for new models from any... Read more
Cyber Monday: Apple Watch on sale for up to $...
B&H Photo has the Apple Watch on sale for Cyber Monday for $50-$100 off MSRP. Shipping is free, and B&H charges NY sales tax only: - Apple Watch Sport: $50 off - Apple Watch: $50-$100 off B... Read more
Cyber Monday: 15% off Apple products, and sto...
Use code CYBER15 on Cyber Monday only to take 15% on Apple products at Target, and store-wide. Choose free shipping or free local store pickup (if available). Sale prices for online orders only, in-... Read more
iPad Air 2 And iPad mini Among Top Five Black...
Adobe has released its 2015 online shopping data for Black Friday and Thanksgiving Day. The five best selling electronic products on Black Friday were Samsung 4K TVs, Apple iPad Air 2, Microsoft Xbox... Read more
All-in-one PC Shipments Projected To Drop Ove...
Digitimes’ Aaron Lee and Joseph Tsai report that all-in-one (AIO) PC shipments may drop a double-digit percentage on-year in 2015 due to weaker-than-expected demand, although second-largest AIO make... Read more

Jobs Board

*Apple* New Products Tester Needed - Apple (...
…we therefore look forward to put out products to quality test for durability. Apple leads the digital music revolution with its iPods and iTunes online store, continues Read more
Software Engineer, *Apple* Watch - Apple (U...
# Software Engineer, Apple Watch Job Number: 33362459 Santa Clara Valley, Califo ia, United States Posted: Jul. 28, 2015 Weekly Hours: 40.00 **Job Summary** Join the Read more
SW Engineer - *Apple* Music - Apple (United...
# SW Engineer - Apple Music Job Number: 40899104 San Francisco, Califo ia, United States Posted: Aug. 18, 2015 Weekly Hours: 40.00 **Job Summary** Join the Android Read more
Sr Software Engineer *Apple* Pay - Apple (U...
# Sr Software Engineer Apple Pay Job Number: 44003019 Santa Clara Valley, Califo ia, United States Posted: Nov. 13, 2015 Weekly Hours: 40.00 **Job Summary** Apple Read more
*Apple* Site Security Manager - Apple (Unite...
# Apple Site Security Manager Job Number: 42975010 Culver City, Califo ia, United States Posted: Oct. 2, 2015 Weekly Hours: 40.00 **Job Summary** The Apple Site Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.