TweetFollow Us on Twitter

awk for Data Processing - Part 2

Volume Number: 22 (2006)
Issue Number: 04
Column Tag: Programming

Mac In The Shell

awk for Data Processing - Part 2

by Edward Marczak

Revving up the engine.

Last month, I introduced you to awk, the 'pattern processor'. That laid the foundation, and merely scratched the surface of awk's power. This month, we're going to dive back into flow control, as we've seen with bash, sed, math routines, and other cool awk features. Of course, awk only becomes more powerful when combined with shell scripting and sed.

Pattern Matching

Now that, over the last few months, we've covered regexp, sed, shell globbing, and now, awk, here's a word on anytime you want to use a utility that does pattern matching. Sometimes you're not in control of the data you're going to need to sift through. However, there are times where, you are the one generating the data. This could be in the form of a report, or even from another command-line tool. In any case, try to make your life easier: don't spew out unnecessary data! For example: you may want to find out ip addresses assigned to a particular interface, so you decide to use ifconfig, and write a sed script to parse the output. The sed script can pattern match the interface and then loop through the results looking for "inet". Not bad, but you decrease your work if you specify the interface you're looking for to ifconfig. If you're using nireport, make sure you output fields in the order that you want them, rather than use awk to swap them around. If you need a file listing, look at all of the switches that will sort and add symbols to the output that will make matching easier. Always make sure you read man page for any program that you're using: you may find some surprising switches that reduce the work you do further down the chain.

Back to Basics

Part 1 of this article gave us some real awk basics - print, match a pattern, field operators and some built-in variables. The built-in variables that we covered were NF, the number of fields in a record, and FS, the field separator. Of course, there are some more built-ins that we should know about. Let's do that before proceeding.

FS separates fields during the input stage. By default, awk separates output with a space. You can define that to be anything you want, using OFS. The output field separator is generated by the comma in a print statement. So, to rewrite an example from last month, we can make the output look better:

$ ls -l | awk 'BEGIN {OFS="\t\t"} {print $5,$9,$1}'
182468   20050629-local.jpg   -rw-r--r--
51986   iChats   drwxrwx---
68   images   drwxr-xr-x
1345   jamlog.txt   -rw-r--r--
61440   lads.exe   -rw-r--r--
271103   mount-1.260-3.wbm.gz    -rw-r--r--
352457   mr.spx    -rw-r--r--
Figure 1 - Output Field Separator in action

Better, but still a little ragged. Don't worry! We'll fix that in a bit.

When you generate multi-line output, awk separates each record with ORS, the record separator. ORS is a newline by default, so each record starts on a new line. You can change this! Why would you want to? You can even define RS, the input record separator. Sometimes, small examples are worth 1,000 words. If you are processing data that comes in a 'block' - spread out over several lines - setting FS to "\n", the newline character, will allow awk access via the field variables. Set RS to "", and awk will split these correctly when you have multiple input records. Practical example: you suspect a problem with user records, and want to search for particular users that (may) have the same home directory assigned. Here's the script:

01. #!/bin/bash
03. for name in `dscl localhost -list /Search/Users`
04. do
05. dscl localhost -read /Search/Users/${name} | awk 'BEGIN {FS="\n"; RS=""} $0 ~ /\/Users\/marczak/ {print $0}'
06. done
Figure 2 - User search script

First, you can see it's a shell script. We'll use bash to feed awk multi-line records using dscl. Line 3 sets up a loop using all of the usernames that we have access to. Line 5 uses dscl again to get the detail for the username provided and feed that record to awk. Using a BEGIN construct, we first set FS and RS. Then, we look for "/Users/marczak" anywhere in the record using $0. If we match, we print the entire record. This way, we'll print all records that have that path as a home directory. It's a fairly specific example, but actually came in handy once. Plus, it illustrates handling multi-line records!

Finally, in our built-in round up, NR and FNR, keep the current line number available for you. NR is cumulative, and FNR gives you the number of the current record with respect to the input file. Useful if you're processing multiple files.

Low-level Format

awk is a fantastic tool for generating reports, however, reports are only really useful if they look good. The data can be good, but if it's hard to read, the brain just switches off. As you've seen, OFS and print only get you so far. awk supports a formatted print statement, printf, that you may have seen in some other languages, notably C. printf is more flexible than standard print, but requires a little more hand holding. Want an example? Here you go:

awk 'BEGIN {printf ("This is a test.\n")}'

Easy, right? So, what's different? First, you'll notice that you have to supply the newline - just like C! What I left out here, are the optional format specifiers, which again, match their counterpart in C. man printf will get you the list, if you forget them. Let's learn by example. The file listing code can be re-written with printf like this:

ls -l | awk '{printf "%s\t\t%s\t\t%s\n",$5, $9, $1}'

This means, print a string ("%s"), two tabs ("\t"), a string, two tabs and a string, followed up by a newline character. Each format specifier needs a corresponding value after the format string to fill in the place-holder with. We're substituting each %s with a field - $5, $9 and $1, respectively. However, this really is the equivalent of the earlier code - it's still ragged! printf also allows you to supply the width and alignment of the output. So, to clean up our listing, we can use this:

$ ls -l | awk 'NR > 1 {printf "%-20s%-20s%-20s\n",$5, $9, $1}'
306                 dist                   drwxr-xr-x          
42364               httpd.conf             -rw-r--r--          
37417               httpd.conf.bak   -rw-r--r--          
38334               httpd.conf.default   -rw-r--r--          
38334               httpd.conf.dist        -rw-r--r--          
12965               magic                  -rw-r--r--          
12965               magic.default          -rw-r--r--          
15201               mime.types             -rw-r--r--          
15201               mime.types.default     -rw-r--r--          
204                 users                  drwxr-xr-x
Figure 3 - Width specifiers

That's much nicer! Explanation: instead of "%s", we can specify a width using "%20s" - "20" being the width. By default, the output is right justified in the space allotted. I added the hyphen - "%-20s" - to our example, to left justify the text.

Flow Control...Again

Depending on how long you've been reading this column, you've seen this before: we've covered looping and decision-making in bash and in sed. Well, flow is important! So, let's see how awk handles these constructs.

The most basic of tests is an if/then test. The pattern matching we've seen is essentially an if/then test that is applied to all input. However, if you've matched something basic, and then need to make further decisions, you can use if/then. Let's combine this with an example using a loop.

As you may have seen in other languages, awk has a while loop that conditionally executes a block of code. Here's the idea:

while (condition is true) action

Like other languages, you can have a line feed between the condition and action, and if the action is multiple lines, they must be contained in curly-braces. I'm actually going to throw a few new things in, and then explain. Let's re-write our user search script from earlier (Figure 2):

01. #!/bin/bash
03. for name in `dscl localhost -list /Search/Users`
04. do
05. dscl localhost -read /Search/Users/${name} | awk '
06. BEGIN {FS="\n"; RS=""}
07. $0 ~ /\/Users\/marczak/ {
08. i=1
09. while (i<=NF) {
10.         if ($i ~ /Dir/ || $i ~ /ID/ || $i ~ /Shell/) print $i
11.         i++
12. }
13. }
14. '
15. done
Figure 4 - A while loop in action

Once again, this is a shell script that feeds full blocks of data into awk. The dscl statements on lines 3 and 5 are identical to the ones from the first script. Look how we break up the awk script across multiple lines from there. Line 5 ends with a single quote, which allows bash to treat everything up until the next single quote as continuous code. Note the closing single quote on line 14! Again, we're going to look for a match on the entire input ($0) - looking for "/Users/Marczak" again. If and when we find it, that's where our adventure begins.

Line 8 initializes the variable "i" to 1. Not zero. We're going to index through fields, and don't need to check $0 again! Line 9 shows off our while loop. As long as i doesn't exceed the number of fields on the input record, we execute the loop. First time through, i=1, and we can use it to reference the first field of input ($1). Line 10 - an if statement! Lovely! If we find that the field we're currently looking at contains "Dir" or ("||") "ID" or "Shell", we print that field. Then, we increment i on line 11 so we don't loop around forever - and, we reference the next field in the next iteration of the loop.

Really cool stuff here: using the built-in NF variable as a comparison in our while loop, using a variable for the field reference, using classic Unix utilities with OS X specific CLI programs....nice. In addition to a while loop, awk supports the familiar "for" and "do" loop constructs. And as you may have guessed, you may recognize them already.

A "do" loop is a variant of the "while" loop. Its main difference is that the action is always executed at least once. It looks like this:

do {
} while (condition is true)
Need to see it in action?  Here you go:
        numMice = 5
        catTime = 3
        do {
                theMicePlay = numMice * theCatIsAway
                if (theCatIsAway > catTime) theCatIsAway = 0
        } while (theCatIsAway)
        print "The mice played " theMicePlay " days."
Figure 5 - An example do loop

Yes, it's a completely contrived example so I could use "while the cat is away"...I needed to bring a little levity to this column. This example does illustrate a little math, though, which I haven't explicitly covered.

The for loop borrows its syntax from C and should be pretty recognizable:

for (initialize; test conditions; increment) {

Rewriting the previous loop using for would look like this:

01. BEGIN { 
02.         numMice = 5
03.         catTime = 3
04.         for (theCatIsAway = 1; theCatIsAway > 0; theCatIsAway++) {
05.                 theMicePlay = numMice * theCatIsAway
06.                 if (theCatIsAway > catTime) theCatIsAway = -1
07.         }
08.         print "The mice played " theMicePlay " days"
09. }
Figure 6 - an example for loop

Look at that for loop! It's a thing of beauty! No, really...I'm serious! (Outside of the fact that it ruins my play on words). It lets you take care of everything you need for a loop. Note, however, that the increment happens at the bottom of the loop. This is important, and is the reason, we set theCatIsAway to -1 rather than 0 on line 6. Otherwise, our test would never be true, and we'd get caught in an infinite loop.

Once again, like other languages, awk lets us skip an iteration of a loop, or break out altogether. Inside of a loop, the break keyword breaks out of the loop, and ends it:

do {
   if (leaveLoopNow) break
} while (x < currentThreshhold)

In this example, if leaveLoopNow is true, execute the break statement and bail out of the loop - never to execute the remainder of the loop, picking up execution following the loop.

A less drastic version of break, is continue. A short example will make it clear:

do {
   if (notThisTime) continue
} while (x < currentThreshhold)

Here, if notThisTime is true, we just go back to the top of the loop. But the loop will continue, as long as our condition is true.

There are also two flow-altering statements that affect awk's entire flow - next and exit. The simpler of the two is exit. When awk encounters the exit statement, it jumps to the END rule. Of course, you don't even have to have and END rule defined. In that case, the script just terminates. Note that exit can supply a value to use as awk's exit code. Nice way to test success or failure in a shell script. exit without a value defaults to "0". next transfers control back to the top of the script where awk will read the next record of input. This is useful in a few different situations. If you only want to process records in a file that has 5 fields, simply sue this rule:

NF != 5 {next}

That's also useful for error checking, whereby if the target input doesn't 'look' right, you can just crank through the file. Perhaps even keeping count of how many records you skipped for use in an exception report.


Here's something that I can't tell you I've covered before. Certainly not in sed, nor in bash. Of course there are other languages that support arrays, so some of this may look familiar. But, if this column has been your introduction to anything remotely related to programming or scripting, this will be slightly new. An array is simply a variable that lets you hold a series of values. Being a loosely typed language, all arrays in awk are associative arrays - arrays that map keys to values. Associative arrays do not need to use integers as the key, or subscript, nor does every value need to be of equal type and size. If you have a PHP background, you'll understand this innately. Naturally, examples are forthcoming. Like other variables in awk, arrays do not need to be declared, so you can just use them:

array[key] = value

Often, you'll see simple numeric keys (subscripts) - useful when loading data in from a file, and you want to track something from every record, or mark certain records based on a value. Just as often, though, you'll see a key; a string that maps to a value. We can use this feature like this:

BEGIN { color["red"] = "0xF00"
color["green"] = "0x0F0"
color["blue"] = "0x00F"
print color["red"]

Nice, right? Arrays let us keep related values together. You can also use a variable as the key. Here's a totally trivial example that illustrates a few new concepts:

01. #!/bin/bash
02. /System/Library/PrivateFrameworks/Apple80211.framework/Versions/A/Resources/airport -I | awk '
03. BEGIN { FS=":" }
04. {
05. gsub(" ","",$1)
06. recordlist[$1] = $2
07. }
08. END {
09. for (key in recordlist)
10.         print "The " key " is equal to" recordlist[key]
11. }
12. '
Figure 7 - Many new concepts!

Once again, I wrap awk in a bash shell script. First new thing, may be the airport command. With the "-I" switch, it gives you information about your current airport status. Next new thing is on line 5: gsub. Sometimes, exposure to many languages is a bit of a curse. When I see this command, I always think back to BASIC's "gosub" (go to subroutine) command. In awk's case, however, it stands for global substitue. I'm using it here just to clean up the output a bit. It's really powerful, though, and works like this:

gsub(regexp, substitution, string)

Now it's apparent; I'm just removing the spaces from $1: a space (" ") is being replaced with nothing ("") in the string $1. Now, look what's happening on line 6 - the value of $1 is being used as the key in the array "recordlist". It's also being assigned the value of $2, the second field. Then on line 9 in the END pattern, there's a new flow control statement. A variant on a for loop, we have some special syntax that accesses each element of an array in turn. "key" is a made-up variable. Right there, on the spot. It could really be whatever we like, but as with all variables, it should be something somewhat meaningful. This variable will contain the current key name in each iteration of the loop.

While there's a lot more to arrays, I'd be remiss if I didn't mention two functions: split and delete. split splits a string into an array based on a separator. This is just like awk's main loop function that breaks input into fields based on FS. If you have awk reading a CSV file, you could use split thusly:

x = split($0, myFields, ",")

What this does is create an array - myFields - that contains each 'field' of $0, fields being separated with a comma. split returns the number of fields in the string, in our case, putting the result into "x". If the input looked like this:

Mike Jones, 555-1234,
Bill Smith, 555-0984,
Sally Foster, 555-3456,

...then during the first pass, myFields would contain:

myFields[1] = "Mike Jones"
myFields[2] = "555-1234"
myFields[3] = ""

split is also a useful way to load up an array:


That's a lot easier than writing out "month[1] = "...

delete is simple: it lets you remove an element of an array. Simply:

delete myFields[2]

...would get rid of the phone number in our previous example. Of course, you can always just ignore a field, using delete will make sure it's gone if you choose to use

Grab Bag

There are some final things I feel the need to mention about awk, but realized that they're each pretty short and belonged altogether in a 'grab bag' section. Without further ado, here they are.

In addition to the other built-in variables that we've covered, awk presents two built-ins as arrays: ARGV and ENVIRON. ARGV is an array that contains each command line argument, including the script name itself in ARGV[0]. If you ran your script like this:

$ awk -f argvtest.awk var1 15 "Iolo" "Shamino"

...ARGV would contain:

ARGV[0] = awk
ARGV[1] = var1
ARGV[2] = 15
ARGV[3] = "Iolo"
ARGV[4] = "Shamino"

ENVIRON maps current environment variables to their values. For example:


on my machine would yield "darwin8.0".

In addition to split and gsub, awk contains some really useful (and common to other languages) string manipulation functions, such as substr, toupper, length and tolower. Consider that homework.

Finally, awk provides a way to get input outside of the main input loop. getline gets a new line of input, and can be used in two different ways. First, when used by itself, it will get the next line from input that the main loop would have gotten. This is similar to next (covered above), however, getline does not bring flow back to the top of the script. Secondarily, you can pipe input into awk and read it with getline. While more in-depth work still requires a shell script, this is my favorite way to write a quick-and-dirty awk script. One example will get you going:

$ awk 'BEGIN {"top -l 1" | getline; print $2 " processes running"}'
121 processes running

The output of top is piped into awk - directly from inside awk! You can, of course, even use that trick conditionally, and go look something up on the fly if needed.


To show everything that awk is capable of would take a book. I believe I've shown things that are immediately understandable and practical. Between this column and the last, you should have a good foundation to build on. Of course, there's a lot more to explore. I didn't get to multi-dimensional arrays, trig math, user-defined functions, piping to output...and more. If this has whetted your appetite, there are many resources that teach awk in-depth. Just dropping into Google and trying "learning awk" brings back an incredible number of resources.

awk is a fantastic utility that has proved its worth over decades of classic Unix use. For OS X administrators, it dovetails perfectly with the powerful command-line utilities at our disposal.

Recommended reading for the month: Cuckoo's Egg, by Cliff Stoll. Released in 1989, this was one of the first track-a-hacker books I ever read. Of course, I could suggest some technical reference for you to dig into, but this is just good reading. I saw a copy at a bookstore not too long ago, and that made me break out my old version. Still a good read today, if not a great way to compare and contrast the technical environment of the late 1980s to today. Also, a good reminder that social engineering is timeless.

Ed Marczak owns and operates Radiotope, a technology consulting company. More tech tips at the blog:


Community Search:
MacTech Search:

Software Updates via MacUpdate

The best video player on mobile
We all know the stock video player on iOS is not particularly convenient, primarily because it asks us to hook a device up to iTunes to sync video in a world that has things like Netflix. [Read more] | Read more »
Four apps to help improve your Super Bow...
Super Bowl Sunday is upon us, and whether you’re a Panthers or a Broncos fan you’re no doubt gearing up for it. [Read more] | Read more »
LooperSonic (Music)
LooperSonic 1.0 Device: iOS Universal Category: Music Price: $4.99, Version: 1.0 (iTunes) Description: LooperSonic is a multi-track audio looper and recorder that will take your loops to the next level. Use it like a loop pedal to... | Read more »
Space Grunts guide - How to survive
Space Grunts is a fast-paced roguelike from popular iOS developer, Orange Pixel. While it taps into many of the typical roguelike sensibilities, you might still find yourself caught out by a few things. We delved further to find you some helpful... | Read more »
Dreii guide - How to play well with othe...
Dreii is a rather stylish and wonderful puzzle game that’s reminiscent of cooperative games like Journey. If that sounds immensely appealing, then you should immediately get cracking and give it a whirl. We can offer you some tips and tricks on... | Read more »
Kill the Plumber World guide - How to ou...
You already know how to hop around like Mario, but do you know how to defeat him? Those are your marching orders in Kill the Plumber, and it's not always as easy as it looks. Here are some tips to get you started. This is not a seasoned platform... | Read more »
Planar Conquest (Games)
Planar Conquest 1.0 Device: iOS Universal Category: Games Price: $12.99, Version: 1.0 (iTunes) Description: IMPORTANT: Planar Conquest is compatible only with iPad 3 & newer devices, iPhone 5 & newer. It’s NOT compatible with... | Read more »
We talk to Cheetah Mobile about its plan...
Piano Tiles 2 is a fast-paced rhythm action high score chaser out now on iOS and Android. You have to tap a series of black tiles that appear on the screen in time to the music, being careful not to accidentally hit anywhere else. Do that and it's... | Read more »
Ultimate Briefcase guide - How to dodge...
Ultimate Briefcase is a simple but tricky game that’s highly dependent on how fast you can react. We can still offer you a few tips and tricks on how to survive though. Guess what? That’s exactly what we’re going to do now. Take it easy [Read more... | Read more »
SoundPrism Link Edition (Music)
SoundPrism Link Edition 1.0 Device: iOS Universal Category: Music Price: $4.99, Version: 1.0 (iTunes) Description: ***Introductory price for a the first few days after launch - if you're reading this, get it while it's fresh out of... | Read more »

Price Scanner via

12-inch 1.2GHz Silver Retina MacBook on sale...
B&H Photo has the 12″ 1.2GHz Silver Retina MacBook on sale for $1399 including free shipping plus NY sales tax only. Their price is $200 off MSRP, and it’s the lowest price for this model from... Read more
iPads on sale at Target: $100 off iPad Air 2,...
Target has WiFi iPad Air 2s and iPad mini 4s on sale for up to $100 off MSRP on their online store for a limited time. Choose free shipping or free local store pickup (if available). Sale prices for... Read more
Target offers Apple Watch for $100 off MSRP
Target has Apple Watches on sale for $100 for a limited time. Choose free shipping or free local store pickup (if available). Sale prices for online orders only, in-store prices may vary: - Apple... Read more
Apple refurbished 2014 13-inch Retina MacBook...
Apple has Certified Refurbished 2014 13″ Retina MacBook Pros available for up to $400 off original MSRP, starting at $979. An Apple one-year warranty is included with each model, and shipping is free... Read more
Macs available for up to $300 off MSRP, $20 o...
Purchase a new Mac or iPad using Apple’s Education Store and take up to $300 off MSRP. All teachers, students, and staff of any educational institution qualify for the discount. Shipping is free, and... Read more
Watch Super Bowl 50 Live On Your iPad For Fre...
Watch Super Bowl 50 LIVE on the CBS Sports app for iPad and Apple TV. Get the app and then tune in Sunday, February 7, 2016 at 6:30 PM ET to catch every moment of the big game. The CBS Sports app is... Read more
Two-thirds Of All Smart Watches Shipped In 20...
Apple dominated the smart watch market in 2015, accounting for over 12 million units and two-thirds of all shipments according to Canalys market research analysts’ estimates. Samsung returned to... Read more
12-inch 1.2GHz Retina MacBooks on sale for up...
B&H Photo has 12″ 1.2GHz Retina MacBooks on sale for $180 off MSRP. Shipping is free, and B&H charges NY tax only: - 12″ 1.2GHz Gray Retina MacBook: $1499 $100 off MSRP - 12″ 1.2GHz Silver... Read more
12-inch 1.1GHz Gray Retina MacBook on sale fo...
B&H Photo has the 12″ 1.1GHz Gray Retina MacBook on sale for $1199 including free shipping plus NY sales tax only. Their price is $100 off MSRP, and it’s the lowest price available for this model... Read more
Apple now offering full line of Certified Ref...
Apple now has a full line of Certified Refurbished 2015 21″ & 27″ iMacs available for up to $350 off MSRP. Apple’s one-year warranty is standard, and shipping is free. The following models are... Read more

Jobs Board

*Apple* Retail - Multiple Positions (US) - A...
Job Description: Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
*Apple* Subject Matter Expert - Experis (Uni...
This position is for an Apple Subject Matter Expert to assist in developing the architecture, support and services for integration of Apple devices into the domain. Read more
*Apple* Macintosh OSX - Net2Source Inc. (Uni...
…: * Work Authorization : * Contact Number(Best time to reach you) : Skills : Apple Macintosh OSX Location : New York, New York. Duartion : 6+ Months The associate would Read more
Computer Operations Technician ll - *Apple*...
# Web Announcement** Apple Technical Liaison**The George Mason University, Information Technology Services (ITS), Technology Support Services, Desktop Support Read more
Restaurant Manager - Apple Gilroy Inc./Apple...
…in every aspect of daily operation. WHY YOU'LL LIKE IT: You'll be the Big Apple . You'll solve problems. You'll get to show your ability to handle the stress and Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.