TweetFollow Us on Twitter

awk for Data Processing - Part 2

Volume Number: 22 (2006)
Issue Number: 04
Column Tag: Programming

Mac In The Shell

awk for Data Processing - Part 2

by Edward Marczak

Revving up the engine.

Last month, I introduced you to awk, the 'pattern processor'. That laid the foundation, and merely scratched the surface of awk's power. This month, we're going to dive back into flow control, as we've seen with bash, sed, math routines, and other cool awk features. Of course, awk only becomes more powerful when combined with shell scripting and sed.

Pattern Matching

Now that, over the last few months, we've covered regexp, sed, shell globbing, and now, awk, here's a word on anytime you want to use a utility that does pattern matching. Sometimes you're not in control of the data you're going to need to sift through. However, there are times where, you are the one generating the data. This could be in the form of a report, or even from another command-line tool. In any case, try to make your life easier: don't spew out unnecessary data! For example: you may want to find out ip addresses assigned to a particular interface, so you decide to use ifconfig, and write a sed script to parse the output. The sed script can pattern match the interface and then loop through the results looking for "inet". Not bad, but you decrease your work if you specify the interface you're looking for to ifconfig. If you're using nireport, make sure you output fields in the order that you want them, rather than use awk to swap them around. If you need a file listing, look at all of the switches that will sort and add symbols to the output that will make matching easier. Always make sure you read man page for any program that you're using: you may find some surprising switches that reduce the work you do further down the chain.

Back to Basics

Part 1 of this article gave us some real awk basics - print, match a pattern, field operators and some built-in variables. The built-in variables that we covered were NF, the number of fields in a record, and FS, the field separator. Of course, there are some more built-ins that we should know about. Let's do that before proceeding.

FS separates fields during the input stage. By default, awk separates output with a space. You can define that to be anything you want, using OFS. The output field separator is generated by the comma in a print statement. So, to rewrite an example from last month, we can make the output look better:

$ ls -l | awk 'BEGIN {OFS="\t\t"} {print $5,$9,$1}'
total
182468   20050629-local.jpg   -rw-r--r--
51986   iChats   drwxrwx---
68   images   drwxr-xr-x
1345   jamlog.txt   -rw-r--r--
61440   lads.exe   -rw-r--r--
271103   mount-1.260-3.wbm.gz    -rw-r--r--
352457   mr.spx    -rw-r--r--
Figure 1 - Output Field Separator in action

Better, but still a little ragged. Don't worry! We'll fix that in a bit.

When you generate multi-line output, awk separates each record with ORS, the record separator. ORS is a newline by default, so each record starts on a new line. You can change this! Why would you want to? You can even define RS, the input record separator. Sometimes, small examples are worth 1,000 words. If you are processing data that comes in a 'block' - spread out over several lines - setting FS to "\n", the newline character, will allow awk access via the field variables. Set RS to "", and awk will split these correctly when you have multiple input records. Practical example: you suspect a problem with user records, and want to search for particular users that (may) have the same home directory assigned. Here's the script:

01. #!/bin/bash
02.
03. for name in `dscl localhost -list /Search/Users`
04. do
05. dscl localhost -read /Search/Users/${name} | awk 'BEGIN {FS="\n"; RS=""} $0 ~ /\/Users\/marczak/ {print $0}'
06. done
Figure 2 - User search script

First, you can see it's a shell script. We'll use bash to feed awk multi-line records using dscl. Line 3 sets up a loop using all of the usernames that we have access to. Line 5 uses dscl again to get the detail for the username provided and feed that record to awk. Using a BEGIN construct, we first set FS and RS. Then, we look for "/Users/marczak" anywhere in the record using $0. If we match, we print the entire record. This way, we'll print all records that have that path as a home directory. It's a fairly specific example, but actually came in handy once. Plus, it illustrates handling multi-line records!

Finally, in our built-in round up, NR and FNR, keep the current line number available for you. NR is cumulative, and FNR gives you the number of the current record with respect to the input file. Useful if you're processing multiple files.

Low-level Format

awk is a fantastic tool for generating reports, however, reports are only really useful if they look good. The data can be good, but if it's hard to read, the brain just switches off. As you've seen, OFS and print only get you so far. awk supports a formatted print statement, printf, that you may have seen in some other languages, notably C. printf is more flexible than standard print, but requires a little more hand holding. Want an example? Here you go:

awk 'BEGIN {printf ("This is a test.\n")}'

Easy, right? So, what's different? First, you'll notice that you have to supply the newline - just like C! What I left out here, are the optional format specifiers, which again, match their counterpart in C. man printf will get you the list, if you forget them. Let's learn by example. The file listing code can be re-written with printf like this:

ls -l | awk '{printf "%s\t\t%s\t\t%s\n",$5, $9, $1}'

This means, print a string ("%s"), two tabs ("\t"), a string, two tabs and a string, followed up by a newline character. Each format specifier needs a corresponding value after the format string to fill in the place-holder with. We're substituting each %s with a field - $5, $9 and $1, respectively. However, this really is the equivalent of the earlier code - it's still ragged! printf also allows you to supply the width and alignment of the output. So, to clean up our listing, we can use this:

$ ls -l | awk 'NR > 1 {printf "%-20s%-20s%-20s\n",$5, $9, $1}'
306                 dist                   drwxr-xr-x          
42364               httpd.conf             -rw-r--r--          
37417               httpd.conf.bak   -rw-r--r--          
38334               httpd.conf.default   -rw-r--r--          
38334               httpd.conf.dist        -rw-r--r--          
12965               magic                  -rw-r--r--          
12965               magic.default          -rw-r--r--          
15201               mime.types             -rw-r--r--          
15201               mime.types.default     -rw-r--r--          
204                 users                  drwxr-xr-x
Figure 3 - Width specifiers

That's much nicer! Explanation: instead of "%s", we can specify a width using "%20s" - "20" being the width. By default, the output is right justified in the space allotted. I added the hyphen - "%-20s" - to our example, to left justify the text.

Flow Control...Again

Depending on how long you've been reading this column, you've seen this before: we've covered looping and decision-making in bash and in sed. Well, flow is important! So, let's see how awk handles these constructs.

The most basic of tests is an if/then test. The pattern matching we've seen is essentially an if/then test that is applied to all input. However, if you've matched something basic, and then need to make further decisions, you can use if/then. Let's combine this with an example using a loop.

As you may have seen in other languages, awk has a while loop that conditionally executes a block of code. Here's the idea:

while (condition is true) action

Like other languages, you can have a line feed between the condition and action, and if the action is multiple lines, they must be contained in curly-braces. I'm actually going to throw a few new things in, and then explain. Let's re-write our user search script from earlier (Figure 2):

01. #!/bin/bash
02.
03. for name in `dscl localhost -list /Search/Users`
04. do
05. dscl localhost -read /Search/Users/${name} | awk '
06. BEGIN {FS="\n"; RS=""}
07. $0 ~ /\/Users\/marczak/ {
08. i=1
09. while (i<=NF) {
10.         if ($i ~ /Dir/ || $i ~ /ID/ || $i ~ /Shell/) print $i
11.         i++
12. }
13. }
14. '
15. done
Figure 4 - A while loop in action

Once again, this is a shell script that feeds full blocks of data into awk. The dscl statements on lines 3 and 5 are identical to the ones from the first script. Look how we break up the awk script across multiple lines from there. Line 5 ends with a single quote, which allows bash to treat everything up until the next single quote as continuous code. Note the closing single quote on line 14! Again, we're going to look for a match on the entire input ($0) - looking for "/Users/Marczak" again. If and when we find it, that's where our adventure begins.

Line 8 initializes the variable "i" to 1. Not zero. We're going to index through fields, and don't need to check $0 again! Line 9 shows off our while loop. As long as i doesn't exceed the number of fields on the input record, we execute the loop. First time through, i=1, and we can use it to reference the first field of input ($1). Line 10 - an if statement! Lovely! If we find that the field we're currently looking at contains "Dir" or ("||") "ID" or "Shell", we print that field. Then, we increment i on line 11 so we don't loop around forever - and, we reference the next field in the next iteration of the loop.

Really cool stuff here: using the built-in NF variable as a comparison in our while loop, using a variable for the field reference, using classic Unix utilities with OS X specific CLI programs....nice. In addition to a while loop, awk supports the familiar "for" and "do" loop constructs. And as you may have guessed, you may recognize them already.

A "do" loop is a variant of the "while" loop. Its main difference is that the action is always executed at least once. It looks like this:

do {
   action
} while (condition is true)
Need to see it in action?  Here you go:
BEGIN {
        numMice = 5
        catTime = 3
        do {
                theCatIsAway++
                theMicePlay = numMice * theCatIsAway
                if (theCatIsAway > catTime) theCatIsAway = 0
        } while (theCatIsAway)
        print "The mice played " theMicePlay " days."
}
Figure 5 - An example do loop

Yes, it's a completely contrived example so I could use "while the cat is away"...I needed to bring a little levity to this column. This example does illustrate a little math, though, which I haven't explicitly covered.

The for loop borrows its syntax from C and should be pretty recognizable:

for (initialize; test conditions; increment) {
   actions
}

Rewriting the previous loop using for would look like this:

01. BEGIN { 
02.         numMice = 5
03.         catTime = 3
04.         for (theCatIsAway = 1; theCatIsAway > 0; theCatIsAway++) {
05.                 theMicePlay = numMice * theCatIsAway
06.                 if (theCatIsAway > catTime) theCatIsAway = -1
07.         }
08.         print "The mice played " theMicePlay " days"
09. }
Figure 6 - an example for loop

Look at that for loop! It's a thing of beauty! No, really...I'm serious! (Outside of the fact that it ruins my play on words). It lets you take care of everything you need for a loop. Note, however, that the increment happens at the bottom of the loop. This is important, and is the reason, we set theCatIsAway to -1 rather than 0 on line 6. Otherwise, our test would never be true, and we'd get caught in an infinite loop.

Once again, like other languages, awk lets us skip an iteration of a loop, or break out altogether. Inside of a loop, the break keyword breaks out of the loop, and ends it:

do {
   if (leaveLoopNow) break
   procedure_one(x,y,z)
   procedure_two(x,y)
   transform_one(x,y)
} while (x < currentThreshhold)

In this example, if leaveLoopNow is true, execute the break statement and bail out of the loop - never to execute the remainder of the loop, picking up execution following the loop.

A less drastic version of break, is continue. A short example will make it clear:

do {
   if (notThisTime) continue
   checkVars(x,y)
   transform_one(x,y)
} while (x < currentThreshhold)

Here, if notThisTime is true, we just go back to the top of the loop. But the loop will continue, as long as our condition is true.

There are also two flow-altering statements that affect awk's entire flow - next and exit. The simpler of the two is exit. When awk encounters the exit statement, it jumps to the END rule. Of course, you don't even have to have and END rule defined. In that case, the script just terminates. Note that exit can supply a value to use as awk's exit code. Nice way to test success or failure in a shell script. exit without a value defaults to "0". next transfers control back to the top of the script where awk will read the next record of input. This is useful in a few different situations. If you only want to process records in a file that has 5 fields, simply sue this rule:

NF != 5 {next}

That's also useful for error checking, whereby if the target input doesn't 'look' right, you can just crank through the file. Perhaps even keeping count of how many records you skipped for use in an exception report.

Arrays

Here's something that I can't tell you I've covered before. Certainly not in sed, nor in bash. Of course there are other languages that support arrays, so some of this may look familiar. But, if this column has been your introduction to anything remotely related to programming or scripting, this will be slightly new. An array is simply a variable that lets you hold a series of values. Being a loosely typed language, all arrays in awk are associative arrays - arrays that map keys to values. Associative arrays do not need to use integers as the key, or subscript, nor does every value need to be of equal type and size. If you have a PHP background, you'll understand this innately. Naturally, examples are forthcoming. Like other variables in awk, arrays do not need to be declared, so you can just use them:

array[key] = value

Often, you'll see simple numeric keys (subscripts) - useful when loading data in from a file, and you want to track something from every record, or mark certain records based on a value. Just as often, though, you'll see a key; a string that maps to a value. We can use this feature like this:

BEGIN { color["red"] = "0xF00"
color["green"] = "0x0F0"
color["blue"] = "0x00F"
print color["red"]
}

Nice, right? Arrays let us keep related values together. You can also use a variable as the key. Here's a totally trivial example that illustrates a few new concepts:

01. #!/bin/bash
02. /System/Library/PrivateFrameworks/Apple80211.framework/Versions/A/Resources/airport -I | awk '
03. BEGIN { FS=":" }
04. {
05. gsub(" ","",$1)
06. recordlist[$1] = $2
07. }
08. END {
09. for (key in recordlist)
10.         print "The " key " is equal to" recordlist[key]
11. }
12. '
Figure 7 - Many new concepts!

Once again, I wrap awk in a bash shell script. First new thing, may be the airport command. With the "-I" switch, it gives you information about your current airport status. Next new thing is on line 5: gsub. Sometimes, exposure to many languages is a bit of a curse. When I see this command, I always think back to BASIC's "gosub" (go to subroutine) command. In awk's case, however, it stands for global substitue. I'm using it here just to clean up the output a bit. It's really powerful, though, and works like this:

gsub(regexp, substitution, string)

Now it's apparent; I'm just removing the spaces from $1: a space (" ") is being replaced with nothing ("") in the string $1. Now, look what's happening on line 6 - the value of $1 is being used as the key in the array "recordlist". It's also being assigned the value of $2, the second field. Then on line 9 in the END pattern, there's a new flow control statement. A variant on a for loop, we have some special syntax that accesses each element of an array in turn. "key" is a made-up variable. Right there, on the spot. It could really be whatever we like, but as with all variables, it should be something somewhat meaningful. This variable will contain the current key name in each iteration of the loop.

While there's a lot more to arrays, I'd be remiss if I didn't mention two functions: split and delete. split splits a string into an array based on a separator. This is just like awk's main loop function that breaks input into fields based on FS. If you have awk reading a CSV file, you could use split thusly:

x = split($0, myFields, ",")

What this does is create an array - myFields - that contains each 'field' of $0, fields being separated with a comma. split returns the number of fields in the string, in our case, putting the result into "x". If the input looked like this:

Mike Jones, 555-1234, mikej@example.com
Bill Smith, 555-0984, bills@example.com
Sally Foster, 555-3456, sallyf@example.com

...then during the first pass, myFields would contain:

myFields[1] = "Mike Jones"
myFields[2] = "555-1234"
myFields[3] = "mikej@example.com"

split is also a useful way to load up an array:

split("Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec",months,",")

That's a lot easier than writing out "month[1] = "...

delete is simple: it lets you remove an element of an array. Simply:

delete myFields[2]

...would get rid of the phone number in our previous example. Of course, you can always just ignore a field, using delete will make sure it's gone if you choose to use for...in.

Grab Bag

There are some final things I feel the need to mention about awk, but realized that they're each pretty short and belonged altogether in a 'grab bag' section. Without further ado, here they are.

In addition to the other built-in variables that we've covered, awk presents two built-ins as arrays: ARGV and ENVIRON. ARGV is an array that contains each command line argument, including the script name itself in ARGV[0]. If you ran your script like this:

$ awk -f argvtest.awk var1 15 "Iolo" "Shamino"

...ARGV would contain:

ARGV[0] = awk
ARGV[1] = var1
ARGV[2] = 15
ARGV[3] = "Iolo"
ARGV[4] = "Shamino"

ENVIRON maps current environment variables to their values. For example:

print ENVIRON[OSTYPE]

on my machine would yield "darwin8.0".

In addition to split and gsub, awk contains some really useful (and common to other languages) string manipulation functions, such as substr, toupper, length and tolower. Consider that homework.

Finally, awk provides a way to get input outside of the main input loop. getline gets a new line of input, and can be used in two different ways. First, when used by itself, it will get the next line from input that the main loop would have gotten. This is similar to next (covered above), however, getline does not bring flow back to the top of the script. Secondarily, you can pipe input into awk and read it with getline. While more in-depth work still requires a shell script, this is my favorite way to write a quick-and-dirty awk script. One example will get you going:

$ awk 'BEGIN {"top -l 1" | getline; print $2 " processes running"}'
121 processes running

The output of top is piped into awk - directly from inside awk! You can, of course, even use that trick conditionally, and go look something up on the fly if needed.

Conclusion

To show everything that awk is capable of would take a book. I believe I've shown things that are immediately understandable and practical. Between this column and the last, you should have a good foundation to build on. Of course, there's a lot more to explore. I didn't get to multi-dimensional arrays, trig math, user-defined functions, piping to output...and more. If this has whetted your appetite, there are many resources that teach awk in-depth. Just dropping into Google and trying "learning awk" brings back an incredible number of resources.

awk is a fantastic utility that has proved its worth over decades of classic Unix use. For OS X administrators, it dovetails perfectly with the powerful command-line utilities at our disposal.

Recommended reading for the month: Cuckoo's Egg, by Cliff Stoll. Released in 1989, this was one of the first track-a-hacker books I ever read. Of course, I could suggest some technical reference for you to dig into, but this is just good reading. I saw a copy at a bookstore not too long ago, and that made me break out my old version. Still a good read today, if not a great way to compare and contrast the technical environment of the late 1980s to today. Also, a good reminder that social engineering is timeless.


Ed Marczak owns and operates Radiotope, a technology consulting company. More tech tips at the blog:

http://www.radiotope.com/writing

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

How to build a successful civilisation i...
GodFinger 2 grants you godlike powers, leaving you to raise a civilization of followers. In the spirit of games like Black & White, the GodFinger games will see you building bigger and better villages, developing more advanced technology and... | Read more »
How to get all the crabs in Mr Crab 2
Mr. Crab 2 may look like a cutesy platformer for kids, but if you're the kind of person who likes to complete a game 100%, you'll soon realise that it's a tougher than a crustacean's shell. [Read more] | Read more »
How to be a star in Britney Spears: Amer...
If you've ever wanted to be a star, baby, then you've probably already checked out Britney Spears: American Dream and are happily making your way up the charts. But fame doesn't come easy, and everyone needs a helping hand sometimes. So we've got... | Read more »
AppSpy is hiring a part time Staff Write...
| Read more »
How to save lives in ER Surgery Simulato...
A serious earthquake has struck a nearby town in ER Surgery Simulator - Emergency Doctor, and it’s up to you to save the victims. [Read more] | Read more »
Tips and tricks to get a high score in G...
Ketchapp Games loves the endless runner genre. And its newest game, Gravity Switch, is no exception. Gravity Switch takes a fresh approach, though, as you move a block, suspended in zero gravity, safely through a maze of shifting pillars. If the... | Read more »
Tips and tricks to get a high score in S...
Smash Fu is a high-paced tile-tapping game that requires quick reflexes and some practice. You’ll have to smash bricks with the skill of a seasoned black belt to get a high score. To raise the stakes a bit, you’ll also have to avoid tapping any... | Read more »
How to keep the ball rolling in Dropple
If you're new to the minimalist puzzler Dropple, you may find yourself struggling to make it beyond the first couple of steps before your ball falls into the endless abyss below. [Read more] | Read more »
Game Craft releases new Legend of War ti...
Set for release at the end of this month, real time strategy title Legend of War seems sure to delight with a veritable feast of sweet features to get stuck into. Developed by Game Craft, the game is due for release through both the App Store and... | Read more »
How not to die in Traffic Rider
Traffic Rider, an Out Run-esque game in which your ride a motorcycle recklessly into trffic, might not seem particularly complicated. [Read more] | Read more »

Price Scanner via MacPrices.net

Apple refurbished iMacs available for up to $...
Apple has Certified Refurbished 2015 21″ & 27″ iMacs available for up to $350 off MSRP. Apple’s one-year warranty is standard, and shipping is free. The following models are available: - 21″ 3.... Read more
Textkraft Professional Becomes A Mobile Produ...
The new update 4.1 of Textkraft Professional for the iPad comes with many new and updated features that will be particularly of interest to self-publishers of e-books. Highlights include import and... Read more
SnipNotes 2.0 – Intelligent note-taking for i...
Indie software developer Felix Lisczyk has announced the release and immediate availability of SnipNotes 2.0, the next major version of his productivity app for iOS devices and Apple Watch.... Read more
Pitch Clock – The Entrepreneur’s Wingman Laun...
Grand Rapids, Michigan based Skunk Tank has announced the release and immediate availability of Pitch Clock – The Entrepreneur’s Wingman 1.1, the company’s new business app available exclusively on... Read more
13-inch 2.9GHz Retina MacBook Pro on sale for...
B&H Photo has the 13″ 2.9GHz Retina MacBook Pro (model #MF841LL/A) on sale for $1599 including free shipping plus NY tax only. Their price is $200 off MSRP. Amazon also has the 13″ 3.9GHz Retina... Read more
Apple price trackers, updated continuously
Scan our Apple Price Trackers for the latest information on sales, bundles, and availability on systems from Apple’s authorized internet/catalog resellers. We update the trackers continuously: - 15″... Read more
Clearance 12-inch Retina MacBooks available s...
B&H Photo has dropped prices on leftover 2015 12″ Retina MacBooks with models now available starting at $999. Shipping is free, and B&H charges NY tax only: - 12″ 1.1GHz Gray Retina MacBook... Read more
Check Apple prices on any device with the iTr...
MacPrices is proud to offer readers a free iOS app (iPhones, iPads, & iPod touch) and Android app (Google Play and Amazon App Store) called iTracx, which allows you to glance at today’s lowest... Read more
New 2016 13-inch 256GB MacBook Air on sale fo...
B&H Photo has the new 13″ 1.6GHz/256GB MacBook Air (model MMGG2LL/A) on sale for $1149 including free shipping plus NY sales tax only. Their price is $50 off MSRP. Amazon has the 13″ 1.6GHz/256GB... Read more
Apple refurbished iPad Air 2s available start...
Apple has Certified Refurbished iPad Air 2 available starting at $339. Apple’s one-year warranty is included with each model, and shipping is free: - 128GB Wi-Fi iPad Air 2: $499 - 64GB Wi-Fi iPad... Read more

Jobs Board

*Apple* Nissan Service Technicians - Apple A...
Apple Automotive is one of the fastest growing dealer...and it shows. Consider making the switch to the Apple Automotive Group today! At Apple Automotive , Read more
ISCS *Apple* ID Site Support Engineer - APP...
…position, we are looking for an individual who has experience supporting customers with Apple ID issues and enjoys this area of support. This person should be Read more
Automotive Sales Consultant - Apple Ford Linc...
…you. The best candidates are smart, technologically savvy and are customer focused. Apple Ford Lincoln Apple Valley is different, because: $30,000 annual salary Read more
*Apple* Support Technician II - Worldventure...
…global, fast growing member based travel company, is currently sourcing for an Apple Support Technician II to be based in our Plano headquarters. WorldVentures is Read more
Restaurant Manager (Neighborhood Captain) - A...
…in every aspect of daily operation. WHY YOU'LL LIKE IT: You'll be the Big Apple . You'll solve problems. You'll get to show your ability to handle the stress and Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.