TweetFollow Us on Twitter

awk for Data Processing - Part 2

Volume Number: 22 (2006)
Issue Number: 04
Column Tag: Programming

Mac In The Shell

awk for Data Processing - Part 2

by Edward Marczak

Revving up the engine.

Last month, I introduced you to awk, the 'pattern processor'. That laid the foundation, and merely scratched the surface of awk's power. This month, we're going to dive back into flow control, as we've seen with bash, sed, math routines, and other cool awk features. Of course, awk only becomes more powerful when combined with shell scripting and sed.

Pattern Matching

Now that, over the last few months, we've covered regexp, sed, shell globbing, and now, awk, here's a word on anytime you want to use a utility that does pattern matching. Sometimes you're not in control of the data you're going to need to sift through. However, there are times where, you are the one generating the data. This could be in the form of a report, or even from another command-line tool. In any case, try to make your life easier: don't spew out unnecessary data! For example: you may want to find out ip addresses assigned to a particular interface, so you decide to use ifconfig, and write a sed script to parse the output. The sed script can pattern match the interface and then loop through the results looking for "inet". Not bad, but you decrease your work if you specify the interface you're looking for to ifconfig. If you're using nireport, make sure you output fields in the order that you want them, rather than use awk to swap them around. If you need a file listing, look at all of the switches that will sort and add symbols to the output that will make matching easier. Always make sure you read man page for any program that you're using: you may find some surprising switches that reduce the work you do further down the chain.

Back to Basics

Part 1 of this article gave us some real awk basics - print, match a pattern, field operators and some built-in variables. The built-in variables that we covered were NF, the number of fields in a record, and FS, the field separator. Of course, there are some more built-ins that we should know about. Let's do that before proceeding.

FS separates fields during the input stage. By default, awk separates output with a space. You can define that to be anything you want, using OFS. The output field separator is generated by the comma in a print statement. So, to rewrite an example from last month, we can make the output look better:

$ ls -l | awk 'BEGIN {OFS="\t\t"} {print $5,$9,$1}'
total
182468   20050629-local.jpg   -rw-r--r--
51986   iChats   drwxrwx---
68   images   drwxr-xr-x
1345   jamlog.txt   -rw-r--r--
61440   lads.exe   -rw-r--r--
271103   mount-1.260-3.wbm.gz    -rw-r--r--
352457   mr.spx    -rw-r--r--
Figure 1 - Output Field Separator in action

Better, but still a little ragged. Don't worry! We'll fix that in a bit.

When you generate multi-line output, awk separates each record with ORS, the record separator. ORS is a newline by default, so each record starts on a new line. You can change this! Why would you want to? You can even define RS, the input record separator. Sometimes, small examples are worth 1,000 words. If you are processing data that comes in a 'block' - spread out over several lines - setting FS to "\n", the newline character, will allow awk access via the field variables. Set RS to "", and awk will split these correctly when you have multiple input records. Practical example: you suspect a problem with user records, and want to search for particular users that (may) have the same home directory assigned. Here's the script:

01. #!/bin/bash
02.
03. for name in `dscl localhost -list /Search/Users`
04. do
05. dscl localhost -read /Search/Users/${name} | awk 'BEGIN {FS="\n"; RS=""} $0 ~ /\/Users\/marczak/ {print $0}'
06. done
Figure 2 - User search script

First, you can see it's a shell script. We'll use bash to feed awk multi-line records using dscl. Line 3 sets up a loop using all of the usernames that we have access to. Line 5 uses dscl again to get the detail for the username provided and feed that record to awk. Using a BEGIN construct, we first set FS and RS. Then, we look for "/Users/marczak" anywhere in the record using $0. If we match, we print the entire record. This way, we'll print all records that have that path as a home directory. It's a fairly specific example, but actually came in handy once. Plus, it illustrates handling multi-line records!

Finally, in our built-in round up, NR and FNR, keep the current line number available for you. NR is cumulative, and FNR gives you the number of the current record with respect to the input file. Useful if you're processing multiple files.

Low-level Format

awk is a fantastic tool for generating reports, however, reports are only really useful if they look good. The data can be good, but if it's hard to read, the brain just switches off. As you've seen, OFS and print only get you so far. awk supports a formatted print statement, printf, that you may have seen in some other languages, notably C. printf is more flexible than standard print, but requires a little more hand holding. Want an example? Here you go:

awk 'BEGIN {printf ("This is a test.\n")}'

Easy, right? So, what's different? First, you'll notice that you have to supply the newline - just like C! What I left out here, are the optional format specifiers, which again, match their counterpart in C. man printf will get you the list, if you forget them. Let's learn by example. The file listing code can be re-written with printf like this:

ls -l | awk '{printf "%s\t\t%s\t\t%s\n",$5, $9, $1}'

This means, print a string ("%s"), two tabs ("\t"), a string, two tabs and a string, followed up by a newline character. Each format specifier needs a corresponding value after the format string to fill in the place-holder with. We're substituting each %s with a field - $5, $9 and $1, respectively. However, this really is the equivalent of the earlier code - it's still ragged! printf also allows you to supply the width and alignment of the output. So, to clean up our listing, we can use this:

$ ls -l | awk 'NR > 1 {printf "%-20s%-20s%-20s\n",$5, $9, $1}'
306                 dist                   drwxr-xr-x          
42364               httpd.conf             -rw-r--r--          
37417               httpd.conf.bak   -rw-r--r--          
38334               httpd.conf.default   -rw-r--r--          
38334               httpd.conf.dist        -rw-r--r--          
12965               magic                  -rw-r--r--          
12965               magic.default          -rw-r--r--          
15201               mime.types             -rw-r--r--          
15201               mime.types.default     -rw-r--r--          
204                 users                  drwxr-xr-x
Figure 3 - Width specifiers

That's much nicer! Explanation: instead of "%s", we can specify a width using "%20s" - "20" being the width. By default, the output is right justified in the space allotted. I added the hyphen - "%-20s" - to our example, to left justify the text.

Flow Control...Again

Depending on how long you've been reading this column, you've seen this before: we've covered looping and decision-making in bash and in sed. Well, flow is important! So, let's see how awk handles these constructs.

The most basic of tests is an if/then test. The pattern matching we've seen is essentially an if/then test that is applied to all input. However, if you've matched something basic, and then need to make further decisions, you can use if/then. Let's combine this with an example using a loop.

As you may have seen in other languages, awk has a while loop that conditionally executes a block of code. Here's the idea:

while (condition is true) action

Like other languages, you can have a line feed between the condition and action, and if the action is multiple lines, they must be contained in curly-braces. I'm actually going to throw a few new things in, and then explain. Let's re-write our user search script from earlier (Figure 2):

01. #!/bin/bash
02.
03. for name in `dscl localhost -list /Search/Users`
04. do
05. dscl localhost -read /Search/Users/${name} | awk '
06. BEGIN {FS="\n"; RS=""}
07. $0 ~ /\/Users\/marczak/ {
08. i=1
09. while (i<=NF) {
10.         if ($i ~ /Dir/ || $i ~ /ID/ || $i ~ /Shell/) print $i
11.         i++
12. }
13. }
14. '
15. done
Figure 4 - A while loop in action

Once again, this is a shell script that feeds full blocks of data into awk. The dscl statements on lines 3 and 5 are identical to the ones from the first script. Look how we break up the awk script across multiple lines from there. Line 5 ends with a single quote, which allows bash to treat everything up until the next single quote as continuous code. Note the closing single quote on line 14! Again, we're going to look for a match on the entire input ($0) - looking for "/Users/Marczak" again. If and when we find it, that's where our adventure begins.

Line 8 initializes the variable "i" to 1. Not zero. We're going to index through fields, and don't need to check $0 again! Line 9 shows off our while loop. As long as i doesn't exceed the number of fields on the input record, we execute the loop. First time through, i=1, and we can use it to reference the first field of input ($1). Line 10 - an if statement! Lovely! If we find that the field we're currently looking at contains "Dir" or ("||") "ID" or "Shell", we print that field. Then, we increment i on line 11 so we don't loop around forever - and, we reference the next field in the next iteration of the loop.

Really cool stuff here: using the built-in NF variable as a comparison in our while loop, using a variable for the field reference, using classic Unix utilities with OS X specific CLI programs....nice. In addition to a while loop, awk supports the familiar "for" and "do" loop constructs. And as you may have guessed, you may recognize them already.

A "do" loop is a variant of the "while" loop. Its main difference is that the action is always executed at least once. It looks like this:

do {
   action
} while (condition is true)
Need to see it in action?  Here you go:
BEGIN {
        numMice = 5
        catTime = 3
        do {
                theCatIsAway++
                theMicePlay = numMice * theCatIsAway
                if (theCatIsAway > catTime) theCatIsAway = 0
        } while (theCatIsAway)
        print "The mice played " theMicePlay " days."
}
Figure 5 - An example do loop

Yes, it's a completely contrived example so I could use "while the cat is away"...I needed to bring a little levity to this column. This example does illustrate a little math, though, which I haven't explicitly covered.

The for loop borrows its syntax from C and should be pretty recognizable:

for (initialize; test conditions; increment) {
   actions
}

Rewriting the previous loop using for would look like this:

01. BEGIN { 
02.         numMice = 5
03.         catTime = 3
04.         for (theCatIsAway = 1; theCatIsAway > 0; theCatIsAway++) {
05.                 theMicePlay = numMice * theCatIsAway
06.                 if (theCatIsAway > catTime) theCatIsAway = -1
07.         }
08.         print "The mice played " theMicePlay " days"
09. }
Figure 6 - an example for loop

Look at that for loop! It's a thing of beauty! No, really...I'm serious! (Outside of the fact that it ruins my play on words). It lets you take care of everything you need for a loop. Note, however, that the increment happens at the bottom of the loop. This is important, and is the reason, we set theCatIsAway to -1 rather than 0 on line 6. Otherwise, our test would never be true, and we'd get caught in an infinite loop.

Once again, like other languages, awk lets us skip an iteration of a loop, or break out altogether. Inside of a loop, the break keyword breaks out of the loop, and ends it:

do {
   if (leaveLoopNow) break
   procedure_one(x,y,z)
   procedure_two(x,y)
   transform_one(x,y)
} while (x < currentThreshhold)

In this example, if leaveLoopNow is true, execute the break statement and bail out of the loop - never to execute the remainder of the loop, picking up execution following the loop.

A less drastic version of break, is continue. A short example will make it clear:

do {
   if (notThisTime) continue
   checkVars(x,y)
   transform_one(x,y)
} while (x < currentThreshhold)

Here, if notThisTime is true, we just go back to the top of the loop. But the loop will continue, as long as our condition is true.

There are also two flow-altering statements that affect awk's entire flow - next and exit. The simpler of the two is exit. When awk encounters the exit statement, it jumps to the END rule. Of course, you don't even have to have and END rule defined. In that case, the script just terminates. Note that exit can supply a value to use as awk's exit code. Nice way to test success or failure in a shell script. exit without a value defaults to "0". next transfers control back to the top of the script where awk will read the next record of input. This is useful in a few different situations. If you only want to process records in a file that has 5 fields, simply sue this rule:

NF != 5 {next}

That's also useful for error checking, whereby if the target input doesn't 'look' right, you can just crank through the file. Perhaps even keeping count of how many records you skipped for use in an exception report.

Arrays

Here's something that I can't tell you I've covered before. Certainly not in sed, nor in bash. Of course there are other languages that support arrays, so some of this may look familiar. But, if this column has been your introduction to anything remotely related to programming or scripting, this will be slightly new. An array is simply a variable that lets you hold a series of values. Being a loosely typed language, all arrays in awk are associative arrays - arrays that map keys to values. Associative arrays do not need to use integers as the key, or subscript, nor does every value need to be of equal type and size. If you have a PHP background, you'll understand this innately. Naturally, examples are forthcoming. Like other variables in awk, arrays do not need to be declared, so you can just use them:

array[key] = value

Often, you'll see simple numeric keys (subscripts) - useful when loading data in from a file, and you want to track something from every record, or mark certain records based on a value. Just as often, though, you'll see a key; a string that maps to a value. We can use this feature like this:

BEGIN { color["red"] = "0xF00"
color["green"] = "0x0F0"
color["blue"] = "0x00F"
print color["red"]
}

Nice, right? Arrays let us keep related values together. You can also use a variable as the key. Here's a totally trivial example that illustrates a few new concepts:

01. #!/bin/bash
02. /System/Library/PrivateFrameworks/Apple80211.framework/Versions/A/Resources/airport -I | awk '
03. BEGIN { FS=":" }
04. {
05. gsub(" ","",$1)
06. recordlist[$1] = $2
07. }
08. END {
09. for (key in recordlist)
10.         print "The " key " is equal to" recordlist[key]
11. }
12. '
Figure 7 - Many new concepts!

Once again, I wrap awk in a bash shell script. First new thing, may be the airport command. With the "-I" switch, it gives you information about your current airport status. Next new thing is on line 5: gsub. Sometimes, exposure to many languages is a bit of a curse. When I see this command, I always think back to BASIC's "gosub" (go to subroutine) command. In awk's case, however, it stands for global substitue. I'm using it here just to clean up the output a bit. It's really powerful, though, and works like this:

gsub(regexp, substitution, string)

Now it's apparent; I'm just removing the spaces from $1: a space (" ") is being replaced with nothing ("") in the string $1. Now, look what's happening on line 6 - the value of $1 is being used as the key in the array "recordlist". It's also being assigned the value of $2, the second field. Then on line 9 in the END pattern, there's a new flow control statement. A variant on a for loop, we have some special syntax that accesses each element of an array in turn. "key" is a made-up variable. Right there, on the spot. It could really be whatever we like, but as with all variables, it should be something somewhat meaningful. This variable will contain the current key name in each iteration of the loop.

While there's a lot more to arrays, I'd be remiss if I didn't mention two functions: split and delete. split splits a string into an array based on a separator. This is just like awk's main loop function that breaks input into fields based on FS. If you have awk reading a CSV file, you could use split thusly:

x = split($0, myFields, ",")

What this does is create an array - myFields - that contains each 'field' of $0, fields being separated with a comma. split returns the number of fields in the string, in our case, putting the result into "x". If the input looked like this:

Mike Jones, 555-1234, mikej@example.com
Bill Smith, 555-0984, bills@example.com
Sally Foster, 555-3456, sallyf@example.com

...then during the first pass, myFields would contain:

myFields[1] = "Mike Jones"
myFields[2] = "555-1234"
myFields[3] = "mikej@example.com"

split is also a useful way to load up an array:

split("Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec",months,",")

That's a lot easier than writing out "month[1] = "...

delete is simple: it lets you remove an element of an array. Simply:

delete myFields[2]

...would get rid of the phone number in our previous example. Of course, you can always just ignore a field, using delete will make sure it's gone if you choose to use for...in.

Grab Bag

There are some final things I feel the need to mention about awk, but realized that they're each pretty short and belonged altogether in a 'grab bag' section. Without further ado, here they are.

In addition to the other built-in variables that we've covered, awk presents two built-ins as arrays: ARGV and ENVIRON. ARGV is an array that contains each command line argument, including the script name itself in ARGV[0]. If you ran your script like this:

$ awk -f argvtest.awk var1 15 "Iolo" "Shamino"

...ARGV would contain:

ARGV[0] = awk
ARGV[1] = var1
ARGV[2] = 15
ARGV[3] = "Iolo"
ARGV[4] = "Shamino"

ENVIRON maps current environment variables to their values. For example:

print ENVIRON[OSTYPE]

on my machine would yield "darwin8.0".

In addition to split and gsub, awk contains some really useful (and common to other languages) string manipulation functions, such as substr, toupper, length and tolower. Consider that homework.

Finally, awk provides a way to get input outside of the main input loop. getline gets a new line of input, and can be used in two different ways. First, when used by itself, it will get the next line from input that the main loop would have gotten. This is similar to next (covered above), however, getline does not bring flow back to the top of the script. Secondarily, you can pipe input into awk and read it with getline. While more in-depth work still requires a shell script, this is my favorite way to write a quick-and-dirty awk script. One example will get you going:

$ awk 'BEGIN {"top -l 1" | getline; print $2 " processes running"}'
121 processes running

The output of top is piped into awk - directly from inside awk! You can, of course, even use that trick conditionally, and go look something up on the fly if needed.

Conclusion

To show everything that awk is capable of would take a book. I believe I've shown things that are immediately understandable and practical. Between this column and the last, you should have a good foundation to build on. Of course, there's a lot more to explore. I didn't get to multi-dimensional arrays, trig math, user-defined functions, piping to output...and more. If this has whetted your appetite, there are many resources that teach awk in-depth. Just dropping into Google and trying "learning awk" brings back an incredible number of resources.

awk is a fantastic utility that has proved its worth over decades of classic Unix use. For OS X administrators, it dovetails perfectly with the powerful command-line utilities at our disposal.

Recommended reading for the month: Cuckoo's Egg, by Cliff Stoll. Released in 1989, this was one of the first track-a-hacker books I ever read. Of course, I could suggest some technical reference for you to dig into, but this is just good reading. I saw a copy at a bookstore not too long ago, and that made me break out my old version. Still a good read today, if not a great way to compare and contrast the technical environment of the late 1980s to today. Also, a good reminder that social engineering is timeless.


Ed Marczak owns and operates Radiotope, a technology consulting company. More tech tips at the blog:

http://www.radiotope.com/writing

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Is there cross-platform play in slither....
So you've sunken plenty of hours into crawling around in slither.io on your iPhone or iPad. You've got your stories of tragedy and triumph, the times you coiled four snakes at one time balanced out by the others when you had a length of more than... | Read more »
Rodeo Stampede guide to running a better...
In Rodeo Stampede, honing your skills so you can jump from animal to animal and outrun the herd as long as possible is only half the fun. Once you've tamed a few animals, you can bring them home with you. [Read more] | Read more »
VoxSyn (Music)
VoxSyn 1.0 Device: iOS Universal Category: Music Price: $6.99, Version: 1.0 (iTunes) Description: VoxSyn turns your voice into the most flexible vocal sound generator ever. Instantly following even subtle modulations of pitch and... | Read more »
Catch Battleplans on Google Play from Ju...
Real-time strategy title Battleplans is due for release on Google Play on June 30th, following its release for iOS systems last month. With its simple interface and pretty graphics, the crowd-pleaser brings a formerly overlooked genre out for the... | Read more »
iDoyle: The interactive Adventures of Sh...
iDoyle: The interactive Adventures of Sherlock Holmes - A Scandal in Bohemia 1.0 Device: iOS Universal Category: Books Price: $1.99, Version: 1.0 (iTunes) Description: Special Release Price $1.99 (Normally $3.99) | Read more »
Five popular free apps to help you slim...
Thanks to retail and advertising, we're used to thinking one season ahead. Here we are just a week into the summer and we're conditioned to start thinking about the fall. [Read more] | Read more »
How to ride longer and tame more animals...
It's hard to accurately describe Rodeo Stampede to people who haven't seen it yet. It's like if someone took Crossy Roadand Disco Zoo and put them in a blender, yet with a unique game mechanic that's still simple and fun for anyone. [Read more] | Read more »
Teeny Titans - A Teen Titans Go! Figure...
Teeny Titans - A Teen Titans Go! Figure Battling Game 1.0.0 Device: iOS Universal Category: Games Price: $3.99, Version: 1.0.0 (iTunes) Description: Teeny Titans, GO! Join Robin for a figure battling RPG of epic proportions! TEENY... | Read more »
NinjAwesome: Tips and tricks to be a mor...
Sorry about that headline, but I'm going to go ahead and assume that GameResort would not have named its game NinjAwesome without expecting some of that. It is, in fact, pretty awesome the way it combines an endless runner and old school arcade... | Read more »
Into Mirror (Games)
Into Mirror 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: "Is all that we see or seem, but a dream within a dream?"- Edgar Allan Poe New game by Lemon Jam Studio, the team behind Pursuit... | Read more »

Price Scanner via MacPrices.net

15-inch Retina MacBook Pros on sale for $200-...
B&H Photo has 15″ Retina MacBook Pros on sale for up to $210 off MSRP. Shipping is free, and B&H charges NY tax only: - 15″ 2.2GHz Retina MacBook Pro: $1799.99 $200 off MSRP - 15″ 2.5GHz... Read more
Mac minis on sale for up to $100 off MSRP
B&H Photo has Mac minis on sale for up to $100 off MSRP including free shipping plus NY sales tax only: - 1.4GHz Mac mini: $449 $50 off MSRP - 2.6GHz Mac mini: $649 $50 off MSRP - 2.8GHz Mac mini... Read more
Clearance 2015 13-inch MacBook Airs available...
B&H Photo has clearance 2015 13″ MacBook Airs available for $300 off original MSRP. Shipping is free, and B&H charges NY sales tax only: - 13″ 1.6GHz/4GB/128GB MacBook Air (MJVE2LL/A): $799.... Read more
Apple refurbished Mac minis available for up...
Apple has Certified Refurbished Mac minis available starting at $419. Apple’s one-year warranty is included with each mini, and shipping is free: - 1.4GHz Mac mini: $419 $80 off MSRP - 2.6GHz Mac... Read more
ABBYY TextGrabber: 1,000,000 Installs in 5 Da...
ABBYY, an international OCR technologies provider, has announced that their image-to-text application TextGrabber, got installed 1,000,000 times in just five days while being featured by the App... Read more
New SkinIt Waterproof Case For iPhone 6
With its impact and waterproof design, the Skinit Waterproof case provides security and protection to guarantee your phone will get you through even the most demanding outdoor conditions. The impact-... Read more
iMacs on sale for up to $150 off MSRP
B&H Photo has 21″ and 27″ iMacs on sale for up to $150 off MSRP including free shipping plus NY sales tax only: - 27″ 3.3GHz iMac 5K: $2181.11 $118 off MSRP - 27″ 3.2GHz/1TB Fusion iMac 5K: $1949... Read more
12-inch 1.1GHz Retina MacBooks on sale for $5...
B&H Photo has 2016 12″ 1.1GHz/256GB Retina MacBooks on sale for up to $50 off MSRP. Shipping is free, and B&H charges NY tax only: - 12″ 1.1GHz Space Gray Retina MacBook: $1249 $50 off MSRP... Read more
WWDC Announcements Revisited Still Underwhelm...
I was disappointed that no new MacBook hardware was announced at this year’s all-software World Wide Developer’s Conference. Not even a hint about what’s in the development pipeline. Of course, we... Read more
Twelve South Compass 2 iPad Stand Now Availab...
Twelve South has updated its most popular iPad stand, Compass 2, with the introduction of two new colors — Gold and Rose Gold. These new color options n perfectly complement the new Rose Gold iPad... Read more

Jobs Board

*Apple* iPhone 6s and New Products Tester Ne...
…we therefore look forward to put out products to quality test for durability. Apple leads the digital music revolution with its iPods and iTunes online store, Read more
Music Marketing Lead, iTunes & *Apple*...
…Music Marketing Lead is responsible for developing robust marketing campaigns and programs for Apple Music and iTunes across the whole of Apple ecosystem. This Read more
*Apple* Valley Medical Clinic is Hiring - AP...
Apple Valley Medical Clinic is Hiring! Apple Valley Medical Clinic is an independently owned practice operating a Family Medicine Clinic, a 24/7 Urgent Care, Read more
*Apple* New Products Testers Needed - Apple...
…we therefore look forward to put out products to quality test for durability. Apple leads the digital music revolution with its iPods and iTunes online store, Read more
*Apple* Solutions Consultant - APPLE (United...
Job Summary As an Apple Solutions Consultant, you'll be the link between our future customers and our products. You'll showcase your entrepreneurial spirit as you Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.