awk for Data Processing - Part 2

Volume Number: 22 (2006)
Issue Number: 04
Column Tag: Programming

Mac In The Shell

awk for Data Processing - Part 2

by Edward Marczak

Revving up the engine.

Last month, I introduced you to awk, the 'pattern processor'. That laid the foundation, and merely scratched the surface of awk's power. This month, we're going to dive back into flow control, as we've seen with bash, sed, math routines, and other cool awk features. Of course, awk only becomes more powerful when combined with shell scripting and sed.

Pattern Matching

Now that, over the last few months, we've covered regexp, sed, shell globbing, and now, awk, here's a word on anytime you want to use a utility that does pattern matching. Sometimes you're not in control of the data you're going to need to sift through. However, there are times where, you are the one generating the data. This could be in the form of a report, or even from another command-line tool. In any case, try to make your life easier: don't spew out unnecessary data! For example: you may want to find out ip addresses assigned to a particular interface, so you decide to use ifconfig, and write a sed script to parse the output. The sed script can pattern match the interface and then loop through the results looking for "inet". Not bad, but you decrease your work if you specify the interface you're looking for to ifconfig. If you're using nireport, make sure you output fields in the order that you want them, rather than use awk to swap them around. If you need a file listing, look at all of the switches that will sort and add symbols to the output that will make matching easier. Always make sure you read man page for any program that you're using: you may find some surprising switches that reduce the work you do further down the chain.

Back to Basics

Part 1 of this article gave us some real awk basics - print, match a pattern, field operators and some built-in variables. The built-in variables that we covered were NF, the number of fields in a record, and FS, the field separator. Of course, there are some more built-ins that we should know about. Let's do that before proceeding.

FS separates fields during the input stage. By default, awk separates output with a space. You can define that to be anything you want, using OFS. The output field separator is generated by the comma in a print statement. So, to rewrite an example from last month, we can make the output look better:

$ ls -l | awk 'BEGIN {OFS="\t\t"} {print $5,$9,$1}'
total
182468   20050629-local.jpg   -rw-r--r--
51986   iChats   drwxrwx---
68   images   drwxr-xr-x
1345   jamlog.txt   -rw-r--r--
61440   lads.exe   -rw-r--r--
271103   mount-1.260-3.wbm.gz    -rw-r--r--
352457   mr.spx    -rw-r--r--

Figure 1 - Output Field Separator in action

Better, but still a little ragged. Don't worry! We'll fix that in a bit.

When you generate multi-line output, awk separates each record with ORS, the record separator. ORS is a newline by default, so each record starts on a new line. You can change this! Why would you want to? You can even define RS, the input record separator. Sometimes, small examples are worth 1,000 words. If you are processing data that comes in a 'block' - spread out over several lines - setting FS to "\n", the newline character, will allow awk access via the field variables. Set RS to "", and awk will split these correctly when you have multiple input records. Practical example: you suspect a problem with user records, and want to search for particular users that (may) have the same home directory assigned. Here's the script:

01. #!/bin/bash
02.
03. for name in `dscl localhost -list /Search/Users`
04. do
05. dscl localhost -read /Search/Users/${name} | awk 'BEGIN {FS="\n"; RS=""} $0 ~ /\/Users\/marczak/ {print $0}'
06. done

Figure 2 - User search script

First, you can see it's a shell script. We'll use bash to feed awk multi-line records using dscl. Line 3 sets up a loop using all of the usernames that we have access to. Line 5 uses dscl again to get the detail for the username provided and feed that record to awk. Using a BEGIN construct, we first set FS and RS. Then, we look for "/Users/marczak" anywhere in the record using $0. If we match, we print the entire record. This way, we'll print all records that have that path as a home directory. It's a fairly specific example, but actually came in handy once. Plus, it illustrates handling multi-line records!

Finally, in our built-in round up, NR and FNR, keep the current line number available for you. NR is cumulative, and FNR gives you the number of the current record with respect to the input file. Useful if you're processing multiple files.

Low-level Format

awk is a fantastic tool for generating reports, however, reports are only really useful if they look good. The data can be good, but if it's hard to read, the brain just switches off. As you've seen, OFS and print only get you so far. awk supports a formatted print statement, printf, that you may have seen in some other languages, notably C. printf is more flexible than standard print, but requires a little more hand holding. Want an example? Here you go:

awk 'BEGIN {printf ("This is a test.\n")}'

Easy, right? So, what's different? First, you'll notice that you have to supply the newline - just like C! What I left out here, are the optional format specifiers, which again, match their counterpart in C. man printf will get you the list, if you forget them. Let's learn by example. The file listing code can be re-written with printf like this:

ls -l | awk '{printf "%s\t\t%s\t\t%s\n",$5, $9, $1}'

This means, print a string ("%s"), two tabs ("\t"), a string, two tabs and a string, followed up by a newline character. Each format specifier needs a corresponding value after the format string to fill in the place-holder with. We're substituting each %s with a field - $5, $9 and $1, respectively. However, this really is the equivalent of the earlier code - it's still ragged! printf also allows you to supply the width and alignment of the output. So, to clean up our listing, we can use this:

$ ls -l | awk 'NR > 1 {printf "%-20s%-20s%-20s\n",$5, $9, $1}'
306                 dist                   drwxr-xr-x          
42364               httpd.conf             -rw-r--r--          
37417               httpd.conf.bak   -rw-r--r--          
38334               httpd.conf.default   -rw-r--r--          
38334               httpd.conf.dist        -rw-r--r--          
12965               magic                  -rw-r--r--          
12965               magic.default          -rw-r--r--          
15201               mime.types             -rw-r--r--          
15201               mime.types.default     -rw-r--r--          
204                 users                  drwxr-xr-x

Figure 3 - Width specifiers

That's much nicer! Explanation: instead of "%s", we can specify a width using "%20s" - "20" being the width. By default, the output is right justified in the space allotted. I added the hyphen - "%-20s" - to our example, to left justify the text.

Flow Control...Again

Depending on how long you've been reading this column, you've seen this before: we've covered looping and decision-making in bash and in sed. Well, flow is important! So, let's see how awk handles these constructs.

The most basic of tests is an if/then test. The pattern matching we've seen is essentially an if/then test that is applied to all input. However, if you've matched something basic, and then need to make further decisions, you can use if/then. Let's combine this with an example using a loop.

As you may have seen in other languages, awk has a while loop that conditionally executes a block of code. Here's the idea:

while (condition is true) action

Like other languages, you can have a line feed between the condition and action, and if the action is multiple lines, they must be contained in curly-braces. I'm actually going to throw a few new things in, and then explain. Let's re-write our user search script from earlier (Figure 2):

01. #!/bin/bash
02.
03. for name in `dscl localhost -list /Search/Users`
04. do
05. dscl localhost -read /Search/Users/${name} | awk '
06. BEGIN {FS="\n"; RS=""}
07. $0 ~ /\/Users\/marczak/ {
08. i=1
09. while (i<=NF) {
10.         if ($i ~ /Dir/ || $i ~ /ID/ || $i ~ /Shell/) print $i
11.         i++
12. }
13. }
14. '
15. done

Figure 4 - A while loop in action

Once again, this is a shell script that feeds full blocks of data into awk. The dscl statements on lines 3 and 5 are identical to the ones from the first script. Look how we break up the awk script across multiple lines from there. Line 5 ends with a single quote, which allows bash to treat everything up until the next single quote as continuous code. Note the closing single quote on line 14! Again, we're going to look for a match on the entire input ($0) - looking for "/Users/Marczak" again. If and when we find it, that's where our adventure begins.

Line 8 initializes the variable "i" to 1. Not zero. We're going to index through fields, and don't need to check $0 again! Line 9 shows off our while loop. As long as i doesn't exceed the number of fields on the input record, we execute the loop. First time through, i=1, and we can use it to reference the first field of input ($1). Line 10 - an if statement! Lovely! If we find that the field we're currently looking at contains "Dir" or ("||") "ID" or "Shell", we print that field. Then, we increment i on line 11 so we don't loop around forever - and, we reference the next field in the next iteration of the loop.

Really cool stuff here: using the built-in NF variable as a comparison in our while loop, using a variable for the field reference, using classic Unix utilities with OS X specific CLI programs....nice. In addition to a while loop, awk supports the familiar "for" and "do" loop constructs. And as you may have guessed, you may recognize them already.

A "do" loop is a variant of the "while" loop. Its main difference is that the action is always executed at least once. It looks like this:

do {
   action
} while (condition is true)
Need to see it in action?  Here you go:
BEGIN {
        numMice = 5
        catTime = 3
        do {
                theCatIsAway++
                theMicePlay = numMice * theCatIsAway
                if (theCatIsAway > catTime) theCatIsAway = 0
        } while (theCatIsAway)
        print "The mice played " theMicePlay " days."
}

Figure 5 - An example do loop

Yes, it's a completely contrived example so I could use "while the cat is away"...I needed to bring a little levity to this column. This example does illustrate a little math, though, which I haven't explicitly covered.

The for loop borrows its syntax from C and should be pretty recognizable:

for (initialize; test conditions; increment) {
   actions
}

Rewriting the previous loop using for would look like this:

01. BEGIN { 
02.         numMice = 5
03.         catTime = 3
04.         for (theCatIsAway = 1; theCatIsAway > 0; theCatIsAway++) {
05.                 theMicePlay = numMice * theCatIsAway
06.                 if (theCatIsAway > catTime) theCatIsAway = -1
07.         }
08.         print "The mice played " theMicePlay " days"
09. }

Figure 6 - an example for loop

Look at that for loop! It's a thing of beauty! No, really...I'm serious! (Outside of the fact that it ruins my play on words). It lets you take care of everything you need for a loop. Note, however, that the increment happens at the bottom of the loop. This is important, and is the reason, we set theCatIsAway to -1 rather than 0 on line 6. Otherwise, our test would never be true, and we'd get caught in an infinite loop.

Once again, like other languages, awk lets us skip an iteration of a loop, or break out altogether. Inside of a loop, the break keyword breaks out of the loop, and ends it:

do {
   if (leaveLoopNow) break
   procedure_one(x,y,z)
   procedure_two(x,y)
   transform_one(x,y)
} while (x < currentThreshhold)

In this example, if leaveLoopNow is true, execute the break statement and bail out of the loop - never to execute the remainder of the loop, picking up execution following the loop.

A less drastic version of break, is continue. A short example will make it clear:

do {
   if (notThisTime) continue
   checkVars(x,y)
   transform_one(x,y)
} while (x < currentThreshhold)

Here, if notThisTime is true, we just go back to the top of the loop. But the loop will continue, as long as our condition is true.

There are also two flow-altering statements that affect awk's entire flow - next and exit. The simpler of the two is exit. When awk encounters the exit statement, it jumps to the END rule. Of course, you don't even have to have and END rule defined. In that case, the script just terminates. Note that exit can supply a value to use as awk's exit code. Nice way to test success or failure in a shell script. exit without a value defaults to "0". next transfers control back to the top of the script where awk will read the next record of input. This is useful in a few different situations. If you only want to process records in a file that has 5 fields, simply sue this rule:

NF != 5 {next}

That's also useful for error checking, whereby if the target input doesn't 'look' right, you can just crank through the file. Perhaps even keeping count of how many records you skipped for use in an exception report.

Arrays

Here's something that I can't tell you I've covered before. Certainly not in sed, nor in bash. Of course there are other languages that support arrays, so some of this may look familiar. But, if this column has been your introduction to anything remotely related to programming or scripting, this will be slightly new. An array is simply a variable that lets you hold a series of values. Being a loosely typed language, all arrays in awk are associative arrays - arrays that map keys to values. Associative arrays do not need to use integers as the key, or subscript, nor does every value need to be of equal type and size. If you have a PHP background, you'll understand this innately. Naturally, examples are forthcoming. Like other variables in awk, arrays do not need to be declared, so you can just use them:

array[key] = value

Often, you'll see simple numeric keys (subscripts) - useful when loading data in from a file, and you want to track something from every record, or mark certain records based on a value. Just as often, though, you'll see a key; a string that maps to a value. We can use this feature like this:

BEGIN { color["red"] = "0xF00"
color["green"] = "0x0F0"
color["blue"] = "0x00F"
print color["red"]
}

Nice, right? Arrays let us keep related values together. You can also use a variable as the key. Here's a totally trivial example that illustrates a few new concepts:

01. #!/bin/bash
02. /System/Library/PrivateFrameworks/Apple80211.framework/Versions/A/Resources/airport -I | awk '
03. BEGIN { FS=":" }
04. {
05. gsub(" ","",$1)
06. recordlist[$1] = $2
07. }
08. END {
09. for (key in recordlist)
10.         print "The " key " is equal to" recordlist[key]
11. }
12. '

Figure 7 - Many new concepts!

Once again, I wrap awk in a bash shell script. First new thing, may be the airport command. With the "-I" switch, it gives you information about your current airport status. Next new thing is on line 5: gsub. Sometimes, exposure to many languages is a bit of a curse. When I see this command, I always think back to BASIC's "gosub" (go to subroutine) command. In awk's case, however, it stands for global substitue. I'm using it here just to clean up the output a bit. It's really powerful, though, and works like this:

gsub(regexp, substitution, string)

Now it's apparent; I'm just removing the spaces from $1: a space (" ") is being replaced with nothing ("") in the string $1. Now, look what's happening on line 6 - the value of $1 is being used as the key in the array "recordlist". It's also being assigned the value of $2, the second field. Then on line 9 in the END pattern, there's a new flow control statement. A variant on a for loop, we have some special syntax that accesses each element of an array in turn. "key" is a made-up variable. Right there, on the spot. It could really be whatever we like, but as with all variables, it should be something somewhat meaningful. This variable will contain the current key name in each iteration of the loop.

While there's a lot more to arrays, I'd be remiss if I didn't mention two functions: split and delete. split splits a string into an array based on a separator. This is just like awk's main loop function that breaks input into fields based on FS. If you have awk reading a CSV file, you could use split thusly:

x = split($0, myFields, ",")

What this does is create an array - myFields - that contains each 'field' of $0, fields being separated with a comma. split returns the number of fields in the string, in our case, putting the result into "x". If the input looked like this:

Mike Jones, 555-1234, mikej@example.com
Bill Smith, 555-0984, bills@example.com
Sally Foster, 555-3456, sallyf@example.com

...then during the first pass, myFields would contain:

myFields[1] = "Mike Jones"
myFields[2] = "555-1234"
myFields[3] = "mikej@example.com"

split is also a useful way to load up an array:

split("Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec",months,",")

That's a lot easier than writing out "month[1] = "...

delete is simple: it lets you remove an element of an array. Simply:

delete myFields[2]

...would get rid of the phone number in our previous example. Of course, you can always just ignore a field, using delete will make sure it's gone if you choose to use for...in.

Grab Bag

There are some final things I feel the need to mention about awk, but realized that they're each pretty short and belonged altogether in a 'grab bag' section. Without further ado, here they are.

In addition to the other built-in variables that we've covered, awk presents two built-ins as arrays: ARGV and ENVIRON. ARGV is an array that contains each command line argument, including the script name itself in ARGV[0]. If you ran your script like this:

$ awk -f argvtest.awk var1 15 "Iolo" "Shamino"

...ARGV would contain:

ARGV[0] = awk
ARGV[1] = var1
ARGV[2] = 15
ARGV[3] = "Iolo"
ARGV[4] = "Shamino"

ENVIRON maps current environment variables to their values. For example:

print ENVIRON[OSTYPE]

on my machine would yield "darwin8.0".

In addition to split and gsub, awk contains some really useful (and common to other languages) string manipulation functions, such as substr, toupper, length and tolower. Consider that homework.

Finally, awk provides a way to get input outside of the main input loop. getline gets a new line of input, and can be used in two different ways. First, when used by itself, it will get the next line from input that the main loop would have gotten. This is similar to next (covered above), however, getline does not bring flow back to the top of the script. Secondarily, you can pipe input into awk and read it with getline. While more in-depth work still requires a shell script, this is my favorite way to write a quick-and-dirty awk script. One example will get you going:

$ awk 'BEGIN {"top -l 1" | getline; print $2 " processes running"}'
121 processes running

The output of top is piped into awk - directly from inside awk! You can, of course, even use that trick conditionally, and go look something up on the fly if needed.

Conclusion

To show everything that awk is capable of would take a book. I believe I've shown things that are immediately understandable and practical. Between this column and the last, you should have a good foundation to build on. Of course, there's a lot more to explore. I didn't get to multi-dimensional arrays, trig math, user-defined functions, piping to output...and more. If this has whetted your appetite, there are many resources that teach awk in-depth. Just dropping into Google and trying "learning awk" brings back an incredible number of resources.

awk is a fantastic utility that has proved its worth over decades of classic Unix use. For OS X administrators, it dovetails perfectly with the powerful command-line utilities at our disposal.

Recommended reading for the month: Cuckoo's Egg, by Cliff Stoll. Released in 1989, this was one of the first track-a-hacker books I ever read. Of course, I could suggest some technical reference for you to dig into, but this is just good reading. I saw a copy at a bookstore not too long ago, and that made me break out my old version. Still a good read today, if not a great way to compare and contrast the technical environment of the late 1980s to today. Also, a good reminder that social engineering is timeless.

Ed Marczak owns and operates Radiotope, a technology consulting company. More tech tips at the blog:

http://www.radiotope.com/writing

Software Updates via MacUpdate

Latest Forum Discussions

Make the passage of time your plaything...

While some of us are still waiting for a chance to get our hands on Ash Prime - yes, don’t remind me I could currently buy him this month I’m barely hanging on - Digital Extremes has announced its next anticipated Prime Form for Warframe. Starting... | Read more »

If you can find it and fit through the d...

The holy trinity of amazing company names have come together, to release their equally amazing and adorable mobile game, Hamster Inn. Published by HyperBeard Games, and co-developed by Mum Not Proud and Little Sasquatch Studios, it's time to... | Read more »

Amikin Survival opens for pre-orders on...

Join me on the wonderful trip down the inspiration rabbit hole; much as Palworld seemingly “borrowed” many aspects from the hit Pokemon franchise, it is time for the heavily armed animal survival to also spawn some illegitimate children as Helio... | Read more »

PUBG Mobile teams up with global phenome...

Since launching in 2019, SpyxFamily has exploded to damn near catastrophic popularity, so it was only a matter of time before a mobile game snapped up a collaboration. Enter PUBG Mobile. Until May 12th, players will be able to collect a host of... | Read more »

Embark into the frozen tundra of certain...

Chucklefish, developers of hit action-adventure sandbox game Starbound and owner of one of the cutest logos in gaming, has released their roguelike deck-builder Wildfrost. Created alongside developers Gaziter and Deadpan Games, Wildfrost will... | Read more »

MoreFun Studios has announced Season 4,...

Tension has escalated in the ever-volatile world of Arena Breakout, as your old pal Randall Fisher and bosses Fred and Perrero continue to lob insults and explosives at each other, bringing us to a new phase of warfare. Season 4, Into The Fog of... | Read more »

Top Mobile Game Discounts

Every day, we pick out a curated list of the best mobile discounts on the App Store and post them here. This list won't be comprehensive, but it every game on it is recommended. Feel free to check out the coverage we did on them in the links below... | Read more »

Marvel Future Fight celebrates nine year...

Announced alongside an advertising image I can only assume was aimed squarely at myself with the prominent Deadpool and Odin featured on it, Netmarble has revealed their celebrations for the 9th anniversary of Marvel Future Fight. The Countdown... | Read more »

HoYoFair 2024 prepares to showcase over...

To say Genshin Impact took the world by storm when it was released would be an understatement. However, I think the most surprising part of the launch was just how much further it went than gaming. There have been concerts, art shows, massive... | Read more »

Explore some of BBCs' most iconic s...

Despite your personal opinion on the BBC at a managerial level, it is undeniable that it has overseen some fantastic British shows in the past, and now thanks to a partnership with Roblox, players will be able to interact with some of these... | Read more »

Price Scanner via MacPrices.net

You can save $300-$480 on a 14-inch M3 Pro/Ma...

Apple has 14″ M3 Pro and M3 Max MacBook Pros in stock today and available, Certified Refurbished, starting at $1699 and ranging up to $480 off MSRP. Each model features a new outer case, shipping is... Read more

24-inch M1 iMacs available at Apple starting...

Apple has clearance M1 iMacs available in their Certified Refurbished store starting at $1049 and ranging up to $300 off original MSRP. Each iMac is in like-new condition and comes with Apple’s... Read more

Walmart continues to offer $699 13-inch M1 Ma...

Walmart continues to offer new Apple 13″ M1 MacBook Airs (8GB RAM, 256GB SSD) online for $699, $300 off original MSRP, in Space Gray, Silver, and Gold colors. These are new MacBook for sale by... Read more

B&H has 13-inch M2 MacBook Airs with 16GB...

B&H Photo has 13″ MacBook Airs with M2 CPUs, 16GB of memory, and 256GB of storage in stock and on sale for $1099, $100 off Apple’s MSRP for this configuration. Free 1-2 day delivery is available... Read more

14-inch M3 MacBook Pro with 16GB of RAM avail...

Apple has the 14″ M3 MacBook Pro with 16GB of RAM and 1TB of storage, Certified Refurbished, available for $300 off MSRP. Each MacBook Pro features a new outer case, shipping is free, and an Apple 1-... Read more

Apple M2 Mac minis on sale for up to $150 off...

Amazon has Apple’s M2-powered Mac minis in stock and on sale for $100-$150 off MSRP, each including free delivery: – Mac mini M2/256GB SSD: $499, save $100 – Mac mini M2/512GB SSD: $699, save $100 –... Read more

Amazon is offering a $200 discount on 14-inch...

Amazon has 14-inch M3 MacBook Pros in stock and on sale for $200 off MSRP. Shipping is free. Note that Amazon’s stock tends to come and go: – 14″ M3 MacBook Pro (8GB RAM/512GB SSD): $1399.99, $200... Read more

Sunday Sale: 13-inch M3 MacBook Air for $999,...

Several Apple retailers have the new 13″ MacBook Air with an M3 CPU in stock and on sale today for only $999 in Midnight. These are the lowest prices currently available for new 13″ M3 MacBook Airs... Read more

Multiple Apple retailers are offering 13-inch...

Several Apple retailers have 13″ MacBook Airs with M2 CPUs in stock and on sale this weekend starting at only $849 in Space Gray, Silver, Starlight, and Midnight colors. These are the lowest prices... Read more

Roundup of Verizon’s April Apple iPhone Promo...

Verizon is offering a number of iPhone deals for the month of April. Switch, and open a new of service, and you can qualify for a free iPhone 15 or heavy monthly discounts on other models: – 128GB... Read more

Jobs Board

Relationship Banker - *Apple* Valley Financ...

Relationship Banker - Apple Valley Financial Center APPLE VALLEY, Minnesota **Job Description:** At Bank of America, we are guided by a common purpose to help Read more

IN6728 Optometrist- *Apple* Valley, CA- Tar...

Date: Apr 9, 2024 Brand: Target Optical Location: Apple Valley, CA, US, 92308 **Requisition ID:** 824398 At Target Optical, we help people see and look great - and Read more

Medical Assistant - Orthopedics *Apple* Hil...

Medical Assistant - Orthopedics Apple Hill York Location: WellSpan Medical Group, York, PA Schedule: Full Time Sign-On Bonus Eligible Remote/Hybrid Regular Apply Now Read more

*Apple* Systems Administrator - JAMF - Activ...

…**Public Trust/Other Required:** None **Job Family:** Systems Administration **Skills:** Apple Platforms,Computer Servers,Jamf Pro **Experience:** 3 + years of Read more

Liquor Stock Clerk - S. *Apple* St. - Idaho...

Liquor Stock Clerk - S. Apple St. Boise Posting Begin Date: 2023/10/10 Posting End Date: 2024/10/14 Category: Retail Sub Category: Customer Service Work Type: Part Read more

SPREAD THE WORD:
Slashdot
Digg
Del.icio.us
Reddit
Newsvine

MacTech

Mac In The Shell

awk for Data Processing - Part 2

Pattern Matching

Back to Basics

Low-level Format

Flow Control...Again

Arrays

Grab Bag

Conclusion

Software Updates via MacUpdate

Latest Forum Discussions

Price Scanner via MacPrices.net

Jobs Board