TweetFollow Us on Twitter

File-Based Dataflow

Volume Number: 19 (2003)
Issue Number: 3
Column Tag: Mac OS X

File-Based Dataflow

building robust systems without explicit file locking

by Rich Morin

Last month, I discussed "File Change Watcher", a Perl daemon that wakes up periodically, checks for file-system changes, copies files, and gathers metadata. To save space, I ignored a critical issue: how can another process know that a metadata file is "ready" to be read?

Race Conditions

This question comes up in the design of any system where one process is writing a file and another is accessing it. In this type of "race condition", the reading process can overtake the writing process, arriving prematurely at the end of the data (and ignoring any data the other process may write thereafter).

Alternatively, if two processes try to write the same file at the same time, assorted damage can ensue. Depending on the circumstances, changes might get lost, output could get intermingled, etc.

One way to deal with the problem is to use a "lock file". By common agreement, if the lock file is in place, the file it "locks" isn't available for access. An inquiring process can test for this by trying to create the lock file. If the process fails, it just waits for a while before trying again.

The reason that this test works is that the kernel won't let two processes open the same file with exclusive access (O_EXCL). So, the first attempt succeeds and all others fail (until the first process closes the file). The technique works quite well, as long as everyone plays nicely; the BSD side of OSX contains a number of lock files (e.g., /var/spool/lock).

In the general case, lock files (or some equivalent technique) are necessary. If, for instance, two users want to edit a file, you really don't want them doing it at the same time. So, some Unix editors (e.g., the BSD version of vi) implement file locking.

Unfortunately, lock files add complexity and room for error. All of the processes have to honor the lock files; what if an existing program doesn't want to play? Also, if the system crashes, the lock file has to be explicitly removed when things start up again. And, for all of this pain, they only solve one part (simultaneous access) of the file-based dataflow problem set.

Atomic Actions

Fortunately, there are a number of alternatives to lock files. Most of them are based on some kind of "atomic action"; that is, something that can only be done completely or not at all. BSD provides several atomic actions as system calls, including chmod(2), chown(2), link(2), open(2), and rename(2).

These same facilities are available from Perl, but its chmod and chown are only atomic for single file nodes. Corresponding shell commands are available, albeit with some caveats:

  • chgrp(1), chmod(1), chown(1) of a single file system node

  • link(1), but only for hard links

  • mv(1), to another name on the same file system

So, although you can't assume that a process will finish writing before some other process starts to read the output file, you can safely rename(2) an existing file (or mv(1) it to another name on the same file system) without worrying about race conditions.

Because any hard link to a file is simply a directory entry that points to a common inode(5), both the data and file system metadata are shared among a set of hard links. This lets a single atomic action (e.g., chmod) change the status of any number of links.

Finally, Cocoa's Application Kit Framework's NSFileWrapper class includes methods such as writeToFile:atomically:updateFilenames:. In short, a wealth of solutions is at hand.

Emitters and Consumers

The system I'm building is based on a data-flow model, similar to Unix pipelines, but it uses files instead of pipes. The method is far from new; mail transfer agents and print spoolers use variations of it in OSX. I am simply generalizing the idea into a framework for creating sets of file- and time-based tasks.

By specifying that only one program will ever write to a file, I can eliminate the issue of simultaneous writing. This means that I only need to prevent processes from opening or removing files prematurely and make sure that every file gets processed to completion.

The rules below allow the safe (i.e., no race conditions) use of files by any number of "emitters" and "consumers", without the need for lock files.

  • Files are created and written by emitters, read and deleted by consumers. No modification of files is allowed, including appending or read/write usage.

  • Files are created under temporary names (on the destination file system), then "published" (e.g., renamed) for use by consumers.

  • A file can only be used by one consumer, which removes it just before exiting.

    Note: This restriction applies only to consumers. Other programs (e.g., more) may read published files at any time. Also, the consumer is not required to read the file, just remove it.

  • If an emitter is creating files for multiple consumers, a separate link must be created for each consumer. All of the links must then be published in a single, atomic operation. One method uses a common directory:

    • Create a temporary directory.

    • For each file with N consumers, make N-1 links in the directory.

    • Rename the temporary file(s) into the directory, as the Nth file link(s).

    • Rename the temporary directory, publishing all of the links.

      Another method, which can only "protect" a single output file, has the advantage that the output links do not have to be in the same directory:

    • Turn off read access (using chmod) on the temporary file.

    • For a file with N consumers, make N-1 links.

    • Rename the temporary file as the Nth link.

    • Make any (and thereby, every) link readable.

Although these rules might complicate the life of programmers, the resulting programs aren't any more complicated. In fact, they tend to be quite simple: discover an input file, process it (writing any output to temporary files); when you're done, publish (e.g., rename) the output files and remove the input file.

Better yet, the file-discovery and -management details can be handed off to a scheduling daemon, allowing most of the "tasks" (working code) to be written as "filters": read from standard input; write to standard output.

A Data Collection System

Let's apply this to a data collection system and see how it plays out. A ps(1)-monitoring task is supposed to run once a minute, writing a report. Another task reads the report, writing a YAML (www.yaml.org) version. Other tasks produce hourly and daily summaries.

This is a fairly complex set of tasks, but it's only a small fraction of the workload for a full-scale operating system monitor. So, the amount of specification for each task should be as simple as possible. Here's a first cut at a configuration file:

# Every minute, collect raw ps(1) data.
{ ps_raw, type: cron, min: every,
  out: 'ps/raw/$time'
}
# Process the raw ps(1) data.
# Create two output links.
{ ps_rare, type: file, patt: 'ps/raw/*',
  out: ['ps/rare/1.$time',
        'ps/rare/2.$time' ]
}
# Every hour, create a summary.
{ ps_hour, type: cron, hour: every,
  out: 'ps/hour/$time'
}
# Every day, create a summary.
{ ps_day, type: cron, day: every,
  out: 'ps/day/$time'
}

This is fairly concise, but there's a lot of repetition. We could "boil it down" by taking advantage of the fact that it describes a tree of processes:

{ ps_raw, type: cron, min: every,
  out: 'ps/raw/$time',
  { ps_rare, type: file, patt: 'ps/raw/*',
    out: ['ps/rare/1.$time',
          'ps/rare/2.$time' ]
    { ps_hour, type: cron, hour: every,
      out: 'ps/hour/$time'
    },
    { ps_day, type: cron, day: every,
      out: 'ps/day/$time'
} } }

But that only kills off two lines (excluding comments), so it's not a huge win. Also, I'm not convinced that it's as easy to read, modify, etc. If we are willing to let the scheduling infrastructure generate file names for us, however, we can get away with "idioms" like:

{ ps_raw,      type: cron, min:  every,
  { ps_rare,   type: file,
    { ps_hour, type: cron, hour: every },
    { ps_day,  type: cron, day:  every }
} }

That brings the overhead back under control. And, if we need to access a file, we can follow standardized naming rules to find it. The output of ps_hour, for instance, would have a name of the form "ps_hour/1.<time>".

So much for theoretical hand-waving and speculative descriptions. Next month, I'll discuss some daemons that can actually make all of this work.


Rich Morin has been using computers since 1970, Unix since 1983, and Mac-based Unix since 1986 (when he helped Apple create A/UX 1.0). When he isn't writing this column, Rich runs Prime Time Freeware (www.ptf.com), a publisher of books and CD-ROMs for the Free and Open Source software community. Feel free to write to Rich at rdm@ptf.com.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Skype 7.9.746 - Voice-over-internet phon...
Skype allows you to talk to friends, family and co-workers across the Internet without the inconvenience of long distance telephone charges. Using peer-to-peer data transmission technology, Skype... Read more
Apple GarageBand 10.1 - Complete recordi...
The new GarageBand is a whole music creation studio right inside your Mac -- complete with keyboard, synths, orchestral and percussion instruments, presets for guitar and voice, an entirely... Read more
Duplicate Annihilator 5.7.7 - Find and d...
Duplicate Annihilator takes on the time-consuming task of comparing the images in your iPhoto library using effective algorithms to make sure that no duplicate escapes. Duplicate Annihilator... Read more
OS X Server 4.1.3 - For OS X 10.10 Yosem...
Designed for OS X and iOS devices, OS X Server makes it easy to share files, schedule meetings, synchronize contacts, develop software, host your own website, publish wikis, configure Mac, iPhone,... Read more
Firefox 39.0 - Fast, safe Web browser. (...
Firefox offers a fast, safe Web browsing experience. Browse quickly, securely, and effortlessly. With its industry-leading features, Firefox is the choice of Web development professionals and casual... Read more
pwSafe 4.1 - Secure password management...
pwSafe provides simple and secure password management across devices and computers. pwSafe uses iCloud to keep your password databases backed-up and synced between Macs and iOS devices. It is... Read more
Kodi 15.0.rc1 - Powerful media center to...
Kodi (was XBMC) is an award-winning free and open-source (GPL) software media player and entertainment hub that can be installed on Linux, OS X, Windows, iOS, and Android, featuring a 10-foot user... Read more
Coda 2.5.11 - One-window Web development...
Coda is a powerful Web editor that puts everything in one place. An editor. Terminal. CSS. Files. With Coda 2, we went beyond expectations. With loads of new, much-requested features, a few surprises... Read more
Bookends 12.5.7 - Reference management a...
Bookends is a full-featured bibliography/reference and information-management system for students and professionals. Access the power of Bookends directly from Mellel, Nisus Writer Pro, or MS Word (... Read more
Maya 2016 - Professional 3D modeling and...
Maya is an award-winning software and powerful, integrated 3D modeling, animation, visual effects, and rendering solution. Because Maya is based on an open architecture, all your work can be scripted... Read more

This Week at 148Apps: June 29-July 3, 20...
Into July With 148Apps How do you know what apps are worth your time and money? Just look to the review team at 148Apps. We sort through the chaos and find the apps you're looking for. The ones we love become Editor’s Choice, standing out above the... | Read more »
Sonic Runners Guide
Despite its flaws, Sonic Runners' platforming action is actually quite fun. Though it can be a little more complicated than old-school Sonic games. Here's how to make sure you're up to speed when jumping in. [Read more] | Read more »
Rage of Bahamut is Giving Almost All of...
The App Store isn't what it used to be back in 2012, so it's not unexpected to see some games changing their structures with the times. Now we can add Rage of Bahamut to that list with the recent announcement that the game is severely cutting back... | Read more »
Adventures of Pip (Games)
Adventures of Pip 1.0 Device: iOS iPhone Category: Games Price: $4.99, Version: 1.0 (iTunes) Description: ** ONE WEEK ONLY — 66% OFF! *** “Adventures of Pip is a delightful little platformer full of charm, challenge and impeccable... | Read more »
Divide By Sheep - Tips, Tricks, and Stre...
Who would have thought splitting up sheep could be so involved? Anyone who’s played Divide by Sheep, that’s who! While we’re not about to give you complete solutions to everything (because that’s just cheating), we will happily give you some... | Read more »
NaturalMotion and Zynga Have Started Tea...
An official sequel to 2012's CSR Racing is officially on the way, with Zynga and NaturalMotion releasing a short teaser trailer to get everyone excited. Well, as excited as one can get from a trailer with no gameplay footage, anyway. [Read more] | Read more »
Grab a Friend and Pick up Overkill 3, Be...
Overkill 3 is a pretty enjoyable third-person shooter that was sort of begging for some online multiplayer. Fortunately the begging can stop, because its newest update has added an online co-op mode. [Read more] | Read more »
Scanner Pro's Newest Update Adds Au...
Scanner Pro is one of the most popular document scanning apps on iOS, thanks in no small part to its near-constant updates, I'm sure. Now we're up to update number six, and it adds some pretty handy new features. [Read more] | Read more »
Heroki (Games)
Heroki 1.0 Device: iOS Universal Category: Games Price: $7.99, Version: 1.0 (iTunes) Description: CLEAR THE SKIES FOR A NEW HERO!The peaceful sky village of Levantia is in danger! The dastardly Dr. N. Forchin and his accomplice,... | Read more »
Wars of the Roses (Games)
Wars of the Roses 1.0 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0 (iTunes) Description: | Read more »

Price Scanner via MacPrices.net

13-inch 1.6GHz MacBook Air on sale for $849,...
Amazon has the 2015 13″ 1.6GHz/128GB MacBook Air on sale for $849.99 including free shipping. Their price is $150 off MSRP, and it’s the lowest price available for this model. Read more
RamDisk4Mac App Helps Run Your Mac Faster And...
Ever use a RAM disk? If you’ve come to the Mac in the OS X era, likely not. The Classic Mac OS had a RAM disk function built-in, but that was dropped in the conversion to OS X. What is a RAM disk?... Read more
13-inch 1.6GHz MacBook Air on sale for $849,...
Best Buy has the 2015 13″ 1.6GHz/128GB MacBook Air on sale for $849.99 on their online store this weekend. Choose free shipping or free local store pickup (if available). Sale price for online orders... Read more
Apple Refurbished iMacs available for up to $...
The Apple Store has Apple Certified Refurbished iMacs available for up to $380 off the cost of new models. Apple’s one-year warranty is standard, and shipping is free: - 27″ 3.5GHz 5K iMac – $1949 $... Read more
Apple refurbished 2014 13-inch Retina MacBook...
The Apple Store has Apple Certified Refurbished 2014 13″ Retina MacBook Pros available for up to $400 off original MSRP, starting at $979. An Apple one-year warranty is included with each model, and... Read more
Seagate Backup Plus Drives Feature 200GB of C...
Seagate Technology plc has announced that its Backup Plus family of external storage offerings will now include 200GB of OneDrive cloud storage, a major added value, and the addition of Lyve’s photo... Read more
Canon PIXMA MG3620 Wireless Inkjet All-in-One...
Canon U.S.A., Inc. has announced the PIXMA MG3620 Wireless (1) Inkjet All-in-One (AIO) printer for high-quality photo and document printing. Built with convenience in mind for the everyday home user... Read more
July 4th Holiday Weekend 13-inch MacBook Pro...
Save up to $150 on the purchase of a new 2015 13″ Retina MacBook Pro at the following resellers this weekend. Shipping is free with each model: 2.7GHz/128GB MSRP $1299 2.7GHz/... Read more
27-inch 3.5GHz 5K iMac on sale for $2149, sav...
Best Buy has the 27″ 3.5GHz 5K iMac on sale for $2149.99. Choose free shipping or free local store pickup (if available). Sale price for online orders only, in-store prices may vary. Their price is $... Read more
Apple now offering refurbished 2015 11-inch...
The Apple Store is now offering Apple Certified Refurbished 2015 11″ MacBook Airs as well as 13″ MacBook Airs (the latest models), available for up to $180 off the cost of new models. An Apple one-... Read more

Jobs Board

*Apple* Solutions Consultant - Retail Sales...
**Job Summary** As an Apple Solutions Consultant (ASC) you are the link between our customers and our products. Your role is to drive the Apple business in a retail Read more
Senior Payments Security Manager - *Apple*...
**Job Summary** Apple , Inc. is looking for a highly motivated, innovative and hands-on senior payments security manager to join the Apple Pay security team. You will Read more
Sr. Technical Services Consultant, *Apple*...
**Job Summary** Apple Professional Services (APS) has an opening for a senior technical position that contributes to Apple 's efforts for strategic and transactional Read more
Sr. Payment Program Manager, *Apple* Pay -...
**Job Summary** Apple Pay is an exciting environment and a…devices in a simple, private and secure way. The Apple Pay Team is looking for an experienced Senior Read more
Project Manager - *Apple* Pay Security - Ap...
**Job Summary** The Apple Pay Security team is seeking a highly organized, results-driven Project Manager to drive the development of Apple Pay Security. If you are Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.