TweetFollow Us on Twitter

File-Based Dataflow

Volume Number: 19 (2003)
Issue Number: 3
Column Tag: Mac OS X

File-Based Dataflow

building robust systems without explicit file locking

by Rich Morin

Last month, I discussed "File Change Watcher", a Perl daemon that wakes up periodically, checks for file-system changes, copies files, and gathers metadata. To save space, I ignored a critical issue: how can another process know that a metadata file is "ready" to be read?

Race Conditions

This question comes up in the design of any system where one process is writing a file and another is accessing it. In this type of "race condition", the reading process can overtake the writing process, arriving prematurely at the end of the data (and ignoring any data the other process may write thereafter).

Alternatively, if two processes try to write the same file at the same time, assorted damage can ensue. Depending on the circumstances, changes might get lost, output could get intermingled, etc.

One way to deal with the problem is to use a "lock file". By common agreement, if the lock file is in place, the file it "locks" isn't available for access. An inquiring process can test for this by trying to create the lock file. If the process fails, it just waits for a while before trying again.

The reason that this test works is that the kernel won't let two processes open the same file with exclusive access (O_EXCL). So, the first attempt succeeds and all others fail (until the first process closes the file). The technique works quite well, as long as everyone plays nicely; the BSD side of OSX contains a number of lock files (e.g., /var/spool/lock).

In the general case, lock files (or some equivalent technique) are necessary. If, for instance, two users want to edit a file, you really don't want them doing it at the same time. So, some Unix editors (e.g., the BSD version of vi) implement file locking.

Unfortunately, lock files add complexity and room for error. All of the processes have to honor the lock files; what if an existing program doesn't want to play? Also, if the system crashes, the lock file has to be explicitly removed when things start up again. And, for all of this pain, they only solve one part (simultaneous access) of the file-based dataflow problem set.

Atomic Actions

Fortunately, there are a number of alternatives to lock files. Most of them are based on some kind of "atomic action"; that is, something that can only be done completely or not at all. BSD provides several atomic actions as system calls, including chmod(2), chown(2), link(2), open(2), and rename(2).

These same facilities are available from Perl, but its chmod and chown are only atomic for single file nodes. Corresponding shell commands are available, albeit with some caveats:

  • chgrp(1), chmod(1), chown(1) of a single file system node

  • link(1), but only for hard links

  • mv(1), to another name on the same file system

So, although you can't assume that a process will finish writing before some other process starts to read the output file, you can safely rename(2) an existing file (or mv(1) it to another name on the same file system) without worrying about race conditions.

Because any hard link to a file is simply a directory entry that points to a common inode(5), both the data and file system metadata are shared among a set of hard links. This lets a single atomic action (e.g., chmod) change the status of any number of links.

Finally, Cocoa's Application Kit Framework's NSFileWrapper class includes methods such as writeToFile:atomically:updateFilenames:. In short, a wealth of solutions is at hand.

Emitters and Consumers

The system I'm building is based on a data-flow model, similar to Unix pipelines, but it uses files instead of pipes. The method is far from new; mail transfer agents and print spoolers use variations of it in OSX. I am simply generalizing the idea into a framework for creating sets of file- and time-based tasks.

By specifying that only one program will ever write to a file, I can eliminate the issue of simultaneous writing. This means that I only need to prevent processes from opening or removing files prematurely and make sure that every file gets processed to completion.

The rules below allow the safe (i.e., no race conditions) use of files by any number of "emitters" and "consumers", without the need for lock files.

  • Files are created and written by emitters, read and deleted by consumers. No modification of files is allowed, including appending or read/write usage.

  • Files are created under temporary names (on the destination file system), then "published" (e.g., renamed) for use by consumers.

  • A file can only be used by one consumer, which removes it just before exiting.

    Note: This restriction applies only to consumers. Other programs (e.g., more) may read published files at any time. Also, the consumer is not required to read the file, just remove it.

  • If an emitter is creating files for multiple consumers, a separate link must be created for each consumer. All of the links must then be published in a single, atomic operation. One method uses a common directory:

    • Create a temporary directory.

    • For each file with N consumers, make N-1 links in the directory.

    • Rename the temporary file(s) into the directory, as the Nth file link(s).

    • Rename the temporary directory, publishing all of the links.

      Another method, which can only "protect" a single output file, has the advantage that the output links do not have to be in the same directory:

    • Turn off read access (using chmod) on the temporary file.

    • For a file with N consumers, make N-1 links.

    • Rename the temporary file as the Nth link.

    • Make any (and thereby, every) link readable.

Although these rules might complicate the life of programmers, the resulting programs aren't any more complicated. In fact, they tend to be quite simple: discover an input file, process it (writing any output to temporary files); when you're done, publish (e.g., rename) the output files and remove the input file.

Better yet, the file-discovery and -management details can be handed off to a scheduling daemon, allowing most of the "tasks" (working code) to be written as "filters": read from standard input; write to standard output.

A Data Collection System

Let's apply this to a data collection system and see how it plays out. A ps(1)-monitoring task is supposed to run once a minute, writing a report. Another task reads the report, writing a YAML ( version. Other tasks produce hourly and daily summaries.

This is a fairly complex set of tasks, but it's only a small fraction of the workload for a full-scale operating system monitor. So, the amount of specification for each task should be as simple as possible. Here's a first cut at a configuration file:

# Every minute, collect raw ps(1) data.
{ ps_raw, type: cron, min: every,
  out: 'ps/raw/$time'
# Process the raw ps(1) data.
# Create two output links.
{ ps_rare, type: file, patt: 'ps/raw/*',
  out: ['ps/rare/1.$time',
        'ps/rare/2.$time' ]
# Every hour, create a summary.
{ ps_hour, type: cron, hour: every,
  out: 'ps/hour/$time'
# Every day, create a summary.
{ ps_day, type: cron, day: every,
  out: 'ps/day/$time'

This is fairly concise, but there's a lot of repetition. We could "boil it down" by taking advantage of the fact that it describes a tree of processes:

{ ps_raw, type: cron, min: every,
  out: 'ps/raw/$time',
  { ps_rare, type: file, patt: 'ps/raw/*',
    out: ['ps/rare/1.$time',
          'ps/rare/2.$time' ]
    { ps_hour, type: cron, hour: every,
      out: 'ps/hour/$time'
    { ps_day, type: cron, day: every,
      out: 'ps/day/$time'
} } }

But that only kills off two lines (excluding comments), so it's not a huge win. Also, I'm not convinced that it's as easy to read, modify, etc. If we are willing to let the scheduling infrastructure generate file names for us, however, we can get away with "idioms" like:

{ ps_raw,      type: cron, min:  every,
  { ps_rare,   type: file,
    { ps_hour, type: cron, hour: every },
    { ps_day,  type: cron, day:  every }
} }

That brings the overhead back under control. And, if we need to access a file, we can follow standardized naming rules to find it. The output of ps_hour, for instance, would have a name of the form "ps_hour/1.<time>".

So much for theoretical hand-waving and speculative descriptions. Next month, I'll discuss some daemons that can actually make all of this work.

Rich Morin has been using computers since 1970, Unix since 1983, and Mac-based Unix since 1986 (when he helped Apple create A/UX 1.0). When he isn't writing this column, Rich runs Prime Time Freeware (, a publisher of books and CD-ROMs for the Free and Open Source software community. Feel free to write to Rich at


Community Search:
MacTech Search:

Software Updates via MacUpdate

Mellel 4.1.0 - The word processor for sc...
Mellel is the leading word processor for OS X and has been widely considered the industry standard for long form documents since its inception. Mellel focuses on writers and scholars for technical... Read more
ScreenFlow 7.3 - Create screen recording...
ScreenFlow is powerful, easy-to-use screencasting software for the Mac. With ScreenFlow you can record the contents of your entire monitor while also capturing your video camera, microphone and your... Read more
Dashlane 5.9.0 - Password manager and se...
Dashlane is an award-winning service that revolutionizes the online experience by replacing the drudgery of everyday transactional processes with convenient, automated simplicity - in other words,... Read more
ForkLift 3.2 - Powerful file manager: FT...
ForkLift is a powerful file manager and ferociously fast FTP client clothed in a clean and versatile UI that offers the combination of absolute simplicity and raw power expected from a well-executed... Read more
Cocktail 11.5 - General maintenance and...
Cocktail is a general purpose utility for macOS that lets you clean, repair and optimize your Mac. It is a powerful digital toolset that helps hundreds of thousands of Mac users around the world get... Read more
Hazel 4.2.4 - Create rules for organizin...
Hazel is your personal housekeeper, organizing and cleaning folders based on rules you define. Hazel can also manage your trash and uninstall your applications. Organize your files using a familiar... Read more
Skype - Voice-over-internet pho...
Skype allows you to talk to friends, family, and co-workers across the Internet without the inconvenience of long distance telephone charges. Using peer-to-peer data transmission technology, Skype... Read more
Backup and Sync 3.40.8921.5350 - File ba...
Backup and Sync (was Google Drive) is a place where you can create, share, collaborate, and keep all of your stuff. Whether you're working with a friend on a joint research project, planning a... Read more
Dashlane 5.9.0 - Password manager and se...
Dashlane is an award-winning service that revolutionizes the online experience by replacing the drudgery of everyday transactional processes with convenient, automated simplicity - in other words,... Read more
Cocktail 11.5 - General maintenance and...
Cocktail is a general purpose utility for macOS that lets you clean, repair and optimize your Mac. It is a powerful digital toolset that helps hundreds of thousands of Mac users around the world get... Read more

Latest Forum Discussions

See All

Destiny meets its mobile match - Everyth...
Shadowgun Legends is the latest game in the Shadowgun series, and it's taking the franchise in some interesting new directions. Which is good news. The even better news is that it's coming out tomorrow, so if you didn't make it into the beta you... | Read more »
How PUBG, Fortnite, and the battle royal...
The history of the battle royale genre isn't a long one. While the nascent parts of the experience have existed ever since players first started killing one another online, it's really only in the past six years that the genre has coalesced into... | Read more »
Around the Empire: What have you missed...
Oh hi nice reader, and thanks for popping in to check out our weekly round-up of all the stuff that you might have missed across the Steel Media network. Yeah, that's right, it's a big ol' network. Obviously 148Apps is the best, but there are some... | Read more »
All the best games on sale for iPhone an...
It might not have been the greatest week for new releases on the App Store, but don't let that get you down, because there are some truly incredible games on sale for iPhone and iPad right now. Seriously, you could buy anything on this list and I... | Read more »
Everything You Need to Know About The Fo...
In just over a week, Epic Games has made a flurry of announcements. First, they revealed that Fortnite—their ultra-popular PUBG competitor—is coming to mobile. This was followed by brief sign-up period for interested beta testers before sending out... | Read more »
The best games that came out for iPhone...
It's not been the best week for games on the App Store. There are a few decent ones here and there, but nothing that's really going to make you throw down what you're doing and run to the nearest WiFi hotspot in order to download it. That's not to... | Read more »
Death Coming (Games)
Death Coming Device: iOS Universal Category: Games Price: $1.99, Version: (iTunes) Description: --- Background Story ---You Died. Pure and simple, but death was not the end. You have become an agent of Death: a... | Read more »
Hints, tips, and tricks for Empires and...
Empires and Puzzles is a slick match-stuff RPG that mixes in a bunch of city-building aspects to keep things fresh. And it's currently the Game of the Day over on the App Store. So, if you're picking it up for the first time today, we thought it'd... | Read more »
What You Need to Know About Sam Barlow’s...
Sam Barlow’s follow up to Her Story is #WarGames, an interactive video series that reimagines the 1983 film WarGames in a more present day context. It’s not exactly a game, but it’s definitely still interesting. Here are the top things you should... | Read more »
Pixel Plex Guide - How to Build Better T...
Pixel Plex is the latest city builder that has come to the App Store, and it takes a pretty different tact than the ones that came before it. Instead of being in charge of your own city by yourself, you have to work together with other players to... | Read more »

Price Scanner via

Back in stock! Apple’s full line of Certified...
Save $300-$300 on the purchase of a 2017 13″ MacBook Pro today with Certified Refurbished models at Apple. Apple’s refurbished prices are the lowest available for each model from any reseller. A... Read more
Back in stock: 13-inch 2.5GHz MacBook Pro (Ce...
Apple has Certified Refurbished 13″ 2.5GHz MacBook Pros (MD101LL/A) available for $829, or $270 off original MSRP. Apple’s one-year warranty is standard, and shipping is free: – 13″ 2.5GHz MacBook... Read more
Apple restocks Certified Refurbished 2017 13″...
Apple has Certified Refurbished 2017 13″ MacBook Airs available starting at $849. An Apple one-year warranty is included with each MacBook, and shipping is free: – 13″ 1.8GHz/8GB/128GB MacBook Air (... Read more
8-Core iMac Pro on sale for $4699, save $300
Amazon has the 8-core iMac Pro on sale for $4699 including free shipping. Their price is $300 off MSRP, and it’s the currently lowest price available for an iMac Pro. For the latest up-to-date prices... Read more
10″ 512GB WiFi iPad Pros on sale for $849, sa...
B&H Photo has Space Gray and Rose Gold 10.5″ 512GB WiFi iPad Pros on sale for $849. Their price is $150 off MSRP, and it’s the lowest price available for these models, new, from any Apple... Read more
MacBook Pro sale! B&H drops prices on new...
B&H Photo has dropped prices on new 2017 13″ MacBook Pros, with models now on sale for up to $200 off MSRP. Shipping is free, and B&H charges sales tax for NY & NJ residents only. Their... Read more
13″ MacBook Airs on sale for $100-$150 off MS...
B&H Photo has 13″ MacBook Airs on sale for $100-$150 off MSRP. Shipping is free, and B&H charges sales tax for NY & NJ residents only: – 13″ 1.8GHz/128GB MacBook Air (MQD32LL/A): $899, $... Read more
Huge iMac sale! Apple reseller now offering 2...
B&H Photo has new 2017 21″ & 27″ iMacs on sale today for up to $300 off MSRP. Shipping is free, and B&H charges sales tax for NY & NJ residents only: – 27″ 3.8GHz iMac (MNED2LL/A): $... Read more
Sale! 1.4GHz Mac mini for $399, $100 off MSRP
B&H Photo has the 1.4GHz Mac mini on sale for $399 for a limited time. Their price is $100 off MSRP, and it’s the lowest price available for a mini from any Apple reseller: – 1.4GHz Mac mini (... Read more
Sale of the year continues as Apple resellers...
Adorama has new 2017 15″ MacBook Pros on sale for $250-$300 off MSRP. Shipping is free, and Adorama charges sales tax in NJ and NY only: – 15″ 2.8GHz Touch Bar MacBook Pro Space Gray (MPTR2LL/A): $... Read more

Jobs Board

*Apple* Retail Sales Associate / Service Wri...
Job DescriptioniStore is the premier retailer of Apple products and solutions. We're looking for dedicated individuals with a passion to simplify and enhance the Read more
Integration Technician, *Apple* - Zones, In...
…adapted, and grown for over 30 years. Position Overview The Apple Integration Technician will be responsible for performing customer specific configuration Read more
*Apple* Retail - Multiple Positions - Apple,...
Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, you're also the Read more
*Apple* Part Time Reseller Specialist - *Ap...
…in a reseller store, you help create the energy and excitement around Apple products, providing the right solutions and getting products into customers' hands. You Read more
*Apple* Technical Specialist - Apple, Inc. (...
…customers purchase our products, you're the one who helps them get more out of their new Apple technology. Your day in the Apple Store is filled with a range of Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.