TweetFollow Us on Twitter

Text-based File Formats

Volume Number: 19 (2003)
Issue Number: 2
Column Tag: Section 7

Text-based File Formats

CSV, OML, XML, YAML...

by Rich Morin

BSD and OSX inherit a long tradition (stretching back into the earliest days of Unix) of using text files for data storage. Although there are some exceptions, most control, log, and other system data files are written in ASCII. This makes them easier to inspect, post-process, and even edit.

Apple, whose historic bent has been more toward binary file formats (e.g., the resource fork), seems to have adopted this idea wholeheartedly. In fact, they have gone a bit further, adopting XML (rather than line-oriented files) and Unicode (rather than ASCII). As a result, many OSX files are well structured, language-independent, and quite accessible to both humans and programs.

Many vendors (e.g., Microsoft) are also joining the XML caravan. Assuming that they document both the syntax and semantics of their interchange formats, we could see a dramatic change in possibilities for file interchange.

It is not clear, however, that XML is the Right Answer for all problems. Let's look at some of the alternatives, examining their strengths and weaknesses. Don't expect a comprehensive list; there are zillions of data formats in use. Here, in any event, are some that I would recommend.

CSV

Although CSV stands for "comma-separated values", commas are by no means essential to the idea. In fact, another term for this is "flat file format". Basically, the idea is that each line is a record and that some other delimiter (e.g., colons, commas, white space) is used to separate fields. Quotes or other devices are sometimes used to protect instances of the delimiter in the body of a field. Here are some examples:

/etc/crontab
  15 3 * * * root periodic daily
/etc/gettytab
  a|std.110|110-baud:\
        :np:nd#1:cd#1:uc:sp#110:
/var/log/netinfo.log
  Dec 21 17:59:33 cerberus netinfod ...
An_Excel_File.csv
  1,2,3
  "1,2,3","3,4,5"

Some files get a bit complex, adding syntax to support block structure, comments, line breaks, shell commands, etc. A highly-ornamented CSV file can begin to look like a "little language". Here is a fairly complex format, drawn from /var/named/named.local:

$TTL 86400
@ IN SOA localhost. root.localhost. (
    1997022700 ; Serial
    28800      ; Refresh
    14400      ; Retry
    3600000    ; Expire
    86400 )    ; Minimum
  IN NS  localhost.
1 IN PTR localhost.

Many BSD control files, logs, and reports use white space to delineate fields, making the assumption that the included data will not contain spaces. This was never a safe assumption for path names, but the advent of OSX has made the problem all too real.

Try running "ps -axww" and see how man path names (for both commands and arguments) contain spaces. Then, consider how you would code up a way to determine which spaces are field separators and which are not. Not simple...

Of course, ps has other problems. Run "vi 1 '2 3' 4" in one window and "ps" in another. In the ps output, the COMMAND field looks like "vi 1 2 3 4", dooming any effort to parse it into a command path and distinct command-line arguments. Some sort of quoting convention is desperately needed here.

Blanks in path names make it impossible to parse certain log files, have already broken one Apple installer script, and could well foul up many BSD control files. It's probably too late to get Apple to back off from their use of embedded blanks in system path names, so you can expect to see some problems of this nature coming along...

OML

Although CSV is attractively simple, it may be too simple for your needs. For instance, you may need to support optional data, hierarchical structures, etc. On the other hand, you may not be ready for the formality of XML (Extensible Markup Language) and unwilling to design your own markup language (and parser).

OML (Ostensible Markup Language :-) is a powerful, simple, and convenient solution to this dilemma:

# This is a sample of file-system metadata.
<snap>
  <file>ASCII text</>
  <flags>f,avbstclinmed,,</>
  <lstat>303507,33200,1,1000,20,914,8</>
  <md5>2e1240f444fc3f984186fc5a4fd28eb0</>
  <times>1040087752,230,1740993,10890145</>
</snap>

This looks quite a bit like XML, but there are some small peculiarities. That comment, for instance, isn't legal XML. Nor is "</>" a legal termination for a tag. Is that really CSV syntax in the middle of some fields? Finally, where are the header lines?

Though OML is seemingly designed to give hives to XML purists, it is also designed to work smoothly with existing XML tools. A couple of lines of Perl will strip out the comments and fill out the terminations. A quick pass through an XML parser extracts all of the named fields and attributes. Perl's split() operator, if need be, can break up the CSV data.

Since OML is pretty much a "roll your own" kind of thing, there isn't any real documentation. I'd suggest a look at "Doing it Simpler" (Leigh Dodds; www.xml.com) for some ideas, however.

XML

Despite any appearance to the contrary, I am quite a fan of XML. An enormous amount of meticulous and thoughtful effort (and some rather fancy computer science!) is going into creating "industrial strength" data formats and processing tools. In addition, the W3C (World Wide Web Consortium; www.w3.org) is being very careful to make sure that the official standards are open to all players.

For some projects, you really need to bring in power tools. The host of translators, validators, and other tools that XML provides can make otherwise impossible projects feasible, if not necessarily reasonable. Using XML also increases the chance that someone else's program will be able to parse your data. Given all of that, complaining about a bit of formality seems rather petty.

I won't try to cover XML here; there are shelves of books on the subject, with more coming out on a weekly basis. O'Reilly and Addison-Wesley have the broadest coverage; O'Reilly's XML web site (www.xml.com) is a good place to start your journey...

YAML

CSV has low overhead, is simple to read and edit, and handles lists and arrays well. OML and XML are a bit more bulky, but handle optional data and hierarchies smoothly. XML has an impressive suite of documentation, standards activities, and support software. Sometimes, however, you want low overhead, simplicity, and support for arbitrary data structures.

YAML (YAML Ain't Markup Language) fills this niche quite admirably. The syntax is simple and clean. The basic data structures are sequences (i.e., Perl arrays) and mappings (i.e., Perl hashes). YAML handles lists, arrays, and hierarchies easily; with a bit of extra work, it can handle arbitrary Perl data structures (e.g., cyclic graphs).

Here is the previous example, transliterated into YAML:

# This is a sample of file-system metadata.
snap:
  file:  'ASCII text'
  flags: 'f,avbstclinmed,,'
  lstat: '303507,33200,1,1000,20,914,8'
  md5:   '2e1240f444fc3f984186fc5a4fd28eb0'
  times: '1040087752,230,1740993,10890145'

A more idiomatic rendering, however, would look like:

# This is a sample of file-system metadata.
snap:
  file:  ASCII text
  flags: [ f, avbstclinmed, , ]
  lstat: [ 303507, 33200, 1, 1000, 20, 914, 8 ]
  md5:   2e1240f444fc3f984186fc5a4fd28eb0
  times: [ 1040087752, 230, 1740993, 10890145 ]

Aside from the fact that some spaces have been added after commas and the quotes have been eliminated (some turned into brackets), the second version looks very similar to the first. The resulting data structure is quite different, however; the bracketed lists have been turned into YAML sequences. This means that they don't have to be parsed in a follow-on step. Here is some access code, in Perl:

$file = $yaml{snap}{file};
$uid  = $yaml{snap}{lstat}[3];

YAML has several ways to write textual data. Here are some examples:

  - a simple text item
  - "double-quoted text\n "
  - 'single-quoted text'
- >
    This text
    is freeform.
- |
    This text
    isn't.

Although YAML has nowhere near the amount of documentation that XML has, there are some useful resources to recommend. The YAML web site (www.yaml.org) is the logical place to start; be sure to visit the YAML wiki. I'd also recommend a look at "Look Ma, No Tags" (Kendall Clark Grant; www.xml.com) for an informal introduction.


Rich Morin has been using computers since 1970, Unix since 1983, and Mac-based Unix since 1986 (when he helped Apple create A/UX 1.0). When he isn't writing this column, Rich runs Prime Time Freeware (www.ptf.com), a publisher of books and CD-ROMs for the Free and Open Source software community. Feel free to write to Rich at rdm@ptf.com.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Google Earth 7.1.8.3036 - View and contr...
Google Earth gives you a wealth of imagery and geographic information. Explore destinations like Maui and Paris, or browse content from Wikipedia, National Geographic, and more. Google Earth combines... Read more
QuickBooks 16.1.11.1556 R12 - Financial...
QuickBooks helps you manage your business easily and efficiently. Organize your finances all in one place, track money going in and out of your business, and spot areas where you can save. Built for... Read more
FileZilla 3.24.0 - Fast and reliable FTP...
FileZilla (ported from Windows) is a fast and reliable FTP client and server with lots of useful features and an intuitive interface. Version 3.24.0: New The context menu for remote file search... Read more
Bookends 12.7.8 - Reference management a...
Bookends is a full-featured bibliography/reference and information-management system for students and professionals. Bookends uses the cloud to sync reference libraries on all the Macs you use.... Read more
Duplicate Annihilator 5.8.3 - Find and d...
Duplicate Annihilator takes on the time-consuming task of comparing the images in your iPhoto library using effective algorithms to make sure that no duplicate escapes. Duplicate Annihilator detects... Read more
BusyContacts 1.1.6 - Fast, efficient con...
BusyContacts is a contact manager for OS X that makes creating, finding, and managing contacts faster and more efficient. It brings to contact management the same power, flexibility, and sharing... Read more
MarsEdit 3.7.10 - Quick and convenient b...
MarsEdit is a blog editor for OS X that makes editing your blog like writing email, with spell-checking, drafts, multiple windows, and even AppleScript support. It works with with most blog services... Read more
BusyCal 3.1.4 - Powerful calendar app wi...
BusyCal is an award-winning desktop calendar that combines personal productivity features for individuals with powerful calendar sharing capabilities for families and workgroups. Its unique features... Read more
VirtualBox 5.1.14 - x86 virtualization s...
VirtualBox is a family of powerful x86 virtualization products for enterprise as well as home use. Not only is VirtualBox an extremely feature rich, high performance product for enterprise customers... Read more
Bookends 12.7.8 - Reference management a...
Bookends is a full-featured bibliography/reference and information-management system for students and professionals. Bookends uses the cloud to sync reference libraries on all the Macs you use.... Read more

Super Mario Run dashes onto Android in M...
Super Mario Run was one of the biggest mobile launches in 2016 before it was met with a lukewarm response by many. While the game itself plays a treat, it's pretty hard to swallow the steep price for the full game. With that said, Android users... | Read more »
WarFriends Beginner's Guide: How to...
Chillingo's new game, WarFriends, is finally available world wide, and so far it's a refreshing change from common mobile game trends. The game's a mix of tower defense, third person shooter, and collectible card game. There's a lot to unpack here... | Read more »
Super Gridland (Entertainment)
Super Gridland 1.0 Device: iOS Universal Category: Entertainment Price: $1.99, Version: 1.0 (iTunes) Description: Match. Build. Survive. "exquisitely tuned" - Rock Paper Shotgun No in-app purches, and no ads! | Read more »
Red's Kingdom (Games)
Red's Kingdom 1.0 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0 (iTunes) Description: Mad King Mac has kidnapped your father and stolen your golden nut! Solve puzzles and battle goons as you explore and battle your... | Read more »
Turbo League Guide: How to tame the cont...
| Read more »
Fire Emblem: Heroes coming to Google Pla...
Nintendo gave us our first look at Fire Emblem: Heroes, the upcoming mobile Fire Emblem game the company hinted at last year. Revealed at the Fire Emblem Direct event held today, the game will condense the series' tactical RPG combat into bite-... | Read more »
ReSlice (Music)
ReSlice 1.0 Device: iOS Universal Category: Music Price: $9.99, Version: 1.0 (iTunes) Description: Audio Slice Machine Slice your audio samples with ReSlice and create flexible musical atoms which can be triggered by MIDI notes or... | Read more »
Stickman Surfer rides in with the tide t...
Stickson is back and this time he's taken up yet another extreme sport - surfing. Stickman Surfer is out this Thursday on both iOS and Android, so if you've been following the other Stickman adventures, you might be interested in picking this one... | Read more »
Z-Exemplar (Games)
Z-Exemplar 1.4 Device: iOS Universal Category: Games Price: $3.99, Version: 1.4 (iTunes) Description: | Read more »
5 dastardly difficult roguelikes like th...
Edmund McMillen's popular roguelike creation The Binding of Isaac: Rebirth has finally crawled onto mobile devices. It's a grotesque dual-stick shooter that tosses you into an endless, procedurally generated basement as you, the pitiable Isaac,... | Read more »

Price Scanner via MacPrices.net

Twelve South Releases RelaxedLeather Cases fo...
Inspired by the laid-back luxury of burnished leather boots and crafted in rich tones of taupe, herb and marsala, RelaxedLeather cases deliver smart, easy protection for the iPhone 7. Each genuine... Read more
Week’s Best Deal: New 2016 13-inch 2.0GHz Mac...
Amazon has the new 2016 13″ 2.0GHz non-Touch Bar MacBook Pros on sale for a limited time for $225 off MSRP including free shipping: - 13″ 2.0GHz MacBook Pro, Space Gray (MLL42LL/A): $1274.99 $225 off... Read more
Back in stock: Apple refurbished Mac minis fr...
Apple has Certified Refurbished Mac minis available starting at $419. Apple’s one-year warranty is included with each mini, and shipping is free: - 1.4GHz Mac mini: $419 $80 off MSRP - 2.6GHz Mac... Read more
Apple Ranked ‘Most Intimate Brand’
The top ranked ‘”intimate” brands continued to outperform the S&P and Fortune 500 indices in revenue and profit over the past 10 years, according to MBLM’s Brand Intimacy 2017 Report, the largest... Read more
B-Eng introduces SSD Health Check for Mac OS
Fehraltorf, Switzerland based independant Swiss company- B-Eng has announced the release and immediate availability of SSD Health Check 1.0, the company’s new hard drive utility for Mac OS X. As the... Read more
Apple’s Education discount saves up to $300 o...
Purchase a new Mac or iPad using Apple’s Education Store and take up to $300 off MSRP. All teachers, students, and staff of any educational institution qualify for the discount. Shipping is free: -... Read more
4-core 3.7GHz Mac Pro on sale for $2290, save...
Guitar Center has the 3.7GHz 4-core Mac Pro (MD253LL/A) on sale for $2289.97 including free shipping or free local store pickup (if available). Their price is a $710 savings over standard MSRP for... Read more
128GB Apple iPad Air 2, refurbished, availabl...
Apple has Certified Refurbished 128GB iPad Air 2s WiFis available for $419 including free shipping. That’s an $80 savings over standard MSRP for this model. A standard Apple one-year warranty is... Read more
13-inch 2.7GHz Retina MacBook Pro on sale for...
B&H Photo has the 2015 13″ 2.7GHz/128GB Retina Apple MacBook Pro on sale for $100 off MSRP. Shipping is free, and B&H charges NY tax only: - 13″ 2.7GHz/128GB Retina MacBook Pro (MF839LL/A): $... Read more
Laptop Market – Flight To Quality? – The ‘Boo...
Preliminary quarterly PC shipments data released by Gartner Inc. last week reveal an interesting disparity between sales performance of major name PC vendors as opposed to that of less well-known... Read more

Jobs Board

*Apple* Retail - Multiple Positions - Apple,...
Job Description: Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, Read more
*Apple* Technician - nfrastructure (United S...
Let’s Work Together Apple Technician This position is based in Portland, ME Life at nfrastructure At nfrastructure, we understand that our success results from our Read more
*Apple* Mobile Master - Best Buy (United Sta...
**467692BR** **Job Title:** Apple Mobile Master **Location Number:** 000602-Columbia MO-Store **Job Description:** **What does a Best Buy Apple Mobile Master Read more
*Apple* MAC Infrastructure Engineer - InnoCo...
Summary: Responsible for all aspects of Apple Desktop hardware. This includes research, design, test, and deploy technologies being researched by the desktop Read more
*Apple* & PC Desktop Support Technician...
Apple & PC Desktop Support Technician job in Manhattan, NY Introduction: We have immediate job openings for several Desktop Support Technicians with one of our most Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.