TweetFollow Us on Twitter

Text-based File Formats

Volume Number: 19 (2003)
Issue Number: 2
Column Tag: Section 7

Text-based File Formats

CSV, OML, XML, YAML...

by Rich Morin

BSD and OSX inherit a long tradition (stretching back into the earliest days of Unix) of using text files for data storage. Although there are some exceptions, most control, log, and other system data files are written in ASCII. This makes them easier to inspect, post-process, and even edit.

Apple, whose historic bent has been more toward binary file formats (e.g., the resource fork), seems to have adopted this idea wholeheartedly. In fact, they have gone a bit further, adopting XML (rather than line-oriented files) and Unicode (rather than ASCII). As a result, many OSX files are well structured, language-independent, and quite accessible to both humans and programs.

Many vendors (e.g., Microsoft) are also joining the XML caravan. Assuming that they document both the syntax and semantics of their interchange formats, we could see a dramatic change in possibilities for file interchange.

It is not clear, however, that XML is the Right Answer for all problems. Let's look at some of the alternatives, examining their strengths and weaknesses. Don't expect a comprehensive list; there are zillions of data formats in use. Here, in any event, are some that I would recommend.

CSV

Although CSV stands for "comma-separated values", commas are by no means essential to the idea. In fact, another term for this is "flat file format". Basically, the idea is that each line is a record and that some other delimiter (e.g., colons, commas, white space) is used to separate fields. Quotes or other devices are sometimes used to protect instances of the delimiter in the body of a field. Here are some examples:

/etc/crontab
  15 3 * * * root periodic daily
/etc/gettytab
  a|std.110|110-baud:\
        :np:nd#1:cd#1:uc:sp#110:
/var/log/netinfo.log
  Dec 21 17:59:33 cerberus netinfod ...
An_Excel_File.csv
  1,2,3
  "1,2,3","3,4,5"

Some files get a bit complex, adding syntax to support block structure, comments, line breaks, shell commands, etc. A highly-ornamented CSV file can begin to look like a "little language". Here is a fairly complex format, drawn from /var/named/named.local:

$TTL 86400
@ IN SOA localhost. root.localhost. (
    1997022700 ; Serial
    28800      ; Refresh
    14400      ; Retry
    3600000    ; Expire
    86400 )    ; Minimum
  IN NS  localhost.
1 IN PTR localhost.

Many BSD control files, logs, and reports use white space to delineate fields, making the assumption that the included data will not contain spaces. This was never a safe assumption for path names, but the advent of OSX has made the problem all too real.

Try running "ps -axww" and see how man path names (for both commands and arguments) contain spaces. Then, consider how you would code up a way to determine which spaces are field separators and which are not. Not simple...

Of course, ps has other problems. Run "vi 1 '2 3' 4" in one window and "ps" in another. In the ps output, the COMMAND field looks like "vi 1 2 3 4", dooming any effort to parse it into a command path and distinct command-line arguments. Some sort of quoting convention is desperately needed here.

Blanks in path names make it impossible to parse certain log files, have already broken one Apple installer script, and could well foul up many BSD control files. It's probably too late to get Apple to back off from their use of embedded blanks in system path names, so you can expect to see some problems of this nature coming along...

OML

Although CSV is attractively simple, it may be too simple for your needs. For instance, you may need to support optional data, hierarchical structures, etc. On the other hand, you may not be ready for the formality of XML (Extensible Markup Language) and unwilling to design your own markup language (and parser).

OML (Ostensible Markup Language :-) is a powerful, simple, and convenient solution to this dilemma:

# This is a sample of file-system metadata.
<snap>
  <file>ASCII text</>
  <flags>f,avbstclinmed,,</>
  <lstat>303507,33200,1,1000,20,914,8</>
  <md5>2e1240f444fc3f984186fc5a4fd28eb0</>
  <times>1040087752,230,1740993,10890145</>
</snap>

This looks quite a bit like XML, but there are some small peculiarities. That comment, for instance, isn't legal XML. Nor is "</>" a legal termination for a tag. Is that really CSV syntax in the middle of some fields? Finally, where are the header lines?

Though OML is seemingly designed to give hives to XML purists, it is also designed to work smoothly with existing XML tools. A couple of lines of Perl will strip out the comments and fill out the terminations. A quick pass through an XML parser extracts all of the named fields and attributes. Perl's split() operator, if need be, can break up the CSV data.

Since OML is pretty much a "roll your own" kind of thing, there isn't any real documentation. I'd suggest a look at "Doing it Simpler" (Leigh Dodds; www.xml.com) for some ideas, however.

XML

Despite any appearance to the contrary, I am quite a fan of XML. An enormous amount of meticulous and thoughtful effort (and some rather fancy computer science!) is going into creating "industrial strength" data formats and processing tools. In addition, the W3C (World Wide Web Consortium; www.w3.org) is being very careful to make sure that the official standards are open to all players.

For some projects, you really need to bring in power tools. The host of translators, validators, and other tools that XML provides can make otherwise impossible projects feasible, if not necessarily reasonable. Using XML also increases the chance that someone else's program will be able to parse your data. Given all of that, complaining about a bit of formality seems rather petty.

I won't try to cover XML here; there are shelves of books on the subject, with more coming out on a weekly basis. O'Reilly and Addison-Wesley have the broadest coverage; O'Reilly's XML web site (www.xml.com) is a good place to start your journey...

YAML

CSV has low overhead, is simple to read and edit, and handles lists and arrays well. OML and XML are a bit more bulky, but handle optional data and hierarchies smoothly. XML has an impressive suite of documentation, standards activities, and support software. Sometimes, however, you want low overhead, simplicity, and support for arbitrary data structures.

YAML (YAML Ain't Markup Language) fills this niche quite admirably. The syntax is simple and clean. The basic data structures are sequences (i.e., Perl arrays) and mappings (i.e., Perl hashes). YAML handles lists, arrays, and hierarchies easily; with a bit of extra work, it can handle arbitrary Perl data structures (e.g., cyclic graphs).

Here is the previous example, transliterated into YAML:

# This is a sample of file-system metadata.
snap:
  file:  'ASCII text'
  flags: 'f,avbstclinmed,,'
  lstat: '303507,33200,1,1000,20,914,8'
  md5:   '2e1240f444fc3f984186fc5a4fd28eb0'
  times: '1040087752,230,1740993,10890145'

A more idiomatic rendering, however, would look like:

# This is a sample of file-system metadata.
snap:
  file:  ASCII text
  flags: [ f, avbstclinmed, , ]
  lstat: [ 303507, 33200, 1, 1000, 20, 914, 8 ]
  md5:   2e1240f444fc3f984186fc5a4fd28eb0
  times: [ 1040087752, 230, 1740993, 10890145 ]

Aside from the fact that some spaces have been added after commas and the quotes have been eliminated (some turned into brackets), the second version looks very similar to the first. The resulting data structure is quite different, however; the bracketed lists have been turned into YAML sequences. This means that they don't have to be parsed in a follow-on step. Here is some access code, in Perl:

$file = $yaml{snap}{file};
$uid  = $yaml{snap}{lstat}[3];

YAML has several ways to write textual data. Here are some examples:

  - a simple text item
  - "double-quoted text\n "
  - 'single-quoted text'
- >
    This text
    is freeform.
- |
    This text
    isn't.

Although YAML has nowhere near the amount of documentation that XML has, there are some useful resources to recommend. The YAML web site (www.yaml.org) is the logical place to start; be sure to visit the YAML wiki. I'd also recommend a look at "Look Ma, No Tags" (Kendall Clark Grant; www.xml.com) for an informal introduction.


Rich Morin has been using computers since 1970, Unix since 1983, and Mac-based Unix since 1986 (when he helped Apple create A/UX 1.0). When he isn't writing this column, Rich runs Prime Time Freeware (www.ptf.com), a publisher of books and CD-ROMs for the Free and Open Source software community. Feel free to write to Rich at rdm@ptf.com.

 
AAPL
$562.29
Apple Inc.
-3.03
MSFT
$29.06
Microsoft Corpora
-0.01
GOOG
$591.53
Google Inc.
-12.13
MacTech Search:
Community Search:

SketchBook Ink Review
SketchBook Ink Review By Lisa Caplan on May 25th, 2012 Our Rating: :: SIMPLEiPad Only App - Designed for the iPad SketchBook Ink has a welcoming interface but lacks key features   Developer: Autodesk Inc. | Read more »
Autumn Dynasty Review
Autumn Dynasty Review By Kevin Stout on May 25th, 2012 Our Rating: :: NEARLY FLAWLESSiPad Only App - Designed for the iPad Autumn Dynasty is an oriental-themed real-time strategy game.   | Read more »
Our Annual “Holy Cow It’s Memorial Day A...
So, it’s that time of year again! BBQs, lawn chairs, beer, and the ability to finally wear shorts with sandals without fear of frostbite. Tan those legs and check out all the huge sales that are going on across the App Store below. We’ll try and... | Read more »
FREEday 5/25/12 – “They Call Me FREE but...
Another week of freebies, this time with very little in the way of “Big Name” titles. No need to panic, it’s intentional. Anyone browsing the App Store will no doubt see the more popular games anyway. | Read more »
Shoot the Zombirds Review
Shoot the Zombirds Review By Kevin Stout on May 25th, 2012 Our Rating: :: ADDICTINGUniversal App - Designed for iPhone and iPad Shoot the Zombirds is an archery game where the player shoots arrows at avian zombies.   | Read more »
Apple Debuts Free App of the Week Promot...
Apple has made a couple of changes to their weekly app features that pop up in the Featured tab of the App Store. While “App of the Week” and “Game of the Week” appear to be just rebranded as “Editors’ Choice,” there’s a new feature: the Free Game... | Read more »
Gun Runner Review
Gun Runner Review By Jason Wadsworth on May 25th, 2012 Our Rating: :: RUN AND GUNUniversal App - Designed for iPhone and iPad The name says it all. This clever homage to classic side-scrolling shooters is easy to enjoy but hard to... | Read more »

Price Scanner via MacPrices.net

Apple Maintains Leading Mobile Device Manufacturer...
Milennial Media says Apple continued to be the number one mobile device manufacturer on their platform in Q1, representing 28% of the top manufacturers impression share. Apple iPhone accounted for 15... Read more
Asustek To Launch Three New ZenBook Ultrabook Mode...
Digitimes’ Rebecca Kuo and Steve Shen report that PC-maker Asustek Computer will launch three new models to its ZenBook Prime Ultrabook lineup – the UX21A, UX31A and UX32VD – in June, featuring full... Read more
Yahoo! Introduces Axis Search Browser For Mobile D...
Yahoo! has announced the availability of Yahoo! Axis, a new Web browser tool that it claims will re-imagine how people search and browse on the web, Axis offering a faster, smarter search with... Read more
Android- and iOS-Powered Smartphones Expand Market...
Smartphones powered by Android and iOS mobile operating systems accounted for more than eight out of ten smartphones shipped in the first quarter of 2012 (1Q12), according to the International Data... Read more
Roundup of Memorial Day Weekend MacBook Pro sales,...
 Apple resellers have MacBook Pros on sale for up to $240 off MSRP this Holiday weekend. Here is a roundup of the best prices available from any reseller: (1) B&H Photo has MacBook Pros on sale... Read more
iPad wait times down to 1-3 days at The Apple Stor...
The Apple Store Online is now reporting a 1-3 business day wait on all iPad orders, as it appears that Apple is clearing out their backlog. The iPad is available in Wi-Fi or Wi-Fi + Cellular... Read more
Roundup of Memorial Day Weekend MacBook Air sales,...
 Apple resellers have MacBook Airs on sale for up to $101 off MSRP this Holiday weekend. Here is a roundup of the best prices available from any reseller: (1) B&H Photo has 11-inch and 13-inch... Read more
13″ 2.8GHz MacBook Pro on sale for $100 off MSRP
Adorama has lowered their price on the 13″ 2.8GHz MacBook Pro to $1399 including free shipping plus NY/NJ sales tax only. Their price is $100 off MSRP, and it’s the lowest price for this model from... Read more

Jobs Board

*Apple* Solutions Consultant-Retail Sal...
The Apple Solutions Consultant is an Apple employee who oversees the sales, merchandising, and operations of an Apple Store-in-a-Store in a single unit retail Read more
iPad/iPhone Developer at Recruitarrow (P...
Job Responsibilities and Requirements: These solutions must be aligned with business and IT strategies and comply with the organization's architectural standards. Involved in the full systems life... Read more
Mobile iphone App with API Connections t...
See requirements. Develop mobile app that interfaces to access database on webserver and infusionsoft through API. Desired Skills: iPhone, Mobile, Infusionsoft, API Read more
*Apple* Retail - Manager - Natick Colle...
Much more than just a place for amazing products, the Apple Retail Store serves a dazzling range of needs for its customers. Not only can users get hands-on experience Read more
XML image iPhone App at Elance.com (Uppe...
I want a similar iphone app like the following App below: /us/app/hd-tattoo-designs-catalog/id524766650?mt=8 I want a ... can tell who knows the expertise and who outsources the project to others.... Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.