TweetFollow Us on Twitter

Text-based File Formats

Volume Number: 19 (2003)
Issue Number: 2
Column Tag: Section 7

Text-based File Formats

CSV, OML, XML, YAML...

by Rich Morin

BSD and OSX inherit a long tradition (stretching back into the earliest days of Unix) of using text files for data storage. Although there are some exceptions, most control, log, and other system data files are written in ASCII. This makes them easier to inspect, post-process, and even edit.

Apple, whose historic bent has been more toward binary file formats (e.g., the resource fork), seems to have adopted this idea wholeheartedly. In fact, they have gone a bit further, adopting XML (rather than line-oriented files) and Unicode (rather than ASCII). As a result, many OSX files are well structured, language-independent, and quite accessible to both humans and programs.

Many vendors (e.g., Microsoft) are also joining the XML caravan. Assuming that they document both the syntax and semantics of their interchange formats, we could see a dramatic change in possibilities for file interchange.

It is not clear, however, that XML is the Right Answer for all problems. Let's look at some of the alternatives, examining their strengths and weaknesses. Don't expect a comprehensive list; there are zillions of data formats in use. Here, in any event, are some that I would recommend.

CSV

Although CSV stands for "comma-separated values", commas are by no means essential to the idea. In fact, another term for this is "flat file format". Basically, the idea is that each line is a record and that some other delimiter (e.g., colons, commas, white space) is used to separate fields. Quotes or other devices are sometimes used to protect instances of the delimiter in the body of a field. Here are some examples:

/etc/crontab
  15 3 * * * root periodic daily
/etc/gettytab
  a|std.110|110-baud:\
        :np:nd#1:cd#1:uc:sp#110:
/var/log/netinfo.log
  Dec 21 17:59:33 cerberus netinfod ...
An_Excel_File.csv
  1,2,3
  "1,2,3","3,4,5"

Some files get a bit complex, adding syntax to support block structure, comments, line breaks, shell commands, etc. A highly-ornamented CSV file can begin to look like a "little language". Here is a fairly complex format, drawn from /var/named/named.local:

$TTL 86400
@ IN SOA localhost. root.localhost. (
    1997022700 ; Serial
    28800      ; Refresh
    14400      ; Retry
    3600000    ; Expire
    86400 )    ; Minimum
  IN NS  localhost.
1 IN PTR localhost.

Many BSD control files, logs, and reports use white space to delineate fields, making the assumption that the included data will not contain spaces. This was never a safe assumption for path names, but the advent of OSX has made the problem all too real.

Try running "ps -axww" and see how man path names (for both commands and arguments) contain spaces. Then, consider how you would code up a way to determine which spaces are field separators and which are not. Not simple...

Of course, ps has other problems. Run "vi 1 '2 3' 4" in one window and "ps" in another. In the ps output, the COMMAND field looks like "vi 1 2 3 4", dooming any effort to parse it into a command path and distinct command-line arguments. Some sort of quoting convention is desperately needed here.

Blanks in path names make it impossible to parse certain log files, have already broken one Apple installer script, and could well foul up many BSD control files. It's probably too late to get Apple to back off from their use of embedded blanks in system path names, so you can expect to see some problems of this nature coming along...

OML

Although CSV is attractively simple, it may be too simple for your needs. For instance, you may need to support optional data, hierarchical structures, etc. On the other hand, you may not be ready for the formality of XML (Extensible Markup Language) and unwilling to design your own markup language (and parser).

OML (Ostensible Markup Language :-) is a powerful, simple, and convenient solution to this dilemma:

# This is a sample of file-system metadata.
<snap>
  <file>ASCII text</>
  <flags>f,avbstclinmed,,</>
  <lstat>303507,33200,1,1000,20,914,8</>
  <md5>2e1240f444fc3f984186fc5a4fd28eb0</>
  <times>1040087752,230,1740993,10890145</>
</snap>

This looks quite a bit like XML, but there are some small peculiarities. That comment, for instance, isn't legal XML. Nor is "</>" a legal termination for a tag. Is that really CSV syntax in the middle of some fields? Finally, where are the header lines?

Though OML is seemingly designed to give hives to XML purists, it is also designed to work smoothly with existing XML tools. A couple of lines of Perl will strip out the comments and fill out the terminations. A quick pass through an XML parser extracts all of the named fields and attributes. Perl's split() operator, if need be, can break up the CSV data.

Since OML is pretty much a "roll your own" kind of thing, there isn't any real documentation. I'd suggest a look at "Doing it Simpler" (Leigh Dodds; www.xml.com) for some ideas, however.

XML

Despite any appearance to the contrary, I am quite a fan of XML. An enormous amount of meticulous and thoughtful effort (and some rather fancy computer science!) is going into creating "industrial strength" data formats and processing tools. In addition, the W3C (World Wide Web Consortium; www.w3.org) is being very careful to make sure that the official standards are open to all players.

For some projects, you really need to bring in power tools. The host of translators, validators, and other tools that XML provides can make otherwise impossible projects feasible, if not necessarily reasonable. Using XML also increases the chance that someone else's program will be able to parse your data. Given all of that, complaining about a bit of formality seems rather petty.

I won't try to cover XML here; there are shelves of books on the subject, with more coming out on a weekly basis. O'Reilly and Addison-Wesley have the broadest coverage; O'Reilly's XML web site (www.xml.com) is a good place to start your journey...

YAML

CSV has low overhead, is simple to read and edit, and handles lists and arrays well. OML and XML are a bit more bulky, but handle optional data and hierarchies smoothly. XML has an impressive suite of documentation, standards activities, and support software. Sometimes, however, you want low overhead, simplicity, and support for arbitrary data structures.

YAML (YAML Ain't Markup Language) fills this niche quite admirably. The syntax is simple and clean. The basic data structures are sequences (i.e., Perl arrays) and mappings (i.e., Perl hashes). YAML handles lists, arrays, and hierarchies easily; with a bit of extra work, it can handle arbitrary Perl data structures (e.g., cyclic graphs).

Here is the previous example, transliterated into YAML:

# This is a sample of file-system metadata.
snap:
  file:  'ASCII text'
  flags: 'f,avbstclinmed,,'
  lstat: '303507,33200,1,1000,20,914,8'
  md5:   '2e1240f444fc3f984186fc5a4fd28eb0'
  times: '1040087752,230,1740993,10890145'

A more idiomatic rendering, however, would look like:

# This is a sample of file-system metadata.
snap:
  file:  ASCII text
  flags: [ f, avbstclinmed, , ]
  lstat: [ 303507, 33200, 1, 1000, 20, 914, 8 ]
  md5:   2e1240f444fc3f984186fc5a4fd28eb0
  times: [ 1040087752, 230, 1740993, 10890145 ]

Aside from the fact that some spaces have been added after commas and the quotes have been eliminated (some turned into brackets), the second version looks very similar to the first. The resulting data structure is quite different, however; the bracketed lists have been turned into YAML sequences. This means that they don't have to be parsed in a follow-on step. Here is some access code, in Perl:

$file = $yaml{snap}{file};
$uid  = $yaml{snap}{lstat}[3];

YAML has several ways to write textual data. Here are some examples:

  - a simple text item
  - "double-quoted text\n "
  - 'single-quoted text'
- >
    This text
    is freeform.
- |
    This text
    isn't.

Although YAML has nowhere near the amount of documentation that XML has, there are some useful resources to recommend. The YAML web site (www.yaml.org) is the logical place to start; be sure to visit the YAML wiki. I'd also recommend a look at "Look Ma, No Tags" (Kendall Clark Grant; www.xml.com) for an informal introduction.


Rich Morin has been using computers since 1970, Unix since 1983, and Mac-based Unix since 1986 (when he helped Apple create A/UX 1.0). When he isn't writing this column, Rich runs Prime Time Freeware (www.ptf.com), a publisher of books and CD-ROMs for the Free and Open Source software community. Feel free to write to Rich at rdm@ptf.com.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

VirtualBox 5.2.6 - x86 virtualization so...
VirtualBox is a family of powerful x86 virtualization products for enterprise as well as home use. Not only is VirtualBox an extremely feature rich, high performance product for enterprise customers... Read more
Espresso 5.1 - Powerful HTML, XML, CSS,...
Note from the developer: For the new Espresso, we changed our versioning and licensing approach with more consistent pricing and a simpler development timeline: "X+1". Each new update would increase... Read more
MacFamilyTree 8.2.7 - Create and explore...
MacFamilyTree gives genealogy a facelift: modern, interactive, convenient and fast. Explore your family tree and your family history in a way generations of chroniclers before you would have loved.... Read more
VueScan 9.6.04 - Scanner software with a...
VueScan is a scanning program that works with most high-quality flatbed and film scanners to produce scans that have excellent color fidelity and color balance. VueScan is easy to use, and has... Read more
WhatsApp 0.2.8000 - Desktop client for W...
WhatsApp is the desktop client for WhatsApp Messenger, a cross-platform mobile messaging app which allows you to exchange messages without having to pay for SMS. WhatsApp Messenger is available for... Read more
Vivaldi 1.13.1008.40 - An advanced brows...
Vivaldi is a browser for our friends. In 1994, two programmers started working on a web browser. Our idea was to make a really fast browser, capable of running on limited hardware, keeping in mind... Read more
TotalFinder 1.10.7 - Adds tabs, hotkeys,...
TotalFinder is a universally acclaimed navigational companion for your Mac. Enhance your Mac's Finder with features so smart and convenient, you won't believe you ever lived without them. Features... Read more
Box Sync 4.0.7886 - Online synchronizati...
Box Sync gives you a hard-drive in the Cloud for online storage. Note: You must first sign up to use Box. What if the files you need are on your laptop -- but you're on the road with your iPhone? No... Read more
iExplorer 4.1.14 - View and transfer fil...
iExplorer is an iPhone browser for Mac lets you view the files on your iOS device. By using a drag and drop interface, you can quickly copy files and folders between your Mac and your iPhone or... Read more
WhatRoute 2.1.1 - Geographically trace o...
WhatRoute is designed to find the names of all the routers an IP packet passes through on its way from your Mac to a destination host. It also measures the round-trip time from your Mac to the router... Read more

Latest Forum Discussions

See All

Tako Bubble guide - Tips and Tricks to S...
Tako Bubble is a pretty simple and fun puzzler, but the game can get downright devious with its puzzle design. If you insist on not paying for the game and want to manage your lives appropriately, check out these tips so you can avoid getting... | Read more »
Everything about Hero Academy 2 - The co...
It's fair to say we've spent a good deal of time on Hero Academy 2. So much so, that we think we're probably in a really good place to give you some advice about how to get the most out of the game. And in this guide, that's exactly what you're... | Read more »
Everything about Hero Academy 2: Part 3...
In the third part of our Hero Academy 2 guide we're going to take a look at the different modes you can play in the game. We'll explain what you need to do in each of them, and tell you why it's important that you do. [Read more] | Read more »
Everything about Hero Academy 2: Part 2...
In this second part of our guide to Hero Academy 2, we're going to have a look at the different card types that you're going to be using in the game. We'll split them up into different sections too, to make sure you're getting the most information... | Read more »
Everything about Hero Academy 2: Part 1...
So you've started playing Hero Academy 2, and you're feeling a little bit lost. Don't worry, we've got your back. So we've come up with a series of guides that are going to help you get to grips with everything that's going on in the game. [Read... | Read more »
What mobile gaming can learn from the Ni...
While Nintendo might not have had things all its own way since it began developing for mobile, one thing it has got right is the release of the Switch. After the disappointment of the WiiU, which I still can't really explain, the Switch felt a... | Read more »
Programmer of Sonic The Hedgehog launche...
Japanese programmer Yuji Naka is best known for leading the team that created the original Sonic The Hedgehog. He’s moved on from the speedy blue hero since then, launching his own company based in Tokyo – Prope Games. Legend of Coin is the... | Read more »
Why doesn't mobile gaming have its...
The Overwatch League is a pretty big deal. It's an attempt to really push eSports into the mainstream, by turning them into, well, regular sports. But slightly less sweaty. It's a lavish affair with teams from all around the world, and more... | Read more »
Give Webzen’s new billiard game PoolTime...
Best known for producing hugely popular MMO titles, South Korean publisher Webzen is now taking aim at a different genre altogether. PoolTime is a realistic eight ball pool simulator, allowing you to compete in real-time matches against players... | Read more »
Let Them Come Guide - How to survive aga...
Let Them Come is all about making it as far as possible against overwhelming odds. Check out some of these tips to help you last a little longer in your unwinnable fight: [Read more] | Read more »

Price Scanner via MacPrices.net

9.7-inch 2017 WiFi iPads on sale starting at...
B&H Photo has 9.7″ 2017 WiFi #Apple #iPads on sale for $30 off MSRP for a limited time. Shipping is free, and pay sales tax in NY & NJ only: – 32GB iPad WiFi: $299, $30 off – 128GB iPad WiFi... Read more
Wednesday deal: 13″ MacBook Pros for $100-$15...
B&H Photo has 13″ #Apple #MacBook Pros on sale for up to $100-$150 off MSRP. Shipping is free, and B&H charges sales tax for NY & NJ residents only: – 13-inch 2.3GHz/128GB Space Gray... Read more
Apple now offering Certified Refurbished 2017...
Apple has Certified Refurbished 9.7″ WiFi iPads available for $50-$80 off the cost of new models. An Apple one-year warranty is included with each iPad, and shipping is free: – 9″ 32GB WiFi iPad: $... Read more
10″ iPad Pros on sale for $50-$75 off MSRP, n...
B&H Photo has 10″ and #Apple #iPad Pros on sale for up to $75 off MSRP. Shipping is free, and B&H charges sales tax in NY & NJ only. Note that some sale prices are restricted to certain... Read more
Apple refurbished Mac minis available startin...
Apple has restocked Certified Refurbished Mac minis starting at $419. Apple’s one-year warranty is included with each mini, and shipping is free: – 1.4GHz Mac mini: $419 $80 off MSRP – 2.6GHz Mac... Read more
Amazon offers Silver 13″ Apple MacBook Pros f...
Amazon has new Silver 2017 13″ #Apple #MacBook Pros on sale today for up to $150 off MSRP, each including free shipping: – 13″ 2.3GHz/128GB Silver MacBook Pro (MPXR2LL/A): $1199.99 $100 off MSRP – 13... Read more
Sale: 12″ 1.3GHz MacBooks on sale for $1499,...
B&H Photo has Space Gray and Rose Gold 12″ 1.3GHz #Apple MacBooks on sale for $100 off MSRP. Shipping is free, and B&H charges sales tax for NY & NJ residents only: – 12″ 1.3GHz Space... Read more
Apple offers Certified Refurbished 2017 iMacs...
Apple has a full line of Certified Refurbished iMacs available for up to $350 off original MSRP. Apple’s one-year warranty is standard, and shipping is free. The following models are available: – 27... Read more
13″ MacBook Airs on sale for $120-$100 off MS...
B&H Photo has 2017 13″ 128GB MacBook Airs on sale for $120 off MSRP. Shipping is free, and B&H charges sales tax for NY & NJ residents only: – 13″ 1.8GHz/128GB MacBook Air (MQD32LL/A): $... Read more
15″ Touch Bar MacBook Pros on sale for up to...
Adorama has Space Gray 15″ MacBook Pros on sale for $200 off MSRP. Shipping is free, and Adorama charges sales tax in NJ and NY only: – 15″ 2.8GHz MacBook Pro Space Gray (MPTR2LL/A): $2199, $200 off... Read more

Jobs Board

*Apple* Solutions Consultant - Apple (United...
# Apple Solutions Consultant Job Number: 113384559 Brandon, Florida, United States Posted: 10-Jan-2018 Weekly Hours: 40.00 **Job Summary** Are you passionate about Read more
Art Director, *Apple* Music + Beats1 Market...
# Art Director, Apple Music + Beats1 Marketing Design Job Number: 113258081 Santa Clara Valley, California, United States Posted: 05-Jan-2018 Weekly Hours: 40.00 Read more
*Apple* Pay & Wallet Engineering Manager...
# Apple Pay & Wallet Engineering Manager, Apple Watch Job Number: 83769531 Santa Clara Valley, California, United States Posted: 06-Nov-2017 Weekly Hours: 40.00 Read more
UI Tools and Automation Engineer, *Apple* M...
# UI Tools and Automation Engineer, Apple Media Products Job Number: 113136387 Santa Clara Valley, California, United States Posted: 11-Jan-2018 Weekly Hours: 40.00 Read more
Senior Product Architect, *Apple* Pay - App...
# Senior Product Architect, Apple Pay Job Number: 58046427 Santa Clara Valley, California, United States Posted: 04-Jan-2018 Weekly Hours: **Job Summary** Apple , Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.