TweetFollow Us on Twitter

Text-based File Formats

Volume Number: 19 (2003)
Issue Number: 2
Column Tag: Section 7

Text-based File Formats

CSV, OML, XML, YAML...

by Rich Morin

BSD and OSX inherit a long tradition (stretching back into the earliest days of Unix) of using text files for data storage. Although there are some exceptions, most control, log, and other system data files are written in ASCII. This makes them easier to inspect, post-process, and even edit.

Apple, whose historic bent has been more toward binary file formats (e.g., the resource fork), seems to have adopted this idea wholeheartedly. In fact, they have gone a bit further, adopting XML (rather than line-oriented files) and Unicode (rather than ASCII). As a result, many OSX files are well structured, language-independent, and quite accessible to both humans and programs.

Many vendors (e.g., Microsoft) are also joining the XML caravan. Assuming that they document both the syntax and semantics of their interchange formats, we could see a dramatic change in possibilities for file interchange.

It is not clear, however, that XML is the Right Answer for all problems. Let's look at some of the alternatives, examining their strengths and weaknesses. Don't expect a comprehensive list; there are zillions of data formats in use. Here, in any event, are some that I would recommend.

CSV

Although CSV stands for "comma-separated values", commas are by no means essential to the idea. In fact, another term for this is "flat file format". Basically, the idea is that each line is a record and that some other delimiter (e.g., colons, commas, white space) is used to separate fields. Quotes or other devices are sometimes used to protect instances of the delimiter in the body of a field. Here are some examples:

/etc/crontab
  15 3 * * * root periodic daily
/etc/gettytab
  a|std.110|110-baud:\
        :np:nd#1:cd#1:uc:sp#110:
/var/log/netinfo.log
  Dec 21 17:59:33 cerberus netinfod ...
An_Excel_File.csv
  1,2,3
  "1,2,3","3,4,5"

Some files get a bit complex, adding syntax to support block structure, comments, line breaks, shell commands, etc. A highly-ornamented CSV file can begin to look like a "little language". Here is a fairly complex format, drawn from /var/named/named.local:

$TTL 86400
@ IN SOA localhost. root.localhost. (
    1997022700 ; Serial
    28800      ; Refresh
    14400      ; Retry
    3600000    ; Expire
    86400 )    ; Minimum
  IN NS  localhost.
1 IN PTR localhost.

Many BSD control files, logs, and reports use white space to delineate fields, making the assumption that the included data will not contain spaces. This was never a safe assumption for path names, but the advent of OSX has made the problem all too real.

Try running "ps -axww" and see how man path names (for both commands and arguments) contain spaces. Then, consider how you would code up a way to determine which spaces are field separators and which are not. Not simple...

Of course, ps has other problems. Run "vi 1 '2 3' 4" in one window and "ps" in another. In the ps output, the COMMAND field looks like "vi 1 2 3 4", dooming any effort to parse it into a command path and distinct command-line arguments. Some sort of quoting convention is desperately needed here.

Blanks in path names make it impossible to parse certain log files, have already broken one Apple installer script, and could well foul up many BSD control files. It's probably too late to get Apple to back off from their use of embedded blanks in system path names, so you can expect to see some problems of this nature coming along...

OML

Although CSV is attractively simple, it may be too simple for your needs. For instance, you may need to support optional data, hierarchical structures, etc. On the other hand, you may not be ready for the formality of XML (Extensible Markup Language) and unwilling to design your own markup language (and parser).

OML (Ostensible Markup Language :-) is a powerful, simple, and convenient solution to this dilemma:

# This is a sample of file-system metadata.
<snap>
  <file>ASCII text</>
  <flags>f,avbstclinmed,,</>
  <lstat>303507,33200,1,1000,20,914,8</>
  <md5>2e1240f444fc3f984186fc5a4fd28eb0</>
  <times>1040087752,230,1740993,10890145</>
</snap>

This looks quite a bit like XML, but there are some small peculiarities. That comment, for instance, isn't legal XML. Nor is "</>" a legal termination for a tag. Is that really CSV syntax in the middle of some fields? Finally, where are the header lines?

Though OML is seemingly designed to give hives to XML purists, it is also designed to work smoothly with existing XML tools. A couple of lines of Perl will strip out the comments and fill out the terminations. A quick pass through an XML parser extracts all of the named fields and attributes. Perl's split() operator, if need be, can break up the CSV data.

Since OML is pretty much a "roll your own" kind of thing, there isn't any real documentation. I'd suggest a look at "Doing it Simpler" (Leigh Dodds; www.xml.com) for some ideas, however.

XML

Despite any appearance to the contrary, I am quite a fan of XML. An enormous amount of meticulous and thoughtful effort (and some rather fancy computer science!) is going into creating "industrial strength" data formats and processing tools. In addition, the W3C (World Wide Web Consortium; www.w3.org) is being very careful to make sure that the official standards are open to all players.

For some projects, you really need to bring in power tools. The host of translators, validators, and other tools that XML provides can make otherwise impossible projects feasible, if not necessarily reasonable. Using XML also increases the chance that someone else's program will be able to parse your data. Given all of that, complaining about a bit of formality seems rather petty.

I won't try to cover XML here; there are shelves of books on the subject, with more coming out on a weekly basis. O'Reilly and Addison-Wesley have the broadest coverage; O'Reilly's XML web site (www.xml.com) is a good place to start your journey...

YAML

CSV has low overhead, is simple to read and edit, and handles lists and arrays well. OML and XML are a bit more bulky, but handle optional data and hierarchies smoothly. XML has an impressive suite of documentation, standards activities, and support software. Sometimes, however, you want low overhead, simplicity, and support for arbitrary data structures.

YAML (YAML Ain't Markup Language) fills this niche quite admirably. The syntax is simple and clean. The basic data structures are sequences (i.e., Perl arrays) and mappings (i.e., Perl hashes). YAML handles lists, arrays, and hierarchies easily; with a bit of extra work, it can handle arbitrary Perl data structures (e.g., cyclic graphs).

Here is the previous example, transliterated into YAML:

# This is a sample of file-system metadata.
snap:
  file:  'ASCII text'
  flags: 'f,avbstclinmed,,'
  lstat: '303507,33200,1,1000,20,914,8'
  md5:   '2e1240f444fc3f984186fc5a4fd28eb0'
  times: '1040087752,230,1740993,10890145'

A more idiomatic rendering, however, would look like:

# This is a sample of file-system metadata.
snap:
  file:  ASCII text
  flags: [ f, avbstclinmed, , ]
  lstat: [ 303507, 33200, 1, 1000, 20, 914, 8 ]
  md5:   2e1240f444fc3f984186fc5a4fd28eb0
  times: [ 1040087752, 230, 1740993, 10890145 ]

Aside from the fact that some spaces have been added after commas and the quotes have been eliminated (some turned into brackets), the second version looks very similar to the first. The resulting data structure is quite different, however; the bracketed lists have been turned into YAML sequences. This means that they don't have to be parsed in a follow-on step. Here is some access code, in Perl:

$file = $yaml{snap}{file};
$uid  = $yaml{snap}{lstat}[3];

YAML has several ways to write textual data. Here are some examples:

  - a simple text item
  - "double-quoted text\n "
  - 'single-quoted text'
- >
    This text
    is freeform.
- |
    This text
    isn't.

Although YAML has nowhere near the amount of documentation that XML has, there are some useful resources to recommend. The YAML web site (www.yaml.org) is the logical place to start; be sure to visit the YAML wiki. I'd also recommend a look at "Look Ma, No Tags" (Kendall Clark Grant; www.xml.com) for an informal introduction.


Rich Morin has been using computers since 1970, Unix since 1983, and Mac-based Unix since 1986 (when he helped Apple create A/UX 1.0). When he isn't writing this column, Rich runs Prime Time Freeware (www.ptf.com), a publisher of books and CD-ROMs for the Free and Open Source software community. Feel free to write to Rich at rdm@ptf.com.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Apple iTunes 12.2 - Play Apple Music...
Apple iTunes lets you organize and stream Apple Music, download and watch video and listen to Podcasts. It can automatically download new music, app, and book purchases across all your devices and... Read more
Apple Security Update 2015-005 - For OS...
Apple Security Update 2015-005 is recommended for all users and improves the security of OS X. For detailed information about the security content of this update, please visit: http://support.apple.... Read more
Apple HP Printer Drivers 3.1 - For OS X...
Apple HP Printer Drivers includes the latest HP printing and scanning software for OS X Lion or later. For information about supported printer models, see this page. Version 3.1: The latest printing... Read more
Epson Printer Drivers 3.1 - For OS X 10....
Epson Printer Drivers installs the latest software for your EPSON printer or scanner for OS X Yosemite, OS X Mavericks, OS X Mountain Lion, and OS X Lion. For more information about printing and... Read more
Xcode 6.4 - Integrated development envir...
Xcode provides everything developers need to create great applications for Mac, iPhone, and iPad. Xcode brings user interface design, coding, testing, and debugging into a united workflow. The Xcode... Read more
OS X Yosemite 10.10.4 - Apple's lat...
OS X Yosemite is Apple's newest operating system for Mac. An elegant design that feels entirely fresh, yet inherently familiar. The apps you use every day, enhanced with new features. And a... Read more
Dash 3.0.2 - Instant search and offline...
Dash is an API Documentation Browser and Code Snippet Manager. Dash helps you store snippets of code, as well as instantly search and browse documentation for almost any API you might use (for a full... Read more
FontExplorer X Pro 5.0 - Font management...
FontExplorer X Pro is optimized for professional use; it's the solution that gives you the power you need to manage all your fonts. Now you can more easily manage, activate and organize your... Read more
Typinator 6.6 - Speedy and reliable text...
Typinator turbo-charges your typing productivity. Type a little. Typinator does the rest. We've all faced projects that require repetitive typing tasks. With Typinator, you can store commonly used... Read more
Arq 4.12.1 - Online backup to Google Dri...
Arq is super-easy online backup for the Mac. Back up to your own Google Drive storage (15GB free storage), your own Amazon Glacier ($.01/GB per month storage) or S3, or any SFTP server. Arq backs up... Read more

Hands-On With Raceline CC
Set for release soon, Rebellion’s motorbike racing game, Raceline CC certainly looks stylish. But how does it play? I got my hands on a preview build to answer exactly that. | Read more »
Siegefall - Tips, Tricks, and Strategies...
So, you fancy establishing a base and ruling the world again. Siegefall is a convenient place to do that, but how about some great tips and tricks on how best to go about it? Here are a few ideas on how to get ahead as a beginner to this medieval... | Read more »
The WWE Comes to Racing Rivals - Because...
Racing Rivals is a racing game that's all about, well, rivalry. And who knows rivalry better than WWE superstars (shhhh, that was rhetorical)? [Read more] | Read more »
Hey, Who Put Apple Music in My SoundHoun...
One of the App Store's popular music discovery sources - SoundHound - has already been updated to include Apple's own music discovery source - Apple Music. That was fast! [Read more] | Read more »
Arcane Legends has a New Expansion Calle...
Arcane Legends has been going strong since it debuted at the tail end of 2012. So well, in fact, that it's already up to its sixth expansion. [Read more] | Read more »
Vector 2 is Officially a Thing and it...
Vector is a pretty cool parkour-driven runner that's gotten a pretty decent following since it first came out - although personally I think more people could stand to show it some love. Anyway, Nekki has announced that a sequel isofficially on its... | Read more »
Get Ready to Trucksform and Roll Out (an...
It looks like NuOxygen is bringing the truck-transforming racer Trucksform (get it?) to iOS in a couple of weeks. Although really it's more of an auto-driver than a racer. But still, transforming trucks! [Read more] | Read more »
This Week at 148Apps:June 22-26, 2015
June's Summer Journey Continues With 148Apps How do you know what apps are worth your time and money? Just look to the review team at 148Apps. We sort through the chaos and find the apps you're looking for. The ones we love become Editor’s Choice,... | Read more »
LEGO® Minifigures Online (Games)
LEGO® Minifigures Online 1.0.1 Device: iOS iPhone Category: Games Price: $4.99, Version: 1.0.1 (iTunes) Description: | Read more »
World of Tanks Blitz celebrates its firs...
Today marks the first anniversary of the launch of World of Tanks Blitz, the mobile version of the PC tank battler, World of Tanks. World of Tanks Blitz launched on iOS and Android on June 26th last year and to celebrate, Wargaming is giving all... | Read more »

Price Scanner via MacPrices.net

Apple Releases OS X 10.10.4 With WIFi Fix, iO...
On Tuesday, Apple released final versions of OS X 10.10.4 and iOS 8.4, as well as updates for the Safari browser for OS X Yosemite, Mavericks, and Mountain Lion. The OS X 10.10.4 update focuses on... Read more
Dual-Band High-Gain Antennas for Home Wi-Fi N...
Linksys has announced what it claims are the first dual-band, omni-directional high-gain antennas for the consumer market. The new Linksys high-gain antennas available in a 2- and 4-pack (WRT004ANT... Read more
Apple refurbished 2014 15-inch Retina MacBook...
The Apple Store has Apple Certified Refurbished 2014 15″ 2.2GHz Retina MacBook Pros available for $1609, $390 off original MSRP. Apple’s one-year warranty is included, and shipping is free. They have... Read more
Clearance 2014 MacBook Airs available for up...
Adorama has 2014 MacBook Airs on sale for up to $301 off original MSRP including NY + NJ sales tax and free shipping: - 11″ 256GB MacBook Air: $798 $301 off original MSRP - 13″ 128GB MacBook Air: $... Read more
5K iMacs on sale for $100 off MSRP, free ship...
B&H Photo has the new 27″ 3.3GHz 5K iMac on sale for $1899.99 including free shipping plus NY tax only. Their price is $100 off MSRP. They have the 27″ 3.5GHz 5K iMac on sale for $2199, also $100... Read more
27-inch 3.2GHz iMac on sale for $1679, save $...
B&H Photo has the 27″ 3.2GHz iMac on sale for $1679.99 including free shipping plus NY sales tax only. Their price is $120 off MSRP. Read more
12-inch 1.2GHz Gray MacBook on sale for $1487...
Amazon.com has the new 12″ 1.2GHz Gray MacBook in stock and on sale for $1487 including free shipping. Their price is $102 off MSRP, and it’s the lowest price available for this model. We expect... Read more
15-inch 2.2GHz Retina MacBook Pro on sale for...
Amazon.com has the 15″ 2.2GHz Retina MacBook Pro on sale for $1819 including free shipping. Their price is $180 off MSRP, and it’s the lowest price available for this model. Read more
OtterBox Releases New Symmetry Series Metalli...
Otterbox’s new Symmetry Series of smartphone cases flaunts the best of both both street style and street smarts with their new metallic finishes and trusted OtterBox protection for iPhone 6 and... Read more
Eliminate Cable Chaos with New GE Branded Wra...
GE licensee Jasco Products has introduced a new line of GE branded Wrap-n-Charge USB wall chargers with built-in cable management. “We are always working to combine great technology with innovative... Read more

Jobs Board

*Apple* TV Live Streaming Frameworks Test En...
**Job Summary** Work and contribute towards the engineering of Apple 's state-of-the-art products involving video, audio, and graphics in Interactive Media Group (IMG) at Read more
Project Manager, WW *Apple* Fulfillment Ope...
…a senior project manager / business analyst to work within our Worldwide Apple Fulfillment Operations and the Business Process Re-engineering team. This role will work Read more
Senior Data Scientist, *Apple* Retail - Onl...
**Job Summary** Apple Retail - Online sells Apple products to customers around the world. In addition to selling Apple products with unique services such as iPad Read more
*Apple* Music Producer - Apple (United State...
**Job Summary** Apple Music seeks a Producer to help shepherd some of the most important content and editorial initiatives within the music app, with a particular focus Read more
Sr. Technical Services Consultant, *Apple*...
**Job Summary** Apple Professional Services (APS) has an opening for a senior technical position that contributes to Apple 's efforts for strategic and transactional Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.