TweetFollow Us on Twitter

Mac in the Shell: Python Text Parsing

Volume Number: 25
Issue Number: 07
Column Tag: Mac in the Shell

Mac in the Shell: Python Text Parsing

Automating entries through keyword searching

by Edward Marczak

Introduction

We've been covering Python basics over the last several columns. This month, we'll hit something with a little practicality: text processing. While computers are really good with numbers, people are really good with words. More often than not, input from people comes as text. Turns out that Python is pretty good at dealing with text processing and manipulation. Let's have a closer look, shall we?

More To The Story

OK, there's a little bit more to the story. I've dealt with e-mail systems and e-mail processing for a very long time. (Let's just say that I started with sendmail before it used m4, mmmmkay?). Oftentimes, though, we want a program dealing with incoming mail. This may be for the purposes of a mailing list, for auto-response or to parse the e-mail and then put relevant bits into a database.

E-mail is either really complex or really simple, depending on how you look at it. It's complex because it's got headers and encoding and parts. But it's simple, because it's all text. No matter what all of the pieces are, they're all just human readable text. Fortunately, there are many pre-built libraries that help deal with the complexity, allowing you, the script writer, to focus on the task at hand: processing the parts of the message body that you're interested in. Python's "batteries included" philosophy ensures that a good mail processing library ships as part of the core package.

How is any of this Mac-specific? Well, it isn't. Not directly. However, I just mentioned that Python has, by default-no extra installation required-a good e-mail processing library. Python ships standard with OS X. That's part of the equation solved. Then, there's the issue of receiving the mail in the first place.

Just about every contemporary mail system has a method of taking incoming mail and feeding it to a script. Postfix, which ships with OS X, is no exception. By default, an SMTP (Simple Mail Transfer Protocol) server wants to receive mail, decide if the mail is for a valid user on it's system, and to then drop that mail in the user's mailbox. That's it. But what about a list server? Well, you take the same SMTP server, but instead of delivering any mail to an end user's mailbox, you hand all mail off to the list server program. The list server program will determine who to deliver this mail to.

This is also similar to server-side anti-spam. All incoming mail is handed off to an anti-spam program. The mail is analyzed, potentially acted upon (read: dropped), and mail is then fed back into the SMTP server for final delivery.

We're not going to do anything so grand here today, but after finishing up, you'll have the groundwork. If you have an OS X machine acting as a mail relay and really want to test/use this, you're going to need to modify some postfix config files directly.

In /etc/postfix/transport, you'll need to first define a transport. Let's say your main mail server is called mail.example.com. If you want to divert mail to a script, have the mail sent to mproc.example.com, and add the following like to /etc/postfix/transport:

mproc.example.com mproc:

This says, "all mail that arrives for mproc.example.com send it to the transport named mproc." Once a transport is defined, we also need to tell postfix how to connect the dots between the transport named mproc and our script. That happens in /etc/postfix/master.cf. Add the following line to the end of the file:

mproc  unix  -       n       n       -       -       pipe
  flags=DRhu user=mproc argv=/usr/bin/mproc.py

This tells postfix that any mail arriving on the mproc transport should be piped to the mproc.py script. This is, of course, assuming that we store our script in /usr/bin as "mproc.py". Adjust as needed.

Of course, we're going to keep it simple: since the text will be piped into the script, it's easy to simulate. The pipe simply delivers the entire message on stdin.

A Text Processing Script

Again, we said that we're really focusing on processing e-mail as it arrives, so, we're going to look for input via stdin (which the pipe above does for us). Other text processing scripts may want to deal with text already in a file or elsewhere. I'll make sure to cover that in a future column, but that's not the goal of today's exercise. Despite 'keeping it simple,' we'll be covering a few new-to-us concepts.

Here's the assignment: currently, stock information arrives via e-mail where a dedicated person reads the mail and inputs the entries into a database. This person could clearly be doing better things, as this can be automated without changing the backend system that is sending the e-mail message (whether that's a person or a machine is immaterial for this article). These messages will have a strict format: category and value, separated by a colon. The body of a message would look like this:

Company: Cartier

Product: Watch

Model: Original Tank

Number: 12324A332

Price: $4,500

Available: Yes

However, there's a problem when parsing an e-mail message: it's never just the body that you receive. It's headers. And MIME parts. Oy. Fortunately, Python's email library has functions to deal with this.

I say, let's dive right in. Here's the code I'm using, which will be followed by an explanation of the program.

Listing 1: e-mail parsing program, epp.py

#!/usr/bin/env python
import email
import re
import sys
from email.Parser import Parser
# The keywords we're looking for
keys = ['Company', 'Product', 'Model', 'Number', 'Price', 'Available']
# Compile each keyword into a regular expression
keysre = {}
for i in keys:
  keysre[i] = re.compile(i)
# Read stdin into a single string
mystdin = sys.stdin.read()
# Create a parser object and parse the input
p = Parser()
ps = p.parsestr(mystdin)
# Examine each message part for an appripriate plain body
for i in ps.walk():
  if i.get_content_subtype() != "plain":
    continue
  plainbody = i.as_string()
# Break message into lines, based on newline char
plainbody = plainbody.splitlines()
for i in plainbody:
  # Look at each key for a match.
  for k in keys:
    if keysre[k].match(i):
      print i
sys.exit(0)

First thing to notice about the code is the relative brevity-37 lines in total. As usual, the first few lines simply get us set up: she-bang line and relevant imports, including the Python-supplied email module.

#!/usr/bin/env python
import email
import re
import sys
from email.Parser import Parser

There have been a few times in this column that I've mentioned the importance of regular expressions (RE). Python has good support for RE from the re module:

# The keywords we're looking for
keys = ['Company', 'Product', 'Model', 'Number', 'Price', 'Available']
# Compile each keyword into a regular expression
keysre = {}
for i in keys:
  keysre[i] = re.compile(i)

What is happening here is that we define a list of the keywords we're going to be looking for in the message body. Python regular expressions need to be compiled into an object, which is why we define the keysre dictionary. Of course, we could define these objects one at a time, but that's really inelegant and doesn't scale. In the loop, the dictionary is filled with keys that correspond to the words we're going to match, with a value of the compiled RE object.

# Read stdin into a single string
mystdin = sys.stdin.read()
# Create a parser object and parse the input
p = Parser()
ps = p.parsestr(mystdin)

The first part of this section is pretty simple: assign all of stdin to the variable mystdin. Part of the email library is the email parser object. This object allows an e-mail message, headers, MIME parts and all to be parsed, iterated over and picked apart. We're defining a new parser object and then loading the variable ps with a parsed version of the message that's arriving on stdin.

# Examine each message part for an appropriate plain body
for i in ps.walk():
  if i.get_content_subtype() != "plain":
    continue
  plainbody = i.as_string()

This section of the code hands us back the plain part of the message. MIME types are described in two parts, such as "text/html". We're only interested in the plain portion of the message if there are additional parts in the message. The conditional tests if the subpart is not plain. If it is not, we continue and go back to the top of the loop. If it is plain, we fall though and assign the entire subpart, as a string, to the variable plainbody.

# Break message into lines, based on newline char
plainbody = plainbody.splitlines()

The splitlines() string method returns a list, each element a line in the string, split by a separator-by default, the newline character. Now, we can examine each line in turn:

for i in plainbody:
  # Look at each key for a match.
  for k in keys:
    if keysre[k].match(i):
      print i

As we examine each line, an if statement tests for a match of our regular expressions by looping through the keysre dirctionary. If there's a match, we print it out. Naturally, we can take other action here besides printing it out, such as storing it internally, comparing it to some known value or even inserting it into a database. One thing you will likely want to do is to split the matching lines into key/value pairs. The string's split method does this very nicely. For example:

key, value = i.split(':')

The argument to split is the separator to split on. In our case, we know the lines are split by the colon character and that we're expecting back two values. The split method will happily split as many times as needed. In the case where you don't know how many values to expect, you may just want to assign to a list, like so:

values = i.split(':')

From there you can work out how many values were split and returned to you, and what to do with them.

Finally, we exit the program with a 'clean' exit code:

sys.exit(0)

Running the Program

If you don't happen to have any test e-mail sitting around, I've placed one on the MacTech ftp site, under this month's directory (ftp.mactech.com/src/mactech/volume25_2009/25.07.sit ). If you run your own mail server, you can actually just go grab a raw message from the mail spool-your own mail, mind you!

Since the instructions I gave in the first part allow postfix to send incoming mail through a pipe and to the application, we need a more convenient way to test. The command line makes this easy: just pipe it yourself. Don't forget to mark the program as executable:

chmod 770 epp.py

and then pipe away:

cat /path/to/mits_test_mail | ./epp.py

(or, substitute the ./ with the full path to the program, if needed). If you're using the test mail from the MacTech ftp site, you should see the output you expect: the values that we're matching on, with no headers, MIME clutter, etc. Take a look at the original test mail file to see just how much cruft is being left out.

Conclusion

This was a bit of a whirlwind tour of several concepts. I'd encourage you to bulk up an application like this by checking for error conditions and then taking appropriate action. Outside of that, though, it's pretty impressive at how few dedicated commands are needed to process a well-formed e-mail message. The rest are really just 'nuts and bolts' features of the language.

Media of the month: I'd like to think that everyone has some kind of music that they like. Something that reached them, or that reminds them of some period of time. Well, growing up in New York certainly left a musical stamp on me. I just finished "No Wave" by Marc Masters, and I just loved every second of it. I remember the NY scene around that time, but was certainly too young to fully appreciate it. I don't expect everyone to fully enjoy or 'get' No Wave. But sometimes, the best way to enjoy music is by reading about it. So think of the music that inspires you and find the reading material that points out its inspiration. Thanks to Bruce Gerson for inspiring the topic this month.

Next month, we'll expand on some of the concepts covered here and dig deeper into the well that Python has to offer.


Ed Marczak is the Executive Editor of MacTech Magazine. He has written for MacTech since 2004.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Vienna 3.0.6 :5eaf312: - RSS and Atom ne...
Vienna is a freeware and Open-Source RSS/Atom newsreader with article storage and management via a SQLite database, written in Objective-C and Cocoa, for the OS X operating system. It provides... Read more
Kodi 15.1.rc1 - Powerful media center to...
Kodi (was XBMC) is an award-winning free and open-source (GPL) software media player and entertainment hub that can be installed on Linux, OS X, Windows, iOS, and Android, featuring a 10-foot user... Read more
Bookends 12.5.8 - Reference management a...
Bookends is a full-featured bibliography/reference and information-management system for students and professionals. Access the power of Bookends directly from Mellel, Nisus Writer Pro, or MS Word (... Read more
Chromium 44.0.2403.125 - Fast and stable...
Chromium is an open-source browser project that aims to build a safer, faster, and more stable way for all Internet users to experience the web. Version 44.0.2403.125: This release contains a number... Read more
iMazing 1.2.2 - Complete iOS device mana...
iMazing (was DiskAid) is the ultimate iOS device manager with capabilities far beyond what iTunes offers. With iMazing and your iOS device (iPhone, iPad, or iPod), you can: Copy music to and from... Read more
Audio Hijack 3.2.0 - Record and enhance...
Audio Hijack (was Audio Hijack Pro) drastically changes the way you use audio on your computer, giving you the freedom to listen to audio when you want and how you want. Record and enhance any audio... Read more
FontExplorer X Pro 5.0.1 - Font manageme...
FontExplorer X Pro is optimized for professional use; it's the solution that gives you the power you need to manage all your fonts. Now you can more easily manage, activate and organize your... Read more
Calcbot 1.0.2 - Intelligent calculator a...
Calcbot is an intelligent calculator and unit converter for the rest of us. Featuring an easy-to-read history tape, expression view, intuitive conversion, and much more! Features History Tape -... Read more
MTR 5.0.0.1 - The Mac's oldest and...
MTR (was MacTheRipper)--the Mac's oldest and smartest DVD-backup app--is now updated to version 5.001 MTR -- the complete toolbox, not a one-trick, point-and-click extractor. MTR is intended for... Read more
LibreOffice 4.4.5.2 - Free, open-source...
LibreOffice is an office suite (word processor, spreadsheet, presentations, drawing tool) compatible with other major office suites. The Document Foundation is coordinating development and... Read more

Card King: Dragon Wars - Tips, Tricks an...
[Read more] | Read more »
Pac-Man Championship Edition DX has brou...
Bandai Namco has released Pac-Man Championship Edition DX on iOS and Android, which features the classic arcade gameplay that we've all grown to love. Pac-Man Championship Edition DX can be enjoyed in much shorter bursts than the arcade versions... | Read more »
Cosmonautica (Games)
Cosmonautica 1.1 Device: iOS Universal Category: Games Price: $6.99, Version: 1.1 (iTunes) Description: Cast off! Are you ready for some hilarious adventures in outer space? | Read more »
Rescue humanity from a Demon horde in An...
Angel Stone is Fincon's follow up to the massively successful Hello Hero and is out now on iOS and Android. You play as a member of The Resistance, a group of mighty human warriors who have risen up in defiance of the Demon horde threatening to... | Read more »
Gallery Doctor (Photography)
Gallery Doctor 1.0 Device: iOS iPhone Category: Photography Price: $2.99, Version: 1.0 (iTunes) Description: Free up valuable iCloud and iPhone storage with Gallery Doctor, the only iPhone cleaner that automatically identifies the... | Read more »
You Against Me (Games)
You Against Me 1.0 Device: iOS Universal Category: Games Price: $.99, Version: 1.0 (iTunes) Description: A simple game… You. Me. Claim, steal, lock, score, win! | Read more »
Yep, it's True - Angry Birds 2 is O...
The not exactly rumors were true and the birds are back. Angry Birds 2 has come to the App Store and the world will... well I suppose it'll still be the same, but now we have more bird-flinging options! [Read more] | Read more »
You Could Design Your Own Card for Chain...
If you've ever wanted to create your own item, weapon, trap, or even monster for Chainsaw Warrior: Lords of the Night, this is your chance. Auroch Digital is currently holding a contest so that fans can fight to the death (not really) to see which... | Read more »
Bitcoin Billionaire is Going Back in Tim...
If you thought you managed to buy everything there is to buy in Bitcoin Billionaire and make all the money, well you though wrong. Those of you who made it far enough might remember investing in time travel - and it looks like that investment is... | Read more »
Domino Drop (Games)
Domino Drop 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: Domino Drop is a delightful new puzzle game with dominos and gravity!Learn how to play it in a minute, master it day by day.Your... | Read more »

Price Scanner via MacPrices.net

11-inch MacBook Airs on sale for $100 off MSR...
Best Buy has 11-inch MacBook Airs on sale for $100 off MSRP. Choose free shipping or free local store pickup (if available). Sale prices for online orders only, in-store prices may vary: - 11″ 1.6GHz... Read more
iPad Air 2 on sale for up to $100 off MSRP
Best Buy has iPad Air 2s on sale for up to $100 off MSRP on their online store for a limited time. Choose free shipping or free local store pickup (if available). Sale prices available for online... Read more
Sale! 13-inch MacBook Pros on sale for $100 o...
B&H Photo has 13″ MacBook Pros on sale for $100 off MSRP. Shipping is free, and B&H charges NY sales tax only: - 13″ 2.5GHz/500GB MacBook Pro: $999.99 save $100 - 13″ 2.7GHz/128GB Retina... Read more
Sale! Save $100 on 13-inch MacBook Airs this...
B&H Photo has the 13″ 1.6GHz/128GB MacBook Air on sale for $899.99 including free shipping plus NY tax only. Their price is $100 off MSRP, and it’s the lowest price available for this model.... Read more
Worldwide Tablet Market Decline Continues, Ap...
The worldwide tablet market declined -7.0% year-over-year in the second quarter of 2015 (2Q15) with shipments totaling 44.7 million units according to preliminary data from the International Data... Read more
TP-LINK TL-PA8030P KIT Powerline Featuring Ho...
Consumer and business networking products provider TP-LINK is now shipping its TL-PA8030P KIT AV1200 3-Port Gigabit Passthrough Powerline Starter Kit that expands your home’s network over its... Read more
Apple refurbished iPad Air 2s available for u...
The Apple Store has Apple Certified Refurbished iPad Air 2s available for up to $140 off the price of new models. Apple’s one-year warranty is included with each model, and shipping is free: - 128GB... Read more
Updated Apple iPad Price Trackers
We’ve updated our iPad Air Price Tracker and our iPad mini Price Tracker with the latest information on prices and availability from Apple and other resellers. Read more
Apple refurbished 2014 13-inch 128GB MacBook...
The Apple Store has Apple Certified Refurbished 2014 13″ MacBook Airs available starting at $759. An Apple one-year warranty is included with each MacBook, and shipping is free: - 13″ 1.4GHz/128GB... Read more
Apple’s Education discount saves up to $300 o...
Purchase a new Mac or iPad at The Apple Store for Education and take up to $300 off MSRP. All teachers, students, and staff of any educational institution qualify for the discount. Shipping is free,... Read more

Jobs Board

*Apple* Retail - Multiple Positions (US) - A...
Sales Specialist - Retail Customer Service and Sales Transform Apple Store visitors into loyal Apple customers. When customers enter the store, you're also the Read more
Infrastructure Engineer - *Apple* /Mac - Hil...
Infrastructure Engineer - Apple /Mac Job Code: 1608 # of openings: 1 Description Our fortune 500 client is looking to hire an experienced Infrastructure Engineer to join Read more
Executive Administrative Assistant, *Apple*...
…supporting presentation development for senior leadership. * User experience with Apple hardware and software is preferred. Additional Requirements The following list Read more
*Apple* Bus Company is now hirin - Apple Bus...
Apple Bus Company is now hiring school bus drivers in the Pettis County area. Class B CDL preferred. Free training provided. No nights or weekends required. Flexible Read more
*Apple* Certified Mac Technician - Updated 6...
…and friendly, hands-on technical support to customers troubleshooting and repairing Apple /Mac products with courtesy, speed and skill. Use your problem-solving skills Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.