
- Home
- Magazine
- Conference & Seminars
- News
- Archives
- Forums
- Store
- Directory
- Editorial
- Advertising
- User/Login
- Contact



Volume Number: 17 (2001)
Issue Number: 11
Column Tag: Database Programming
by Paul Shields
Databases have become a fundamental requirement of managing almost any business. Everyone needs to store some type of data and a well designed and managed database can make tracking and manipulating this data trivial. Building simple databases is easy, especially with the GUI tools provided by most database vendors. As our databases get more complex however, we need to develop a better understanding of database design principles. Poor database design can lead to the storage of large amount of duplicate data, data integrity problems caused by the accidental deletion of critical information, and difficulties in inserting new data into the database.
Once we have a complete database design, we need tools to manage and manipulate the data and database structure. SQL (Structured Query Language) is just such a language. The advantage of SQL is that it is designed to be platform and database independent, allowing you to easily move data between different vendor products. SQL is designed from the ground up to be optimized for manipulating relational databases, thus much of its power comes from being combined with a truly relational database. Not all relational databases include SQL support though, so don't assume that its absence is an indication of a databases internal capabilities.
To better understand database designs and eventually SQL command structure, we should first consider the principles of relational database design. There is a lot of disagreement over all of the exact definition and feature set of a relational database. Here we define some of the most basic requirements and features that are common to most relational databases.
In over-simplified terms, a relational database is a database where all data visible to the user is organized strictly as tables of data values and all database operations work on those tables. This definition serves as an introduction to several terms common in the database world.
Column/Entity
A column is a collection of values with a common data type and definition. The range of values associated with a single column is known as the domain. There are two components to the domain, the physical definition and logical constraints. The physical definition states the generic type of data such as an integer, or string. The logical constraints are arbitrary constraints imposed by the database designer such as a valid integer between 1 and 100.
If we were to create a table to store the state component of a person's address, the domain would be a string that matches any of the 50 states.
Tables
A table is a collection of columns brought together based on some common grouping. A table can contain one or more columns. Each table in a database is assigned a unique name so that it can be referenced using SQL or other scripting tools. The rules associated with a table are as follows:
Listing 1: Simple Table Definition
FirstName Lastname Mary Jones John Smith Larry Stone
Primary key
Since the rows have no specific order, you cannot refer to them by location (first, last, third, etc.) This creates a problem if we want to reference a specific row of the table. In a well-designed database, every table has a column or columns marked as the primary key. The primary key is a special designation that ensures that no two rows have the same set of values. We can use the value stored in the primary key to find and extract the data from a specific row of the table
Primary keys come in many forms. When working with a table that describes people (such as an employee or customer table), we most often assign a unique employee or customer number. Our table would now look like the one in Listing 2:
Listing 2: Simple Table with a Primary Key
EmployeeID FirstName Lastname 12 Mary Jones 2 John Smith 34 Larry Stone
Relationships
In a relational database there is no ability to define explicit pointers, creating a parent/child relationship. Instead, relationships are defined by the sharing of common data. The data values in one column of Table 1 are used as the domain for the values in a column in Table 2. The value in Table 2 can now be used to reference information from Table 1.
Looking at the tables in listing 3 we can see that in the students table we store a list of students and the activities in which they participate. We can use the value in the Activity field of the student table to look up information on the activity, such as cost, in the Activities table. The data is still stored in two locations, we have simply defined a relation between the tables that says the Activity column in the students table relates to the Activity column in the Activities table.
Listing 3: Two tables with a defined relationship
Here we have two tables; one holds student information and the other information on activities at school. Table 1: Students StudentID FirstName LastName Activity Table 2: Activities Activity Cost Location
Foreign key
The definition of a foreign key is a combination of relations and primary keys. If the values in a column from one table match the values of a primary key column in another table, the first is called a foreign key reference. If we were to modify the table definition in listing 3 to note that the Activity field of the Activities table is a primary key, then the reference from the Students table would be known as a foreign key reference.
A foreign key reference creates a parent/child relationship between tables. A table can have more than one foreign key to define relationships with multiple tables. Foreign keys are used to reduce the amount of data stored in a database. We can now use the values stored in one column to look up information in another table. This reduces duplication of data across tables. Duplication can lead to issues in database management, most significant of which is data integrity, the process of ensuring that the data is valid. These issues are discussed further in the section on data normalization.
Codd's twelve rules
In a 1985 article in the magazine ComputerWorld, Ted Codd presented 12 rules that a database must obey if it is to be considered truly relational. These rules have become the basis for most definitions of relational databases. Few databases comply with every rule in the list, but they represent a good starting point for understanding the core features of a relational database.
There is an implied rule in this definition. "For any system that is claimed to be a relational database management system, that system must be able to manage data entirely through its relational capabilities." No currently shipping commercial database claims to meet all 12 rules, yet we can safely assume they are relational and can satisfy our needs.
The discussion of relations and foreign keys leads directly to the next major topic in database design, normalization. Normalization is the process of eliminating duplicates, validating data integrity, and defining relationships between tables. There are several levels to database normalization; each one designed to solve a different set of problems. It is absolutely critical that you normalize your database design before creating any tables as it is very difficult to change a database structure once filled with data.
When designing a database, the first pass design is most likely a collection of columns/entities based on existing forms or processes. This leaves us with a collection of tables and columns whose relationships may be poorly defined. Poorly defined relationships can lead to modification anomalies and data integrity issues. The process of normalization is designed to eliminate such anomalies.
For an example of a relationship with modification anomalies, refer to Listing 4. Here we define a student-activity table designed to track student participation in various school activities along with the fees associated with the activity. The problem with this table is that when we insert data into the table we must update two different relations, the student-activity relation and the activity-cost relation. If we delete a student, we not only delete the activity associated with the student, but potentially the information on how much the activity cost. This is known as a deletion anomaly.
Listing 4: Student activity information
A table to track student/activity participation, along with the fee associated with each activity.
Primary key: (StudentID, Activity)
Table 1: Student-Activity
StudentID Activity Fee
Jogging $100
Theatre $150
300 Skiing $300
Suppose we want to add in the cost of a new activity such as canoeing that costs $75. We cannot add this information unless we also have a student who wants to participate in the activity. This is called an insertion anomaly.
Normalization is the process of redefining these relationships to eliminate the modification anomalies. There are several levels of normalization that we will cover in detail, along with a few derivative techniques based on the mathematics of relations (well beyond the scope of this article). Each level of normalization eliminates a specific problem. One final issue to consider before diving into normalization is that every time we break a relationship into multiple components, we may have to introduce relational integrity constraints. Such constraints may impose limitations on our ability to insert data into a specific table. These restraints can be part of the underlying data definition or managed at the application level.
Another term to be familiar with is functional dependency. Functional dependency is a relationship between or among entities (columns) of a database. If we have two entities, x and y, y is functionally dependent on x, if the value of x determines the value of y. The common notation for a functional dependency is:
x->y
In this case, x is known as the determinant, because x determines the value of y. Functional dependencies can exist between group of entities. For example,
(x,y)->z z->(y,z)
In the first example, we cannot subdivide the functional dependency because neither x nor y can determine the value of z on their own. In the latter example we can subdivide the functional dependency into x->y and x->z since x determines the value of both y and z.
First Normal Form (1NF)
Any table of data that meets the definition of a relation is considered to be in 1NF. Remember the definition of a relation is:
This definition covers just about any grouping of data columns that includes a primary key to ensure uniqueness of the rows.
Second Normal Form (2NF)
A relation is in second normal form if all non-key entities are dependent on all key entities. If a table has only a single column as the primary key, it is by definition already in 2NF. This is because all of the other columns are dependent upon the single key column. For those tables with composite keys, we need to ensure that there are no partial dependencies.
Consider the table in listing 4. The column that contains the fee information is dependent upon the composite key of StudentID and Activity. The reality though is that the fee is only partially dependent on the key. The activity is what determines the fee each student pays. There is no dependency between the fee and the StudentID. This is called a partial dependency and 2NF eliminates such anomalies.
The easiest way to solve a problem like this is to create a second relation. In this case, we create a second table to store the Activity-Fee information (listing 5).
Listing 5: Student activity information as two relations
Table 1: Student-Activity StudentID(key) Activity 100 Jogging 200 Theatre 300 Skiing Table 2: Activity Fee Activity(key) Fee Jogging $100 Theatre $150 Skiing $300
Third Normal Form (3NF)
To understand 3NF, consider the example in listing 6, a table for storing student housing information. In this relation we have a functional dependency of StudentID->Building because a student can only live in one place at a time. We also have a functional dependency of Building->Cost since each dorm room has a fixed cost based on the building. This creates a transitive dependency of StudentID->Building->Cost, whereby StudentID determine the value in the Cost column.
Listing 6: Student housing information
A table containing student housing information. Primary key: StudentID Table 1: Student-Housing StudentID Building Cost 100 Crockett Hall $2100 200 Davis Hall $2150 300 Hall Hall $1300
The table in listing 6 is in 2NF, but still has problems. If we were to delete an entry, we not only delete the information for that student, but information on how much it cost to live in a particular dorm. For example if we delete the first row, we lose the information on how much Crockett Hall cost because there is no other student in the table living in that dorm. This is a case of the deletion anomaly mentioned earlier.
A relation is in third normal form if it is in second normal form and has no transitive dependencies. Again, we can address this issue by breaking the relation into two relations as shown in listing 7.
Listing 7: Student activity information as two relations
Table 1: Student-Building StudentID(key) Building 100 Crockett Hall 200 Davis Hall 300 Hall Hall Table 2: Building-Cost Building(key) Cost Crockett Hall $2100 Davis Hall $2150 Hall Hall $1300
Boyce-Codd Normal Form (BCNF)
While 3NF eliminate most anomalies, it does address them all. Consider the example in listing 8. Assume that a student can have more than one major, a major can have one or more advisors, and that an advisor can advise on only one major. StudentID cannot be the sole key because a student can have multiple majors and thus multiple advisors.
We can use a combination of either StudentID and Major to determine the advisor or a combination of StudentID and Advisor to determine the major. Thus, we have two sets of candidate keys, one of which we will select as the primary key. Since the advisor name aslo determines the major (a functional dependency) it is known as a determinant. While this table meets all the requirements of 1NF, 2NF, and 3NF it still has a deletion anomaly. If we delete a student, we risk losing information on which advisors cover specific majors. Again, we can solve this problem by breaking the relation into two, separating out the Student-Major information from the Major-Advisor information.
Listing 8: Student advisor information
A table that defines a relation between students, majors, and advisors. Primary key: (StudentID, Major) Candidate Key: (StudentID, Advisor) Table 1: Advisors StudentID Major Advisor 100 Chemistry Smith 200 Art Jones 300 History Johnson
Like in the other cases, we can solve this problem by breaking the relation into two components. We could create one table to hold the student-advisor relation and another to hold the advisor-major relation.
Fourth normal form (4NF)
4NF addresses issues associated with tracking two one-to-many relations within a single table as shown in listing 9. The table defines a relation that stores StudentID, major, and activity. A student can have one or more major and participate in one or more activities. Here, we store duplicate information on student 100 because he participates in two activities but has only one major. Again, the solution is to break the relation into two as show in listing 10.
Listing 9: Student information
A table to store student major and activity information. Primary key: StudentID Table 1: Students StudentID Major Activity 100 Chemistry Skiing 100 Chemistry Theatre 200 Art Swimming 300 History Bowling
Listing 10: Student information
A table to store student major and activity information. Table 1: Majors Primary key: StudentID StudentID Major 100 Chemistry 200 Art 300 History Table 1: Majors Primary key: StudentID StudentID Activity 100 Skiing 100 Theatre 200 Swimming 300 Bowling
Once in 4NF, most database designers can feel comfortable in stopping, knowing that they have resolved all deletion and insertion anomalies. Additional normal form definition exist, but most are based on mathematical theories and do little to address additional known anomalies. The most prevalent of these is Domain Key Normal Form (DKNF), proposed by R. Fagin in 1981. By definition any relation in DKNF has no anomalies and a relation with no anomalies is in DKNF.
Now that we have a fundamental understanding of relational databases, let's take a look at SQL (Structured Query Language), the standard language used for building and accessing relational databases. While there are alternatives to SQL and implementing SQL is not a core requirement of a relational database, it is a very common feature.
SQL is a tool for organizing, managing, and retrieving data from a computer database. The SQL language is essentially a programming language for relational databases. SQL is independent of the underlying database structure.
The flow of information between the user and the database using SQL is similar to that illustrated in Figure 1.

The first concept to understand is the role of the database management system (DBMS). The DBMS handles the management of the underlying database, handling all communications between the user and the database. Requests for data are sent to the DBMS (as are all database management calls) where they are processed and turned into actual read and writes to the database engine. This makes the DBMS moderately independent of the underlying database engine and often they are run on separate machines for performance reasons.
The process for processing an SQL command is:
How SQL manages to accomplish this is based on two principles. First, the SQL command language is broad in scope. Second, SQL draws much of its power from the underlying definition of relational databases.
Remember that one of the fundamental rules of a relational database is the Information rule. The Information rule stipulates that all information regarding the structure of the database must be represented as values in tables. This means that just like the regular information in a database, we can query and change the values associated with the database structure. SQL is indifferent to the fact that we are changing the values in the tables that control the structure of the database or those that represent end-user data. This allows us to use the same commands for database management as we use to query end-user data.
SQL is not, however, a complete programming language. First, it lacks conditional tests (IF) and flow control (GOTO, DO, and FOR) statements. Some database vendors may offer extensions to the SQL language to accomplish these functions, but they are not part of the SQL92 standard. SQL can however be integrated into other programming languages. Finally, despite the name, SQL lacks the explicit structure of languages like C or Pascal
The primary advantages of SQL are:
Apple provides a version of MySQL with OS X Server (available through the Software Update utility), but users running OS X client are on their own. Before jumping into deploying a whole database server, many users prefer to install a development version on a desktop machine. There is nothing inherent in MySQL or OS X client that keep the two from working together.
MySQL version 3.23.40 is the latest version of MySQL commonly available for OS X. There are a number of different installers available on the web, some are just a basic install of MySQL, while others include GUI management tools. For now, we will stick with the out-of-the-box, non-GUI installs, saving the GUI tools for a later date. This keeps the installation simple, but does require that you use the Mac OS X command-line for administration.
Before installing MySQL on an OS X client machine, I highly recommend first enabling the ‘root' user. The MySQL installer that creates the GRANT tables (the GRANT tables hold all the security information), seems ot detect whether or not the root user is enabled. If the system root user is not enabled, the MySQL ‘root' seems to be established in a state that prcents you from using the account properly. Without a fully enabled ‘root' user, you cannot administer the database. To enable the root user under OS X follow these steps:
One of the best installers is put together by Marc Liyanage and is available at http://www.entropy.ch/software/macosx/. This is a basic port of the MySQL code, packaged in an easy to use Mac OS X installer.
The first step is to create a user account for the MySQL install. This account is used so that the MySQL server application is run independent of any active user or the root account, thus making it more secure. To create the account:
You are now completely ready to install MySQL.
At this point you have MySQL installed and running. If you restart your machine, you will need to rerun the command from step 5. Alternatively, you can download an alternate installer that automatically configures MySQL to run at startup.
The default database is called ‘test' and is completely unsecured. Security of the internal database structure is separate from the OS-level security. There are a few things that you should do if your database is hosted on a server with public Internet access.
The mysql_install_db script initializes the grant tables to contain the following set of privileges:
Other privileges are denied.
Because your installation is initially wide open, one of the first things you should do is specify a password for the MySQL root user. You can do this using:
[localhost:/usr/local/bin] pshields% mysql -u root mysql mysql> UPDATE user SET Password=PASSWORD(‘new_password') WHERE user='root'; mysql> FLUSH PRIVILEGES;
You can also use the SET PASSWORD statement:
[localhost:/usr/local/bin] pshields% mysql -u root mysql mysql> SET PASSWORD FOR root=PASSWORD(‘new_password');
Only users with write/update access to the mysql database can change the password for others users. All normal users (not anonymous ones) can only change their own password.
If you update the password in the user table directly using the first method, you must tell the server to re-read the grant tables (with FLUSH PRIVILEGES).
You may wish to leave the root password blank so that you don't need to specify it while you perform additional setup or testing. However, be sure to set it before using your installation for any real production work.
If you run into problems, you can recreate the GRANT tables from scratch. To re-create the grant tables, remove all the `.frm', `.MYI', and `.MYD' files in the directory ‘/usr/local/var/mysql'. Then run the mysql_install_db script referenced in the installation steps.
Refer to the MySQL documentation http://www.mysql.com/documentation/mysql/ for more detailed information on securing MySQL and creating users.
This section provides a very cursory look at creating tables and inserting data. Once you have gone through the process of building your tables and normalizing them on paper, you can use SQL to create them in the database. All these examples are built using the test database. The MySQL documentation can provide more assistance with creating a new database that is customized to meet your needs.
To create a simple table of user names:
mysql> create table users ( firstname varchar(15), lastname varchar(15));
Notice a few things about the structure of the command. First, every command must be terminated with a semicolon (;). This means that a command can span multiple lines of input. For example you might enter the previous command like this:
mysql> create table users (
->firstname varchar(15),
->lastname varchar(15) );
In this command we create a table with the name "users". Table and column names are case-sensitive. Table names must be unique while column names within a table are unique but can be re-used between tables. Finally, the datatype we used to store the name information is a varchar of length 15. This is equivalent to specifying an ASCII string of maximum length 15 characters. There are several datatypes in the SQL language including integer, float, char, varchar, double, date, and money among others.
Once the table is created, we can begin to insert data. To insert data into the users table, use the command:
mysql> insert into users (firstname, lastname) values (‘John', ‘Smith') mysql> insert into users values (‘John', ‘Smith')
When inserting data, we specify the table name, along with a parenthetical list of columns for which data will be provided. For each column in the first part of the command, we must specify a value to store. Since our ‘users' tables uses varchars (strings), we need to put the values in quotes. The second version of the command has the column names omitted. We can do this when we will be providing data for every column in the table as part of the insert. The values for the insert must be listed in the order the columns were created originally if you don't specify the column names.
Finally, if you want to see the a listing of the column names in a table, you can use the command:
mysql> desc users; +— — — -+— — — — — — -+— — — +— — -+— — — — -+— — — -+ | Field | Type | Null | Key | Default | Extra | +— — — -+— — — — — — -+— — — +— — -+— — — — -+— — — -+ | first | varchar(15) | YES | | NULL | | | last | varchar(15) | YES | | NULL | | +— — — -+— — — — — — -+— — — +— — -+— — — — -+— — — -+ 2 rows in set (0.03 sec)
Listing 11 demonstrates a more complex table creation command that includes defining a primary key and setting the properties of the SID column so that a user cannot insert a row without providing a value for SID. One idiosyncrasy of MySQL is that fields defined as ‘NOT NULL' will have a default value assigned unless otherwise specified. Thus the table defined in listing 11 will actually allow for one row entry without a SID specified, assigning the row a value of 0 for the SID, when a second row is inserted without specifying a SID, the user gets an error message indicating there is a duplicate entry. There is no easy way around this within MySQL because of the internal logic used to store the table definition. It is a minor issue though that you will need to be aware of and is an example of how different databases interpret the same SQL code and definitions.
Listing 11: Example Create with NOT NULL and PRIMARY KEY set
mysql> create table students (
-> SID integer not null,
-> firstname varchar(20),
-> lastname varchar(20),
-> primary key (SID));
Query OK, 0 rows affected (0.01 sec)
mysql> desc students;
+— — — — — -+— — — — — — -+— — — + — — +— — — — -+— — — -+
| Field | Type | Null | Key | Default | Extra |
+— — — — — -+— — — — — — -+— — — + — — +— — — — -+— — — -+
| SID | int(11) | | PRI | 0 | |
| firstname | varchar(20) | YES | | NULL | |
| lastname | varchar(20) | YES | | NULL | |
+— — — — — -+— — — — — — -+— — — + — — +— — — — -+— — — -+
3 rows in set (0.02 sec)
mysql> insert into students (firstname, lastname) values (‘john', ‘smith');
Query OK, 1 row affected (0.03 sec)
mysql> insert into students (firstname, lastname) values (‘john', ‘smith');
ERROR 1062: Duplicate entry ‘0' for key 1
Every time you run a command, MySQL provides a summary of how long the command took to execute. This is useful later on when building more complex queries. One thing you have to be careful with when writing SQL, is building queries that don't consumer significant amounts of CPU time. If you build such a query and make it available to many users, their simultaneous use of the command could bring your database to its knees.
Designing relational database can be a complex undertaking. The process of ensuring that your data is properly normalized while at times tedious is critical to the long-term success of your database implementation. The best practice is to proceed in small steps, making one change at a time, validating the change in structure and moving on to the next item.
Once the database design is done, start looking at how to implement the structure within the tools you selected. While most modern relational databases offer SQL as a primary programming tool, other offer proprietary languages and GUI tools both designed to offer enhanced functionality or ease of use.
MySQL is just one of many Relational Database Management Systems (RDBMS) for the Mac platform. It has the primary advantage of being free, but has some limitations when compared to the commercial offerings. Other products in the RDBMS field are 4D from ACI, FileMaker Pro, and FrontBase. Unfortunately, high-end systems like Oracle are not currently available for the Mac OS X platform, but you can use tools like WebObjects to connect to an Oracle database running on another platform.
Paul Shields is currently an Advisor and Project Manager for a major telecommunications firm in Dallas, TX. In his role Paul selects and implements the critical technologies that comprise the backbone of the computing infrastructure. Paul is also the Senior Editor for AppleLinks
http://www.applelinks.com/ and The Business Mac
http://www.thebusinessmac.com/. Feel free to forward any questions or comments to him at pshields@applelinks.com.




