TweetFollow Us on Twitter

The R statistics package

Volume Number: 23 (2007)
Issue Number: 01
Column Tag: Statistics

The R statistics package

What is R and how you can use it.

By Mihalis Tsoukalos

Introduction

The R language of statistical computing is a free implementation of the S language, also used for statistical computing. Rick Becker, John Chambers and Allan Wilks developed S at the famous AT&T Bell Labs. The commercial version of S is called S-PLUS and the problem with S-PLUS is that it is very expensive.

Version 1.00 of R was first released on February 2000 and the latest version of R, at the time of writing this article, is 2.4.0. Version 2.3.1 is used for the purposes of this article.

R and S-PLUS can be used for statistical analysis and graphics. Put simply, you insert datasets that you want to analyze and visualize in creative ways.

What, a statistics package in MacTech?

Well, you may wonder what is a statistics package doing in MacTech so I think that I have to explain some things to you. First, let me tell you that statistics are not that difficult in all of their aspects -you can use a small division of statistics that are incredibly simple. Second, I should add that statistics can be very useful for systems administration purposes including Mac OS X. Last, you should know that statistics are particularly useful when you want to generate a report for your boss that usually does not understand technical information very well.

If you still feel uncomfortable with statistics, please have in mind that this article is not going to use higher-level statistics. What will be used are straightforward statistical methods and some R commands that generate a lot of handy and impressive graphical images.

Introducing R

The good news is that there is a Macintosh version of R that can either run as a console or a graphical application. R can also run on Windows as well as other UNIX machines.

Figure 1 shows the console version of R whereas figure 2 shows the graphical version of R. In order to run the console version you just have to type R provided that the directory of the R command is included in your PATH variable.


Figure 1: Running R from the console

As you can see both versions of R have a similar text window that you enter your commands. Nevertheless, the GUI version is more elegant and offers more options. By typing q() -this works for both versions- you can quit R. Additionally, you can quit the GUI version from its usual Mac-related menus.


Figure 2: The R GUI

R can also be used as a simple calculator. The following examples illustrate it:

> 1 + 5
[1] 6
> abs(-1.4)
[1] 1.4

R can also be used in a batch mode -only the command line version of it- where you store the desired commands in a file and execute them from the command line or a cron job. The following commands demonstrate it:

$ cat example.R
1 + 1;
5 - 8;
$ R CMD BATCH example.R
$ cat example.Rout 
R : Copyright 2006, The R Foundation for Statistical Computing
Version 2.3.1 (2006-06-01)
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
  Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
[Previously saved workspace restored]
> invisible(options(echo = TRUE))
> 1 + 1;
[1] 2
> 5 - 8;
[1] -3
> 
> proc.time()
[1] 0.634 0.169 0.818 0.000 0.000
>
$

As you can see, the output for the example.R batch file is stored in a file called example.Rout. If the batch executed commands generate any graphics files, those graphics files should have been created as well.

Learning more about R and Statistics

One of the most important things that you have to learn is how to insert external data inside R. This can be made using the read.table() command. The following example shows how to use it:

$ cat TEST.data 
Name       Salary  Age
Mike       100000  25
Eugenia    200000  22
John       125000  26
PIK        250000  38
Antonis    180000  30
$ R
...
> SAL <- read.table("TEST.data", header=TRUE)
> SAL
     Name    Salary Age
1    Mike    100000  25
2 Eugenia    200000  22
3    John    125000  26
4     PIK    250000  38
5 Antonis    180000  30
>

In this example, a table was saved in a text file called TEST.data and loaded into R. Notice the header=TRUE parameter that tells that the first line of the TEST.data file is the header row of the table and therefore should be treated in a different way. Also notice that the SAL variable holds the whole table.

Imagine that you want to learn some information about your SAL data. The summary() command can be used as follows:

> summary(SAL)
      Name       Salary            Age      
 Antonis:1   Min.   :100000   Min.   :22.0  
 Eugenia:1   1st Qu.:125000   1st Qu.:25.0  
 John   :1   Median :180000   Median :26.0  
 Mike   :1   Mean   :171000   Mean   :28.2  
 PIK    :1   3rd Qu.:200000   3rd Qu.:30.0  
             Max.   :250000   Max.   :38.0  
>

As you will understand, this is a great way to summarize your data. Now, let us explain the output.

The Name column does not contain numbers so R sums the occurrences (considering each value as a string) of each "string" and prints the top numbers. As far as Salary and Age columns are concerned, which are both numeric, R calculates and displays the following six values:

- Min.: This is the minimum value of the whole data set.

- Median: It is an element that divides the data set into two subsets (left and right subsets) with the same number of elements. If the data set has an odd number of elements, then the Median is part of the data set. If the data set has an even number of elements, then the Median is the mean value of the two center elements of the data set.

- 1st Qu.: The 1st Quartile (q1) is a value, that does not necessarily belong to the data set, with the property that at most 25% of the data set values are smaller than q1 and at most 75% of the data set values are bigger than q1. Simplistically, you can consider it as the Median value of the left half subset of the sorted data set.

In the case that the number of elements of the data set is such that q1 does not belong to the data set, it is produced by interpolating the two values at the left (v) and the right (w) of its position to the sorted data set as:

q1 =  0.75 * v + 0.25 * w

- Mean: This is the mean value of the data set (the sum of all values divided by the number of the items in the data set).

- 3rd Qu.: The 3rd Quartile (q3) is a value, not necessarily belonging to the data set, with the property that at most 75% of the data set values are smaller than q3 and at most 25% of the data set values are bigger than q3. Put simply, you can consider the 3rd Quartile as the Median of the right half subset of the sorted data set.

In the case that the number of elements of the data set is such that q3 does not belong to the data set, it is produced by interpolation of the two values at the left (v) and the right (w) of its position to the sorted data set as:

q3 =  0.25 * v + 0.75 * w

- Max.: This is the maximum value found in the data set.

Please note that there exist many practices for finding Quartiles. In you try another statistical package, you may get slightly different results.

Creating Graphics with R

In the main part of the article I am going to tell you how to generate creative graphics with R. R has amazing graphical capabilities. Please look at the Bibliography and References section for more information. Also, the presented examples are real examples, using real data.

WWW server example

For this example, I used some old web server log data from a real web server. The duration of the logs is one week. Let me explain all the required steps.

First, let me show you some things about the log files using the wc command:

$ wc *.log
416041  6656584   119534721 day1.log
429039  6864552   123800090 day2.log
1185958 18975185  338653060 day3.log
1162803 18604776  330550972 day4.log
1157444 18519068  329710792 day5.log
1209289 19348537  342242234 day6.log
1078902 17262326  307343799 day7.log
6639476 106231028 1891835668 total

The wc command provides us counts of lines, words and bytes of each file. As you can see, the web log files are big as this is a very popular web server.

The log file format is the "standard" Apache "combined" format as follows:

#Fields: date time c-ip cs-username s-ip ¬
cs-method cs-uri-stem cs-uri-query sc-status sc-bytes ¬
cs-bytes time-taken cs-version cs(User-Agent) cs(Cookie) cs(Referer)

I have now to decide which fields to use and extract from the log files. I will use the following fields:

time: the time of the request

sc-bytes: a number that shows the server to client bytes

cs-bytes: a number that shows the client to server bytes

time-taken: the time -in milliseconds- it took the web server to process the request. Please note that a 0 value may declare that the requested resource was stored in a cache memory and therefore the web server did not have to process it.

The following UNIX shell script does what we want:

$ cat WWW.sh 
#!/bin/bash
grep -v '^#' day1.log | awk '{print $2, $10, $11, $12}'¬
| sed 's/:/ /g' | awk '{print $1 ":" $2, $4, $5, $6}'

Its output, for the day1.log file, begins as follows:

00:00 137   465 0
00:00 142   471 0
00:00 13449 338 0
00:00 140   471 0
00:00 142   476 0
00:00 141   468 15
00:00 142   474 0
00:00 466   228 0
00:00 139   465 0
00:00 140   464 0

Of course, you have to change the day1.log string to fit your own filename. I did so for the rest of the web server log files. The files created are as follows (again using the output of the handy wc command):

$ wc *.data
 416033 1664132 6816604 day1.data
 429031 1716124 7026785 day2.data
1185942 4743768 19385103 day3.data
1162795 4651180 19041770 day4.data
1157440 4629760 18933110 day5.data
1209281 4837124 19748074 day6.data
1078894 4315576 17627914 day7.data
6639416 26557664 108579360 total

If you want to have header data in your files, you can do it by manually editing the output files. I put the "Time sc cs timeTaken" line at the beginning of each of the daily data files.

Now, we are finally ready to use R to process some of our data. I used the Misc, Change Working Directory (or Command-D) option to change my working directory so that I will not have to use full paths for my data files.

First, I am going to use the summary() command to overview my day1 data.

> day1 <- read.table("day1.data", header=TRUE)
> summary(day1)
      Time              sc                cs           timeTaken        
 18:05  :   775   Min.   :      0   Min.   :   0.0   Min.   :      0.0  
 17:32  :   708   1st Qu.:    141   1st Qu.: 378.0   1st Qu.:      0.0  
 12:21  :   697   Median :    142   Median : 431.0   Median :      0.0  
 17:07  :   696   Mean   :   2997   Mean   : 428.9   Mean   :    253.3  
 10:15  :   693   3rd Qu.:    842   3rd Qu.: 464.0   3rd Qu.:      0.0  
 18:15  :   687   Max.   :5686096   Max.   :2340.0   Max.   :1908734.0  
 (Other):411777                                                         
>

You can easily see the moments that were very busy: 18:05, 17:32, 12:21, 17:07, 10:15 and 18:15. You can also understand from the timeTaken column output that your web server was serving requests pretty fast (because the 3rd Qu. value is 0).

There is also a very quick way to represent a data set graphically. It can be done with the pairs(<dataset_name>) command which plots pairs of the columns in a data set. The output of the

> pairs(day1)

can be seen in figure 3. Isn't it worth every statistical definition you have read in this article?


Figure 3: the output of pairs(day1) command

The attach() command takes a data set as its argument, and lets you use the columns of the data set separately. In this example, I will use the day2 data. Also, check the objects() and search() commands that help you discover existing objects.

> day2 <- read.table("day2.data", header=TRUE)
> attach(day2)
> objects()
[1] "day2"
> search()
 [1] ".GlobalEnv"        "day2"              "tools:RGUI"       
 [4] "package:methods"   "package:stats"     "package:graphics" 
 [7] "package:grDevices" "package:utils"     "package:datasets" 
[10] "Autoloads"         "package:base"     
> objects(2)
[1] "Time"      "cs"        "sc"        "timeTaken"
>

The plot(Time) command will produce figure 4. This figure shows the total number of connections per time. It makes sense that after midnight there are less connections than the rest of the day. On the other hand, unreasonable outputs may be the cause of a network attack.


Figure 4: The output of the plot(Time) command

Last, imagine that you want to limit the output values for both x and y variables. You can do that by using the xlim and ylim parameters of the plot command. The following example shows this (the output can be seen in figure 5):

> plot(cs, sc, xlim=c(450, 500), ylim=c(450,500))

Network data example

In this example, I will use network data. As many of you already know, the usual way to capture network data is the tcpdump tool. The output of the tcpdump utility is difficult to read but there are many tools (i.e. tcpshow, ethereal/wireshark) that will help you parse it. Anyway, imagine that you have readable tcpdump output that contains the following fields:


Figure 5: Limiting the values of the output

Packet Number: column title "Packet"

Time: column title "Time"

Time Difference from Previous Packet: column title "dt"

Source Port: column title "sp"

Destination Port: column title "dp"

I used the tcpshow package which produces output that looks as follows:

-----------------------------------
Packet 171
TIME:    15:00:15.367367 (0.000190)
LINK:    00:60:97:DE:54:36 -> 00:00:0C:04:41:BC type=IP
  IP:    207.46.130.139 -> 172.16.117.52 hlen=20 TOS=00 dgramlen=40 id=003A
    MF/DF=0/1 frag=0 TTL=64 proto=TCP cksum=C797
 TCP:    port http -> 1024 seq=1274940435 ack=3183900831
    hlen=20 (data=0) UAPRSF=010000 wnd=32735 cksum=2A2F urg=0
DATA:    <No data>
-----------------------------------
Packet 172
TIME:    15:00:15.455012 (0.087645)
LINK:    00:00:0C:04:41:BC -> 00:C0:4F:A3:58:23 type=IP
  IP:    172.16.112.20 -> 192.168.1.10 hlen=20 TOS=00 dgramlen=60 id=0080
    MF/DF=0/0 frag=0 TTL=63 proto=UDP cksum=9D5A
 UDP:    port domain -> domain hdr=8 data=32
DATA:    .9..........
    hostmaster.com.....
-----------------------------------

I used a small Perl script to extract the data (TCP traffic only) that I wanted from the tcpshow command output. Remember that you may have to replace text values like http, telnet, etc., found in tcpshow output with their service numbers so that R can use it.

This time, I will also bring into play a new R package for creating graphics called lattice. The following command shows how to load the lattice package in R.

> library(lattice)

In order to get some help about the lattice package, you can type the following command:

> help(lattice)

After executing the last command inside the graphical version of R, you will get the output shown in figure 6.


Figure 6: help(lattice) graphical output

For the first example, I will use the data from the first three columns (Packet, Time, and dt) of the extracted data (C3.data file). I executed the following three commands:

> c3 <- read.table("C3.data", header=TRUE)
> attach(c3)
> c3m <- as.matrix(read.table("C3.data", header=TRUE))

You already know the first command. The second command inserts the data as a matrix (as.matrix() command) because some graphics functions that plot more than two variables will only accept data as a matrix. Do not forget to also run the library(lattice) command.

Write the following commands in a text editor. After that, copy them and paste them inside R. You will get figure 7! This example is based on an existing example that uses the volcano data set.

x <- 10*(1:nrow(c3m))
y <- 10*(1:ncol(c3m))
image(x, y, c3m, col = terrain.colors(100), axes = FALSE)
contour(x, y, c3m, add = TRUE, col = "peru")
axis(1, at = seq(100, 800, by = 100))
axis(2, at = seq(100, 600, by = 100))
box()
title(main = "c3m plot", font.main = 4)

Do not ask me about the physical meaning of that graph. If you know your data, you can tell more about this image. This is just an example about getting an idea of R capabilities.

Now, I am going to show you a more down-to-earth example. The following commands

> plot.new()
> xyplot(Packet ~ Time)
> title(main = "Packet vs Time", font.main = 4)

will plot Packet number versus Time -using data from the c3 dataset- as can be seen in Figure 8. Straight lines may


Figure 7: Plotting the c3m data set

represent complete HTTP transactions. You can see that there are moments that there is not so much TCP traffic whereas other times the TCP traffic is very high.


Figure 8: Packet vs Time plot

For the last example I will use the data from the last two columns (sp, and dp) of the extracted information (C2.data file). First, run the following commands:

> c2 <- read.table("C2.data", header=TRUE)
> attach(c2)
> c2m <- as.matrix(read.table("C2.data", header=TRUE))
> summary(c2)
       sp             dp      
 http   :2472   http   :1923  
 smtp   : 202   smtp   : 223  
 telnet :  39   telnet :  42  
 1024   :  31   1024   :  38  
 2615   :  22   1306   :  36  
 1026   :  21   4233   :  27  
 (Other):2124   (Other):2622

As you can easily understand, the summary command is very useful and meaningful this time. This traffic has many HTTP, SMPT and TELNET requests. If you are concerned about security, you may try to lower the TELNET connections and replace them with secure connections (ssh).

As this sample contains non-numeric values, there are not so many things to do. Plotting your data set is something you can do (figure 9):

> plot(c2)
> title(main = "dp vs sp plot", font.main = 4)


Figure 9: dp vs sp plot

One more thing

You may wonder by now, why did I talk about web log files for seven days although I used only two of them! Well, the answer is that I am going to use them now.

Run the following commands for each one of the seven web log data files:

head -n 1 day1.data > hour1
grep "^13:" day1.data  >> hour1

This will create seven files, each of them containing the web log data between 13:00 and 13:59, one for each weekday. Also, execute the following command:

$ wc -l hour*
  25647 hour1
  23211 hour2
  70192 hour3
  67904 hour4
  59699 hour5
  60121 hour6
  58629 hour7
 365403 total
$ wc -l hour* > hour13.data
$ head -n 7 hour13.data > hour13

Now, let us go back to R and execute the following commands:

> hours <- read.table("hour13")
> attach(hour)
> barplot(V1, angle=c(45,135), density=20, col="grey", ¬
names=c("Sunday", "Monday", "Tuesday", "Wednesday", ¬
"Thursday", "Friday", "Saturday") ) > title(main="Web server connections from 13:00 to 13:59", font=5)

The output can be see in Figure 10.


Figure 10: A bar plot

This procedure can be easily automated (and also run using cron) and therefore you can have a report of your data every day at your email account!

What else can R do?

Now that you have learned more things about R, let me briefly tell you what else can R do. Well, R can also perform the following:

1. Advanced data analysis.

2. Advanced statistics.

3. R has an object oriented programming language that you can write your own programs.

Summary

In this article you learned a lot - you learned some basic things about R, how to import data into R, and how to create graphics with R.

Those things can be very valuable for regular users and, especially, for system administrators.

Conclusions

I hope that this article did not contain too much statistics. I also hope that you have, by now, understood some of the capabilities of R. If you want to learn more about R then visit its home page, and check out some of the proposed books.

The output of R should help you prove your points to either your colleagues or your manager and get a general overview of your data.

Please let me know if you have any questions or if you want another article about R and its rich graphical capabilities.

Bibliography and References

R home page: http://www.R-project.org/

S-PLUS home page: http://www.insightful.com/products
/splus/default.asp

R and DBMSs page: http://developer.R-project.org/db/

Venables, W.N. and Ripley, B.D., Modern Applied Statistics with S, 4th Ed., Springer Verlang, 2002

Venables, W. N., Smith, D. M., An Introduction to R, Network Theory Ltd., 2002

Murrel, Paul, R Graphics, Chapman & Hall/CRC, 2006

Crawley, Michael, Statistics: An introduction using R, Wiley, 2005

Dalgaard, Peter, Introductory Statistics with R, Springer, 2004

Venables, W.N. and Ripley, B.D., S Programming, Springer, 2004

Tom Christiansen and Nathan Torkington, Perl Cookbook, 2nd Ed., O'Reilly, 2003


Mihalis Tsouklos lives in Greece with his wife Eugenia and enjoys digital photography and writing articles. You can reach him at tsoukalos@sch.gr.

 
AAPL
$562.29
Apple Inc.
-3.03
MSFT
$29.06
Microsoft Corpora
-0.01
GOOG
$591.53
Google Inc.
-12.13
MacTech Search:
Community Search:

Men in Black 3 Review
Men in Black 3 Review By Rob Rich on May 25th, 2012 Our Rating: :: WE'LL TAKE IT FROM HEREUniversal App - Designed for iPhone and iPad Gameloft delivers a surprisingly awesome free-to-play management game based on a beloved series... | Read more »
SketchBook Ink Review
SketchBook Ink Review By Lisa Caplan on May 25th, 2012 Our Rating: :: SIMPLEiPad Only App - Designed for the iPad SketchBook Ink has a welcoming interface but lacks key features   Developer: Autodesk Inc. | Read more »
Autumn Dynasty Review
Autumn Dynasty Review By Kevin Stout on May 25th, 2012 Our Rating: :: NEARLY FLAWLESSiPad Only App - Designed for the iPad Autumn Dynasty is an oriental-themed real-time strategy game.   | Read more »
Our Annual “Holy Cow It’s Memorial Day A...
So, it’s that time of year again! BBQs, lawn chairs, beer, and the ability to finally wear shorts with sandals without fear of frostbite. Tan those legs and check out all the huge sales that are going on across the App Store below. We’ll try and... | Read more »
FREEday 5/25/12 – “They Call Me FREE but...
Another week of freebies, this time with very little in the way of “Big Name” titles. No need to panic, it’s intentional. Anyone browsing the App Store will no doubt see the more popular games anyway. | Read more »
Shoot the Zombirds Review
Shoot the Zombirds Review By Kevin Stout on May 25th, 2012 Our Rating: :: ADDICTINGUniversal App - Designed for iPhone and iPad Shoot the Zombirds is an archery game where the player shoots arrows at avian zombies.   | Read more »
Apple Debuts Free App of the Week Promot...
Apple has made a couple of changes to their weekly app features that pop up in the Featured tab of the App Store. While “App of the Week” and “Game of the Week” appear to be just rebranded as “Editors’ Choice,” there’s a new feature: the Free Game... | Read more »

Price Scanner via MacPrices.net

Apple Maintains Leading Mobile Device Manufacturer...
Milennial Media says Apple continued to be the number one mobile device manufacturer on their platform in Q1, representing 28% of the top manufacturers impression share. Apple iPhone accounted for 15... Read more
Asustek To Launch Three New ZenBook Ultrabook Mode...
Digitimes’ Rebecca Kuo and Steve Shen report that PC-maker Asustek Computer will launch three new models to its ZenBook Prime Ultrabook lineup – the UX21A, UX31A and UX32VD – in June, featuring full... Read more
Yahoo! Introduces Axis Search Browser For Mobile D...
Yahoo! has announced the availability of Yahoo! Axis, a new Web browser tool that it claims will re-imagine how people search and browse on the web, Axis offering a faster, smarter search with... Read more
Android- and iOS-Powered Smartphones Expand Market...
Smartphones powered by Android and iOS mobile operating systems accounted for more than eight out of ten smartphones shipped in the first quarter of 2012 (1Q12), according to the International Data... Read more
Roundup of Memorial Day Weekend MacBook Pro sales,...
 Apple resellers have MacBook Pros on sale for up to $240 off MSRP this Holiday weekend. Here is a roundup of the best prices available from any reseller: (1) B&H Photo has MacBook Pros on sale... Read more
iPad wait times down to 1-3 days at The Apple Stor...
The Apple Store Online is now reporting a 1-3 business day wait on all iPad orders, as it appears that Apple is clearing out their backlog. The iPad is available in Wi-Fi or Wi-Fi + Cellular... Read more
Roundup of Memorial Day Weekend MacBook Air sales,...
 Apple resellers have MacBook Airs on sale for up to $101 off MSRP this Holiday weekend. Here is a roundup of the best prices available from any reseller: (1) B&H Photo has 11-inch and 13-inch... Read more
13″ 2.8GHz MacBook Pro on sale for $100 off MSRP
Adorama has lowered their price on the 13″ 2.8GHz MacBook Pro to $1399 including free shipping plus NY/NJ sales tax only. Their price is $100 off MSRP, and it’s the lowest price for this model from... Read more

Jobs Board

Help Desk-Desk-Side Support (Apple, Mac...
9001 certification. Help Desk - Desk-Side Support (Apple, Mac and PC support strongly preferred) Location: Secaucus, ... equipment. 1+ years of experience in supporting MAC desktops as well as... Read more
*Apple* Solutions Consultant-Retail Sal...
The Apple Solutions Consultant is an Apple employee who oversees the sales, merchandising, and operations of an Apple Store-in-a-Store in a single unit retail Read more
iPad/iPhone Developer at Recruitarrow (P...
Job Responsibilities and Requirements: These solutions must be aligned with business and IT strategies and comply with the organization's architectural standards. Involved in the full systems life... Read more
Mobile iphone App with API Connections t...
See requirements. Develop mobile app that interfaces to access database on webserver and infusionsoft through API. Desired Skills: iPhone, Mobile, Infusionsoft, API Read more
*Apple* Retail - Manager - Natick Colle...
Much more than just a place for amazing products, the Apple Retail Store serves a dazzling range of needs for its customers. Not only can users get hands-on experience Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.