Mon - Fri : 08:00 AM - 5:00 PM
734.544-8038

What Does it Mean to Say R is an Object-Oriented Programming Language?

What Does it Mean to Say R is an Object-Oriented Programming Language?

///
Comment0

You may hear R described as an “object oriented programming” (OOP) language, but what does this mean? OOP languages are computer programming languages designed in a manner to minimize the amount of new code that needs to be written. In the context of R, what is primarily helpful to know is that we can take a variety of different generic functions and code them so that their behavior changes depending on the type of construct (be it a variable, data frame, or some other entity) the function takes as its primary argument. This works by creating objects, assigning them a class, and defining methods that operate on objects according to the type of class they have been assigned.

An example will illustrate. Let’s first generate a data frame consisting of three variables: income, education, and gender.

set.seed(1234)

educ<-round(runif(1000,7.5,18.49))
gender<-rbinom(1000,1,.55)
income<-exp(10+.1*educ+.05*gender+rnorm(1000,0,.25))

The first line sets the seed of the random number generator so that the results presented here can be replicated exactly. Skipping this step would cause the random number generator to calculate a different set of values each time the code is run.

The next line generates a variable educ consisting of 1,000 numbers drawn from a uniform distribution ranging from 7.5 to 18.49. Wrapping the runif() function inside a round() function returns integers, in this case whole numbers between 8 and 18. Think of these as the number of years spent in school.

The gender variable is next defined to consist of 1,000 observations drawn from a binomial distribution, where each draw consists of a single coin flip with a probability that the outcome is 1 (i.e. heads) equal to .55. Think of this variable as equaling one for females and zero for males.

The next line creates 1,000 observations for the income variable, where income is a function of education and gender. The linear combination of education plus gender plus random error is placed inside an exp() function, which has the effect of creating a skewed distribution for income similar to what it is observed to occur in the general population:

plot(density(income))

income-density-159

We can combine these variables into a single data frame object.

fakeData<-data.frame(cbind(educ,gender,income))

We want R to know that gender is a nominal variable (or a factor), so we use the lines:

fakeData$gender<-as.factor(fakeData$gender)
levels(fakeData$gender)<-c("Males","Females")

Finally, we can fit a regression of the log of income on education and gender and save the output into its own object, which we call regOut.

regOut<-lm(log(income)~gender+educ,data=fakeData)

We have now created several different objects. We have our variables, our data frame, and our regression output. R interacts with each of these objects differently, however, on the basis of their class. When you are using common R functions, or additionally any R functions that have been properly coded into an add-on package, R will automatically assign the appropriate class to the object. You can see what the class is using the class() function (note that most, but not all, objects will have a class). Notice how the following output differs.

> class(fakeData)
[1] "data.frame"

> class(fakeData$educ)
[1] "numeric"

> class(fakeData$gender)
[1] "factor"

> class(fakeData$income)
[1] "numeric"
 
> class(regOut)
[1] "lm"

R knows that the fakeData object is a data frame, that the educ and income variables in the data frame are numeric, that the gender variable is a factor, and that the regression output came from the lm() function.

R contains several generic functions that produce different output depending on the class of the object they receive as their argument. This is because programmers have taken the functions and applied methods to them, which means the programmers have defined class-specific behaviors for the generic functions.

Two examples are the summary() and plot() functions. Look at the output that comes from applying the former to each of our five objects:

> summary(fakeData)
      educ          gender        income      
 Min.   : 8.0   Males  :427   Min.   : 28205  
 1st Qu.:10.0   Females:573   1st Qu.: 62822  
 Median :13.0                 Median : 84716  
 Mean   :13.1                 Mean   : 91231  
 3rd Qu.:16.0                 3rd Qu.:113885  
 Max.   :18.0                 Max.   :225762  

> summary(fakeData$educ)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.00   10.00   13.00   13.1   16.00   18.00 

> summary(fakeData$gender)
  Males Females 
    427     573 

> summary(fakeData$income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  28200   62820   84720   91230  113900  225800
 
> summary(regOut)

Call:
lm(formula = log(income) ~ gender + educ, data = fakeData)

Call:
lm(formula = log(income) ~ gender + educ, data = fakeData)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77616 -0.15969 -0.00072  0.16405  0.78219 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   10.007807   0.034493 290.141  < 2e-16 ***
genderFemales  0.058006   0.015710   3.692 0.000234 ***
educ           0.099331   0.002442  40.674  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2455 on 997 degrees of freedom
Multiple R-squared: 0.6245,     Adjusted R-squared: 0.6238 
F-statistic: 829.1 on 2 and 997 DF,  p-value: < 2.2e-16

When the object is of class data.frame, the summary() function returns a description of each variable in the file. The description will change depending on whether the variable has a numeric class or a factor class. In the former case, the minimum, maximum, quartiles, and mean are presented. In the latter case, a frequency count is presented. These summaries can be provided for a single variable at a time, as illustrated with the next three calls to the summary() function. Notice that, for objects of class lm, the summary() function returns something very different. We see all of the information we need for reporting regression results.

Now consider the plot() function. We get a scatterplot matrix when the class is data.frame.

plot(fakeData)

scatter-161

We get a plot of values by row index when we request a plot of a single numeric variable (in this case not a very informative figure).

plot(fakeData$educ)

educ-157

We get a bar chart for plots of objects with a factor class:

plot(fakeData$gender)

gender-158

Most interesting is what happens when we plot an lm object. Here we get a series of plots we can use for residual diagnostics:

par(mfrow=c(2,2))
plot(regOut)

residuals-160

This of course only scratches the surface of OOP programming, but it demonstrates some of the power of R’s OOP model (just compare doing regression diagnostics with R compared to, say, SPSS). Most applied researchers who are now beginning to discover R may not need to know much about other OOP features such as inheritance and superclasses, but it is very useful to understand how R can know how to perform the appropriate actions when applying the same function to very different types of objects.

Still have questions? Contact us!