Object Oriented Programming in Data Science with R

June 2018 · 8 minute read

Since R is mostly a functional language and data science work lends itself to be expressed in a functional form you can come by just fine without learning about object-oriented programming.

Personally, I mostly follow a functional programming style (although often not a pure one, i.e. w/o side-effects, because of limited RAM). Expressing mathematical concepts in a functional way is quite natural in my opinion.

However, object-oriented programming offers a lot of benefits in certain use cases. The Python data science community embraces oop, possibly because of its larger background in computer science as opposed to math/stats of the R community. While I think that oop is sometimes taken to far (I do not want to write numpy.matmul(a, b) to do matrix multiplication, I prefer A %*% B:), I also think that there is a lot to like about it. Oop helps to hide complexity, e.g. by encapsulating the complexity of a prediction algorithm.

In this post I want to show you how to use the S3 class system to load data from different sources into R and how to implement a class myPredictionAlgorithm with a fit() and predict() method using R6 as a class system.

Object-oriented programming in R

As already mentioned, R has multiple systems to implement object-oriented programming. In order of complexity, starting from the simplest, they are:

  1. S3 classes,
  2. S4 classes,
  3. Reference classes (~ R5) and
  4. R6 class system.

In contrast to ‘classic’ message-passing object-oriented languages like Python, C++ or Java, S3 uses so called generic-function oop. Message-passing oop involves sending messages (= methods) to an object, which then tries to find an appropriate function to call (Hadley Wickham, ‘Advanced R’). S3 generic-function oop is actually quite similar to operator overloading. A generic function say, print(), decides which method to call, such as print.myClass(). S3 has no formal class definition. S4 and Reference Classes are more both more formal than S3. R6 is what most programmers coming from say Python expect an oo system to look like.

S3 generics

A generic function is a function whose functionality depends on the object it is used on. print() is one of the best examples that shows the power of generic functions:

# print for a vector:
print(1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
# print for a data.frame:
print(mtcars)
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Depending on the class of the input, print() returns a different output.

Generic functions are also called polymorphic functions. When we inspect print we can see that it is indeed a S3-generic function:

pryr::ftype(print)
## [1] "s3"      "generic"

Now, let’s use the S3 class system to connect to a data source and extract some data. We start by building a connection class that we can pass to our extraction class to get some data:

create_connection = function(connection_string, type) {
  class(connection_string) = type
  
  return(connection_string)
}

In S3 we can create a class simply by setting the class attribute. In our create_connection() function the class depends on the object type, which will be either ‘local’ or ‘database’. Now we define a generic function extract_data() to load our data into R depending on the class of the connection object:

# Note: extract() is the function name that dispatches to functions starting with extract.class
extract = function(object) UseMethod("extract")

# generic.class = function() ...
extract.local = function(connection_object) cat("Extracting data from local source ...\n")
extract.database = function(connection_object, getCols = NULL) cat("Extracting data from database ...\n")

# default if no class specific method is found
extract.default = function(connection_object) message("unknown connection object")

We define a generic function with UseMethod(). When we call extract_data(), our function checks the class attribute of the object we passed. Based on the class it searches for a function following the naming convention ‘extract.class_name’. If it finds one, it will use it, else it will use our default implementation:

con_local = create_connection("local connection", type = "local")
con_db = create_connection("database connection", type = "database")
con_spark = create_connection("spark connection", type = "spark")

# methods for class 'local' and 'database' are defined:
extract(con_local)
## Extracting data from local source ...
extract(con_db)
## Extracting data from database ...
# there is no method for class 'spark':
extract(con_spark)
## unknown connection object

You could also rename our function extract() to load(), but the convention is to use the same name as in our call to UseMethod(). This allows you to easily find all specific implementations of the generic function using methods():

methods(extract)
## [1] extract.database extract.default  extract.local   
## see '?methods' for accessing help and source code

We can also use methods() to list all methods implemented for a given class:

methods(class = "local")
## [1] extract
## see '?methods' for accessing help and source code

So essentially we created the following class hierarchy (adapted from: Thomas Mailund, Advanced Object-Oriented Programming in R):

The abstract class ‘extractData’ defines an interface ‘extract’. We do not explicitly create the abstract class, we only define its method extract() using a generic function. Using this generic function, we implement concrete classes called ‘local’, ‘database’ and ‘default’ by writing corresponding extract.class functions.

In the S3 class system inheritance works by specifying a character vector as class attribute like so:

object = 1
class(object) = c("C", "B", "A")

class(object)
## [1] "C" "B" "A"

The first element in the class attribute vector is the most specialized and the last the most general.

R6

With a few notable exceptions (e.g. the data.table package) data is immutable in R. R6 is a class system that breaks the immutable-data principle by allowing mutable data structures. This allows us to create methods that actually modify objects and not make a copy. R6 can be seen as an improved version of the reference class system (R5), so I will not cover R5.

Let’s try to build an R6 class for our prediction algorithm:

library(R6)

myPredictionAlgorithm = R6Class("myPredictionAlgorithm",
                                private = list(
                                  model = NULL
                                  ),
                                public = list(
                                  formula = NULL,
                                  data = NULL,
                                  initialize = function(data = NA, formula = NA) {
                                    self$data = data
                                    self$formula = formula
                                    cat("model object created")
                                  },
                                  print = function() {
                                    print(paste0("Formula = ", c(self$formula)))
                                  },
                                  fit_lm = function(){
                                    private$model =lm(formula = self$formula, data = self$data)
                                    
                                    print(private$model)
                                  },
                                  get_coeff = function(){
                                    print(coefficients(private$model))
                                  }
                                )
                                )

To create an object of our class we call $new(), which will call initialize() if it exits:

ols_model = myPredictionAlgorithm$new(data = mtcars, formula = mpg ~ cyl + carb)
## model object created

Printing our model object gives us:

ols_model
## [1] "Formula = mpg ~ cyl + carb"

because we specified a custom-print function.

We can use $ to get access to all public methods and values, like so:

ols_model$fit_lm()
## 
## Call:
## lm(formula = self$formula, data = self$data)
## 
## Coefficients:
## (Intercept)          cyl         carb  
##     37.8127      -2.6250      -0.5261

Or so:

ols_model$get_coeff()
## (Intercept)         cyl        carb 
##   37.812739   -2.625023   -0.526146

So, public members are accessed using self$ and private members using private$. The R6 introduction vignette suggests to have methods return invisible(self), if you want you want methods to be chainable. Private attributes can be accessed only by methods defined in the class or sub-classes.

Unfortunately, there is no way to enforce types of fields in R6 except by implementing checks manually (e.g. as a checker class).