Data serialization in R and Racket
When programming in R, I generally pass data around by reading and writing text files (typically, CSV files). The ubiquity of CSV files means that many different types of software will open them easily (e.g., Notepad, Excel, TextEdit, etc.). However, if the data structure is not flat or contains other attributes, then writing to CSV requires flattening and/or dropping attributes. The general solution to writing data to a file while retaining structure and attributes is serialization.
In R, I use
saveRDS for reading and writing serialized objects. Let’s pull a nested list from a recent post as an example.
nested_list = list("female" = list("0-1" = 0, "1-2" = 0, "2-3" = 0.15, "3-4" = 0.4, "4+" = 0.8), "male" = list("0-1" = 0, "1-2" = 0.1, "2-3" = 0.4, "3-4" = 0.6, "4+" = 0.9)) saveRDS(nested_list, "list.rds") # write list object to file identical(nested_list, readRDS("list.rds")) # read list object from file and compare to original
Even with a flat object, serialization can be useful. For example, we could write a matrix to CSV with
write.table but it won’t be exactly the same when reading from the CSV file.1 The approach with
readRDS is more straightforward.
mat = matrix(data = 0, nrow = 10, ncol = 5) write.table(mat, "matrix.csv", row.names = FALSE, col.names = FALSE) mat_csv = as.matrix(read.table("matrix.csv")) identical(mat, mat_csv) saveRDS(mat, "matrix.rds") mat_rds = readRDS("matrix.rds") identical(mat, mat_rds)
#lang racket (require racket/serialize) (define nested-hash (hash "female" (hash "0-1" 0 "1-2" 0 "2-3" 0.15 "3-4" 0.4 "4+" 0.8) "male" (hash "0-1" 0 "1-2" 0.1 "2-3" 0.4 "3-4" 0.6 "4+" 0.9))) ; define function for saving data to a rkdt file (define (save-rktd data path) (if (serializable? data) (with-output-to-file path (lambda () (write (serialize data))) #:exists 'replace) (error "Data is not serializable"))) ; define function for reading data from a rkdt file (define (read-rktd path) (with-input-from-file path (lambda () (deserialize (read))))) (save-rktd nested-hash "hash.rktd") ; write hash table to file (equal? nested-hash (read-rktd "hash.rktd")) ; read hash table from file and compare to original
save-rktd function first checks if the data structure is serializable and returns an error if it is not.2 The
with-output-to-file function handles opening and closing a port to the file. While the port is open, the anonymous function (specified with
(lambda)) indicates that the data should first be serialized and then written to the output port. The keyword argument
#:exists is set to
'replace because overwriting the existing file is familiar to me from
read-rktd, a port is opened and closed with
with-input-from-file and the anonymous function first reads the file and then deserializes the data. The
read-rktd functions are then applied in the same way as
read-rktd should work on any data structures that are serializable. Here is an example with a “matrix” comprised of a vector of vectors.
(define matrix (for/vector ([i (in-range 10)]) (make-vector 5 0))) (save-rktd matrix "matrix.rktd") (equal? matrix (read-rktd "matrix.rktd"))
One key difference between the R functions and my Racket versions is that the R functions apply compression by default and no compression is applied in the Racket functions. In fact, relative to writing/reading CSV files, using
readRDS provides the benefits of small size on disk and fast read/write operations in addition to retaining data structure and attributes.