As an exercise in my Scheme (R6RS) learning journey, I have implemented a dataframe record type and procedures to work with the dataframe record type. Dataframes are column-oriented, tabular data structures useful for data analysis found in several languages including R, Python, Julia, and Go. In this post, I will introduce the dataframe record type and basic procedures for working with dataframes. In subsequent posts, I will describe other dataframe procedures, e.g., filter, sort, aggregate, etc.

A key design decision was to opt for an immutable data structure. Thus, dataframes are based on association lists (alists) rather than hashtables. I find it easier to reason about immutable data and thought the performance trade-off was worth it for this excercise. Here are the properties of a dataframe:

  • an alist where each sublist is a column;
  • the first element of each column is the column name;
  • the column name must be a symbol;
  • all column names must be unique;
  • all columns must have the same length.

Dataframe record type

I won't profess to have a good understanding of record types. This is what I came up with for dataframes.

(define-record-type dataframe (fields alist names dim)
                    (protocol
                     (lambda (new)
                       (lambda (alist)
                         (let ([who "(make-dataframe alist)"])
                           (check-alist alist who))
                         (new alist
                              (map car alist)
                              (cons (length (cdar alist)) (length alist)))))))

A key component of the record definition is check-alist, which confirms that the alist meets the definition of a dataframe (see bulleted list above). Each dataframe has three fields (i.e., alist, names, and dim), but alist is the only required field. The other two are based on the properties of the alist. define-record-type creates a predicate, dataframe?, constructor procedure, make-dataframe, and accessor procedures for each field: dataframe-alist, dataframe-names, and dataframe-dim.

> (define df (make-dataframe '((a 1 2 3) (b 4 5 6))))

> df
#[#{dataframe cziqfonusl4ihl0gdwa8clop7-3} ((a 1 2 3) (b 4 5 6)) (a b) (3 . 2)]

> (dataframe? df)
#t

> (dataframe? '((a 1 2 3) (b 4 5 6)))
#f

> (dataframe-alist df)  
((a 1 2 3) (b 4 5 6))

> (dataframe-names df)
(a b)

> (dataframe-dim df)
(3 . 2)                  ; (rows . columns)

> (define df (make-dataframe '(("a" 1 2 3) ("b" 4 5 6))))
Exception in (make-dataframe alist): names are not symbols

> (dataframe-display df)
 dim: 3 rows x 2 cols
     a     b 
    1.    4. 
    2.    5. 
    3.    6. 

Head and tail

In R, I frequently use head to preview the first few rows of a dataframe and, less frequently, use tail to view the last few rows. Scheme provides list-head and list-tail with similar functionality. However, tail in R returns the last n rows of the dataframe whereas list-tail in Scheme returns the rest of the list starting at a given index. My first instinct was to write dataframe-tail to use the R behavior, but eventually decided that dataframe-tail should follow the behavior established by list-tail. I was trying to think in terms of the principle of least surprise, but the degree of surprise depends on the potential users. Am I targeting R or Scheme programmers? The most realistic scenario is that future me is the only potential user and I want that guy to think in terms of typical Scheme patterns.

(define (dataframe-head df n)
  (let ([who  "(dataframe-head df n)"])
    (check-dataframe df who)
    (check-integer-positive n "n" who)
    (check-index n (car (dataframe-dim df)) who)
    (make-dataframe (alist-head-tail (dataframe-alist df) n list-head))))

;; dataframe-tail is based on list-tail, which does not work the same as tail in R
(define (dataframe-tail df n)
  (let ([who  "(dataframe-tail df n)"])
    (check-dataframe df who)
    (check-integer-gte-zero n "n" who)
    (check-index (sub1 n) (car (dataframe-dim df)) who)
    (make-dataframe (alist-head-tail (dataframe-alist df) n list-tail))))

(define (alist-head-tail alist n proc)
  (map (lambda (col) (cons (car col) (proc (cdr col) n))) alist))

dataframe-head and dataframe-tail illustrate a common pattern in the dataframe library: extracting the alist, breaking the alist into sublists, working on the sublists, and then rebuilding the alist and dataframe. In the case of dataframe-head and dataframe-tail, the core logic is so simple that most of the code involves checking inputs.

Transpose

Dataframes are a column-oriented data structure. However, the more natural pattern when reading and writing CSV files is to use a row-oriented list, which I'm calling a rowtable. dataframe->rowtable and rowtable->dataframe allow for switching between row and column orientation.

> (define df (make-dataframe '((a 100 300) (b 4 6) (c 700 900))))

> (dataframe->rowtable df)
((a b c) (100 4 700) (300 6 900))

> (dataframe-display 
    (rowtable->dataframe '((a b c) (1 4 7) (2 5 8) (3 6 9))))

 dim: 3 rows x 3 cols
     a     b     c 
    1.    4.    7. 
    2.    5.    8. 
    3.    6.    9. 

> (dataframe-display 
    (rowtable->dataframe '(("a" "b" "c") (1 4 7) (2 5 8) (3 6 9))))

 dim: 3 rows x 3 cols
     a     b     c 
    1.    4.    7. 
    2.    5.    8. 
    3.    6.    9. 

> (dataframe-display 
    (rowtable->dataframe '((1 4 7) (2 5 8) (3 6 9)) #f))

 dim: 3 rows x 3 cols
    V0    V1    V2 
    1.    4.    7. 
    2.    5.    8. 
    3.    6.    9. 

Read and write

If you are working exclusively with dataframes, you can read and write them directly (i.e., without transposing to and from rowtables) with dataframe-read and dataframe-write. These procedures are straightforward because they are simply reading and writing the alists with read and write.

(define (dataframe-write df path overwrite?)
  (when (and (file-exists? path) (not overwrite?))
    (assertion-violation path "file already exists"))
  (delete-file path)
  (with-output-to-file path
    (lambda () (write (dataframe-alist df)))))

(define (dataframe-read path)
  (make-dataframe (with-input-from-file path read)))

Extract values

dataframe-values returns all the values in a column as a simple list. Following R, I've included $ as an alias for dataframe-values. This procedure is particularly useful when modifying and aggregating dataframes (as I will show in a future blog post). dataframe-values-unique returns the unique values from a column.

> (define df (make-dataframe '((a 100 200 300) (b 4 5 6) (c 700 800 900))))

> (dataframe-values df 'b)
(4 5 6)

> ($ df 'b)                 
(4 5 6)

> (map (lambda (name) ($ df name)) '(c a))
((700 800 900) (100 200 300))

> (define df1 (make-dataframe '((x a a b) (y c d e))))

> (dataframe-values-unique df1 'x)
(a b)

> (dataframe-values-unique df1 'y)
(c d e)

dataframe-ref returns a dataframe based on a list of row indices and, optionally, the selected column names. I did not follow the principle of least surprise here because dataframe-ref takes a list of indices rather than a single value as in list-ref. For dataframes, the scenario of referencing a single row seemed less likely than a range of rows and I wanted to provide the option to simultaneously select the columns returned.

> (define df (make-dataframe '((grp "a" "a" "b" "b" "b")
                               (trt "a" "b" "a" "b" "b")
                               (adult 1 2 3 4 5)
                               (juv 10 20 30 40 50))))

> (dataframe-display (dataframe-ref df '(0 2 4)))

 dim: 3 rows x 4 cols
   grp   trt  adult   juv 
     a     a     1.   10. 
     b     a     3.   30. 
     b     b     5.   50. 

> (dataframe-display (dataframe-ref df '(0 2 4) 'adult 'juv))

 dim: 3 rows x 2 cols
  adult   juv 
     1.   10. 
     3.   30. 
     5.   50.