This post is the third in a series on the dataframe library for Scheme (R6RS). In this post, I will contrast the dataframe library with functions from base R and the dplyr package for splitting, binding, and appending dataframes.

Set up

First, let's create a couple of dataframes in both languages.

df1 <- data.frame(trt = rep(c("a", "b"), each = 6),
                  grp = rep(rep(c("x", "y"), each = 3), times = 2),
                  rsp = rep(1:4, each = 3))

df2 <- data.frame(asc = 0:11, desc = 11:0)
                 
(define df1
  (make-dataframe
   (list (cons 'trt (append (make-list 6 "a") (make-list 6 "b")))
         (cons 'grp (append (make-list 3 "x") (make-list 3 "y")
                            (make-list 3 "x") (make-list 3 "y")))
         (cons 'rsp (append (make-list 3 1) (make-list 3 2)
                            (make-list 3 3) (make-list 3 4))))))

(define df2
  (make-dataframe
   (list (cons 'asc (iota 12))
         (cons 'desc (reverse (iota 12))))))

I think the Scheme code for creating df1 is clear, but I like the conciseness of the R code. I'm taking this post on a little detour to write a rep procedure to replicate (pun intended) the functionality of rep.

(define (rep ls n type)
  (cond [(symbol=? type 'each)
         (apply append (map (lambda (x) (make-list n x)) ls))]
        [(symbol=? type 'times)
         (rep-times ls n)]
        [else
         (assertion-violation "(rep ls n type)"
                              "type must be 'each or 'times")]))

(define (rep-times ls n)
  (define (loop ls-out n)
    (if (= n 1) ls-out (loop (append ls ls-out) (sub1 n))))
  (loop ls n))

(define df1
  (make-dataframe
   (list (cons 'trt (rep '("a" "b") 6 'each))
         (cons 'grp (rep (rep '("x" "y") 3 'each) 2 'times))
         (cons 'rsp (rep '(1 2 3 4) 3 'each)))))

The each case was a simple map, but I couldn't think of a way to use higher-order functions for the times case. Instead, I wrote a separate recursive procedure, rep-times, to handle that case.

Append

I'm using append to refer to a cbind operation in R.

> head(cbind(df1, df2))

  trt grp rsp asc desc
1   a   x   1   0   11
2   a   x   1   1   10
3   a   x   1   2    9
4   a   y   2   3    8
5   a   y   2   4    7
6   a   y   2   5    6

In Scheme, we append dataframes with equal numbers of rows via dataframe-append.

> (dataframe-display (dataframe-append df1 df2) 6)

 dim: 12 rows x 5 cols
   trt   grp   rsp   asc  desc 
     a     x    1.    0.   11. 
     a     x    1.    1.   10. 
     a     x    1.    2.    9. 
     a     y    2.    3.    8. 
     a     y    2.    4.    7. 
     a     y    2.    5.    6. 

I chose dataframe-append as the name because alists, which are at the heart of dataframes, are straightforwardly combined with append in Scheme.

> (append '((a 1 2 3)) '((b 4 5 6)))

((a 1 2 3) (b 4 5 6))

Split

In R, split returns a named list of dataframes where the names are based on the column names defining the groups.

> split(df1, list(df1$trt, df1$grp))

$a.x
  trt grp rsp
1   a   x   1
2   a   x   1
3   a   x   1

$b.x
  trt grp rsp
7   b   x   3
8   b   x   3
9   b   x   3

$a.y
  trt grp rsp
4   a   y   2
5   a   y   2
6   a   y   2

$b.y
   trt grp rsp
10   b   y   4
11   b   y   4
12   b   y   4

dataframe-split returns a list of dataframes.

> (for-each dataframe-display (dataframe-split df1 'trt 'grp))

 dim: 3 rows x 3 cols
   trt   grp   rsp 
     a     x    1. 
     a     x    1. 
     a     x    1. 
 dim: 3 rows x 3 cols
   trt   grp   rsp 
     a     y    2. 
     a     y    2. 
     a     y    2. 
 dim: 3 rows x 3 cols
   trt   grp   rsp 
     b     x    3. 
     b     x    3. 
     b     x    3. 
 dim: 3 rows x 3 cols
   trt   grp   rsp 
     b     y    4. 
     b     y    4. 
     b     y    4. 

Bind

For binding by rows, we will use functions from dplyr. In the first example, all dataframes in the list have the same columns.

> dplyr::bind_rows(split(df1, list(df1$trt, df1$grp)))

   trt grp rsp
1    a   x   1
2    a   x   1
3    a   x   1
4    b   x   3
5    b   x   3
6    b   x   3
7    a   y   2
8    a   y   2
9    a   y   2
10   b   y   4
11   b   y   4
12   b   y   4

dataframe-bind works similarly to dplyr::bind_rows, but we need to use apply to bust open the list of dataframes created by dataframe-split.

> (dataframe-display 
    (apply dataframe-bind (dataframe-split df1 'trt 'grp)) 
    12)

 dim: 12 rows x 3 cols
   trt   grp   rsp 
     a     x    1. 
     a     x    1. 
     a     x    1. 
     a     y    2. 
     a     y    2. 
     a     y    2. 
     b     x    3. 
     b     x    3. 
     b     x    3. 
     b     y    4. 
     b     y    4. 
     b     y    4. 

To show how to bind dataframes with different columns, let's split up df1.

df_a <- dplyr::filter(df1, trt == "a")
df_b <- dplyr::filter(df1, trt == "b")

(define-values (df-a df-b)
  (dataframe-partition* df1 (trt) (string=? trt "a")))

dplyr::bind_rows fills missing columns with NA.

dplyr::bind_rows(df_a, df_b[,c("trt", "grp")])

   trt grp rsp
1    a   x   1
2    a   x   1
3    a   x   1
4    a   y   2
5    a   y   2
6    a   y   2
7    b   x  NA
8    b   x  NA
9    b   x  NA
10   b   y  NA
11   b   y  NA
12   b   y  NA

Because Scheme doesn't have explicit missing values, I created a separate procedure, dataframe-bind-all, for binding dataframes where missing columns are filled by the specified missing value.

> (dataframe-display 
    (dataframe-bind-all -999 df-a (dataframe-drop df-b 'rsp)) 
    12)

 dim: 12 rows x 3 cols
   trt   grp    rsp 
     a     x     1. 
     a     x     1. 
     a     x     1. 
     a     y     2. 
     a     y     2. 
     a     y     2. 
     b     x  -999. 
     b     x  -999. 
     b     x  -999. 
     b     y  -999. 
     b     y  -999. 
     b     y  -999. 

In contrast, dataframe-bind will drop all columns not shared across the dataframes being bound.

> (dataframe-display 
    (dataframe-bind df-a (dataframe-drop df-b 'rsp)) 
    12)

 dim: 12 rows x 2 cols
   trt   grp 
     a     x 
     a     x 
     a     x 
     a     y 
     a     y 
     a     y 
     b     x 
     b     x 
     b     x 
     b     y 
     b     y 
     b     y 

Final thoughts

With the exception of dataframe-split, all of the procedures described in the first three posts in the dataframe series involve straightforward composition of Scheme's fundamental procedures (e.g., map, apply, append, cons, car, cdr, etc.) on Scheme's core data structure, i.e., lists. The next couple of posts involve procedures that forced me to wrestle with tradeoffs between convenient syntax via macros (e.g., dataframe-partition*) and familiarity/consistency with Scheme's standard library. In the next post, I will describe how to filter, partition, and sort dataframes in Scheme.