Split, bind, and append dataframes in Chez Scheme
This post is the third in a series on the dataframe
library for Chez Scheme. In this post, I will contrast the dataframe
library with functions from base R and the dplyr
package for splitting, binding, and appending dataframes.
Set up
First, let's create a couple of dataframes in both languages.
df1 <- data.frame(trt = rep(c("a", "b"), each = 6),
grp = rep(rep(c("x", "y"), each = 3), times = 2),
rsp = rep(1:4, each = 3))
df2 <- data.frame(asc = 0:11, desc = 11:0)
(define df1
(make-dataframe
(list (cons 'trt (append (make-list 6 "a") (make-list 6 "b")))
(cons 'grp (append (make-list 3 "x") (make-list 3 "y")
(make-list 3 "x") (make-list 3 "y")))
(cons 'rsp (append (make-list 3 1) (make-list 3 2)
(make-list 3 3) (make-list 3 4))))))
(define df2
(make-dataframe
(list (cons 'asc (iota 12))
(cons 'desc (reverse (iota 12))))))
I think the Scheme code for creating df1
is clear, but I like the conciseness of the R code. I'm taking this post on a little detour to write a rep
procedure to replicate (pun intended) the functionality of rep
.
(define (rep ls n type)
(cond [(symbol=? type 'each)
(apply append (map (lambda (x) (make-list n x)) ls))]
[(symbol=? type 'times)
(rep-times ls n)]
[else
(assertion-violation "(rep ls n type)"
"type must be 'each or 'times")]))
(define (rep-times ls n)
(define (loop ls-out n)
(if (= n 1) ls-out (loop (append ls ls-out) (sub1 n))))
(loop ls n))
(define df1
(make-dataframe
(list (cons 'trt (rep '("a" "b") 6 'each))
(cons 'grp (rep (rep '("x" "y") 3 'each) 2 'times))
(cons 'rsp (rep '(1 2 3 4) 3 'each)))))
The each
case was a simple map
, but I couldn't think of a way to use higher-order functions for the times
case. Instead, I wrote a separate recursive procedure, rep-times
, to handle that case.
Append
I'm using append
to refer to a cbind
operation in R.
> head(cbind(df1, df2))
trt grp rsp asc desc
1 a x 1 0 11
2 a x 1 1 10
3 a x 1 2 9
4 a y 2 3 8
5 a y 2 4 7
6 a y 2 5 6
In Chez Scheme, we append dataframes with equal numbers of rows via dataframe-append
.
> (dataframe-display (dataframe-append df1 df2) 6)
dim: 12 rows x 5 cols
trt grp rsp asc desc
a x 1. 0. 11.
a x 1. 1. 10.
a x 1. 2. 9.
a y 2. 3. 8.
a y 2. 4. 7.
a y 2. 5. 6.
I chose dataframe-append
as the name because alists, which are at the heart of dataframes, are straightforwardly combined with append
in Chez Scheme.
> (append '((a 1 2 3)) '((b 4 5 6)))
((a 1 2 3) (b 4 5 6))
Split
In R, split
returns a named list of dataframes where the names are based on the column names defining the groups.
> split(df1, list(df1$trt, df1$grp))
$a.x
trt grp rsp
1 a x 1
2 a x 1
3 a x 1
$b.x
trt grp rsp
7 b x 3
8 b x 3
9 b x 3
$a.y
trt grp rsp
4 a y 2
5 a y 2
6 a y 2
$b.y
trt grp rsp
10 b y 4
11 b y 4
12 b y 4
dataframe-split
returns a list of dataframes.
> (for-each dataframe-display (dataframe-split df1 'trt 'grp))
dim: 3 rows x 3 cols
trt grp rsp
b x 3.
b x 3.
b x 3.
dim: 3 rows x 3 cols
trt grp rsp
b y 4.
b y 4.
b y 4.
dim: 3 rows x 3 cols
trt grp rsp
a x 1.
a x 1.
a x 1.
dim: 3 rows x 3 cols
trt grp rsp
a y 2.
a y 2.
a y 2.
Implementation
The first step in dataframe-split
is to find the unique values of the grouping columns [1].
> (dataframe-display (dataframe-unique (dataframe-select df1 'trt 'grp)))
dim: 4 rows x 2 cols
trt grp
a x
a y
b x
b y
dataframe-unique
involves transposing the alist to a row-based structure to remove duplicates and then transposing back to the column-based structure. This is another example of me choosing a straightforward solution over an efficient one [2].
(define (transpose ls) (apply map list ls))
;; https://stackoverflow.com/questions/8382296/scheme-remove-duplicated-numbers-from-list
(define (remove-duplicates ls)
(cond [(null? ls)
'()]
[(member (car ls) (cdr ls))
(remove-duplicates (cdr ls))]
[else
(cons (car ls) (remove-duplicates (cdr ls)))]))
> (remove-duplicates '((trt "a" "a" "a") (grp "x" "x" "x")))
((trt "a" "a" "a") (grp "x" "x" "x"))
> (transpose '((trt "a" "a" "a") (grp "x" "x" "x")))
((trt grp) ("a" "x") ("a" "x") ("a" "x"))
> (remove-duplicates (cdr (transpose '((trt "a" "a" "a") (grp "x" "x" "x")))))
(("a" "x"))
We loop through the rows of the unique groups and partition the dataframe [3]. dataframe-partition
returns two dataframes. The keep
and drop
dataframes contain the rows where the filter-expr
is #t
and #f
, respectively.
> (define-values (keep drop)
(dataframe-partition
df1 (filter-expr (trt grp) (and (string=? trt "a") (string=? grp "x")))))
> (dataframe-display keep)
dim: 3 rows x 3 cols
trt grp rsp
a x 1.
a x 1.
a x 1.
> (dataframe-display drop)
dim: 9 rows x 3 cols
trt grp rsp
a y 2.
a y 2.
a y 2.
b x 3.
b x 3.
b x 3.
b y 4.
b y 4.
b y 4.
The keep
dataframe becomes the first dataframe in the list of dataframes returned by dataframe-split
. The algorithm continues looping through the rows of unique groups and partitions the drop
dataframe in each subsequent iteration.
Bind
For binding by rows, we will use functions from dplyr
. In the first example, all dataframes in the list have the same columns.
> dplyr::bind_rows(split(df1, list(df1$trt, df1$grp)))
trt grp rsp
1 a x 1
2 a x 1
3 a x 1
4 b x 3
5 b x 3
6 b x 3
7 a y 2
8 a y 2
9 a y 2
10 b y 4
11 b y 4
12 b y 4
dataframe-bind
works similarly to dplyr::bind_rows
, but we need to use apply
to bust open the list of dataframes created by dataframe-split
.
> (dataframe-display (apply dataframe-bind (dataframe-split df1 'trt 'grp)) 12)
dim: 12 rows x 3 cols
trt grp rsp
a x 1.
a x 1.
a x 1.
a y 2.
a y 2.
a y 2.
b x 3.
b x 3.
b x 3.
b y 4.
b y 4.
b y 4.
To show how to bind dataframes with different columns, let's split up df1
.
df_a <- dplyr::filter(df1, trt == "a")
df_b <- dplyr::filter(df1, trt == "b")
(define-values (df-a df-b)
(dataframe-partition df1 (filter-expr (trt) (string=? trt "a"))))
dplyr::bind_rows
fills missing columns with NA
.
dplyr::bind_rows(df_a, df_b[,c("trt", "grp")])
trt grp rsp
1 a x 1
2 a x 1
3 a x 1
4 a y 2
5 a y 2
6 a y 2
7 b x NA
8 b x NA
9 b x NA
10 b y NA
11 b y NA
12 b y NA
Because Chez Scheme doesn't have explicit missing values, I created a separate procedure, dataframe-bind-all
, for binding dataframes where missing columns are filled by the specified missing value.
> (dataframe-display (dataframe-bind-all -999 df-a (dataframe-drop df-b 'rsp)) 12)
dim: 12 rows x 3 cols
trt grp rsp
a x 1.
a x 1.
a x 1.
a y 2.
a y 2.
a y 2.
b x -999.
b x -999.
b x -999.
b y -999.
b y -999.
b y -999.
In contrast, dataframe-bind
will drop all columns not shared across the dataframes being bound.
> (dataframe-display (dataframe-bind df-a (dataframe-drop df-b 'rsp)) 12)
dim: 12 rows x 2 cols
trt grp
a x
a x
a x
a y
a y
a y
b x
b x
b x
b y
b y
b y
Final thoughts
With the exception of dataframe-split
, all of the procedures described in the first three posts in the dataframe series involve straightforward composition of Scheme's fundamental procedures (e.g., map
, apply
, append
, cons
, car
, cdr
, etc.) on Scheme's core data structure, i.e., lists. The next couple of posts involve procedures that forced me to wrestle with tradeoffs between convenient syntax via macros (e.g., filter-expr
) and familiarity/consistency with Chez Scheme's standard library. In the next post, I will describe how to filter, partition, and sort dataframes in Chez Scheme.
[1] I'm illustrating the ideas with the user-facing dataframe
procedures, but inside dataframe-split
, and most dataframe
procedures, are procedures that work directly on alists to avoid the overhead of unwrapping and rewrapping the alists as dataframes.
[2] This is not to say that I know a more efficient solution, but, rather, that I opted for a straightforward solution even though it contains the (significant?) overhead of transposing the list a couple of times.
[3] dataframe-partition
will be covered in the next post of this series.