Skip to content

Travis Hinkelman

Modify and aggregate dataframes in Chez Scheme

This post is part of a series on the dataframe library for Chez Scheme. In this post, I will contrast the dataframe library with functions from base R and the dplyr package for modifying and aggregating dataframes.

Set up

First, let's create a dataframe in both languages.

df <- data.frame(grp = c("a", "a", "b", "b", "b"),
                 trt = c("x", "y", "x", "y", "y"),
                 adult = c(1, 2, 3, 4, 5),
                 juv = c(10, 20, 30, 40, 50))
                 
(define df
  (make-dataframe '((grp "a" "a" "b" "b" "b")
                    (trt "x" "y" "x" "y" "y")
                    (adult 1 2 3 4 5)
                    (juv 10 20 30 40 50))))

Mutate/Modify [1]

In R, dplyr::mutate changes all the values in a column according to the expression provided. If the column name exists in the dataframe, then the old column is overwritten. If the column name doesn't exist, then a new column is created at the end of the dataframe. A scalar value is filled across all rows.

> df2 <- dplyr::mutate(df, 
                       grp = toupper(grp),
                       total = adult + juv,
                       scalar = 16,
                       lst = c(2, 4, 6, 8, 10))

> df2
  grp trt adult juv total scalar lst
1   A   x     1  10    11     16   2
2   A   y     2  20    22     16   4
3   B   x     3  30    33     16   6
4   B   y     4  40    44     16   8
5   B   y     5  50    55     16  10

dataframe-modify takes a modify-expr (see more below) to replicate the core behavior of dplyr::mutate. When passing values directly (e.g., scalar or list with length equal to number of rows), the column names used in the expression need to be explicitly specified as missing with ().

> (define df2
    (dataframe-modify df (modify-expr (grp (grp) (string-upcase grp))
                                      (total (adult juv) (+ adult juv))
                                      (scalar () 16)
                                      (lst () '(2 4 6 8 10)))))
> (dataframe-display df2)
       grp       trt     adult       juv     total    scalar       lst
         A         x         1        10        11        16         2
         A         y         2        20        22        16         4
         B         x         3        30        33        16         6
         B         y         4        40        44        16         8
         B         y         5        50        55        16        10

dplyr also provides mutate_at and mutate_all [2].

> dplyr::mutate_at(df2, c("total", "scalar", "lst"), sqrt)
  grp trt adult juv    total scalar      lst
1   A   x     1  10 3.316625      4 1.414214
2   A   y     2  20 4.690416      4 2.000000
3   B   x     3  30 5.744563      4 2.449490
4   B   y     4  40 6.633250      4 2.828427
5   B   y     5  50 7.416198      4 3.162278

dataframe-modify-at works similarly, but the procedure, sqrt in this example, is constrained to only accept one argument.

> (dataframe-display
   (dataframe-modify-at df2 sqrt 'total 'scalar 'lst))
       grp       trt     adult       juv     total    scalar       lst
         A         x         1        10  3.316624         4  1.414213
         A         y         2        20  4.690415         4         2
         B         x         3        30  5.744562         4  2.449489
         B         y         4        40  6.633249         4  2.828427
         B         y         5        50  7.416198         4  3.162277

Implementation

modify-expr is a macro that allows for a more concise syntax when writing the expressions used to modify a dataframe.

(define-syntax modify-expr
  (syntax-rules ()
    [(_ (new-name names expr) ...)
     (list
      (list (quote new-name) ...)
      (list (quote names) ...)
      (list (lambda names expr) ...))]))

The following modify-expr

(modify-expr (grp (grp) (string-upcase grp))
             (total (adult juv) (+ adult juv))
             (scalar () 16)
             (lst () '(2 4 6 8 10)))

expands to

'((grp total scalar lst)
  ((grp) (adult juv) () ())
  (lambda (grp) (string-upcase grp))
  (lambda (adult juv) (+ adult juv))
  (lambda () 16)
  (lambda () '(2 4 6 8 10)))

In previous posts on macros used in filtering and sorting dataframes, I've acknowledged that the filter-expr and sort-expr macros don't provide a very compelling simplification. In this case, though, the modify-expr macro is helpful both for reducing the number of characters and keeping the pieces of the expression together. The use of filter-expr and sort-expr follows from modify-expr and provides more consistent syntax across all of the dataframe procedures.

Aggregate

In base R, dataframes are aggregated by first splitting into groups, applying the summary statistic, and then complining the pieces. This example uses the formula syntax where the left side indicates the columns to be summarized and the right side indicates the grouping variables.

> aggregate(cbind(adult, juv) ~ grp + trt, data = df, sum)
  grp trt adult juv
1   a   x     1  10
2   b   x     3  30
3   a   y     2  20
4   b   y     9  90

dataframe-aggregate also uses the split-apply-combine approach, but uses similar syntax to dataframe-modify.

> (dataframe-display
   (dataframe-aggregate df
                        '(grp trt)
                         (aggregate-expr (adult-sum (adult) (apply + adult))
                         (juv-sum (juv) (apply + juv)))))
       grp       trt  adult-sum   juv-sum
         a         x          1        10
         a         y          2        20
         b         x          3        30
         b         y          9        90

[1] I opted not to use the term mutate in Chez's datafame library because it felt too directly contradictory with the fact that dataframes are immutable. Arguably, this was a silly choice because modify is a synonym of mutate.

[2] These functions have been superseded by the use of across within mutate.