Filter, partition, and sort dataframes in Chez Scheme
This post is the fourth in a series on the dataframe library for Chez Scheme. In this post, I will contrast the
dataframe library with functions from the
dplyr R package for filtering, partitioning, and sorting dataframes. And discuss implementation decisions in the
First, let’s create a dataframe in both languages. The
rep procedure was introduced in the previous post, but it is not part of the
df <- data.frame(trt = rep(c("a", "b"), each = 6), grp = rep(rep(c("x", "y"), each = 3), times = 2), rsp = rep(1:4, each = 3), ind = 0:11) (define df (make-dataframe (list (cons 'trt (rep '("a" "b") 6 'each)) (cons 'grp (rep (rep '("x" "y") 3 'each) 2 'times)) (cons 'rsp (rep '(1 2 3 4) 3 'each)) (cons 'ind (iota 12)))))
dplyr::filter takes an expression and returns the rows of the dataframe where the expression is
> dplyr::filter(df, trt == "a" & grp == "y") trt grp rsp ind 1 a y 2 3 2 a y 2 4 3 a y 2 5
dataframe-filter takes a
filter-expr (see more below) and returns the rows where the
> (dataframe-display (dataframe-filter df (filter-expr (trt grp) (and (string=? trt "a") (string=? grp "y"))))) trt grp rsp ind a y 2 3 a y 2 4 a y 2 5
filter-expr is a macro that allows for a slightly more concise syntax when writing the expressions used to filter a dataframe.
(define-syntax filter-expr (syntax-rules () [(_ names expr) (list (quote names) (lambda names expr))]))
I spent a lot of time wrestling with whether I should use
eval or macros to simplify the syntax in my
dataframe procedures. Or whether I should just stick to passing lambda expressions around. I pretty quickly concluded that I should avoid
eval thanks to some guidance from Reddit and was intrigued by some suggested neat tricks that didn’t involve
eval or macros. Eventually, though, a better phrased StackOverflow question prompted comments and answers that gave me clarity on understanding simple macros.
(filter-expr (trt grp) (and (string=? trt "a") (string=? grp "y")))
'((trt grp) (lambda (trt grp) (and (string=? trt "a") (string=? grp "y"))))
Admittedly, that is not a very compelling simplification. My primary concern with the expanded form was that passing names (e.g.,
(trt grp)) separately and as part of the
lambda expression introduces a potential source of errors (e.g, names provided in different order).
proc are extracted from the expanded
proc is mapped over the dataframe columns identified by names.
filter-map returns a list of boolean values.
(define (filter-map df filter-expr) (let ([names (car filter-expr)] [proc (cadr filter-expr)]) (apply check-df-names df "(dataframe-filter df filter-expr)" names) (apply map proc (dataframe-values-map df names))))
The boolean values from
filter-map are used to filter each column with
(define (filter-vals bools vals) (let ([bools-vals (map cons bools vals)]) (map cdr (filter (lambda (x) (car x)) bools-vals))))
filter only accepts one list, I first zip the
bools to the
cons, filter on the
car, and then unzip with
cdr to get the filtered
Throughout my posts on writing Chez Scheme libraries, you will find frequent disclaimers about how my libraries are not written with performance in mind. I simply don’t understand Chez Scheme well enough to have good intuition about performance pitfalls. For example, you may have noticed that filtering a dataframe involves zipping the same list of boolean values to every column before filtering each column. If dataframes were row based, then the
bools could be added to the begining of every row with
cons and the whole dataframe filtered in one pass. That seems obviously better. However, a row-based structure makes
filter-map trickier. Or, rather, it makes the part where you extract columns trickier.
Instead of filtering on every column separately, I could transpose the dataframe to the row-based form, add the
bools, filter on the whole dataframe at once, remove the
bools, and transpose back to the column-based form. My guess is that transposing a dataframe to row based and back might generally be faster than zipping and unzipping every column with
bools, particularly if a dataframe has lots of columns, but I find it easier to think about operating on columns and I was striving for internal consistency across the dataframe procedures.
Because R doesn’t allow multiple return values, you would partition a dataframe with two
> dplyr::filter(df, grp == "x") trt grp rsp ind 1 a x 1 0 2 a x 1 1 3 a x 1 2 4 b x 3 6 5 b x 3 7 6 b x 3 8 > dplyr::filter(df, grp == "y") trt grp rsp ind 1 a y 2 3 2 a y 2 4 3 a y 2 5 4 b y 4 9 5 b y 4 10 6 b y 4 11
As mentioned in the previous post,
dataframe-partition allows for partitioning a dataframe based on the
> (define-values (keep drop) (dataframe-partition df (filter-expr (grp) (string=? grp "x")))) > (dataframe-display keep) trt grp rsp ind a x 1 0 a x 1 1 a x 1 2 b x 3 6 b x 3 7 b x 3 8 > (dataframe-display drop) trt grp rsp ind a y 2 3 a y 2 4 a y 2 5 b y 4 9 b y 4 10 b y 4 11
The dirty little secret of
dataframe-partition is that it is simply two calls to
dataframe-filter under the covers and, thus, requires two passes over the whole dataframe (more inefficiency!).
dplyr, dataframes are sorted on multiple columns with
arrange. By default, columns are sorted in ascending order with
desc used to indicate descending order. The dataframe is sorted in the order that column names are listed (left to right).
> dplyr::arrange(df, grp, desc(ind)) trt grp rsp ind 1 b x 3 8 2 b x 3 7 3 b x 3 6 4 a x 1 2 5 a x 1 1 6 a x 1 0 7 b y 4 11 8 b y 4 10 9 b y 4 9 10 a y 2 5 11 a y 2 4 12 a y 2 3
dataframe-sort takes a
sort-expr (see more below) and sorts the dataframe in the order that predicates and column names are listed (left to right) in the
> (dataframe-display (dataframe-sort df (sort-expr (string<? grp) (> ind))) 12) trt grp rsp ind b x 3 8 b x 3 7 b x 3 6 a x 1 2 a x 1 1 a x 1 0 b y 4 11 b y 4 10 b y 4 9 a y 2 5 a y 2 4 a y 2 3
sort-expr is a very simple macro.
(define-syntax sort-expr (syntax-rules () [(_ (predicate name) ...) (list (list predicate ...) (list (quote name) ...))]))
(sort-expr (string<? grp) (> ind))
(list (list string<? >) '(grp ind))
The justification for this macro is similar to
filter-expr, i.e., passing predicates and names together reduces likelihood of mixing up the order of the predicate list and the column name list. The macro is also slightly shorter than this alternative:
(list (cons string<? 'grp) (cons > 'ind))
Does that save enough keystrokes to justify using a macro? It will be interesting to see where I stand on this issue after I become a more experience Scheme programmer.
Here are the steps involved in
- Extract the columns selected for sorting from the dataframe.
- Calculate the weights for each column.2
- Pick arbitrary weight for first column.
- Loop through the rest of the columns and set weight of next column to the weight of the previous column divided by the length of unique values in the next column.
- Sort unique values in each column by the predicate provided for that column.
- Enumerate (rank) each unique value and multiply rank by weight.
- Match weighted ranks for unique values to all values in the column.
- Sum weighted ranks across columns for each row.
- Sort all columns in the dataframe by the weighted rank sums.
dataframe-sort is applied to each column and carries the same potential performance pitfalls described above for