I have done some recent work on my dataframe library for Scheme (R6RS) and thought I would run through the examples in the Data Transformation chapter of R for Data Science (R4DS). In this post, I won't reproduce any of the R code and will provide limited commentary on the Scheme code (which is also available via this gist).

Setup

The nycflights13::flights dataset is used for all of the examples shown below. I've written it to a file and posted it here.

(import (chez-stats)
        (dataframe))

(define flights
  (-> (read-delim "nycflights.tsv" #\tab)
      (rowtable->dataframe)))

;; using this as a check of missingness
(define (all-true? . x)
  (for-all (lambda (y) y) x))

The dataframe library has no explicit concept of missingness. Instead, I use #f to indicate missingness. #f is returned when string->number fails to return a valid represention of a number. By default, rowtable->dataframe attempts automatic conversion of strings to numbers. Unlike dplyr::filter, dataframe-filter* does not exclude missing values. Thus, in every dataframe-filter* in this post, we will include all-true? to retain only the rows where all values in the relevant columns are #t.

The flights dataset has 336,776 rows and 19 columns. Datasets of this size strain the dataframe library and provide a suboptimal experience, especially compared to R. Skipped sections indicate that the dataframe library has no equivalent capabilities.

4.1 Introduction

4.1.3

(-> flights
    (dataframe-filter*
     (arr_delay dest)
     (and (all-true? arr_delay)
          (string=? dest "IAH"))) 
    (dataframe-aggregate*
     (year month day)
     (arr_delay (arr_delay) (sum arr_delay)))
    (dataframe-display))

4.2 Rows

4.2.1

Piping all of the output to a dataframe-display will often not show clearly the operation was successful. The one liners below show that the operation yielded the expected result.

(define delayed-flights
  (-> flights
      (dataframe-filter*
       (dep_delay)
       (and (all-true? dep_delay)
            (> dep_delay 120)))))

(apply min ($ delayed-flights 'dep_delay))

(-> flights
    (dataframe-filter*
     (month day)
     (and (all-true? month day)
          (= month 1)
          (= day 1)))
    (dataframe-display))

(define jan-feb-flights
  (-> flights
      (dataframe-filter*
       (month)
       (and (all-true? month)
            (or (= month 1)
                (= month 2))))))

(dataframe-values-unique jan-feb-flights 'month)

(-> flights
    (dataframe-filter*
     (month)
     (and (all-true? month)
          (member month '(1 2))))
     (dataframe-display))

4.2.3

(-> flights
    (dataframe-filter*
     (year month day dep_time)
     (all-true? year month day dep_time))
    (dataframe-sort*
     (< year) (< month) (< day) (< dep_time))
    (dataframe-display))


(-> flights
    (dataframe-filter*
     (dep_delay)
     (all-true? dep_delay))
    (dataframe-sort*
     (> dep_delay))
    (dataframe-display))

4.2.4

dplyr::distinct has a parameter, .keep_all, that keeps all columns for the first occurrence of each unique combo. dataframe-unique does not have that functionality.

(-> flights
    (dataframe-unique)
    (dataframe-display))

(-> flights
    (dataframe-select 'origin 'dest)
    (dataframe-unique)
    (dataframe-display))

The dataframe library does not include a procedure comparable to dplyr::count, but the same result can be obtained with dataframe-aggregate*.

(-> flights
    (dataframe-aggregate*
     (origin dest)
     (n (origin) (length origin)))
    (dataframe-sort*
     (> n))
    (dataframe-display))

4.3 Columns

4.3.1

(-> flights
    (dataframe-filter*
     (dep_delay arr_delay distance air_time)
     (all-true? dep_delay arr_delay distance air_time))
    (dataframe-modify*
     (gain (dep_delay arr_delay) (- dep_delay arr_delay))
     (speed (distance air_time) (* (/ distance air_time) 60)))
    (dataframe-display))

dataframe-modify* has no equivalent to the .before and .after parameters of dplyr::mutate. Using a dataframe-select to get focal columns but this drops all other columns.

(-> flights
    (dataframe-filter*
     (dep_delay arr_delay distance air_time)
     (all-true? dep_delay arr_delay distance air_time))
    (dataframe-modify*
     (gain (dep_delay arr_delay) (- dep_delay arr_delay))
     (speed (distance air_time) (* (/ distance air_time) 60)))
    (dataframe-select 'year 'month 'day 'gain 'speed)
    (dataframe-display))

4.3.2

dplyr::select has many wonderful helper functions to do complicated select operations easily. dataframe-select is very simple.

(-> flights
    (dataframe-select 'year 'month 'day)
    (dataframe-display))

4.3.3

(-> flights
    (dataframe-rename '(tailnum tail_num))
    (dataframe-display))

4.4 The Pipe

(-> flights
    (dataframe-filter*
     (dest distance air_time)
     (and (all-true? dest distance air_time)
          (string=? dest "IAH")))
    (dataframe-modify*
     (speed
      (distance air_time)
      (exact->inexact (* (/ distance air_time) 60))))
    (dataframe-select
     'year 'month 'day 'dep_time 'carrier 'flight 'speed)
    (dataframe-sort*
     (> speed))
    (dataframe-display))

(dataframe-display
 (dataframe-sort*
  (dataframe-select
   (dataframe-modify*
    (dataframe-filter*
     flights
     (dest distance air_time)
     (and (all-true? dest distance air_time)
          (string=? dest "IAH")))
    (speed
     (distance air_time)
     (exact->inexact (* (/ distance air_time) 60))))
   'year 'month 'day 'dep_time 'carrier 'flight 'speed)
  (> speed)))

(define flights1
  (dataframe-filter*
   flights
   (dest distance air_time)
   (and (all-true? dest distance air_time)
        (string=? dest "IAH"))))

(define flights2
  (dataframe-modify*
   flights1
   (speed
    (distance air_time)
    (exact->inexact (* (/ distance air_time) 60)))))

(define flights3
  (dataframe-select
   flights2
   'year 'month 'day 'dep_time 'carrier 'flight 'speed))

(dataframe-display
 (dataframe-sort* flights3 (> speed)))

4.5 Groups

4.5.2

dplyr::group_by automatically sorts by grouping variables, but that step needs to be done explicitly in dataframe-aggregate*.

(-> flights
    (dataframe-filter*
     (dep_delay)
     (all-true? dep_delay))
    (dataframe-aggregate*
     (month)
     (avg_delay (dep_delay) (exact->inexact (mean dep_delay))))
    (dataframe-sort*
     (< month))
    (dataframe-display))

(-> flights
    (dataframe-filter*
     (dep_delay)
     (all-true? dep_delay))
    (dataframe-aggregate*
     (month)
     (avg_delay (dep_delay) (exact->inexact (mean dep_delay)))
     (n (dep_delay) (length dep_delay)))
    (dataframe-sort*
     (< month))
    (dataframe-display))

4.5.5

dataframe-aggregate* struggles when splitting a large dataframe into many groups (in this case, 365). The operation to summarize monthly results (not shown) is reasonably quick. In this case, I didn't bother to filter with all-true? because I knew there were no missing values for year, month, and day.

(define daily_flights
  (dataframe-aggregate*
   flights
   (year month day)
   (n (dep_time) (length dep_time))))