Dataframe manipulation in Scheme

Published

April 22, 2024

Modified

May 10, 2024

This is yet another post where I try out my dataframe library for Scheme (R6RS) on examples found in blog posts. This blog post demonstrates data manipulation with dplyr (R). I provide only the Scheme code and some commentary. You will have to click through to the original post to see how it compares to R.

Library and Data

First, we import the dataframe library and the shuffle procedure from chez-stats. The original post uses the Palmer Penguins dataset, which I wrote to a csv file from R. We read the data from file and select only the columns used in this post. dataframe-display is similar to head in R and dataframe-glimpse is similar to str, but dataframe currently provides no functionality comparable to summary in R.

(import (dataframe)
        (only (chez-stats)
              shuffle))

(define dat
  (-> (csv->dataframe "penguins.csv")
      (dataframe-select* species body_mass_g sex year)))

(dataframe-display dat)

 dim: 344 rows x 4 cols
  species  body_mass_g     sex    year 
    <str>        <num>   <str>   <num> 
   Adelie         3750    male   2007. 
   Adelie         3800  female   2007. 
   Adelie         3250  female   2007. 
   Adelie           na      na   2007. 
   Adelie         3450  female   2007. 
   Adelie         3650    male   2007. 
   Adelie         3625  female   2007. 
   Adelie         4675    male   2007. 
   Adelie         3475      na   2007. 
   Adelie         4250      na   2007. 

(dataframe-glimpse dat)

 dim: 344 rows x 4 cols
 species      <str>   Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, ...       
 body_mass_g  <num>   3750, 3800, 3250, na, 3450, 3650, 3625, 4675, 3475, ...   
 sex          <str>   male, female, female, na, female, male, female, male, ... 
 year         <num>   2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, ...

Filter Observations

The dataframe library includes some handling of missing values, represented as 'na, but nothing currently that will automatically drop 'na values while filtering as in dplyr::filter. A couple of options for handling 'na are presented in the code below. In the original blog post, they built up to a compound filter with pipe operators, but here I will just cut to the chase and show the final version.

;; this version is more readable, but requires two passes through the dataframe
;;  i.e., first with dataframe-remove-na and then with dataframe-filter
(-> dat
    (dataframe-remove-na 'body_mass_g 'sex)
    (dataframe-filter*
     (body_mass_g sex)
     (and (> body_mass_g 4000)
          (string=? sex "female")))
    (dataframe-display))

;; this version is more awkward to read (and write)
;;  but does all the filtering in one pass
(-> dat
    (dataframe-filter*
     (body_mass_g sex)
     (and (and (not (na? body_mass_g))
               (> body_mass_g 4000))
          (and (not (na? sex))
               (string=? sex "female"))))
    (dataframe-display))

 dim: 58 rows x 4 cols
  species  body_mass_g     sex    year 
    <str>        <num>   <str>   <num> 
   Gentoo        4500.  female   2007. 
   Gentoo        4450.  female   2007. 
   Gentoo        4550.  female   2007. 
   Gentoo        4800.  female   2007. 
   Gentoo        4400.  female   2007. 
   Gentoo        4650.  female   2007. 
   Gentoo        4650.  female   2007. 
   Gentoo        4200.  female   2007. 
   Gentoo        4150.  female   2007. 
   Gentoo        4800.  female   2007.

Extract Observations

The dataframe library doesn’t include a slice function, but we can accomplish similar functionality with other dataframe procedures. You can slice by row indices using dataframe-ref. dataframe-head is similar to slice_head. dataframe-tail works similarly to list-tail in Scheme but differently than slice_tail in dplyr and requires jumping through some extra hoops to replicate slice_tail.

(dataframe-display (dataframe-ref dat '(1 4 36)))

 dim: 3 rows x 4 cols
  species  body_mass_g     sex    year 
    <str>        <num>   <str>   <num> 
   Adelie        3800.  female   2007. 
   Adelie        3450.  female   2007. 
   Adelie        3950.    male   2007. 

(dataframe-display (dataframe-head dat 3))

 dim: 3 rows x 4 cols
  species  body_mass_g     sex    year 
    <str>        <num>   <str>   <num> 
   Adelie        3750.    male   2007. 
   Adelie        3800.  female   2007. 
   Adelie        3250.  female   2007. 

(dataframe-display (dataframe-tail dat (- (car (dataframe-dim dat)) 3)))

 dim: 3 rows x 4 cols
    species  body_mass_g     sex    year 
      <str>        <num>   <str>   <num> 
  Chinstrap        3775.    male   2009. 
  Chinstrap        4100.    male   2009. 
  Chinstrap        3775.  female   2009.

It takes considerably more code to replicate slice_min and slice_max from dplyr.

;; equivalent of `slice_min` example in original post
(-> (dataframe-filter*
     dat
     (body_mass_g)
     (and (not (na? body_mass_g))
          (<= body_mass_g (quantile ($ dat 'body_mass_g) 0.25 7))))
    (dataframe-sort* (< body_mass_g))
    (dataframe-display))

;; previous version only works with filter as the first step in the pipe
;;  this version works with filter step anywhere in the pipe
(-> dat
    (dataframe-remove-na 'body_mass_g)
    (->>
     ((lambda (dfx)
        (dataframe-filter*
         dfx
         (body_mass_g)
         (<= body_mass_g (quantile ($ dfx 'body_mass_g) 0.25 7))))))
    (dataframe-sort* (< body_mass_g))
    (dataframe-display))

 dim: 89 rows x 4 cols
    species  body_mass_g     sex    year 
      <str>        <num>   <str>   <num> 
  Chinstrap        2700.  female   2008. 
     Adelie        2850.  female   2008. 
     Adelie        2850.  female   2008. 
     Adelie        2900.  female   2008. 
     Adelie        2900.  female   2008. 
     Adelie        2900.  female   2009. 
  Chinstrap        2900.  female   2007. 
     Adelie        2925.  female   2009. 
     Adelie        2975.      na   2007. 
     Adelie        3000.  female   2007.

;; equivalent of `slice_max` example in original post
(-> (dataframe-filter*
     dat
     (body_mass_g)
     (and (not (na? body_mass_g))
          (>= body_mass_g (quantile ($ dat 'body_mass_g) 0.75 7))))
    (dataframe-sort* (< body_mass_g))
    (dataframe-display))

 dim: 90 rows x 4 cols
    species  body_mass_g     sex    year 
      <str>        <num>   <str>   <num> 
     Gentoo        4750.    male   2008. 
     Gentoo        4750.  female   2008. 
     Gentoo        4750.  female   2008. 
     Gentoo        4750.  female   2009. 
     Gentoo        4750.  female   2009. 
     Adelie        4775.    male   2009. 
     Gentoo        4800.  female   2007. 
     Gentoo        4800.  female   2007. 
  Chinstrap        4800.    male   2008. 
     Gentoo        4850.  female   2008.

Sample Observations

The equivalent of sample_n and sample_frac don’t currently exist in dataframe, but it doesn’t take much code to reproduce that functionality. Here, we lean on shuffle from chez-stats to randomly arrange the indices and then call dataframe-ref.

(define (get-first-n lst n)
  (let loop ([lst lst]
             [out '()]
             [count 0])
    (if (or (null? lst) (= count n))
        (reverse out)
        (loop (cdr lst) (cons (car lst) out) (add1 count)))))

(define (sample-n df n)
  (let* ([ind1 (iota (car (dataframe-dim df)))]
         [ind2 (get-first-n (shuffle ind1) n)])
    (dataframe-ref df ind2)))

(define (sample-frac df frac)
  (let* ([rows (car (dataframe-dim df))]
         [n (floor (inexact (* rows frac)))])
    (sample-n df n)))

(dataframe-display (sample-n dat 3))

 dim: 3 rows x 4 cols
  species  body_mass_g     sex    year 
    <str>        <num>   <str>   <num> 
   Adelie        4250.    male   2009. 
   Gentoo        5250.    male   2009. 
   Gentoo        4600.  female   2008. 

 (dataframe-display (sample-frac dat 1/2))

 dim: 172 rows x 4 cols
    species  body_mass_g     sex    year 
      <str>        <num>   <str>   <num> 
     Adelie        3325.  female   2007. 
     Adelie        3275.  female   2009. 
     Adelie        3475.  female   2009. 
  Chinstrap        3650.    male   2007. 
     Gentoo        4550.  female   2007. 
  Chinstrap        3675.  female   2009. 
     Adelie        4600.    male   2007. 
     Gentoo        4650.  female   2007. 
  Chinstrap        3325.  female   2009. 
     Adelie        3075.  female   2009.

Sort Observations

Sorting also requires explicitly handling missing values. dataframe-sort first sorts based on the expression on the left and then moves right through all other expressions (in this case, it means first sorting by body mass and then sex).

(-> dat
    (dataframe-remove-na 'body_mass_g 'sex)
    (dataframe-sort* (< body_mass_g) (string<? sex))
    (dataframe-display))

 dim: 333 rows x 4 cols
    species  body_mass_g     sex    year 
      <str>        <num>   <str>   <num> 
  Chinstrap        2700.  female   2008. 
     Adelie        2850.  female   2008. 
     Adelie        2850.  female   2008. 
     Adelie        2900.  female   2008. 
     Adelie        2900.  female   2008. 
     Adelie        2900.  female   2009. 
  Chinstrap        2900.  female   2007. 
     Adelie        2925.  female   2009. 
     Adelie        3000.  female   2007. 
     Adelie        3000.  female   2009.

Select Variables

dplyr’s select provides particularly rich functionality for selecting columns from a dataframe. In the dataframe library, you can only select or drop columns by exact column names.

(dataframe-display (dataframe-select* dat body_mass_g year))

 dim: 344 rows x 2 cols
  body_mass_g    year 
        <num>   <num> 
         3750   2007. 
         3800   2007. 
         3250   2007. 
           na   2007. 
         3450   2007. 
         3650   2007. 
         3625   2007. 
         4675   2007. 
         3475   2007. 
         4250   2007. 

(dataframe-display (dataframe-drop* dat body_mass_g year))

 dim: 344 rows x 2 cols
  species     sex 
    <str>   <str> 
   Adelie    male 
   Adelie  female 
   Adelie  female 
   Adelie      na 
   Adelie  female 
   Adelie    male 
   Adelie  female 
   Adelie    male 
   Adelie      na 
   Adelie      na 

;; re-arrange column order
(dataframe-display (dataframe-select* dat sex species body_mass_g year))

 dim: 344 rows x 4 cols
     sex  species  body_mass_g    year 
   <str>    <str>        <num>   <num> 
    male   Adelie         3750   2007. 
  female   Adelie         3800   2007. 
  female   Adelie         3250   2007. 
      na   Adelie           na   2007. 
  female   Adelie         3450   2007. 
    male   Adelie         3650   2007. 
  female   Adelie         3625   2007. 
    male   Adelie         4675   2007. 
      na   Adelie         3475   2007. 
      na   Adelie         4250   2007.

Rename Variables

Renaming columns works similarly to dplyr, but the order is (old-name new-name) not new_name = old_name.

(dataframe-display
 (dataframe-rename* dat (body_mass_g body_mass) (year study_year)))

 dim: 344 rows x 4 cols
  species  body_mass     sex  study_year 
    <str>      <num>   <str>       <num> 
   Adelie       3750    male       2007. 
   Adelie       3800  female       2007. 
   Adelie       3250  female       2007. 
   Adelie         na      na       2007. 
   Adelie       3450  female       2007. 
   Adelie       3650    male       2007. 
   Adelie       3625  female       2007. 
   Adelie       4675    male       2007. 
   Adelie       3475      na       2007. 
   Adelie       4250      na       2007.

Create or Modify Variables

Again, we have to explicitly handle 'na values because / and < don’t handle symbols and dataframe doesn’t provide alternatives. In the second example, we can create a new column that is not based on any columns in the dataframe by not specifying any columns () to use in the expression.

(dataframe-display
 (dataframe-modify*
  dat
  (body_mass_kg
   (body_mass_g)
   (if (na? body_mass_g) 'na (inexact (/ body_mass_g 1000))))))

 dim: 344 rows x 5 cols
  species  body_mass_g     sex    year  body_mass_kg 
    <str>        <num>   <str>   <num>         <num> 
   Adelie         3750    male   2007.          3.75 
   Adelie         3800  female   2007.           3.8 
   Adelie         3250  female   2007.          3.25 
   Adelie           na      na   2007.            na 
   Adelie         3450  female   2007.          3.45 
   Adelie         3650    male   2007.          3.65 
   Adelie         3625  female   2007.         3.625 
   Adelie         4675    male   2007.         4.675 
   Adelie         3475      na   2007.         3.475 
   Adelie         4250      na   2007.          4.25 

(dataframe-display
 (dataframe-modify*
  dat
  (id () (iota (car (dataframe-dim dat))))))

 dim: 344 rows x 5 cols
  species  body_mass_g     sex    year      id 
    <str>        <num>   <str>   <num>   <num> 
   Adelie         3750    male   2007.      0. 
   Adelie         3800  female   2007.      1. 
   Adelie         3250  female   2007.      2. 
   Adelie           na      na   2007.      3. 
   Adelie         3450  female   2007.      4. 
   Adelie         3650    male   2007.      5. 
   Adelie         3625  female   2007.      6. 
   Adelie         4675    male   2007.      7. 
   Adelie         3475      na   2007.      8. 
   Adelie         4250      na   2007.      9. 

(dataframe-display
 (dataframe-modify*
  dat
  (body_mass_cat
   (body_mass_g)
   (if (na? body_mass_g)
       'na
       (if (> body_mass_g 4000) "High" "Low")))))

 dim: 344 rows x 5 cols
  species  body_mass_g     sex    year  body_mass_cat 
    <str>        <num>   <str>   <num>          <str> 
   Adelie         3750    male   2007.            Low 
   Adelie         3800  female   2007.            Low 
   Adelie         3250  female   2007.            Low 
   Adelie           na      na   2007.             na 
   Adelie         3450  female   2007.            Low 
   Adelie         3650    male   2007.            Low 
   Adelie         3625  female   2007.            Low 
   Adelie         4675    male   2007.           High 
   Adelie         3475      na   2007.            Low 
   Adelie         4250      na   2007.           High 

(dataframe-display
 (dataframe-modify*
  dat
  (body_mass_cat
   (body_mass_g)
   (cond [(na? body_mass_g) 'na]
         [(< body_mass_g 3500) "Low"]
         [(> body_mass_g 4750) "High"]
         [else "Medium"]))))

 dim: 344 rows x 5 cols
  species  body_mass_g     sex    year  body_mass_cat 
    <str>        <num>   <str>   <num>          <str> 
   Adelie         3750    male   2007.         Medium 
   Adelie         3800  female   2007.         Medium 
   Adelie         3250  female   2007.            Low 
   Adelie           na      na   2007.             na 
   Adelie         3450  female   2007.            Low 
   Adelie         3650    male   2007.         Medium 
   Adelie         3625  female   2007.         Medium 
   Adelie         4675    male   2007.         Medium 
   Adelie         3475      na   2007.            Low 
   Adelie         4250      na   2007.         Medium

Identify Distinct Values

(-> dat
    (dataframe-select* species)
    (dataframe-unique)
    (dataframe-display))

 dim: 3 rows x 1 cols
    species 
      <str> 
     Adelie 
     Gentoo 
  Chinstrap 

(-> dat
    (dataframe-select* species year)
    (dataframe-unique)
    (dataframe-display))

 dim: 9 rows x 2 cols
    species    year 
      <str>   <num> 
     Adelie   2007. 
     Adelie   2008. 
     Adelie   2009. 
     Gentoo   2007. 
     Gentoo   2008. 
     Gentoo   2009. 
  Chinstrap   2007. 
  Chinstrap   2008. 
  Chinstrap   2009.

Aggregate Observations

(dataframe-display
 (dataframe-aggregate*
  dat
  (species sex)
  (mean (body_mass_g) (inexact (mean body_mass_g)))
  (sd (body_mass_g) (standard-deviation body_mass_g))))

 dim: 8 rows x 4 cols
    species     sex       mean        sd 
      <str>   <str>      <num>     <num> 
     Adelie    male  4043.4932  346.8116 
     Adelie  female  3368.8356  269.3801 
     Adelie      na  3540.0000  477.1661 
     Gentoo  female  4679.7414  281.5783 
     Gentoo    male  5484.8361  313.1586 
     Gentoo      na  4587.5000  338.1937 
  Chinstrap  female  3527.2059  285.3339 
  Chinstrap    male  3938.9706  362.1376 

(dataframe-display
 (dataframe-aggregate*
  dat
  (species)
  (n_obs (species) (length species))))

 dim: 3 rows x 2 cols
    species   n_obs 
      <str>   <num> 
     Adelie    152. 
     Gentoo    124. 
  Chinstrap     68.