Dataframe manipulation in Scheme
This is yet another post where I try out my dataframe
library for Scheme (R6RS) on examples found in blog posts. This blog post demonstrates data manipulation with dplyr
(R). I provide only the Scheme code and some commentary. You will have to click through to the original post to see how it compares to R.
Library and Data
First, we import the dataframe
library and the shuffle
procedure from chez-stats
. The original post uses the Palmer Penguins dataset, which I wrote to a csv file from R. We read the data from file and select only the columns used in this post. dataframe-display
is similar to head
in R and dataframe-glimpse
is similar to str
, but dataframe
currently provides no functionality comparable to summary
in R.
(import (dataframe)
(only (chez-stats)
shuffle))
define dat
("penguins.csv")
(-> (csv->dataframe
(dataframe-select* species body_mass_g sex year)))
(dataframe-display dat)
344 rows x 4 cols
dim:
species body_mass_g sex year
<str> <num> <str> <num> 3750 male 2007.
Adelie 3800 female 2007.
Adelie 3250 female 2007.
Adelie 2007.
Adelie na na 3450 female 2007.
Adelie 3650 male 2007.
Adelie 3625 female 2007.
Adelie 4675 male 2007.
Adelie 3475 na 2007.
Adelie 4250 na 2007.
Adelie
(dataframe-glimpse dat)
344 rows x 4 cols
dim: ...
species <str> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, 3750, 3800, 3250, na, 3450, 3650, 3625, 4675, 3475, ...
body_mass_g <num> ...
sex <str> male, female, female, na, female, male, female, male, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, ... year <num>
Filter Observations
The dataframe
library includes some handling of missing values, represented as 'na
, but nothing currently that will automatically drop 'na
values while filtering as in dplyr::filter
. A couple of options for handling 'na
are presented in the code below. In the original blog post, they built up to a compound filter with pipe operators, but here I will just cut to the chase and show the final version.
;; this version is more readable, but requires two passes through the dataframe
;; i.e., first with dataframe-remove-na and then with dataframe-filter
(-> dat
(dataframe-remove-na 'body_mass_g 'sex)
(dataframe-filter*
(body_mass_g sex)and (> body_mass_g 4000)
(string=? sex "female")))
(
(dataframe-display))
;; this version is more awkward to read (and write)
;; but does all the filtering in one pass
(-> dat
(dataframe-filter*
(body_mass_g sex)and (and (not (na? body_mass_g))
(> body_mass_g 4000))
(and (not (na? sex))
(string=? sex "female"))))
(
(dataframe-display))
58 rows x 4 cols
dim:
species body_mass_g sex year
<str> <num> <str> <num> 4500. female 2007.
Gentoo 4450. female 2007.
Gentoo 4550. female 2007.
Gentoo 4800. female 2007.
Gentoo 4400. female 2007.
Gentoo 4650. female 2007.
Gentoo 4650. female 2007.
Gentoo 4200. female 2007.
Gentoo 4150. female 2007.
Gentoo 4800. female 2007. Gentoo
Extract Observations
The dataframe
library doesn’t include a slice function, but we can accomplish similar functionality with other dataframe
procedures. You can slice by row indices using dataframe-ref
. dataframe-head
is similar to slice_head
. dataframe-tail
works similarly to list-tail
in Scheme but differently than slice_tail
in dplyr
and requires jumping through some extra hoops to replicate slice_tail
.
1 4 36)))
(dataframe-display (dataframe-ref dat '(
3 rows x 4 cols
dim:
species body_mass_g sex year
<str> <num> <str> <num> 3800. female 2007.
Adelie 3450. female 2007.
Adelie 3950. male 2007.
Adelie
3))
(dataframe-display (dataframe-head dat
3 rows x 4 cols
dim:
species body_mass_g sex year
<str> <num> <str> <num> 3750. male 2007.
Adelie 3800. female 2007.
Adelie 3250. female 2007.
Adelie
- (car (dataframe-dim dat)) 3)))
(dataframe-display (dataframe-tail dat (
3 rows x 4 cols
dim:
species body_mass_g sex year
<str> <num> <str> <num> 3775. male 2009.
Chinstrap 4100. male 2009.
Chinstrap 3775. female 2009. Chinstrap
It takes considerably more code to replicate slice_min
and slice_max
from dplyr
.
;; equivalent of `slice_min` example in original post
(-> (dataframe-filter*
dat
(body_mass_g)and (not (na? body_mass_g))
(<= body_mass_g (quantile ($ dat 'body_mass_g) 0.25 7))))
(< body_mass_g))
(dataframe-sort* (
(dataframe-display))
;; previous version only works with filter as the first step in the pipe
;; this version works with filter step anywhere in the pipe
(-> dat
(dataframe-remove-na 'body_mass_g)
(->>lambda (dfx)
((
(dataframe-filter*
dfx
(body_mass_g)<= body_mass_g (quantile ($ dfx 'body_mass_g) 0.25 7))))))
(< body_mass_g))
(dataframe-sort* (
(dataframe-display))
89 rows x 4 cols
dim:
species body_mass_g sex year
<str> <num> <str> <num> 2700. female 2008.
Chinstrap 2850. female 2008.
Adelie 2850. female 2008.
Adelie 2900. female 2008.
Adelie 2900. female 2008.
Adelie 2900. female 2009.
Adelie 2900. female 2007.
Chinstrap 2925. female 2009.
Adelie 2975. na 2007.
Adelie 3000. female 2007.
Adelie
;; equivalent of `slice_max` example in original post
(-> (dataframe-filter*
dat
(body_mass_g)and (not (na? body_mass_g))
(>= body_mass_g (quantile ($ dat 'body_mass_g) 0.75 7))))
(< body_mass_g))
(dataframe-sort* (
(dataframe-display))
90 rows x 4 cols
dim:
species body_mass_g sex year
<str> <num> <str> <num> 4750. male 2008.
Gentoo 4750. female 2008.
Gentoo 4750. female 2008.
Gentoo 4750. female 2009.
Gentoo 4750. female 2009.
Gentoo 4775. male 2009.
Adelie 4800. female 2007.
Gentoo 4800. female 2007.
Gentoo 4800. male 2008.
Chinstrap 4850. female 2008. Gentoo
Sample Observations
The equivalent of sample_n
and sample_frac
don’t currently exist in dataframe
, but it doesn’t take much code to reproduce that functionality. Here, we lean on shuffle
from chez-stats
to randomly arrange the indices and then call dataframe-ref
.
define (get-first-n lst n)
(let loop ([lst lst]
([out '()]
[count 0])
if (or (null? lst) (= count n))
(reverse out)
(cdr lst) (cons (car lst) out) (add1 count)))))
(loop (
define (sample-n df n)
(let* ([ind1 (iota (car (dataframe-dim df)))]
([ind2 (get-first-n (shuffle ind1) n)])
(dataframe-ref df ind2)))
define (sample-frac df frac)
(let* ([rows (car (dataframe-dim df))]
([n (floor (inexact (* rows frac)))])
(sample-n df n)))
3))
(dataframe-display (sample-n dat
3 rows x 4 cols
dim:
species body_mass_g sex year
<str> <num> <str> <num> 4250. male 2009.
Adelie 5250. male 2009.
Gentoo 4600. female 2008.
Gentoo
1/2))
(dataframe-display (sample-frac dat
172 rows x 4 cols
dim:
species body_mass_g sex year
<str> <num> <str> <num> 3325. female 2007.
Adelie 3275. female 2009.
Adelie 3475. female 2009.
Adelie 3650. male 2007.
Chinstrap 4550. female 2007.
Gentoo 3675. female 2009.
Chinstrap 4600. male 2007.
Adelie 4650. female 2007.
Gentoo 3325. female 2009.
Chinstrap 3075. female 2009. Adelie
Sort Observations
Sorting also requires explicitly handling missing values. dataframe-sort
first sorts based on the expression on the left and then moves right through all other expressions (in this case, it means first sorting by body mass and then sex).
(-> dat
(dataframe-remove-na 'body_mass_g 'sex)< body_mass_g) (string<? sex))
(dataframe-sort* (
(dataframe-display))
333 rows x 4 cols
dim:
species body_mass_g sex year
<str> <num> <str> <num> 2700. female 2008.
Chinstrap 2850. female 2008.
Adelie 2850. female 2008.
Adelie 2900. female 2008.
Adelie 2900. female 2008.
Adelie 2900. female 2009.
Adelie 2900. female 2007.
Chinstrap 2925. female 2009.
Adelie 3000. female 2007.
Adelie 3000. female 2009. Adelie
Select Variables
dplyr
’s select
provides particularly rich functionality for selecting columns from a dataframe. In the dataframe
library, you can only select or drop columns by exact column names.
(dataframe-display (dataframe-select* dat body_mass_g year))
344 rows x 2 cols
dim:
body_mass_g year
<num> <num> 3750 2007.
3800 2007.
3250 2007.
2007.
na 3450 2007.
3650 2007.
3625 2007.
4675 2007.
3475 2007.
4250 2007.
(dataframe-display (dataframe-drop* dat body_mass_g year))
344 rows x 2 cols
dim:
species sex
<str> <str>
Adelie male
Adelie female
Adelie female
Adelie na
Adelie female
Adelie male
Adelie female
Adelie male
Adelie na
Adelie na
;; re-arrange column order
(dataframe-display (dataframe-select* dat sex species body_mass_g year))
344 rows x 4 cols
dim:
sex species body_mass_g year
<str> <str> <num> <num> 3750 2007.
male Adelie 3800 2007.
female Adelie 3250 2007.
female Adelie 2007.
na Adelie na 3450 2007.
female Adelie 3650 2007.
male Adelie 3625 2007.
female Adelie 4675 2007.
male Adelie 3475 2007.
na Adelie 4250 2007. na Adelie
Rename Variables
Renaming columns works similarly to dplyr
, but the order is (old-name new-name)
not new_name = old_name
.
(dataframe-display
(dataframe-rename* dat (body_mass_g body_mass) (year study_year)))
344 rows x 4 cols
dim:
species body_mass sex study_year
<str> <num> <str> <num> 3750 male 2007.
Adelie 3800 female 2007.
Adelie 3250 female 2007.
Adelie 2007.
Adelie na na 3450 female 2007.
Adelie 3650 male 2007.
Adelie 3625 female 2007.
Adelie 4675 male 2007.
Adelie 3475 na 2007.
Adelie 4250 na 2007. Adelie
Create or Modify Variables
Again, we have to explicitly handle 'na
values because /
and <
don’t handle symbols and dataframe
doesn’t provide alternatives. In the second example, we can create a new column that is not based on any columns in the dataframe by not specifying any columns ()
to use in the expression.
(dataframe-display
(dataframe-modify*
dat
(body_mass_kg
(body_mass_g)if (na? body_mass_g) 'na (inexact (/ body_mass_g 1000))))))
(
344 rows x 5 cols
dim:
species body_mass_g sex year body_mass_kg
<str> <num> <str> <num> <num> 3750 male 2007. 3.75
Adelie 3800 female 2007. 3.8
Adelie 3250 female 2007. 3.25
Adelie 2007. na
Adelie na na 3450 female 2007. 3.45
Adelie 3650 male 2007. 3.65
Adelie 3625 female 2007. 3.625
Adelie 4675 male 2007. 4.675
Adelie 3475 na 2007. 3.475
Adelie 4250 na 2007. 4.25
Adelie
(dataframe-display
(dataframe-modify*
datcar (dataframe-dim dat))))))
(id () (iota (
344 rows x 5 cols
dim:
species body_mass_g sex year id
<str> <num> <str> <num> <num> 3750 male 2007. 0.
Adelie 3800 female 2007. 1.
Adelie 3250 female 2007. 2.
Adelie 2007. 3.
Adelie na na 3450 female 2007. 4.
Adelie 3650 male 2007. 5.
Adelie 3625 female 2007. 6.
Adelie 4675 male 2007. 7.
Adelie 3475 na 2007. 8.
Adelie 4250 na 2007. 9.
Adelie
(dataframe-display
(dataframe-modify*
dat
(body_mass_cat
(body_mass_g)if (na? body_mass_g)
(
'naif (> body_mass_g 4000) "High" "Low")))))
(
344 rows x 5 cols
dim:
species body_mass_g sex year body_mass_cat
<str> <num> <str> <num> <str> 3750 male 2007. Low
Adelie 3800 female 2007. Low
Adelie 3250 female 2007. Low
Adelie 2007. na
Adelie na na 3450 female 2007. Low
Adelie 3650 male 2007. Low
Adelie 3625 female 2007. Low
Adelie 4675 male 2007. High
Adelie 3475 na 2007. Low
Adelie 4250 na 2007. High
Adelie
(dataframe-display
(dataframe-modify*
dat
(body_mass_cat
(body_mass_g)cond [(na? body_mass_g) 'na]
([(< body_mass_g 3500) "Low"]
[(> body_mass_g 4750) "High"]
[else "Medium"]))))
344 rows x 5 cols
dim:
species body_mass_g sex year body_mass_cat
<str> <num> <str> <num> <str> 3750 male 2007. Medium
Adelie 3800 female 2007. Medium
Adelie 3250 female 2007. Low
Adelie 2007. na
Adelie na na 3450 female 2007. Low
Adelie 3650 male 2007. Medium
Adelie 3625 female 2007. Medium
Adelie 4675 male 2007. Medium
Adelie 3475 na 2007. Low
Adelie 4250 na 2007. Medium Adelie
Identify Distinct Values
(-> dat
(dataframe-select* species)
(dataframe-unique)
(dataframe-display))
3 rows x 1 cols
dim:
species
<str>
Adelie
Gentoo
Chinstrap
(-> dat
(dataframe-select* species year)
(dataframe-unique)
(dataframe-display))
9 rows x 2 cols
dim:
species year
<str> <num> 2007.
Adelie 2008.
Adelie 2009.
Adelie 2007.
Gentoo 2008.
Gentoo 2009.
Gentoo 2007.
Chinstrap 2008.
Chinstrap 2009. Chinstrap
Aggregate Observations
(dataframe-display
(dataframe-aggregate*
dat
(species sex)inexact (mean body_mass_g)))
(mean (body_mass_g) (
(sd (body_mass_g) (standard-deviation body_mass_g))))
8 rows x 4 cols
dim:
species sex mean sd
<str> <str> <num> <num> 4043.4932 346.8116
Adelie male 3368.8356 269.3801
Adelie female 3540.0000 477.1661
Adelie na 4679.7414 281.5783
Gentoo female 5484.8361 313.1586
Gentoo male 4587.5000 338.1937
Gentoo na 3527.2059 285.3339
Chinstrap female 3938.9706 362.1376
Chinstrap male
(dataframe-display
(dataframe-aggregate*
dat
(species)length species))))
(n_obs (species) (
3 rows x 2 cols
dim:
species n_obs
<str> <num> 152.
Adelie 124.
Gentoo 68. Chinstrap