A dataframe record type for Scheme
As an exercise in my Scheme (R6RS) learning journey, I have implemented a dataframe record type and procedures to work with the dataframe record type. Dataframes are column-oriented, tabular data structures useful for data analysis found in several languages including R, Python, Julia, and Go. In this post, I will introduce the dataframe record type and basic procedures for working with dataframes. In subsequent posts, I will describe other dataframe procedures, e.g., filter, sort, aggregate, etc.
Series record type
A dataframe is based on the series record type. A series record is a list where every element is the same type (one of bool, chr, str, sym, num, or other). The series record type includes the series name
, source list (src
), converted list (lst
), list type
, and list length
. Only name and src are required to create a series; lst
, type
, and length
are derived from src
. Failed type conversions produce 'na
values.
define-record-type
creates a predicate, series?
, constructor procedure, make-series
, and accessor procedures for each field: series-name
, series-lst
, series-type
, and series-length
. [series-src
is also created, but not exported to the dataframe namespace.] make-series*
is macro to provide alternative syntax for making a series.
(define-record-type series
(fields name src lst type length)
(protocol
(lambda (new)
(lambda (name src)
(check-series name src "(make-series name src)")
(let* ([type (guess-type src 1000)]
[lst (convert-type src type)])
(new name src lst type (length lst)))))))
> (define s1 (make-series 'a '(1 2 3)))
> (define s2 (make-series* (a 1 2 3)))
> (series-equal? s1 s2)
#t
> (series? s1)
#t
> (series-name s1)
a
> (series-lst s1)
(1 2 3)
> (series-length s1)
3
Dataframe record type
The dataframe record type is based on a list of series (slist
). The names
and dim
are derived from the slist
.
(define-record-type dataframe
(fields slist names dim)
(protocol
(lambda (new)
(lambda (slist)
(check-slist slist "(make-dataframe slist)")
(let* ([names (map series-name slist)]
[rows (series-length (car slist))]
[cols (length names)])
(new slist names (cons rows cols)))))))
A key component of the record definition is check-slist
, which confirms that all elements of slist
are series with the same length and unique names. define-record-type
creates a predicate, dataframe?
, constructor procedure, make-dataframe
, and accessor procedures for each field: dataframe-slist
, dataframe-names
, and dataframe-dim
. make-df*
is macro to provide alternative syntax for making a dataframe.
> (define df (make-df* (a 1 2 3) (b 4 5 6)))
> df
#[#{dataframe f4aik6efdmw9tjrhx8ell3b2e-58} (#[#{series f4aik6efdmw9tjrhx8ell3b2e-59} a (1 2 3) (1 2 3) num 3] #[#{series f4aik6efdmw9tjrhx8ell3b2e-59} b (4 5 6) (4 5 6) num 3]) (a b) (3 . 2)]
> (dataframe? df)
#t
> (dataframe-slist df)
(#[#{series f4aik6efdmw9tjrhx8ell3b2e-59} a (1 2 3) (1 2 3) num 3]
#[#{series f4aik6efdmw9tjrhx8ell3b2e-59} b (4 5 6) (4 5 6) num 3])
> (dataframe-names df)
(a b)
> (dataframe-dim df)
(3 . 2) ; (rows . columns)
> (make-df* (a 1 2 3) (a 4 5 6))
Exception in (make-dataframe slist): names not unique
> (dataframe-display df)
dim: 3 rows x 2 cols
a b
<num> <num>
1. 4.
2. 5.
3. 6.
Head and tail
In R, I frequently use head
to preview the first few rows of a dataframe and, less frequently, use tail
to view the last few rows. Scheme provides list-head
and list-tail
with similar functionality. However, tail
in R returns the last n
rows of the dataframe whereas list-tail
in Scheme returns the rest of the list starting at a given index. My first instinct was to write dataframe-tail
to use the R behavior, but eventually decided that dataframe-tail
should follow the behavior established by list-tail
. I was trying to think in terms of the principle of least surprise, but the degree of surprise depends on the potential users. Am I targeting R or Scheme programmers? The most realistic scenario is that future me is the only potential user and I want that guy to think in terms of typical Scheme patterns.
> (define df (make-df* (a 1 2 3 1 2 3) (b 4 5 6 4 5 6) (c 7 8 9 -999 -999 -999)))
> (dataframe-display (dataframe-head df 3))
dim: 3 rows x 3 cols
a b c
<num> <num> <num>
1. 4. 7.
2. 5. 8.
3. 6. 9.
> (dataframe-display df 3)
dim: 5 rows x 4 cols
grp trt adult juv
<str> <str> <num> <num>
a a 1. 10.
a b 2. 20.
b a 3. 30.
> (dataframe-display (dataframe-tail df 2))
dim: 4 rows x 3 cols
a b c
<num> <num> <num>
3. 6. 9.
1. 4. -999.
2. 5. -999.
3. 6. -999.
Read and write
If you are working exclusively with dataframes in Scheme, you can read and write them directly with dataframe-read
and dataframe-write
. These procedures are straightforward because they are simply reading and writing the dataframe with read
and write
.
(define dataframe-write
(case-lambda
[(df path) (dataframe-write df path #t)]
[(df path overwrite)
(when (and (file-exists? path) (not overwrite))
(assertion-violation path "file already exists"))
(when (file-exists? path)
(delete-file path))
(with-output-to-file path
(lambda () (write df)))]))
(define (dataframe-read path)
(with-input-from-file path read))
Extract values
dataframe-values
returns all the values in a column as a list. Following R, I've included $
as an alias for dataframe-values
. This procedure is particularly useful when modifying and aggregating dataframes (as I will show in a future blog post).
> (define df (make-df* (a 100 200 300) (b 4 5 6) (c 700 800 900)))
> (dataframe-values df 'b)
(4 5 6)
> ($ df 'b)
(4 5 6)
> (map (lambda (name) ($ df name)) '(c a))
((700 800 900) (100 200 300))
> (define df1 (make-df* (x 'b 'a 'b) (y 'd 'e 'c)))
> (remove-duplicates ($ df1 'x))
(b a)
> (remove-duplicates ($ df1 'y))
(d e c)
dataframe-ref
returns a dataframe based on a list of row indices and, optionally, the selected column names. I did not follow the principle of least surprise here because dataframe-ref
takes a list of indices rather than a single value as in list-ref
. For dataframes, the scenario of referencing a single row seemed less likely than a range of rows and I wanted to provide the option to simultaneously select the columns returned.
> (define df
(make-df*
(grp "a" "a" "b" "b" "b")
(trt "a" "b" "a" "b" "b")
(adult 1 2 3 4 5)
(juv 10 20 30 40 50)))
> (dataframe-display df)
dim: 5 rows x 4 cols
grp trt adult juv
<str> <str> <num> <num>
a a 1. 10.
a b 2. 20.
b a 3. 30.
b b 4. 40.
b b 5. 50.
> (dataframe-display (dataframe-ref df '(0 2 4) 'adult 'juv))
dim: 3 rows x 2 cols
adult juv
<num> <num>
1. 10.
3. 30.
5. 50.