A dataframe record type for Scheme
As an exercise in my Scheme (R6RS) learning journey, I have implemented a dataframe record type and procedures to work with the dataframe record type. Dataframes are column-oriented, tabular data structures useful for data analysis found in several languages including R, Python, Julia, and Go. In this post, I will introduce the dataframe record type and basic procedures for working with dataframes. In subsequent posts, I will describe other dataframe procedures, e.g., filter, sort, aggregate, etc.
Series record type
A dataframe is based on the series record type. A series record is a list where every element is the same type (one of bool, chr, str, sym, num, or other). The series record type includes the series name
, source list (src
), converted list (lst
), list type
, and list length
. Only name
and src
are required to create a series; lst
, type
, and length
are derived from src
. Failed type conversions produce 'na
values.
define-record-type
creates a predicate, series?
, constructor procedure, make-series
, and accessor procedures for each field: series-name
, series-lst
, series-type
, and series-length
. [series-src
is also created, but not exported to the dataframe namespace.] make-series*
is macro to provide alternative syntax for making a series.
define-record-type series
(length)
(fields name src lst type
(protocollambda (new)
(lambda (name src)
("(make-series name src)")
(check-series name src let* ([type (guess-type src 1000)]
([lst (convert-type src type)])
length lst)))))))
(new name src lst type (
> (define s1 (make-series 'a '(1 2 3)))
> (define s2 (make-series* (a 1 2 3)))
> (series-equal? s1 s2)
#t
> (series? s1)
#t
> (series-name s1)
a
> (series-lst s1)
1 2 3)
(
> (series-length s1)
3
Dataframe record type
The dataframe record type is based on a list of series (slist
). The names
and dim
are derived from the slist
.
define-record-type dataframe
(
(fields slist names dim)
(protocollambda (new)
(lambda (slist)
("(make-dataframe slist)")
(check-slist slist let* ([names (map series-name slist)]
([rows (series-length (car slist))]
[cols (length names)])
cons rows cols))))))) (new slist names (
A key component of the record definition is check-slist
, which confirms that all elements of slist
are series with the same length and unique names. define-record-type
creates a predicate, dataframe?
, constructor procedure, make-dataframe
, and accessor procedures for each field: dataframe-slist
, dataframe-names
, and dataframe-dim
. make-df*
is macro to provide alternative syntax for making a dataframe.
> (define df (make-df* (a 1 2 3) (b 4 5 6)))
> df
[#{dataframe f4aik6efdmw9tjrhx8ell3b2e-58}
#[#{series f4aik6efdmw9tjrhx8ell3b2e-59} a (1 2 3) (1 2 3) num 3]
(#[#{series f4aik6efdmw9tjrhx8ell3b2e-59} b (4 5 6) (4 5 6) num 3])
#3 . 2)]
(a b) (
> (dataframe? df)
#t
> (dataframe-slist df)
[#{series f4aik6efdmw9tjrhx8ell3b2e-59} a (1 2 3) (1 2 3) num 3]
(#[#{series f4aik6efdmw9tjrhx8ell3b2e-59} b (4 5 6) (4 5 6) num 3])
#
> (dataframe-names df)
(a b)
> (dataframe-dim df)
3 . 2) ; (rows . columns)
(
> (make-df* (a 1 2 3) (a 4 5 6))
not unique
Exception in (make-dataframe slist): names
> (dataframe-display df)
3 rows x 2 cols
dim:
a b
<num> <num> 1. 4.
2. 5.
3. 6.
Head and tail
In R, I frequently use head
to preview the first few rows of a dataframe and, less frequently, use tail
to view the last few rows. Scheme provides list-head
and list-tail
with similar functionality. However, tail
in R returns the last n
rows of the dataframe whereas list-tail
in Scheme returns the rest of the list starting at a given index. My first instinct was to write dataframe-tail
to use the R behavior, but eventually decided that dataframe-tail
should follow the behavior established by list-tail
. I was trying to think in terms of the principle of least surprise, but the degree of surprise depends on the potential users. Am I targeting R or Scheme programmers? The most realistic scenario is that future me is the only potential user and I want that guy to think in terms of typical Scheme patterns.
> (define df (make-df* (a 1 2 3 1 2 3) (b 4 5 6 4 5 6) (c 7 8 9 -999 -999 -999)))
> (dataframe-display (dataframe-head df 3))
3 rows x 3 cols
dim:
a b c
<num> <num> <num> 1. 4. 7.
2. 5. 8.
3. 6. 9.
> (dataframe-display df 3)
5 rows x 4 cols
dim:
grp trt adult juv
<str> <str> <num> <num> 1. 10.
a a 2. 20.
a b 3. 30.
b a
> (dataframe-display (dataframe-tail df 2))
4 rows x 3 cols
dim:
a b c
<num> <num> <num> 3. 6. 9.
1. 4. -999.
2. 5. -999.
3. 6. -999.
Read and write
If you are working exclusively with dataframes in Scheme, you can read and write them directly with dataframe-read
and dataframe-write
. These procedures are straightforward because they are simply reading and writing the dataframe with read
and write
.
define dataframe-write
(case-lambda
([(df path) (dataframe-write df path #t)]
[(df path overwrite)
and (file-exists? path) (not overwrite))
(when (assertion-violation path "file already exists"))
(file-exists? path)
(when (delete-file path))
(with-output-to-file path
(lambda () (write df)))]))
(
define (dataframe-read path)
(with-input-from-file path read)) (
Extract values
dataframe-values
returns all the values in a column as a list. Following R, I’ve included $
as an alias for dataframe-values
. This procedure is particularly useful when modifying and aggregating dataframes (as I will show in a future blog post).
> (define df (make-df* (a 100 200 300) (b 4 5 6) (c 700 800 900)))
> (dataframe-values df 'b)
4 5 6)
(
> ($ df 'b)
4 5 6)
(
> (map (lambda (name) ($ df name)) '(c a))
700 800 900) (100 200 300))
((
> (define df1 (make-df* (x 'b 'a 'b) (y 'd 'e 'c)))
> (remove-duplicates ($ df1 'x))
(b a)
> (remove-duplicates ($ df1 'y))
(d e c)
dataframe-ref
returns a dataframe based on a list of row indices and, optionally, the selected column names. I did not follow the principle of least surprise here because dataframe-ref
takes a list of indices rather than a single value as in list-ref
. For dataframes, the scenario of referencing a single row seemed less likely than a range of rows and I wanted to provide the option to simultaneously select the columns returned.
> (define df
(make-df* "a" "a" "b" "b" "b")
(grp "a" "b" "a" "b" "b")
(trt 1 2 3 4 5)
(adult 10 20 30 40 50)))
(juv
> (dataframe-display df)
5 rows x 4 cols
dim:
grp trt adult juv
<str> <str> <num> <num> 1. 10.
a a 2. 20.
a b 3. 30.
b a 4. 40.
b b 5. 50.
b b
> (dataframe-display (dataframe-ref df '(0 2 4) 'adult 'juv))
3 rows x 2 cols
dim:
adult juv
<num> <num> 1. 10.
3. 30.
5. 50.