Travis Hinkelman - Comparing dataframe operations in Scheme, Python, and R

I recently came across this blog post that calls pandas (Python) "clunky" compared to the "silky smooth" dplyr (R). No objections from me. dplyr is my favorite R package. I thought it would fun to compare the relative clunkiness of my dataframe library for Scheme (R6RS) to pandas and dplyr. I will include the R and Python code here to save some clicking back and forth, but will mostly be commenting on the Scheme code.

Reading in the data

# R 
library(dplyr)
purchases <- read.csv("purchases.csv")

# Python
import pandas as pd
purchases = pd.read_csv("purchases.csv")

;; Scheme
(import (dataframe))
(define purchases (csv->dataframe "purchases.csv"))

"How much do we sell..? Let's take the total sum!"

# R 
sum(purchases$amount)

# Python
sum(purchases["amount"])

;; Scheme
(sum ($ purchases 'amount))

"Ah, they wanted it by country..."

dataframe-aggregate* is a macro to slightly reduce the verbosity of the Scheme code, but it still falls short of the brevity and readability of the dplyr code.

# R 
purchases |>
  group_by(country) |>
  summarize(total = sum(amount))

# Python
(purchases
  .groupby("country")
  .agg(total=("amount", "sum")) 
  .reset_index()                
)

;; Scheme
(dataframe-aggregate* purchases (country) (total (amount) (sum amount)))

"And I guess I should deduct the discount."

In Version 1, subtracting the discount is included by mapping over the columns within dataframe-aggregate*. In Version 2, dataframe-modify* handles mapping down the columns and produces more compact code. Version 2 illustrates the similarities in dataframe-modify* and dataframe-aggregate* with the primary difference being that you need to specify the grouping columns, e.g., (country) in dataframe-aggregate*, but the general form is (new-name (names) (expression)).

# R 
purchases |> 
  group_by(country) |> 
  summarize(total = sum(amount - discount))

# Python
(purchases
  .groupby("country")
  .apply(lambda df: (df["amount"] - df["discount"]).sum()) 
  .reset_index()
  .rename(columns={0: "total"})                            
)

;; Scheme
;; Version 1
(-> purchases
    (dataframe-aggregate*
     (country)
     (total
      (amount discount)
      (sum (map (lambda (amt dis) (- amt dis)) amount discount)))))

;; Version 2
(-> purchases
    (dataframe-modify* (diff (amount discount) (- amount discount)))
    (dataframe-aggregate* (country) (total (diff) (sum diff))))

“Oh, and Maria asked me to remove any outliers.”

The Scheme version only works because the filter step is the first in the pipeline, i.e., the pipe is passing the same purchases that is referred to in ($ purchases 'amount).

# R 
purchases |>
  filter(amount <= median(amount) * 10) |> 
  group_by(country) |> 
  summarize(total = sum(amount - discount))

# Python
(purchases
  .query("amount <= amount.median() * 10")
  .groupby("country")
  .apply(lambda df: (df["amount"] - df["discount"]).sum())
  .reset_index()
  .rename(columns={0: "total"})
)

;; Scheme
(-> purchases
    (dataframe-filter* (amount) (<= amount (* 10 (median ($ purchases 'amount)))))
    (dataframe-modify* (diff (amount discount) (- amount discount)))
    (dataframe-aggregate* (country) (total (diff) (sum diff))))

“I probably should use the median within each country.”

In Version 1, the outlier filtering and discount subtraction happens within the hidden split-apply-combine approach in dataframe-aggregate*. In Version 2, the split-apply-combine is made explicit for outlier filtering. Personally, I find Version 2 more readable, but it should be avoided for large dataframes because it involves splitting twice (dataframe-split and dataframe-aggregate*). Of course, neither are as elegant as the dplyr version.

# R 
purchases |>
  group_by(country) |>                    
  filter(amount <= median(amount) * 10) |> 
  summarize(total = sum(amount - discount))

# Python
(purchases
  .groupby("country")                                               
  .apply(lambda df: df[df["amount"] <= df["amount"].median() * 10]) 
  .reset_index(drop=True)                                           
  .groupby("country")
  .apply(lambda df: (df["amount"] - df["discount"]).sum())
  .reset_index()
  .rename(columns={0: "total"})
)

;; Scheme
;; Version 1
(-> purchases
    (dataframe-aggregate*
     (country)
     (total
      (amount discount)
      (sum (map (lambda (amt dis)
                  (let ([multi (if (<= amt (* 10 (median amount))) 1 0)])
                    (* multi (- amt dis))))
                amount
                discount)))))

;; Version 2
(-> purchases
    (dataframe-split 'country)
    (->> (map (lambda (dfx)
                (dataframe-filter* dfx (amount) (<= amount (* 10 (median ($ dfx 'amount))))))))
    (dataframe-bind-all)
    (dataframe-modify* (diff (amount discount) (- amount discount)))
    (dataframe-aggregate* (country) (total (diff) (sum diff))))

Conclusion

I'm clearly biased, but I think a case could be made that my dataframe library for Scheme is at least not more clunky than pandas.

Contents