Adding string matching to chez-docs
I recently wrote a little library, chez-docs
, to make accessing documentation easier while learning Chez Scheme (blog post). The main procedure, doc
, in chez-docs
only returns results for exact matches with proc
. To aid in discovery, I’ve added a procedure, find-proc
, that provides exact and approximate matching of search strings.
proc
is shorthand for procedure, but the categories for documentation items include procedure
, global param
, thread param
, module
, and syntax
.Levenshtein Distance
My initial thought was that I should approach this problem with approximate string matching. After a little searching, I learned that Levenshtein distance was one of the simplest approaches to calculate the distance between two strings. This excellent blog post included a few MATLAB implementations of Levenshtein distance algorithms that were relatively easy for me to follow because of my experience with MATLAB and R.
I first implemented the recursive algorithm thinking that it would be most natural in Scheme, but it was unacceptably slow. I then implemented the iterative two-row algorithm and found the performance to be sufficiently snappy for my needs.
define (lev s t)
(let* ([s (list->vector (string->list s))]
([t (list->vector (string->list t))]
[m (vector-length s)]
[n (vector-length t)]
[x (list->vector (iota (add1 n)))]
[y (list->vector (make-list (add1 n) 0))])
do ((i 0 (add1 i)))
(= i m))
((vector-set! y 0 i)
(do ((j 0 (add1 j)))
(= j n))
((let ([c (if (char=? (vector-ref s i) (vector-ref t j)) 0 1)])
(vector-set! y (add1 j) (min (add1 (vector-ref y j))
(vector-ref x (add1 j)))
(add1 (+ c (vector-ref x j))))))
(;; swap x and y
let ([tmp x])
(set! x y)
(set! y tmp)))
(vector-ref x n))) (
This is the first time that I’ve used do
loops in Scheme. In the example below, the looping index i
is initialized to zero and incremented by one on each pass through the loop. The loop is exited when (= i 10)
. The equivalent code in R is for (i in 0:9) cat(paste0(i, " "))
.
> (do ((i 0 (add1 i)))
= i 10))
((display (string-append (number->string i) " ")))
(0 1 2 3 4 5 6 7 8 9
lev
tallies the numbers of insertions, deletions, and substitutions; a value of zero indicates an exact match.
> (map (lambda (x) (lev "head" x)) '("head" "read" "load" "list-head"))
0 1 2 5) (
Exact Substring Matching
doc
uses assoc
to find any exact matches of the full string in the list of procedures. After working with the Levenshtein distance, I realized that exact matching of substrings would generally be more useful than fuzzy matching. I wrote the string-match
procedure to test if a search string is present in the target string.
define (string-match s t)
(let* ([s-list (string->list s)]
([t-list (string->list t)])
if (char=? (car s-list) #\^)
(cdr s-list) t-list)
(string-match-helper (not (for-all (lambda (x) (equal? x #f))
(lambda (t-sub) (string-match-helper s-list t-sub))
(map (car s-list) t-list)))))))
(potential-matches (
;; loop through characters in search string
;; to check if search string is found in target string
define (string-match-helper s-list t-list)
(cond [(not t-list) #f]
([(null? s-list) #t]
[(< (length t-list) (length s-list)) #f]
[(char=? (car s-list) (car t-list))
cdr s-list) (cdr t-list))]
(string-match-helper ([else #f]))
;; loop through target string
;; to find all potential substring matches
define (potential-matches char t-list)
(let loop ([t-list t-list]
([results '()])
if (null? t-list)
(reverse results))
(remove-duplicates (cdr t-list) (cons (member char t-list) results)))))
(loop (
define (remove-duplicates ls)
(cond [(null? ls)
(]
'()[(member (car ls) (cdr ls))
cdr ls))]
(remove-duplicates ([else
cons (car ls) (remove-duplicates (cdr ls)))])) (
member
is the workhorse of string-match
(via potential-matches
). It’s an interesting turn for me because when I first started using member
in my Scheme code I was puzzled by why it didn’t work like %in%
in R. For example, (member 2 '(1 2 3))
returns (2 3)
, but 2 %in% c(1, 2, 3)
returns TRUE
. Because all values other than #f
count as #t
in Scheme, member
can be used as a predicate, e.g., (if (member 2 '(1 2 3)) 1 0)
returns 1
. Nonetheless, it wasn’t obvious to me how member
’s behavior was useful…until I started writing string-match
. Those experiences make programming fun.
string-match
returns a boolean value.
> (map (lambda (x) (string-match "head" x)) '("head" "read" "load" "list-head"))
#t #f #f #t) (
Procedure Discovery
find-proc
takes a search-string
and two optional arguments, search-type
and max-results
, which default to 'exact
and 10
, respectively.
define find-proc
(case-lambda
([(search-string)
10)]
(find-proc-helper search-string 'exact [(search-string search-type)
10)]
(find-proc-helper search-string search-type [(search-string search-type max-results)
])) (find-proc-helper search-string search-type max-results)
find-proc-helper
maps either lev
or string-match
to the full list of procedures, proc-list
, and then sorts or filters the results, respectively.
define (find-proc-helper search-string search-type max-results)
(string? search-string)
(unless (assertion-violation "(find-proc search-string)" "search-string is not a string"))
(cond [(symbol=? search-type 'fuzzy)
(let* ([dist-list (map (lambda (x) (lev search-string x))
(]
proc-list)[dist-proc (map (lambda (dist proc) (cons dist proc))
]
dist-list proc-list)[dist-proc-sort (sort (lambda (x y) (< (car x) (car y)))
])
dist-proc)]
(prepare-results dist-proc-sort search-type max-results))[(symbol=? search-type 'exact)
let* ([bool-list (map (lambda (x) (string-match search-string x))
(]
proc-list)[bool-proc (map (lambda (bool proc) (cons bool proc))
]
bool-list proc-list)[bool-proc-filter (filter (lambda (x) (car x)) bool-proc)])
]
(prepare-results bool-proc-filter search-type max-results))[else
assertion-violation "(find-proc search-string search-type)"
("search-type must be either 'exact or 'fuzzy")]))
define (prepare-results ls search-type max-results)
(let* ([len (length ls)]
([max-n (if (> max-results len) len max-results)])
and (symbol=? search-type 'exact) (> len max-results))
(when (display (string-append "Returning " (number->string max-results)
(" of " (number->string len)
" results\n")))
cdr (list-head ls max-n)))) (map
I first realized that Levenshtein distance might not be very useful for find-proc
when searching for head
, a commonly used procedure in R.
> (find-proc "head" 'fuzzy 5)
"read" "and" "cadr" "car" "cd") (
However, substring matching points us to the relevant function, list-head
, in Chez Scheme.
> (find-proc "head" 'exact 5)
"list-head" "lookahead-char" "lookahead-u8" "make-boot-header") (
Fuzzy matching is useful, though, for discovery when there are options with similar forms, e.g., hash-table?
and hashtable?
.
> (find-proc "hash-table?" 'exact 3)
"hash-table?")
(> (find-proc "hash-table?" 'fuzzy 3)
"hash-table?" "hashtable?" "eq-hashtable?") (
The ^
indicates that only search strings found at the start of the procedure should be returned.
> (find-proc "map")
"andmap" "hash-table-map" "map" "ormap" "vector-map")
(> (find-proc "^map")
"map")
(
> (find-proc "file" 'exact 3)
3 of 78 results
Returning "&i/o-file-already-exists"
("&i/o-file-does-not-exist"
"&i/o-file-is-read-only")
> (find-proc "^file" 'exact 3)
"file-access-time" "file-buffer-size" "file-change-time")
(
> (find-proc "let" 'exact 5)
5 of 20 results
Returning "delete-directory"
("delete-file"
"eq-hashtable-delete!"
"fluid-let"
"fluid-let-syntax")
> (find-proc "^let" 'exact)
"let" "let*" "let*-values" "let-syntax" "let-values"
("letrec" "letrec*" "letrec-syntax")
Under fuzzy matching, the ^
is included as part of the Levenshtein distance calculation and, thus, should not be included in search strings when using fuzzy matching.
> (find-proc "map" 'fuzzy 5)
"map" "max" "*" "+" "-")
(> (find-proc "^map" 'fuzzy 5)
"map" "max" "car" "exp" "memp") (
UPDATE (2025-05-08): I realized recently that one of my favorite little elements of chez-docs
is that, if doc
can’t find match, it returns a guess powered by find-proc
and fuzzy matching.
> (doc "ifelse")
not found in csug or tspl
Exception in (doc proc): ifelse Did you mean 'else'?
Searching for ifelse
(from R) and turning up else
isn’t the most direct route to understanding, but it will lead you to cond
and get you to a little closer to where you want to go.
It took me about three years to realize that it would be an easy addition (basically 3 lines of code added in July 2023) and another two years to realize that I should mention it somewhere.
`scheme (define (guess-proc proc) (string-append "Did you mean '" (car (find-proc proc 'fuzzy)) "'?"))
It has the same limitations of fuzzy matching as described above, but it is helpful if you simply introduced a typo or a misplaced dash in your call to doc
.
> (doc "make-byte-vector")
not found in csug or tspl
Exception in (doc proc): make-byte-vector Did you mean 'make-bytevector'?