Adding string matching to chez-docs
I recently wrote a little library,
chez-docs, to make accessing documentation easier while learning Chez Scheme (blog post). The main procedure,
chez-docs only returns results for exact matches with
proc.1 To aid in discovery, I’ve added a procedure,
find-proc, that provides exact and approximate matching of search strings.
My initial thought was that I should approach this problem with approximate string matching. After a little searching, I learned that Levenshtein distance was one of the simplest approaches to calculate the distance between two strings. This excellent blog post included a few MATLAB implementations of Levenshtein distance algorithms2 that were relatively easy for me to follow because of my experience with MATLAB and R.
I first implemented the recursive algorithm3 thinking that it would be most natural in Scheme, but it was unacceptably slow. I then implemented the iterative two-row algorithm and found the performance to be sufficiently snappy for my needs.
(define (lev s t) (let* ([s (list->vector (string->list s))] [t (list->vector (string->list t))] [m (vector-length s)] [n (vector-length t)] [x (list->vector (iota (add1 n)))] [y (list->vector (make-list (add1 n) 0))]) (do ((i 0 (add1 i))) ((= i m)) (vector-set! y 0 i) (do ((j 0 (add1 j))) ((= j n)) (let ([c (if (char=? (vector-ref s i) (vector-ref t j)) 0 1)]) (vector-set! y (add1 j) (min (add1 (vector-ref y j)) (add1 (vector-ref x (add1 j))) (+ c (vector-ref x j)))))) ;; swap x and y (let ([tmp x]) (set! x y) (set! y tmp))) (vector-ref x n)))
This is the first time that I’ve used
do loops in Scheme. In the example below, the looping index
i is initialized to zero and incremented by one on each pass through the loop. The loop is exited when
(= i 10). The (sort of) equivalent syntax in R is
for (i in 0:9) print(i).
> (do ((i 0 (add1 i))) ((= i 10)) (display (string-append (number->string i) " "))) 0 1 2 3 4 5 6 7 8 9
lev tallies the numbers of insertions, deletions, and substitutions; a value of zero indicates an exact match.
> (map (lambda (x) (lev "head" x)) '("head" "read" "load" "list-head")) (0 1 2 5)
Exact Substring Matching
assoc to find any exact matches of the full string in the list of procedures. After working with the Levenshtein distance, I realized that exact matching of substrings would generally be more useful than fuzzy matching. I wrote the
string-match procedure to test if a search string is present in the target string.
(define (string-match s t) (let* ([s-list (string->list s)] [t-list (string->list t)]) (if (char=? (car s-list) #\^) (string-match-helper (cdr s-list) t-list) (not (for-all (lambda (x) (equal? x #f)) (map (lambda (t-sub) (string-match-helper s-list t-sub)) (potential-matches (car s-list) t-list))))))) ;; loop through characters in search string ;; to check if search string is found in target string (define (string-match-helper s-list t-list) (cond [(not t-list) #f] [(null? s-list) #t] [(< (length t-list) (length s-list)) #f] [(char=? (car s-list) (car t-list)) (string-match-helper (cdr s-list) (cdr t-list))] [else #f])) ;; loop through target string ;; to find all potential substring matches (define (potential-matches char t-list) (define (loop t-list results) (cond [(null? t-list) (remove-duplicates (reverse results))] [else (loop (cdr t-list) (cons (member char t-list) results))])) (loop t-list '())) (define (remove-duplicates ls) (cond [(null? ls) '()] [(member (car ls) (cdr ls)) (remove-duplicates (cdr ls))] [else (cons (car ls) (remove-duplicates (cdr ls)))]))
member is the workhorse of
potential-matches). It’s an interesting turn for me because when I first started using
member in my Scheme code I was puzzled by why it didn’t work like
%in% in R. For example,
(member 2 '(1 2 3)) returns
(2 3), but
2 %in% c(1, 2, 3) returns
TRUE. Because all values other than
#f count as
#t in Scheme,
member can be used as a predicate, e.g.,
(if (member 2 '(1 2 3)) 1 0) returns
1. Nonetheless, it wasn’t obvious to me how
member’s behavior was useful…until I started writing
string-match. Those experiences make programming fun.
string-match returns a boolean value.
> (map (lambda (x) (string-match "head" x)) '("head" "read" "load" "list-head")) (#t #f #f #t)
find-proc takes a
search-string and two optional arguments,
fuzzy?, which default to
(define find-proc (case-lambda [(search-string) (find-proc-helper search-string 10 #f)] [(search-string max-results) (find-proc-helper search-string max-results #f)] [(search-string max-results fuzzy?) (find-proc-helper search-string max-results fuzzy?)]))
find-proc-helper maps either
string-match to the full list of procedures,
proc-list, and then sorts or filters the results, respectively.
(define (find-proc-helper search-string max-results fuzzy?) (unless (string? search-string) (assertion-violation "(find-proc search-string)" "search-string is not a string")) (cond [fuzzy? (let* ([dist-list (map (lambda (x) (lev search-string x)) proc-list)] [dist-proc (map (lambda (dist proc) (cons dist proc)) dist-list proc-list)] [dist-proc-sort (sort (lambda (x y) (< (car x) (car y))) dist-proc)]) (prepare-results dist-proc-sort max-results))] [else (let* ([bool-list (map (lambda (x) (string-match search-string x)) proc-list)] [bool-proc (map (lambda (bool proc) (cons bool proc)) bool-list proc-list)] [bool-proc-filter (filter (lambda (x) (car x)) bool-proc)]) (prepare-results bool-proc-filter max-results))])) (define (prepare-results ls max-results) (let* ([len (length ls)] [max-n (if (> max-results len) len max-results)]) (map cdr (list-head ls max-n))))
I first realized that Levenshtein distance might not be very useful for
find-proc when searching for
head, a commonly used procedure in R.
> (find-proc "head" 5 #t) ("read" "and" "cadr" "car" "cd")
However, substring matching points us to the relevant function,
list-head, in Chez Scheme.
> (find-proc "head" 5) ("list-head" "lookahead-char" "lookahead-u8" "make-boot-header")
Fuzzy matching is useful, though, for discovery when there are options with similar forms, e.g.,
> (find-proc "hash-table?" 3) ("hash-table?") > (find-proc "hash-table?" 3 #t) ("hash-table?" "hashtable?" "eq-hashtable?")
^ indicates that only search strings found at the start of the procedure should be returned.
> (find-proc "map") ("andmap" "hash-table-map" "map" "ormap" "vector-map") > (find-proc "^map") ("map") > (find-proc "file" 3) ("&i/o-file-already-exists" "&i/o-file-does-not-exist" "&i/o-file-is-read-only") > (find-proc "^file" 3) ("file-access-time" "file-buffer-size" "file-change-time") > (find-proc "let" 5) ("delete-directory" "delete-file" "eq-hashtable-delete!" "fluid-let" "fluid-let-syntax") > (find-proc "^let") ("let*" "let*-values" "let-syntax" "let-values" "letrec" "letrec*" "letrec-syntax")
Under fuzzy matching, the
^ is included as part of the Levenshtein distance calculation and, thus, should not be included in search strings when using fuzzy matching.
> (find-proc "map" 5 #t) ("map" "max" "*" "+" "-") > (find-proc "^map" 5 #t) ("map" "max" "car" "exp" "memp")
procis shorthand for procedure, but not all of the items in
chez-docsare procedures, e.g.,