This repository is supposed to contain all my GNU Guile or Scheme machine learning algorithm implementations.

zelphir.kaltstahl 77f93cb446 add more notes about using fibers 4 tahun lalu
old-racket-code 05f7208871 add racket code test file 4 tahun lalu
scripts 7adfd9baf1 run metrics tests in run test script 4 tahun lalu
test f272fac139 add dataset-get-columns procedure 4 tahun lalu
utils 191af9c8c6 add range function 4 tahun lalu
.gitignore 81913d80f7 update gitignore to include Emacs files 4 tahun lalu
LICENSE 198d22ffbb Initial commit 7 tahun lalu
README.org fde8c7b798 update readme 4 tahun lalu
columns.csv ced580d8b0 initial commit 7 tahun lalu
data-point.scm 3df084f217 add missing procedure 4 tahun lalu
data_banknote_authentication.csv ced580d8b0 initial commit 7 tahun lalu
dataset.scm f272fac139 add dataset-get-columns procedure 4 tahun lalu
decision-tree.scm 43290ee473 add todo comments for parallelism 4 tahun lalu
metrics.scm 51ef8a8b35 update comment 4 tahun lalu
notes.org 77f93cb446 add more notes about using fibers 4 tahun lalu
prediction.scm 90a79c8f89 separate prediction module 4 tahun lalu
pruning.scm 526ce93aa3 separate pruning module 4 tahun lalu
split-quality-measure.scm 0111a9d334 remove commented out Racket expression 4 tahun lalu
todo.org 117ede1f47 update todo items 4 tahun lalu
tree.scm 67d801c996 move tree printing procedure to tree module 4 tahun lalu
utils.scm 74e8f4af8a move list procedures from utils into list utils 4 tahun lalu

README.org

Tests

You can run the tests by running the script run-tests.bash in the scripts/ directory as follows:


# from the root directory of this project:
bash scripts/run-tests.bash

Usage (outdated example)

This example is outdated and still for the older Racket code.


(define shuffled-dataset (shuffle dataset))

(define small-dataset
  (data-range shuffled-dataset
              0
              ;; take only a fifth of the data to make this example run faster
              (exact-floor (/ (dataset-length shuffled-dataset)
                              5))))

;; be sure to collect all garbage, apparently this should be called thrice
(collect-garbage)
(collect-garbage)
(collect-garbage)

;; requires a ~time~ macro
(time
 ;; ~for/list~ -- a Racketism, needs to be rewritten
 (for/list ([i (in-range 1)])
   (mean
    (evaluate-algorithm #:dataset (shuffle dataset)
                        #:n-folds 10
                        #:feature-column-indices (list 0 1 2 3)
                        #:label-column-index 4
                        #:max-depth 5
                        #:min-data-points 24
                        #:min-data-points-ratio 0.02
                        #:min-impurity-split (expt 10 -7)
                        #:stop-at-no-impurity-improvement #t
                        #:random-seed 0))))

;; be sure to collect all garbage, apparently this should be called thrice
(collect-garbage)
(collect-garbage)
(collect-garbage)

(time
 ;; ~for/list~ -- a Racketism, needs to be rewritten
 (for/list ([i (in-range 1)])
   ;; run with the whole dataset as an example, no random seed
   (define tree (fit #:train-data dataset
                     #:feature-column-indices (list 0 1 2 3)
                     #:label-column-index 4
                     #:max-depth 5
                     #:min-data-points 12
                     #:min-data-points-ratio 0.02
                     #:min-impurity-split (expt 10 -7)
                     #:stop-at-no-impurity-improvement #t))
   'done))

Approach

Data representation

  • A dataset is currently represented by a list of vectors. Rows are represented by vectors.