123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148 |
- This directory includes some useful codes:
- 1. subset selection tools.
- 2. parameter selection tools.
- 3. LIBSVM format checking tools
- Part I: Subset selection tools
- Introduction
- ============
- Training large data is time consuming. Sometimes one should work on a
- smaller subset first. The python script subset.py randomly selects a
- specified number of samples. For classification data, we provide a
- stratified selection to ensure the same class distribution in the
- subset.
- Usage: subset.py [options] dataset number [output1] [output2]
- This script selects a subset of the given data set.
- options:
- -s method : method of selection (default 0)
- 0 -- stratified selection (classification only)
- 1 -- random selection
- output1 : the subset (optional)
- output2 : the rest of data (optional)
- If output1 is omitted, the subset will be printed on the screen.
- Example
- =======
- > python subset.py heart_scale 100 file1 file2
- From heart_scale 100 samples are randomly selected and stored in
- file1. All remaining instances are stored in file2.
- Part II: Parameter Selection Tools
- Introduction
- ============
- grid.py is a parameter selection tool for C-SVM classification using
- the RBF (radial basis function) kernel. It uses cross validation (CV)
- technique to estimate the accuracy of each parameter combination in
- the specified range and helps you to decide the best parameters for
- your problem.
- grid.py directly executes libsvm binaries (so no python binding is needed)
- for cross validation and then draw contour of CV accuracy using gnuplot.
- You must have libsvm and gnuplot installed before using it. The package
- gnuplot is available at http://www.gnuplot.info/
- On Mac OSX, the precompiled gnuplot file needs the library Aquarterm,
- which thus must be installed as well. In addition, this version of
- gnuplot does not support png, so you need to change "set term png
- transparent small" and use other image formats. For example, you may
- have "set term pbm small color".
- Usage: grid.py [-log2c begin,end,step] [-log2g begin,end,step] [-v fold]
- [-svmtrain pathname] [-gnuplot pathname] [-out pathname] [-png pathname]
- [additional parameters for svm-train] dataset
- The program conducts v-fold cross validation using parameter C (and gamma)
- = 2^begin, 2^(begin+step), ..., 2^end.
- You can specify where the libsvm executable and gnuplot are using the
- -svmtrain and -gnuplot parameters.
- For windows users, please use pgnuplot.exe. If you are using gnuplot
- 3.7.1, please upgrade to version 3.7.3 or higher. The version 3.7.1
- has a bug. If you use cygwin on windows, please use gunplot-x11.
- Example
- =======
- > python grid.py -log2c -5,5,1 -log2g -4,0,1 -v 5 -m 300 heart_scale
- Users (in particular MS Windows users) may need to specify the path of
- executable files. You can either change paths in the beginning of
- grid.py or specify them in the command line. For example,
- > grid.py -log2c -5,5,1 -svmtrain c:\libsvm\windows\svm-train.exe -gnuplot c:\tmp\gnuplot\bin\pgnuplot.exe -v 10 heart_scale
- Output: two files
- dataset.png: the CV accuracy contour plot generated by gnuplot
- dataset.out: the CV accuracy at each (log2(C),log2(gamma))
- Parallel grid search
- ====================
- You can conduct a parallel grid search by dispatching jobs to a
- cluster of computers which share the same file system. First, you add
- machine names in grid.py:
- ssh_workers = ["linux1", "linux5", "linux5"]
- and then setup your ssh so that the authentication works without
- asking a password.
- The same machine (e.g., linux5 here) can be listed more than once if
- it has multiple CPUs or has more RAM. If the local machine is the
- best, you can also enlarge the nr_local_worker. For example:
- nr_local_worker = 2
- Example:
- > python grid.py heart_scale
- [local] -1 -1 78.8889 (best c=0.5, g=0.5, rate=78.8889)
- [linux5] -1 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333)
- [linux5] 5 -1 77.037 (best c=0.5, g=0.0078125, rate=83.3333)
- [linux1] 5 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333)
- .
- .
- .
- If -log2c, -log2g, or -v is not specified, default values are used.
- If your system uses telnet instead of ssh, you list the computer names
- in telnet_workers.
- Part III: LIBSVM format checking tools
- Introduction
- ============
- `svm-train' conducts only a simple check of the input data. To do a
- detailed check, we provide a python script `checkdata.py.'
- Usage: checkdata.py dataset
- This tool is written by Rong-En Fan at National Taiwan University.
- Example
- =======
- > cat bad_data
- 1 3:1 2:4
- > python checkdata.py bad_data
- line 1: feature indices must be in an ascending order, previous/current features 3:1 2:4
- Found 1 lines with error.
|