README 4.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
  1. This directory includes some useful codes:
  2. 1. subset selection tools.
  3. 2. parameter selection tools.
  4. 3. LIBSVM format checking tools
  5. Part I: Subset selection tools
  6. Introduction
  7. ============
  8. Training large data is time consuming. Sometimes one should work on a
  9. smaller subset first. The python script subset.py randomly selects a
  10. specified number of samples. For classification data, we provide a
  11. stratified selection to ensure the same class distribution in the
  12. subset.
  13. Usage: subset.py [options] dataset number [output1] [output2]
  14. This script selects a subset of the given data set.
  15. options:
  16. -s method : method of selection (default 0)
  17. 0 -- stratified selection (classification only)
  18. 1 -- random selection
  19. output1 : the subset (optional)
  20. output2 : the rest of data (optional)
  21. If output1 is omitted, the subset will be printed on the screen.
  22. Example
  23. =======
  24. > python subset.py heart_scale 100 file1 file2
  25. From heart_scale 100 samples are randomly selected and stored in
  26. file1. All remaining instances are stored in file2.
  27. Part II: Parameter Selection Tools
  28. Introduction
  29. ============
  30. grid.py is a parameter selection tool for C-SVM classification using
  31. the RBF (radial basis function) kernel. It uses cross validation (CV)
  32. technique to estimate the accuracy of each parameter combination in
  33. the specified range and helps you to decide the best parameters for
  34. your problem.
  35. grid.py directly executes libsvm binaries (so no python binding is needed)
  36. for cross validation and then draw contour of CV accuracy using gnuplot.
  37. You must have libsvm and gnuplot installed before using it. The package
  38. gnuplot is available at http://www.gnuplot.info/
  39. On Mac OSX, the precompiled gnuplot file needs the library Aquarterm,
  40. which thus must be installed as well. In addition, this version of
  41. gnuplot does not support png, so you need to change "set term png
  42. transparent small" and use other image formats. For example, you may
  43. have "set term pbm small color".
  44. Usage: grid.py [-log2c begin,end,step] [-log2g begin,end,step] [-v fold]
  45. [-svmtrain pathname] [-gnuplot pathname] [-out pathname] [-png pathname]
  46. [additional parameters for svm-train] dataset
  47. The program conducts v-fold cross validation using parameter C (and gamma)
  48. = 2^begin, 2^(begin+step), ..., 2^end.
  49. You can specify where the libsvm executable and gnuplot are using the
  50. -svmtrain and -gnuplot parameters.
  51. For windows users, please use pgnuplot.exe. If you are using gnuplot
  52. 3.7.1, please upgrade to version 3.7.3 or higher. The version 3.7.1
  53. has a bug. If you use cygwin on windows, please use gunplot-x11.
  54. Example
  55. =======
  56. > python grid.py -log2c -5,5,1 -log2g -4,0,1 -v 5 -m 300 heart_scale
  57. Users (in particular MS Windows users) may need to specify the path of
  58. executable files. You can either change paths in the beginning of
  59. grid.py or specify them in the command line. For example,
  60. > grid.py -log2c -5,5,1 -svmtrain c:\libsvm\windows\svm-train.exe -gnuplot c:\tmp\gnuplot\bin\pgnuplot.exe -v 10 heart_scale
  61. Output: two files
  62. dataset.png: the CV accuracy contour plot generated by gnuplot
  63. dataset.out: the CV accuracy at each (log2(C),log2(gamma))
  64. Parallel grid search
  65. ====================
  66. You can conduct a parallel grid search by dispatching jobs to a
  67. cluster of computers which share the same file system. First, you add
  68. machine names in grid.py:
  69. ssh_workers = ["linux1", "linux5", "linux5"]
  70. and then setup your ssh so that the authentication works without
  71. asking a password.
  72. The same machine (e.g., linux5 here) can be listed more than once if
  73. it has multiple CPUs or has more RAM. If the local machine is the
  74. best, you can also enlarge the nr_local_worker. For example:
  75. nr_local_worker = 2
  76. Example:
  77. > python grid.py heart_scale
  78. [local] -1 -1 78.8889 (best c=0.5, g=0.5, rate=78.8889)
  79. [linux5] -1 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333)
  80. [linux5] 5 -1 77.037 (best c=0.5, g=0.0078125, rate=83.3333)
  81. [linux1] 5 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333)
  82. .
  83. .
  84. .
  85. If -log2c, -log2g, or -v is not specified, default values are used.
  86. If your system uses telnet instead of ssh, you list the computer names
  87. in telnet_workers.
  88. Part III: LIBSVM format checking tools
  89. Introduction
  90. ============
  91. `svm-train' conducts only a simple check of the input data. To do a
  92. detailed check, we provide a python script `checkdata.py.'
  93. Usage: checkdata.py dataset
  94. This tool is written by Rong-En Fan at National Taiwan University.
  95. Example
  96. =======
  97. > cat bad_data
  98. 1 3:1 2:4
  99. > python checkdata.py bad_data
  100. line 1: feature indices must be in an ascending order, previous/current features 3:1 2:4
  101. Found 1 lines with error.