README.tokenizers 5.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136
  1. 1. FTS3 Tokenizers
  2. When creating a new full-text table, FTS3 allows the user to select
  3. the text tokenizer implementation to be used when indexing text
  4. by specifying a "tokenize" clause as part of the CREATE VIRTUAL TABLE
  5. statement:
  6. CREATE VIRTUAL TABLE <table-name> USING fts3(
  7. <columns ...> [, tokenize <tokenizer-name> [<tokenizer-args>]]
  8. );
  9. The built-in tokenizers (valid values to pass as <tokenizer name>) are
  10. "simple", "porter" and "unicode".
  11. <tokenizer-args> should consist of zero or more white-space separated
  12. arguments to pass to the selected tokenizer implementation. The
  13. interpretation of the arguments, if any, depends on the individual
  14. tokenizer.
  15. 2. Custom Tokenizers
  16. FTS3 allows users to provide custom tokenizer implementations. The
  17. interface used to create a new tokenizer is defined and described in
  18. the fts3_tokenizer.h source file.
  19. Registering a new FTS3 tokenizer is similar to registering a new
  20. virtual table module with SQLite. The user passes a pointer to a
  21. structure containing pointers to various callback functions that
  22. make up the implementation of the new tokenizer type. For tokenizers,
  23. the structure (defined in fts3_tokenizer.h) is called
  24. "sqlite3_tokenizer_module".
  25. FTS3 does not expose a C-function that users call to register new
  26. tokenizer types with a database handle. Instead, the pointer must
  27. be encoded as an SQL blob value and passed to FTS3 through the SQL
  28. engine by evaluating a special scalar function, "fts3_tokenizer()".
  29. The fts3_tokenizer() function may be called with one or two arguments,
  30. as follows:
  31. SELECT fts3_tokenizer(<tokenizer-name>);
  32. SELECT fts3_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>);
  33. Where <tokenizer-name> is a string identifying the tokenizer and
  34. <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module
  35. structure encoded as an SQL blob. If the second argument is present,
  36. it is registered as tokenizer <tokenizer-name> and a copy of it
  37. returned. If only one argument is passed, a pointer to the tokenizer
  38. implementation currently registered as <tokenizer-name> is returned,
  39. encoded as a blob. Or, if no such tokenizer exists, an SQL exception
  40. (error) is raised.
  41. SECURITY: If the fts3 extension is used in an environment where potentially
  42. malicious users may execute arbitrary SQL (i.e. gears), they should be
  43. prevented from invoking the fts3_tokenizer() function. The
  44. fts3_tokenizer() function is disabled by default. It is only enabled
  45. by SQLITE_DBCONFIG_ENABLE_FTS3_TOKENIZER. Do not enable it in
  46. security sensitive environments.
  47. See "Sample code" below for an example of calling the fts3_tokenizer()
  48. function from C code.
  49. 3. ICU Library Tokenizers
  50. If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor
  51. symbol defined, then there exists a built-in tokenizer named "icu"
  52. implemented using the ICU library. The first argument passed to the
  53. xCreate() method (see fts3_tokenizer.h) of this tokenizer may be
  54. an ICU locale identifier. For example "tr_TR" for Turkish as used
  55. in Turkey, or "en_AU" for English as used in Australia. For example:
  56. "CREATE VIRTUAL TABLE thai_text USING fts3(text, tokenizer icu th_TH)"
  57. The ICU tokenizer implementation is very simple. It splits the input
  58. text according to the ICU rules for finding word boundaries and discards
  59. any tokens that consist entirely of white-space. This may be suitable
  60. for some applications in some locales, but not all. If more complex
  61. processing is required, for example to implement stemming or
  62. discard punctuation, this can be done by creating a tokenizer
  63. implementation that uses the ICU tokenizer as part of its implementation.
  64. When using the ICU tokenizer this way, it is safe to overwrite the
  65. contents of the strings returned by the xNext() method (see
  66. fts3_tokenizer.h).
  67. 4. Sample code.
  68. The following two code samples illustrate the way C code should invoke
  69. the fts3_tokenizer() scalar function:
  70. int registerTokenizer(
  71. sqlite3 *db,
  72. char *zName,
  73. const sqlite3_tokenizer_module *p
  74. ){
  75. int rc;
  76. sqlite3_stmt *pStmt;
  77. const char zSql[] = "SELECT fts3_tokenizer(?, ?)";
  78. rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
  79. if( rc!=SQLITE_OK ){
  80. return rc;
  81. }
  82. sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
  83. sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC);
  84. sqlite3_step(pStmt);
  85. return sqlite3_finalize(pStmt);
  86. }
  87. int queryTokenizer(
  88. sqlite3 *db,
  89. char *zName,
  90. const sqlite3_tokenizer_module **pp
  91. ){
  92. int rc;
  93. sqlite3_stmt *pStmt;
  94. const char zSql[] = "SELECT fts3_tokenizer(?)";
  95. *pp = 0;
  96. rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0);
  97. if( rc!=SQLITE_OK ){
  98. return rc;
  99. }
  100. sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC);
  101. if( SQLITE_ROW==sqlite3_step(pStmt) ){
  102. if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){
  103. memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp));
  104. }
  105. }
  106. return sqlite3_finalize(pStmt);
  107. }