Single header source code tokenizer written in ANSI C

bzt fa234554a5 Allow matching closing parenthesis with tok_next 1 year ago
LICENSE 4cb4976132 Initial commit 1 year ago
README.md 4cb4976132 Initial commit 1 year ago
tok.h fa234554a5 Allow matching closing parenthesis with tok_next 1 year ago

README.md

Tokenizer

First, this Tok library has nothing to do with the Tok'ra. :-) Instead it is a really simple, dependency-free, stb-style single header source code tokenizer written in ANSI C. It has a simple to use interface to manipulate the tokens and then concatenate them into a string again.

[[TOC]]

Usage

Just include tok.h. In exactly one of your source files, also define the implementation:

#define TOK_IMPLEMENTATION
#include "tok.h"

Normally it only uses libc, but you can also provide your own functions (and with that, remove even the libc dependency) by overriding the function names with defines, like TOK_REALLOC, TOK_STRLEN, TOK_MEMCPY, TOK_VA_LIST etc.

Language Rules

Tok accepts language rules in a two dimensional string array, where each element is a NULL terminated list of simple regexp patterns (lowercase only, the match is case in-sensitive), and where the top dimension encodes the token's type. For example:

char *c_comments[] =    { "\\/\\/.*?$", "\\/\\*.*?\\*\\/", NULL };
char *c_precompiler[] = { "#.*?$", NULL };
char *c_operators[] =   { "->", "[=\\<\\>\\+\\-\\*\\/%&\\^\\|!:\\.][=]?", NULL };
char *c_numbers[] =     { "[0-9][0-9bx]?[0-9\\.a-f\\+\\-]*[UL]*", NULL };
char *c_strings[] =     { "\"", "'", "L\"", "L'", NULL };
char *c_separators[] =  { "[", "]", "{", "}", ",", ";", NULL };
char *c_types[] =       { "signed", "unsigned", "char", "short", "int", "long", "float", "double", "true", NULL };
char *c_keywords[] =    { "if", "else", "switch", "case", "for", "while", "do", "break", "continue", "return", NULL };

char **c_rules[] =      { c_comments, c_precompiler, c_operators, c_numbers, c_strings, c_separators, c_types, c_keywords };

You can then use this c_rules language rules set as:

tok_t tok;

tok_new(&tok, c_rules, source_string, -1);
tok_dump(&tok);
tok_free(&tok);

Note this is just an example, and not a full list of the ANSI C language. If you provide a non-complete rules set, then the worst that could happen is that some of the language specific keywords or operators are reported as a variable.

Tokens

There are tok.num tokens, and those are stored in a plain simple string array in tok.tokens. Here the first character is always a token type, the rest is the string from the source code. K.I.S.S. The available token types are:

Enum Type Description
TOK_COMMENT A comment, could be multiline if you have used such a pattern
TOK_PRECOMPILER A precompiler directive
TOK_OPERATOR An operator
TOK_NUMBER A number literal
TOK_STRING A string literal
TOK_SEPARATOR A separator. Any whitespace plus the elements you specify to be separators
TOK_TYPE A language built-in variable type
TOK_KEYWORD A language keyword
TOK_FUNCTION A function name (variable that's followed by ()
TOK_VARIABLE A variable

Note that the last two aren't part of the rules set, those are detected automatically. To access tokens one can do:

if (tok.tokens[i][0] == TOK_KEYWORD)
    printf("The %uth token is '%s' and it is a keyword.\n", i, tok.tokens[i] + 1);

HINT: you can use tok_dump() to dump all the parsed tokens to stdout.

API

Tokenizer

tok_t tok;

int tok_new(tok_t *tok, char ***rules, char *src, int len);

Reads in a source code from a string, and fills in the tokenizer's structure.

Argument Description
tok The Tok instance
rules The language rules set to be used
src Zero terminated UTF-8 source string
len Length of string if it's not zero terminated, otherwise -1

Returns 1 on success, 0 otherwise.

Detokenizer

int tok_tostr(tok_t *tok, char *dst, int maxlen);

Constructs a string from list of tokens.

Argument Description
tok The Tok instance
dst Destination buffer
maxlen Destination buffer's length (see tok_strlen() below)

Returns the size of the constructed string (and the string in dst), or -1 if the destination buffer wasn't big enough.

String Length

int tok_strlen(tok_t *tok);

Returns how big destination buffer is required for tok_tostr() detokenization.

Argument Description
tok The Tok instance

Returns the size of the constructed string plus one (for the zero terminator).

Delete

int tok_delete(tok_t *tok, int idx);

Removes the token specified by index idx. Tokens after that are moved forward.

Argument Description
tok The Tok instance
idx Index of the token

Returns 1 on success, 0 otherwise.

Insert

int tok_insert(tok_t *tok, int idx, char type, char *str);

Inserts a token into the list before the given index idx. Calling with idx being -1 or tok.num is the same as calling tok_append().

Argument Description
tok The Tok instance
idx Where to insert (use -1 to insert at the end)
type One of the TOK_* type enums
str A zero terminated UTF-8 string

Returns 1 on success, 0 otherwise.

Replace

int tok_replace(tok_t *tok, int idx, char type, char *str);

Replaces a token at the given index idx.

Argument Description
tok The Tok instance
idx Index of the token
type One of the TOK_* type enums
str A zero terminated UTF-8 string

Returns 1 on success, 0 otherwise.

Append

int tok_append(tok_t *tok, char type, char *str);

Appends a token at the end of the token list.

Argument Description
tok The Tok instance
type One of the TOK_* type enums
str A zero terminated UTF-8 string

Returns 1 on success, 0 otherwise.

Find

int tok_find(tok_t *tok, int idx, char type, char *str);

Looks for a token starting from index idx. It can either match type (if that's not -1) or str (if that's not NULL) or both. This function does not care about parenthesis.

Argument Description
tok The Tok instance
idx Index of the token
type One of the TOK_* type enums or -1 (any token)
str A zero terminated UTF-8 string or NULL (any string)

Returns the index of the first occurance of the token or -1 if not found.

Find Parenthesis Correctly

int tok_next(tok_t *tok, int idx, char type, char *str);

Looks for a token starting from index idx. It can either match type (if that's not -1) or str (if that's not NULL) or both. It differs from tok_find() in a way that it considers parenthesis. For example, if we have a tokenlist (, a, ,, b, ), ,, c then tok_find(tok,0,-1,",") returns 2, but tok_next(tok,0,-1,",") returns 5.

Argument Description
tok The Tok instance
idx Index of the token
type One of the TOK_* type enums or -1 (any token)
str A zero terminated UTF-8 string or NULL (any string)

Returns the index of the first occurance of the token at the same parenthesis level or -1 if not found.

Match Pattern

int tok_match(tok_t *tok, int idx, int num, ...);

Matches a token pattern starting from index idx.

Argument Description
tok The Tok instance
idx Index of the token
num Number of tokens to match
... List of TOK_* type enums

Returns 1 if the pattern matches, 0 otherwise.

Free

void tok_free(tok_t *tok);

Free the token list and clear instance to zero.

Argument Description
tok The Tok instance

Debugging

void tok_dump(tok_t *tok, int sidx, int eidx);

This function is only available if the TOK_NODEBUG define isn't defined. It simply dumps the tokens to stdout using printf.

Argument Description
tok The Tok instance
sidx Starting from this index (inclusive, use 0 to dump all)
eidx Ending with this index (exclisive, use 0 to dump all)

License

This library is licensed under the terms of MIT license.

Cheers, bzt