This is flex.info, produced by makeinfo version 4.5 from flex.texi. INFO-DIR-SECTION Programming START-INFO-DIR-ENTRY * flex: (flex). Fast lexical analyzer generator (lex replacement). END-INFO-DIR-ENTRY The flex manual is placed under the same licensing conditions as the rest of flex: Copyright (C) 1990, 1997 The Regents of the University of California. All rights reserved. This code is derived from software contributed to Berkeley by Vern Paxson. The United States Government has rights in this work pursuant to contract no. DE-AC03-76SF00098 between the United States Department of Energy and the University of California. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. File: flex.info, Node: Debugging Options, Next: Miscellaneous Options, Prev: Options for Scanner Speed and Size, Up: Scanner Options Debugging Options ================= `-b, --backup, `%option backup'' Generate backing-up information to `lex.backup'. This is a list of scanner states which require backing up and the input characters on which they do so. By adding rules one can remove backing-up states. If _all_ backing-up states are eliminated and `-Cf' or `-CF' is used, the generated scanner will run faster (see the `--perf-report' flag). Only users who wish to squeeze every last cycle out of their scanners need worry about this option. (*note Performance::). `-d, --debug, `%option debug'' makes the generated scanner run in "debug" mode. Whenever a pattern is recognized and the global variable `yy_flex_debug' is non-zero (which is the default), the scanner will write to `stderr' a line of the form: -accepting rule at line 53 ("the matched text") The line number refers to the location of the rule in the file defining the scanner (i.e., the file that was fed to flex). Messages are also generated when the scanner backs up, accepts the default rule, reaches the end of its input buffer (or encounters a NUL; at this point, the two look the same as far as the scanner's concerned), or reaches an end-of-file. `-p, --perf-report, `%option perf-report'' generates a performance report to `stderr'. The report consists of comments regarding features of the `flex' input file which will cause a serious loss of performance in the resulting scanner. If you give the flag twice, you will also get comments regarding features that lead to minor performance losses. Note that the use of `REJECT', and variable trailing context (*note Limitations::) entails a substantial performance penalty; use of `yymore()', the `^' operator, and the `--interactive' flag entail minor performance penalties. `-s, --nodefault, `%option nodefault'' causes the _default rule_ (that unmatched scanner input is echoed to `stdout)' to be suppressed. If the scanner encounters input that does not match any of its rules, it aborts with an error. This option is useful for finding holes in a scanner's rule set. `-T, --trace, `%option trace'' makes `flex' run in "trace" mode. It will generate a lot of messages to `stderr' concerning the form of the input and the resultant non-deterministic and deterministic finite automata. This option is mostly for use in maintaining `flex'. `-w, --nowarn, `%option nowarn'' suppresses warning messages. `-v, --verbose, `%option verbose'' specifies that `flex' should write to `stderr' a summary of statistics regarding the scanner it generates. Most of the statistics are meaningless to the casual `flex' user, but the first line identifies the version of `flex' (same as reported by `--version'), and the next line the flags used when generating the scanner, including those that are on by default. `--warn, `%option warn'' warn about certain things. In particular, if the default rule can be matched but no defualt rule has been given, the flex will warn you. We recommend using this option always. File: flex.info, Node: Miscellaneous Options, Prev: Debugging Options, Up: Scanner Options Miscellaneous Options ===================== `-c' is a do-nothing option included for POSIX compliance. generates `-h, -?, --help' generates a "help" summary of `flex''s options to `stdout' and then exits. `-n' is another do-nothing option included only for POSIX compliance. `-V, --version' prints the version number to `stdout' and exits. File: flex.info, Node: Performance, Next: Cxx, Prev: Scanner Options, Up: Top Performance Considerations ************************** The main design goal of `flex' is that it generate high-performance scanners. It has been optimized for dealing well with large sets of rules. Aside from the effects on scanner speed of the table compression `-C' options outlined above, there are a number of options/actions which degrade performance. These are, from most expensive to least: REJECT arbitrary trailing context pattern sets that require backing up %option yylineno %array %option interactive %option always-interactive @samp{^} beginning-of-line operator yymore() with the first two all being quite expensive and the last two being quite cheap. Note also that `unput()' is implemented as a routine call that potentially does quite a bit of work, while `yyless()' is a quite-cheap macro. So if you are just putting back some excess text you scanned, use `ss()'. `REJECT' should be avoided at all costs when performance is important. It is a particularly expensive option. There is one case when `%option yylineno' can be expensive. That is when your patterns match long tokens that could _possibly_ contain a newline character. There is no performance penalty for rules that can not possibly match newlines, since flex does not need to check them for newlines. In general, you should avoid rules such as `[^f]+', which match very long tokens, including newlines, and may possibly match your entire file! A better approach is to separate `[^f]+' into two rules: %option yylineno %% [^f\n]+ \n+ The above scanner does not incur a performance penalty. Getting rid of backing up is messy and often may be an enormous amount of work for a complicated scanner. In principal, one begins by using the `-b' flag to generate a `lex.backup' file. For example, on the input: %% foo return TOK_KEYWORD; foobar return TOK_KEYWORD; the file looks like: State #6 is non-accepting - associated rule line numbers: 2 3 out-transitions: [ o ] jam-transitions: EOF [ \001-n p-\177 ] State #8 is non-accepting - associated rule line numbers: 3 out-transitions: [ a ] jam-transitions: EOF [ \001-` b-\177 ] State #9 is non-accepting - associated rule line numbers: 3 out-transitions: [ r ] jam-transitions: EOF [ \001-q s-\177 ] Compressed tables always back up. The first few lines tell us that there's a scanner state in which it can make a transition on an 'o' but not on any other character, and that in that state the currently scanned text does not match any rule. The state occurs when trying to match the rules found at lines 2 and 3 in the input file. If the scanner is in that state and then reads something other than an 'o', it will have to back up to find a rule which is matched. With a bit of headscratching one can see that this must be the state it's in when it has seen `fo'. When this has happened, if anything other than another `o' is seen, the scanner will have to back up to simply match the `f' (by the default rule). The comment regarding State #8 indicates there's a problem when `foob' has been scanned. Indeed, on any character other than an `a', the scanner will have to back up to accept "foo". Similarly, the comment for State #9 concerns when `fooba' has been scanned and an `r' does not follow. The final comment reminds us that there's no point going to all the trouble of removing backing up from the rules unless we're using `-Cf' or `-CF', since there's no performance gain doing so with compressed scanners. The way to remove the backing up is to add "error" rules: %% foo return TOK_KEYWORD; foobar return TOK_KEYWORD; fooba | foob | fo { /* false alarm, not really a keyword */ return TOK_ID; } Eliminating backing up among a list of keywords can also be done using a "catch-all" rule: %% foo return TOK_KEYWORD; foobar return TOK_KEYWORD; [a-z]+ return TOK_ID; This is usually the best solution when appropriate. Backing up messages tend to cascade. With a complicated set of rules it's not uncommon to get hundreds of messages. If one can decipher them, though, it often only takes a dozen or so rules to eliminate the backing up (though it's easy to make a mistake and have an error rule accidentally match a valid token. A possible future `flex' feature will be to automatically add rules to eliminate backing up). It's important to keep in mind that you gain the benefits of eliminating backing up only if you eliminate _every_ instance of backing up. Leaving just one means you gain nothing. _Variable_ trailing context (where both the leading and trailing parts do not have a fixed length) entails almost the same performance loss as `REJECT' (i.e., substantial). So when possible a rule like: %% mouse|rat/(cat|dog) run(); is better written: %% mouse/cat|dog run(); rat/cat|dog run(); or as %% mouse|rat/cat run(); mouse|rat/dog run(); Note that here the special '|' action does _not_ provide any savings, and can even make things worse (*note Limitations::). Another area where the user can increase a scanner's performance (and one that's easier to implement) arises from the fact that the longer the tokens matched, the faster the scanner will run. This is because with long tokens the processing of most input characters takes place in the (short) inner scanning loop, and does not often have to go through the additional work of setting up the scanning environment (e.g., `yytext') for the action. Recall the scanner for C comments: %x comment %% int line_num = 1; "/*" BEGIN(comment); <comment>[^*\n]* <comment>"*"+[^*/\n]* <comment>\n ++line_num; <comment>"*"+"/" BEGIN(INITIAL); This could be sped up by writing it as: %x comment %% int line_num = 1; "/*" BEGIN(comment); <comment>[^*\n]* <comment>[^*\n]*\n ++line_num; <comment>"*"+[^*/\n]* <comment>"*"+[^*/\n]*\n ++line_num; <comment>"*"+"/" BEGIN(INITIAL); Now instead of each newline requiring the processing of another action, recognizing the newlines is distributed over the other rules to keep the matched text as long as possible. Note that _adding_ rules does _not_ slow down the scanner! The speed of the scanner is independent of the number of rules or (modulo the considerations given at the beginning of this section) how complicated the rules are with regard to operators such as `*' and `|'. A final example in speeding up a scanner: suppose you want to scan through a file containing identifiers and keywords, one per line and with no other extraneous characters, and recognize all the keywords. A natural first approach is: %% asm | auto | break | ... etc ... volatile | while /* it's a keyword */ .|\n /* it's not a keyword */ To eliminate the back-tracking, introduce a catch-all rule: %% asm | auto | break | ... etc ... volatile | while /* it's a keyword */ [a-z]+ | .|\n /* it's not a keyword */ Now, if it's guaranteed that there's exactly one word per line, then we can reduce the total number of matches by a half by merging in the recognition of newlines with that of the other tokens: %% asm\n | auto\n | break\n | ... etc ... volatile\n | while\n /* it's a keyword */ [a-z]+\n | .|\n /* it's not a keyword */ One has to be careful here, as we have now reintroduced backing up into the scanner. In particular, while _we_ know that there will never be any characters in the input stream other than letters or newlines, `flex' can't figure this out, and it will plan for possibly needing to back up when it has scanned a token like `auto' and then the next character is something other than a newline or a letter. Previously it would then just match the `auto' rule and be done, but now it has no `auto' rule, only a `auto\n' rule. To eliminate the possibility of backing up, we could either duplicate all rules but without final newlines, or, since we never expect to encounter such an input and therefore don't how it's classified, we can introduce one more catch-all rule, this one which doesn't include a newline: %% asm\n | auto\n | break\n | ... etc ... volatile\n | while\n /* it's a keyword */ [a-z]+\n | [a-z]+ | .|\n /* it's not a keyword */ Compiled with `-Cf', this is about as fast as one can get a `flex' scanner to go for this particular problem. A final note: `flex' is slow when matching `NUL's, particularly when a token contains multiple `NUL's. It's best to write rules which match _short_ amounts of text if it's anticipated that the text will often include `NUL's. Another final note regarding performance: as mentioned in *Note Matching::, dynamically resizing `yytext' to accommodate huge tokens is a slow process because it presently requires that the (huge) token be rescanned from the beginning. Thus if performance is vital, you should attempt to match "large" quantities of text but not "huge" quantities, where the cutoff between the two is at about 8K characters per token. File: flex.info, Node: Cxx, Next: Reentrant, Prev: Performance, Up: Top Generating C++ Scanners *********************** *IMPORTANT*: the present form of the scanning class is _experimental_ and may change considerably between major releases. `flex' provides two different ways to generate scanners for use with C++. The first way is to simply compile a scanner generated by `flex' using a C++ compiler instead of a C compiler. You should not encounter any compilation errors (*note Reporting Bugs::). You can then use C++ code in your rule actions instead of C code. Note that the default input source for your scanner remains `yyin', and default echoing is still done to `yyout'. Both of these remain `FILE *' variables and not C++ _streams_. You can also use `flex' to generate a C++ scanner class, using the `-+' option (or, equivalently, `%option c++)', which is automatically specified if the name of the `flex' executable ends in a '+', such as `flex++'. When using this option, `flex' defaults to generating the scanner to the file `lex.yy.cc' instead of `lex.yy.c'. The generated scanner includes the header file `FlexLexer.h', which defines the interface to two C++ classes. The first class, `FlexLexer', provides an abstract base class defining the general scanner class interface. It provides the following member functions: `const char* YYText()' returns the text of the most recently matched token, the equivalent of `yytext'. `int YYLeng()' returns the length of the most recently matched token, the equivalent of `yyleng'. `int lineno() const' returns the current input line number (see `%option yylineno)', or `1' if `%option yylineno' was not used. `void set_debug( int flag )' sets the debugging flag for the scanner, equivalent to assigning to `yy_flex_debug' (*note Scanner Options::). Note that you must build the scannerusing `%option debug' to include debugging information in it. `int debug() const' returns the current setting of the debugging flag. Also provided are member functions equivalent to `yy_switch_to_buffer()', `yy_create_buffer()' (though the first argument is an `istream*' object pointer and not a `FILE*)', `yy_flush_buffer()', `yy_delete_buffer()', and `yyrestart()' (again, the first argument is a `istream*' object pointer). The second class defined in `FlexLexer.h' is `yyFlexLexer', which is derived from `FlexLexer'. It defines the following additional member functions: `yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )' constructs a `yyFlexLexer' object using the given streams for input and output. If not specified, the streams default to `cin' and `cout', respectively. `virtual int yylex()' performs the same role is `yylex()' does for ordinary `flex' scanners: it scans the input stream, consuming tokens, until a rule's action returns a value. If you derive a subclass `S' from `yyFlexLexer' and want to access the member functions and variables of `S' inside `yylex()', then you need to use `%option yyclass="S"' to inform `flex' that you will be using that subclass instead of `yyFlexLexer'. In this case, rather than generating `yyFlexLexer::yylex()', `flex' generates `S::yylex()' (and also generates a dummy `yyFlexLexer::yylex()' that calls `yyFlexLexer::LexerError()' if called). `virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0)' reassigns `yyin' to `new_in' (if non-null) and `yyout' to `new_out' (if non-null), deleting the previous input buffer if `yyin' is reassigned. `int yylex( istream* new_in, ostream* new_out = 0 )' first switches the input streams via `switch_streams( new_in, new_out )' and then returns the value of `yylex()'. In addition, `yyFlexLexer' defines the following protected virtual functions which you can redefine in derived classes to tailor the scanner: `virtual int LexerInput( char* buf, int max_size )' reads up to `max_size' characters into `buf' and returns the number of characters read. To indicate end-of-input, return 0 characters. Note that `interactive' scanners (see the `-B' and `-I' flags in *Note Scanner Options::) define the macro `YY_INTERACTIVE'. If you redefine `LexerInput()' and need to take different actions depending on whether or not the scanner might be scanning an interactive input source, you can test for the presence of this name via `#ifdef' statements. `virtual void LexerOutput( const char* buf, int size )' writes out `size' characters from the buffer `buf', which, while `NUL'-terminated, may also contain internal `NUL's if the scanner's rules can match text with `NUL's in them. `virtual void LexerError( const char* msg )' reports a fatal error message. The default version of this function writes the message to the stream `cerr' and exits. Note that a `yyFlexLexer' object contains its _entire_ scanning state. Thus you can use such objects to create reentrant scanners, but see also *Note Reentrant::. You can instantiate multiple instances of the same `yyFlexLexer' class, and you can also combine multiple C++ scanner classes together in the same program using the `-P' option discussed above. Finally, note that the `%array' feature is not available to C++ scanner classes; you must use `%pointer' (the default). Here is an example of a simple C++ scanner: // An example of using the flex C++ scanner class. %{ int mylineno = 0; %} string \"[^\n"]+\" ws [ \t]+ alpha [A-Za-z] dig [0-9] name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? number {num1}|{num2} %% {ws} /* skip blanks and tabs */ "/*" { int c; while((c = yyinput()) != 0) { if(c == '\n') ++mylineno; else if(c == @samp{*}) { if((c = yyinput()) == '/') break; else unput(c); } } } {number} cout "number " YYText() '\n'; \n mylineno++; {name} cout "name " YYText() '\n'; {string} cout "string " YYText() '\n'; %% int main( int /* argc */, char** /* argv */ ) { @code{flex}Lexer* lexer = new yyFlexLexer; while(lexer->yylex() != 0) ; return 0; } If you want to create multiple (different) lexer classes, you use the `-P' flag (or the `prefix=' option) to rename each `yyFlexLexer' to some other `xxFlexLexer'. You then can include `<FlexLexer.h>' in your other sources once per lexer class, first renaming `yyFlexLexer' as follows: #undef yyFlexLexer #define yyFlexLexer xxFlexLexer #include <FlexLexer.h> #undef yyFlexLexer #define yyFlexLexer zzFlexLexer #include <FlexLexer.h> if, for example, you used `%option prefix="xx"' for one of your scanners and `%option prefix="zz"' for the other. File: flex.info, Node: Reentrant, Next: Lex and Posix, Prev: Cxx, Up: Top Reentrant C Scanners ******************** `flex' has the ability to generate a reentrant C scanner. This is accomplished by specifying `%option reentrant' (`-R') The generated scanner is both portable, and safe to use in one or more separate threads of control. The most common use for reentrant scanners is from within multi-threaded applications. Any thread may create and execute a reentrant `flex' scanner without the need for synchronization with other threads. * Menu: * Reentrant Uses:: * Reentrant Overview:: * Reentrant Example:: * Reentrant Detail:: * Reentrant Functions:: File: flex.info, Node: Reentrant Uses, Next: Reentrant Overview, Prev: Reentrant, Up: Reentrant Uses for Reentrant Scanners =========================== However, there are other uses for a reentrant scanner. For example, you could scan two or more files simultaneously to implement a `diff' at the token level (i.e., instead of at the character level): /* Example of maintaining more than one active scanner. */ do { int tok1, tok2; tok1 = yylex( scanner_1 ); tok2 = yylex( scanner_2 ); if( tok1 != tok2 ) printf("Files are different."); } while ( tok1 && tok2 ); Another use for a reentrant scanner is recursion. (Note that a recursive scanner can also be created using a non-reentrant scanner and buffer states. *Note Multiple Input Buffers::.) The following crude scanner supports the `eval' command by invoking another instance of itself. /* Example of recursive invocation. */ %option reentrant %% "eval(".+")" { yyscan_t scanner; YY_BUFFER_STATE buf; yylex_init( &scanner ); yytext[yyleng-1] = ' '; buf = yy_scan_string( yytext + 5, scanner ); yylex( scanner ); yy_delete_buffer(buf,scanner); yylex_destroy( scanner ); } ... %% File: flex.info, Node: Reentrant Overview, Next: Reentrant Example, Prev: Reentrant Uses, Up: Reentrant An Overview of the Reentrant API ================================ The API for reentrant scanners is different than for non-reentrant scanners. Here is a quick overview of the API: `%option reentrant' must be specified. * All functions take one additional argument: `yyscanner' * All global variables are replaced by their macro equivalents. (We tell you this because it may be important to you during debugging.) * `yylex_init' and `yylex_destroy' must be called before and after `yylex', respectively. * Accessor methods (get/set functions) provide access to common `flex' variables. * User-specific data can be stored in `yyextra'. File: flex.info, Node: Reentrant Example, Next: Reentrant Detail, Prev: Reentrant Overview, Up: Reentrant Reentrant Example ================= First, an example of a reentrant scanner: /* This scanner prints "//" comments. */ %option reentrant stack %x COMMENT %% "//" yy_push_state( COMMENT, yyscanner); .|\n <COMMENT>\n yy_pop_state( yyscanner ); <COMMENT>[^\n]+ fprintf( yyout, "%s\n", yytext); %% int main ( int argc, char * argv[] ) { yyscan_t scanner; yylex_init ( &scanner ); yylex ( scanner ); yylex_destroy ( scanner ); return 0; } File: flex.info, Node: Reentrant Detail, Next: Reentrant Functions, Prev: Reentrant Example, Up: Reentrant The Reentrant API in Detail =========================== Here are the things you need to do or know to use the reentrant C API of `flex'. * Menu: * Specify Reentrant:: * Extra Reentrant Argument:: * Global Replacement:: * Init and Destroy Functions:: * Accessor Methods:: * Extra Data:: * About yyscan_t:: File: flex.info, Node: Specify Reentrant, Next: Extra Reentrant Argument, Prev: Reentrant Detail, Up: Reentrant Detail Declaring a Scanner As Reentrant -------------------------------- %option reentrant (-reentrant) must be specified. Notice that `%option reentrant' is specified in the above example (*note Reentrant Example::. Had this option not been specified, `flex' would have happily generated a non-reentrant scanner without complaining. You may explicitly specify `%option noreentrant', if you do _not_ want a reentrant scanner, although it is not necessary. The default is to generate a non-reentrant scanner. File: flex.info, Node: Extra Reentrant Argument, Next: Global Replacement, Prev: Specify Reentrant, Up: Reentrant Detail The Extra Argument ------------------ All functions take one additional argument: `yyscanner'. Notice that the calls to `yy_push_state' and `yy_pop_state' both have an argument, `yyscanner' , that is not present in a non-reentrant scanner. Here are the declarations of `yy_push_state' and `yy_pop_state' in the generated scanner: static void yy_push_state ( int new_state , yyscan_t yyscanner ) ; static void yy_pop_state ( yyscan_t yyscanner ) ; Notice that the argument `yyscanner' appears in the declaration of both functions. In fact, all `flex' functions in a reentrant scanner have this additional argument. It is always the last argument in the argument list, it is always of type `yyscan_t' (which is typedef'd to `void *') and it is always named `yyscanner'. As you may have guessed, `yyscanner' is a pointer to an opaque data structure encapsulating the current state of the scanner. For a list of function declarations, see *Note Reentrant Functions::. Note that preprocessor macros, such as `BEGIN', `ECHO', and `REJECT', do not take this additional argument. File: flex.info, Node: Global Replacement, Next: Init and Destroy Functions, Prev: Extra Reentrant Argument, Up: Reentrant Detail Global Variables Replaced By Macros ----------------------------------- All global variables in traditional flex have been replaced by macro equivalents. Note that in the above example, `yyout' and `yytext' are not plain variables. These are macros that will expand to their equivalent lvalue. All of the familiar `flex' globals have been replaced by their macro equivalents. In particular, `yytext', `yyleng', `yylineno', `yyin', `yyout', `yyextra', `yylval', and `yylloc' are macros. You may safely use these macros in actions as if they were plain variables. We only tell you this so you don't expect to link to these variables externally. Currently, each macro expands to a member of an internal struct, e.g., #define yytext (((struct yyguts_t*)yyscanner)->yytext_r) One important thing to remember about `yytext' and friends is that `yytext' is not a global variable in a reentrant scanner, you can not access it directly from outside an action or from other functions. You must use an accessor method, e.g., `yyget_text', to accomplish this. (See below). File: flex.info, Node: Init and Destroy Functions, Next: Accessor Methods, Prev: Global Replacement, Up: Reentrant Detail Init and Destroy Functions -------------------------- `yylex_init' and `yylex_destroy' must be called before and after `yylex', respectively. int yylex_init ( yyscan_t * ptr_yy_globals ) ; int yylex ( yyscan_t yyscanner ) ; int yylex_destroy ( yyscan_t yyscanner ) ; The function `yylex_init' must be called before calling any other function. The argument to `yylex_init' is the address of an uninitialized pointer to be filled in by `flex'. The contents of `ptr_yy_globals' need not be initialized, since `flex' will overwrite it anyway. The value stored in `ptr_yy_globals' should thereafter be passed to `yylex()' and yylex_destroy(). Flex does not save the argument passed to `yylex_init', so it is safe to pass the address of a local pointer to `yylex_init'. The function `yylex' should be familiar to you by now. The reentrant version takes one argument, which is the value returned (via an argument) by `yylex_init'. Otherwise, it behaves the same as the non-reentrant version of `yylex'. `yylex_init' returns 0 (zero) on success, or non-zero on failure, in which case, errno is set to one of the following values: * ENOMEM Memory allocation error. *Note memory-management::. * EINVAL Invalid argument. The function `yylex_destroy' should be called to free resources used by the scanner. After `yylex_destroy' is called, the contents of `yyscanner' should not be used. Of course, there is no need to destroy a scanner if you plan to reuse it. A `flex' scanner (both reentrant and non-reentrant) may be restarted by calling `yyrestart'. Below is an example of a program that creates a scanner, uses it, then destroys it when done: int main () { yyscan_t scanner; int tok; yylex_init(&scanner); while ((tok=yylex()) > 0) printf("tok=%d yytext=%s\n", tok, yyget_text(scanner)); yylex_destroy(scanner); return 0; } File: flex.info, Node: Accessor Methods, Next: Extra Data, Prev: Init and Destroy Functions, Up: Reentrant Detail Accessing Variables with Reentrant Scanners ------------------------------------------- Accessor methods (get/set functions) provide access to common `flex' variables. Many scanners that you build will be part of a larger project. Portions of your project will need access to `flex' values, such as `yytext'. In a non-reentrant scanner, these values are global, so there is no problem accessing them. However, in a reentrant scanner, there are no global `flex' values. You can not access them directly. Instead, you must access `flex' values using accessor methods (get/set functions). Each accessor method is named `yyget_NAME' or `yyset_NAME', where `NAME' is the name of the `flex' variable you want. For example: /* Set the last character of yytext to NULL. */ void chop ( yyscan_t scanner ) { int len = yyget_leng( scanner ); yyget_text( scanner )[len - 1] = '\0'; } The above code may be called from within an action like this: %% .+\n { chop( yyscanner );} You may find that `%option header-file' is particularly useful for generating prototypes of all the accessor functions. *Note option-header::. File: flex.info, Node: Extra Data, Next: About yyscan_t, Prev: Accessor Methods, Up: Reentrant Detail Extra Data ---------- User-specific data can be stored in `yyextra'. In a reentrant scanner, it is unwise to use global variables to communicate with or maintain state between different pieces of your program. However, you may need access to external data or invoke external functions from within the scanner actions. Likewise, you may need to pass information to your scanner (e.g., open file descriptors, or database connections). In a non-reentrant scanner, the only way to do this would be through the use of global variables. `Flex' allows you to store arbitrary, "extra" data in a scanner. This data is accessible through the accessor methods `yyget_extra' and `yyset_extra' from outside the scanner, and through the shortcut macro `yyextra' from within the scanner itself. They are defined as follows: #define YY_EXTRA_TYPE void* YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); void yyset_extra ( YY_EXTRA_TYPE arbitrary_data , yyscan_t scanner); By default, `YY_EXTRA_TYPE' is defined as type `void *'. You will have to cast `yyextra' and the return value from `yyget_extra' to the appropriate value each time you access the extra data. To avoid casting, you may override the default type by defining `YY_EXTRA_TYPE' in section 1 of your scanner: /* An example of overriding YY_EXTRA_TYPE. */ %{ #include <sys/stat.h> #include <unistd.h> #define YY_EXTRA_TYPE struct stat* %} %option reentrant %% __filesize__ printf( "%ld", yyextra->st_size ); __lastmod__ printf( "%ld", yyextra->st_mtime ); %% void scan_file( char* filename ) { yyscan_t scanner; struct stat buf; yylex_init ( &scanner ); yyset_in( fopen(filename,"r"), scanner ); stat( filename, &buf); yyset_extra( &buf, scanner ); yylex ( scanner ); yylex_destroy( scanner ); } File: flex.info, Node: About yyscan_t, Prev: Extra Data, Up: Reentrant Detail About yyscan_t -------------- `yyscan_t' is defined as: typedef void* yyscan_t; It is initialized by `yylex_init()' to point to an internal structure. You should never access this value directly. In particular, you should never attempt to free it (use `yylex_destroy()' instead.) File: flex.info, Node: Reentrant Functions, Prev: Reentrant Detail, Up: Reentrant Functions and Macros Available in Reentrant C Scanners ====================================================== The following Functions are available in a reentrant scanner: char *yyget_text ( yyscan_t scanner ); int yyget_leng ( yyscan_t scanner ); FILE *yyget_in ( yyscan_t scanner ); FILE *yyget_out ( yyscan_t scanner ); int yyget_lineno ( yyscan_t scanner ); YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); int yyget_debug ( yyscan_t scanner ); void yyset_debug ( int flag, yyscan_t scanner ); void yyset_in ( FILE * in_str , yyscan_t scanner ); void yyset_out ( FILE * out_str , yyscan_t scanner ); void yyset_lineno ( int line_number , yyscan_t scanner ); void yyset_extra ( YY_EXTRA_TYPE user_defined , yyscan_t scanner ); There are no "set" functions for yytext and yyleng. This is intentional. The following Macro shortcuts are available in actions in a reentrant scanner: yytext yyleng yyin yyout yylineno yyextra yy_flex_debug In a reentrant C scanner, support for yylineno is always present (i.e., you may access yylineno), but the value is never modified by `flex' unless `%option yylineno' is enabled. This is to allow the user to maintain the line count independently of `flex'. The following functions and macros are made available when `%option bison-bridge' (`--bison-bridge') is specified: YYSTYPE * yyget_lval ( yyscan_t scanner ); void yyset_lval ( YYSTYPE * yylvalp , yyscan_t scanner ); yylval The following functions and macros are made available when `%option bison-locations' (`--bison-locations') is specified: YYLTYPE *yyget_lloc ( yyscan_t scanner ); void yyset_lloc ( YYLTYPE * yyllocp , yyscan_t scanner ); yylloc Support for yylval assumes that `YYSTYPE' is a valid type. Support for yylloc assumes that `YYSLYPE' is a valid type. Typically, these types are generated by `bison', and are included in section 1 of the `flex' input. File: flex.info, Node: Lex and Posix, Next: Memory Management, Prev: Reentrant, Up: Top Incompatibilities with Lex and Posix ************************************ `flex' is a rewrite of the AT&T Unix _lex_ tool (the two implementations do not share any code, though), with some extensions and incompatibilities, both of which are of concern to those who wish to write scanners acceptable to both implementations. `flex' is fully compliant with the POSIX `lex' specification, except that when using `%pointer' (the default), a call to `unput()' destroys the contents of `yytext', which is counter to the POSIX specification. In this section we discuss all of the known areas of incompatibility between `flex', AT&T `lex', and the POSIX specification. `flex''s `-l' option turns on maximum compatibility with the original AT&T `lex' implementation, at the cost of a major loss in the generated scanner's performance. We note below which incompatibilities can be overcome using the `-l' option. `flex' is fully compatible with `lex' with the following exceptions: * The undocumented `lex' scanner internal variable `yylineno' is not supported unless `-l' or `%option yylineno' is used. * `yylineno' should be maintained on a per-buffer basis, rather than a per-scanner (single global variable) basis. * `yylineno' is not part of the POSIX specification. * The `input()' routine is not redefinable, though it may be called to read characters following whatever has been matched by a rule. If `input()' encounters an end-of-file the normal `yywrap()' processing is done. A "real" end-of-file is returned by `input()' as `EOF'. * Input is instead controlled by defining the `YY_INPUT()' macro. * The `flex' restriction that `input()' cannot be redefined is in accordance with the POSIX specification, which simply does not specify any way of controlling the scanner's input other than by making an initial assignment to `yyin'. * The `unput()' routine is not redefinable. This restriction is in accordance with POSIX. * `flex' scanners are not as reentrant as `lex' scanners. In particular, if you have an interactive scanner and an interrupt handler which long-jumps out of the scanner, and the scanner is subsequently called again, you may get the following message: fatal @code{flex} scanner internal error--end of buffer missed To reenter the scanner, first use: yyrestart( yyin ); Note that this call will throw away any buffered input; usually this isn't a problem with an interactive scanner. *Note Reentrant::, for `flex''s reentrant API. * Also note that `flex' C++ scanner classes _are_ reentrant, so if using C++ is an option for you, you should use them instead. *Note Cxx::, and *Note Reentrant:: for details. * `output()' is not supported. Output from the ECHO macro is done to the file-pointer `yyout' (default `stdout)'. * `output()' is not part of the POSIX specification. * `lex' does not support exclusive start conditions (%x), though they are in the POSIX specification. * When definitions are expanded, `flex' encloses them in parentheses. With `lex', the following: NAME [A-Z][A-Z0-9]* %% foo{NAME}? printf( "Found it\n" ); %% will not match the string `foo' because when the macro is expanded the rule is equivalent to `foo[A-Z][A-Z0-9]*?' and the precedence is such that the `?' is associated with `[A-Z0-9]*'. With `flex', the rule will be expanded to `foo([A-Z][A-Z0-9]*)?' and so the string `foo' will match. * Note that if the definition begins with `^' or ends with `$' then it is _not_ expanded with parentheses, to allow these operators to appear in definitions without losing their special meanings. But the `<s>', `/', and `<<EOF>>' operators cannot be used in a `flex' definition. * Using `-l' results in the `lex' behavior of no parentheses around the definition. * The POSIX specification is that the definition be enclosed in parentheses. * Some implementations of `lex' allow a rule's action to begin on a separate line, if the rule's pattern has trailing whitespace: %% foo|bar<space here> { foobar_action();} `flex' does not support this feature. * The `lex' `%r' (generate a Ratfor scanner) option is not supported. It is not part of the POSIX specification. * After a call to `unput()', _yytext_ is undefined until the next token is matched, unless the scanner was built using `%array'. This is not the case with `lex' or the POSIX specification. The `-l' option does away with this incompatibility. * The precedence of the `{,}' (numeric range) operator is different. The AT&T and POSIX specifications of `lex' interpret `abc{1,3}' as match one, two, or three occurrences of `abc'", whereas `flex' interprets it as "match `ab' followed by one, two, or three occurrences of `c'". The `-l' and `--posix' options do away with this incompatibility. * The precedence of the `^' operator is different. `lex' interprets `^foo|bar' as "match either 'foo' at the beginning of a line, or 'bar' anywhere", whereas `flex' interprets it as "match either `foo' or `bar' if they come at the beginning of a line". The latter is in agreement with the POSIX specification. * The special table-size declarations such as `%a' supported by `lex' are not required by `flex' scanners.. `flex' ignores them. * The name `FLEX_SCANNER' is `#define''d so scanners may be written for use with either `flex' or `lex'. Scanners also include `YY_FLEX_MAJOR_VERSION', `YY_FLEX_MINOR_VERSION' and `YY_FLEX_SUBMINOR_VERSION' indicating which version of `flex' generated the scanner. For example, for the 2.5.22 release, these defines would be 2, 5 and 22 respectively. If the version of `flex' being used is a beta version, then the symbol `FLEX_BETA' is defined. * The symbols `[[' and `]]' in the code sections of the input may conflict with the m4 delimiters. *Note M4 Dependency::. The following `flex' features are not included in `lex' or the POSIX specification: * C++ scanners * %option * start condition scopes * start condition stacks * interactive/non-interactive scanners * yy_scan_string() and friends * yyterminate() * yy_set_interactive() * yy_set_bol() * YY_AT_BOL() <<EOF>> * <*> * YY_DECL * YY_START * YY_USER_ACTION * YY_USER_INIT * #line directives * %{}'s around actions * reentrant C API * multiple actions on a line * almost all of the `flex' command-line options The feature "multiple actions on a line" refers to the fact that with `flex' you can put multiple actions on the same line, separated with semi-colons, while with `lex', the following: foo handle_foo(); ++num_foos_seen; is (rather surprisingly) truncated to foo handle_foo(); `flex' does not truncate the action. Actions that are not enclosed in braces are simply terminated at the end of the line. File: flex.info, Node: Memory Management, Next: Serialized Tables, Prev: Lex and Posix, Up: Top Memory Management ***************** This chapter describes how flex handles dynamic memory, and how you can override the default behavior. * Menu: * The Default Memory Management:: * Overriding The Default Memory Management:: * A Note About yytext And Memory:: File: flex.info, Node: The Default Memory Management, Next: Overriding The Default Memory Management, Prev: Memory Management, Up: Memory Management The Default Memory Management ============================= Flex allocates dynamic memory during initialization, and once in a while from within a call to yylex(). Initialization takes place during the first call to yylex(). Thereafter, flex may reallocate more memory if it needs to enlarge a buffer. As of version 2.5.9 Flex will clean up all memory when you call `yylex_destroy' *Note faq-memory-leak::. Flex allocates dynamic memory for four purposes, listed below (1) 16kB for the input buffer. Flex allocates memory for the character buffer used to perform pattern matching. Flex must read ahead from the input stream and store it in a large character buffer. This buffer is typically the largest chunk of dynamic memory flex consumes. This buffer will grow if necessary, doubling the size each time. Flex frees this memory when you call yylex_destroy(). The default size of this buffer (16384 bytes) is almost always too large. The ideal size for this buffer is the length of the longest token expected, in bytes, plus a little more. Flex will allocate a few extra bytes for housekeeping. Currently, to override the size of the input buffer you must `#define YY_BUF_SIZE' to whatever number of bytes you want. We don't plan to change this in the near future, but we reserve the right to do so if we ever add a more robust memory management API. 64kb for the REJECT state. This will only be allocated if you use REJECT. The size is the large enough to hold the same number of states as characters in the input buffer. If you override the size of the input buffer (via `YY_BUF_SIZE'), then you automatically override the size of this buffer as well. 100 bytes for the start condition stack. Flex allocates memory for the start condition stack. This is the stack used for pushing start states, i.e., with yy_push_state(). It will grow if necessary. Since the states are simply integers, this stack doesn't consume much memory. This stack is not present if `%option stack' is not specified. You will rarely need to tune this buffer. The ideal size for this stack is the maximum depth expected. The memory for this stack is automatically destroyed when you call yylex_destroy(). *Note option-stack::. 40 bytes for each YY_BUFFER_STATE. Flex allocates memory for each YY_BUFFER_STATE. The buffer state itself is about 40 bytes, plus an additional large character buffer (described above.) The initial buffer state is created during initialization, and with each call to yy_create_buffer(). You can't tune the size of this, but you can tune the character buffer as described above. Any buffer state that you explicitly create by calling yy_create_buffer() is _NOT_ destroyed automatically. You must call yy_delete_buffer() to free the memory. The exception to this rule is that flex will delete the current buffer automatically when you call yylex_destroy(). If you delete the current buffer, be sure to set it to NULL. That way, flex will not try to delete the buffer a second time (possibly crashing your program!) At the time of this writing, flex does not provide a growable stack for the buffer states. You have to manage that yourself. *Note Multiple Input Buffers::. 84 bytes for the reentrant scanner guts Flex allocates about 84 bytes for the reentrant scanner structure when you call yylex_init(). It is destroyed when the user calls yylex_destroy(). ---------- Footnotes ---------- (1) The quantities given here are approximate, and may vary due to host architecture, compiler configuration, or due to future enhancements to flex.