This is flex.info, produced by makeinfo version 4.5 from flex.texi. INFO-DIR-SECTION Programming START-INFO-DIR-ENTRY * flex: (flex). Fast lexical analyzer generator (lex replacement). END-INFO-DIR-ENTRY The flex manual is placed under the same licensing conditions as the rest of flex: Copyright (C) 1990, 1997 The Regents of the University of California. All rights reserved. This code is derived from software contributed to Berkeley by Vern Paxson. The United States Government has rights in this work pursuant to contract no. DE-AC03-76SF00098 between the United States Department of Energy and the University of California. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. File: flex.info, Node: Overriding The Default Memory Management, Next: A Note About yytext And Memory, Prev: The Default Memory Management, Up: Memory Management Overriding The Default Memory Management ======================================== Flex calls the functions `yyalloc', `yyrealloc', and `yyfree' when it needs to allocate or free memory. By default, these functions are wrappers around the standard C functions, `malloc', `realloc', and `free', respectively. You can override the default implementations by telling flex that you will provide your own implementations. To override the default implementations, you must do two things: 1. Suppress the default implementations by specifying one or more of the following options: * `%option noyyalloc' * `%option noyyrealloc' * `%option noyyfree'. 2. Provide your own implementation of the following functions: (1) // For a non-reentrant scanner void * yyalloc (size_t bytes); void * yyrealloc (void * ptr, size_t bytes); void yyfree (void * ptr); // For a reentrant scanner void * yyalloc (size_t bytes, void * yyscanner); void * yyrealloc (void * ptr, size_t bytes, void * yyscanner); void yyfree (void * ptr, void * yyscanner); In the following example, we will override all three memory routines. We assume that there is a custom allocator with garbage collection. In order to make this example interesting, we will use a reentrant scanner, passing a pointer to the custom allocator through `yyextra'. %{ #include "some_allocator.h" %} /* Suppress the default implementations. */ %option noyyalloc noyyrealloc noyyfree %option reentrant /* Initialize the allocator. */ #define YY_EXTRA_TYPE struct allocator* #define YY_USER_INIT yyextra = allocator_create(); %% .|\n ; %% /* Provide our own implementations. */ void * yyalloc (size_t bytes, void* yyscanner) { return allocator_alloc (yyextra, bytes); } void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) { return allocator_realloc (yyextra, bytes); } void yyfree (void * ptr, void * yyscanner) { /* Do nothing -- we leave it to the garbage collector. */ } ---------- Footnotes ---------- (1) It is not necessary to override all (or any) of the memory management routines. You may, for example, override `yyrealloc', but not `yyfree' or `yyalloc'. File: flex.info, Node: A Note About yytext And Memory, Prev: Overriding The Default Memory Management, Up: Memory Management A Note About yytext And Memory ============================== When flex finds a match, `yytext' points to the first character of the match in the input buffer. The string itself is part of the input buffer, and is _NOT_ allocated separately. The value of yytext will be overwritten the next time yylex() is called. In short, the value of yytext is only valid from within the matched rule's action. Often, you want the value of yytext to persist for later processing, i.e., by a parser with non-zero lookahead. In order to preserve yytext, you will have to copy it with strdup() or a similar function. But this introduces some headache because your parser is now responsible for freeing the copy of yytext. If you use a yacc or bison parser, (commonly used with flex), you will discover that the error recovery mechanisms can cause memory to be leaked. To prevent memory leaks from strdup'd yytext, you will have to track the memory somehow. Our experience has shown that a garbage collection mechanism or a pooled memory mechanism will save you a lot of grief when writing parsers. File: flex.info, Node: Serialized Tables, Next: Diagnostics, Prev: Memory Management, Up: Top Serialized Tables ***************** A `flex' scanner has the ability to save the DFA tables to a file, and load them at runtime when needed. The motivation for this feature is to reduce the runtime memory footprint. Traditionally, these tables have been compiled into the scanner as C arrays, and are sometimes quite large. Since the tables are compiled into the scanner, the memory used by the tables can never be freed. This is a waste of memory, especially if an application uses several scanners, but none of them at the same time. The serialization feature allows the tables to be loaded at runtime, before scanning begins. The tables may be discarded when scanning is finished. * Menu: * Creating Serialized Tables:: * Loading and Unloading Serialized Tables:: * Tables File Format:: File: flex.info, Node: Creating Serialized Tables, Next: Loading and Unloading Serialized Tables, Prev: Serialized Tables, Up: Serialized Tables Creating Serialized Tables ========================== You may create a scanner with serialized tables by specifying: %option tables-file=FILE or --tables-file=FILE These options instruct flex to save the DFA tables to the file FILE. The tables will _not_ be embedded in the generated scanner. The scanner will not function on its own. The scanner will be dependent upon the serialized tables. You must load the tables from this file at runtime before you can scan anything. If you do not specify a filename to `--tables-file', the tables will be saved to `lex.yy.tables', where `yy' is the appropriate prefix. If your project uses several different scanners, you can concatenate the serialized tables into one file, and flex will find the correct set of tables, using the scanner prefix as part of the lookup key. An example follows: $ flex --tables-file --prefix=cpp cpp.l $ flex --tables-file --prefix=c c.l $ cat lex.cpp.tables lex.c.tables > all.tables The above example created two scanners, `cpp', and `c'. Since we did not specify a filename, the tables were serialized to `lex.c.tables' and `lex.cpp.tables', respectively. Then, we concatenated the two files together into `all.tables', which we will distribute with our project. At runtime, we will open the file and tell flex to load the tables from it. Flex will find the correct tables automatically. (See next section). File: flex.info, Node: Loading and Unloading Serialized Tables, Next: Tables File Format, Prev: Creating Serialized Tables, Up: Serialized Tables Loading and Unloading Serialized Tables ======================================= If you've built your scanner with `%option tables-file', then you must load the scanner tables at runtime. This can be accomplished with the following function: - Function: int yytables_fload (FILE* FP [, yyscan_t SCANNER]) Locates scanner tables in the stream pointed to by FP and loads them. Memory for the tables is allocated via `yyalloc'. You must call this function before the first call to `yylex'. The argument SCANNER only appears in the reentrant scanner. This function returns `0' (zero) on success, or non-zero on error. The loaded tables are *not* automatically destroyed (unloaded) when you call `yylex_destroy'. The reason is that you may create several scanners of the same type (in a reentrant scanner), each of which needs access to these tables. To avoid a nasty memory leak, you must call the following function: - Function: int yytables_destroy ([yyscan_t SCANNER]) Unloads the scanner tables. The tables must be loaded again before you can scan any more data. The argument SCANNER only appears in the reentrant scanner. This function returns `0' (zero) on success, or non-zero on error. *The functions `yytables_fload' and `yytables_destroy' are not thread-safe.* You must ensure that these functions are called exactly once (for each scanner type) in a threaded program, before any thread calls `yylex'. After the tables are loaded, they are never written to, and no thread protection is required thereafter - until you destroy them. File: flex.info, Node: Tables File Format, Prev: Loading and Unloading Serialized Tables, Up: Serialized Tables Tables File Format ================== This section defines the file format of serialized `flex' tables. The tables format allows for one or more sets of tables to be specified, where each set corresponds to a given scanner. Scanners are indexed by name, as described below. The file format is as follows: TABLE SET 1 +-------------------------------+ Header | uint32 th_magic; | | uint32 th_hsize; | | uint32 th_ssize; | | uint16 th_flags; | | char th_version[]; | | char th_name[]; | | uint8 th_pad64[]; | +-------------------------------+ Table 1 | uint16 td_id; | | uint16 td_flags; | | uint32 td_lolen; | | uint32 td_hilen; | | void td_data[]; | | uint8 td_pad64[]; | +-------------------------------+ Table 2 | | . . . . . . . . . . . . Table n | | +-------------------------------+ TABLE SET 2 . . . TABLE SET N The above diagram shows that a complete set of tables consists of a header followed by multiple individual tables. Furthermore, multiple complete sets may be present in the same file, each set with its own header and tables. The sets are contiguous in the file. The only way to know if another set follows is to check the next four bytes for the magic number (or check for EOF). The header and tables sections are padded to 64-bit boundaries. Below we describe each field in detail. This format does not specify how the scanner will expand the given data, i.e., data may be serialized as int8, but expanded to an int32 array at runtime. This is to reduce the size of the serialized data where possible. Remember, _all integer values are in network byte order_. Fields of a table header: `th_magic' Magic number, always 0xF13C57B1. `th_hsize' Size of this entire header, in bytes, including all fields plus any padding. `th_ssize' Size of this entire set, in bytes, including the header, all tables, plus any padding. `th_flags' Bit flags for this table set. Currently unused. `th_version[]' Flex version in NULL-termninated string format. e.g., `2.5.13a'. This is the version of flex that was used to create the serialized tables. `th_name[]' Contains the name of this table set. The default is `yytables', and is prefixed accordingly, e.g., `footables'. Must be NULL-terminated. `th_pad64[]' Zero or more NULL bytes, padding the entire header to the next 64-bit boundary as calculated from the beginning of the header. Fields of a table: `td_id' Specifies the table identifier. Possible values are: `YYTD_ID_ACCEPT (0x01)' `yy_accept' `YYTD_ID_BASE (0x02)' `yy_base' `YYTD_ID_CHK (0x03)' `yy_chk' `YYTD_ID_DEF (0x04)' `yy_def' `YYTD_ID_EC (0x05)' `yy_ec ' `YYTD_ID_META (0x06)' `yy_meta' `YYTD_ID_NUL_TRANS (0x07)' `yy_NUL_trans' `YYTD_ID_NXT (0x08)' `yy_nxt'. This array may be two dimensional. See the `td_hilen' field below. `YYTD_ID_RULE_CAN_MATCH_EOL (0x09)' `yy_rule_can_match_eol' `YYTD_ID_START_STATE_LIST (0x0A)' `yy_start_state_list'. This array is handled specially because it is an array of pointers to structs. See the `td_flags' field below. `YYTD_ID_TRANSITION (0x0B)' `yy_transition'. This array is handled specially because it is an array of structs. See the `td_lolen' field below. `YYTD_ID_ACCLIST (0x0C)' `yy_acclist' `td_flags' Bit flags describing how to interpret the data in `td_data'. The data arrays are one-dimensional by default, but may be two dimensional as specified in the `td_hilen' field. `YYTD_DATA8 (0x01)' The data is serialized as an array of type int8. `YYTD_DATA16 (0x02)' The data is serialized as an array of type int16. `YYTD_DATA32 (0x04)' The data is serialized as an array of type int32. `YYTD_PTRANS (0x08)' The data is a list of indexes of entries in the expanded `yy_transition' array. Each index should be expanded to a pointer to the corresponding entry in the `yy_transition' array. We count on the fact that the `yy_transition' array has already been seen. `YYTD_STRUCT (0x10)' The data is a list of yy_trans_info structs, each of which consists of two integers. There is no padding between struct elements or between structs. The type of each member is determined by the `YYTD_DATA*' bits. `td_lolen' Specifies the number of elements in the lowest dimension array. If this is a one-dimensional array, then it is simply the number of elements in this array. The element size is determined by the `td_flags' field. `td_hilen' If `td_hilen' is non-zero, then the data is a two-dimensional array. Otherwise, the data is a one-dimensional array. `td_hilen' contains the number of elements in the higher dimensional array, and `td_lolen' contains the number of elements in the lowest dimension. Conceptually, `td_data' is either `sometype td_data[td_lolen]', or `sometype td_data[td_hilen][td_lolen]', where `sometype' is specified by the `td_flags' field. It is possible for both `td_lolen' and `td_hilen' to be zero, in which case `td_data' is a zero length array, and no data is loaded, i.e., this table is simply skipped. Flex does not currently generate tables of zero length. `td_data[]' The table data. This array may be a one- or two-dimensional array, of type `int8', `int16', `int32', `struct yy_trans_info', or `struct yy_trans_info*', depending upon the values in the `td_flags', `td_lolen', and `td_hilen' fields. `td_pad64[]' Zero or more NULL bytes, padding the entire table to the next 64-bit boundary as calculated from the beginning of this table. File: flex.info, Node: Diagnostics, Next: Limitations, Prev: Serialized Tables, Up: Top Diagnostics *********** The following is a list of `flex' diagnostic messages: * `warning, rule cannot be matched' indicates that the given rule cannot be matched because it follows other rules that will always match the same text as it. For example, in the following `foo' cannot be matched because it comes after an identifier "catch-all" rule: [a-z]+ got_identifier(); foo got_foo(); Using `REJECT' in a scanner suppresses this warning. * `warning, -s option given but default rule can be matched' means that it is possible (perhaps only in a particular start condition) that the default rule (match any single character) is the only one that will match a particular input. Since `-s' was given, presumably this is not intended. * `reject_used_but_not_detected undefined' or `yymore_used_but_not_detected undefined'. These errors can occur at compile time. They indicate that the scanner uses `REJECT' or `yymore()' but that `flex' failed to notice the fact, meaning that `flex' scanned the first two sections looking for occurrences of these actions and failed to find any, but somehow you snuck some in (via a #include file, for example). Use `%option reject' or `%option yymore' to indicate to `flex' that you really do use these features. * `flex scanner jammed'. a scanner compiled with `-s' has encountered an input string which wasn't matched by any of its rules. This error can also occur due to internal problems. * `token too large, exceeds YYLMAX'. your scanner uses `%array' and one of its rules matched a string longer than the `YYLMAX' constant (8K bytes by default). You can increase the value by #define'ing `YYLMAX' in the definitions section of your `flex' input. * `scanner requires -8 flag to use the character 'x''. Your scanner specification includes recognizing the 8-bit character `'x'' and you did not specify the -8 flag, and your scanner defaulted to 7-bit because you used the `-Cf' or `-CF' table compression options. See the discussion of the `-7' flag, *Note Scanner Options::, for details. * `flex scanner push-back overflow'. you used `unput()' to push back so much text that the scanner's buffer could not hold both the pushed-back text and the current token in `yytext'. Ideally the scanner should dynamically resize the buffer in this case, but at present it does not. * `input buffer overflow, can't enlarge buffer because scanner uses REJECT'. the scanner was working on matching an extremely large token and needed to expand the input buffer. This doesn't work with scanners that use `REJECT'. * `fatal flex scanner internal error--end of buffer missed'. This can occur in a scanner which is reentered after a long-jump has jumped out (or over) the scanner's activation frame. Before reentering the scanner, use: yyrestart( yyin ); or, as noted above, switch to using the C++ scanner class. * `too many start conditions in <> construct!' you listed more start conditions in a <> construct than exist (so you must have listed at least one of them twice). File: flex.info, Node: Limitations, Next: Bibliography, Prev: Diagnostics, Up: Top Limitations *********** Some trailing context patterns cannot be properly matched and generate warning messages (`dangerous trailing context'). These are patterns where the ending of the first part of the rule matches the beginning of the second part, such as `zx*/xy*', where the 'x*' matches the 'x' at the beginning of the trailing context. (Note that the POSIX draft states that the text matched by such patterns is undefined.) For some trailing context rules, parts which are actually fixed-length are not recognized as such, leading to the abovementioned performance loss. In particular, parts using `|' or `{n}' (such as `foo{3}') are always considered variable-length. Combining trailing context with the special `|' action can result in _fixed_ trailing context being turned into the more expensive _variable_ trailing context. For example, in the following: %% abc | xyz/def Use of `unput()' invalidates yytext and yyleng, unless the `%array' directive or the `-l' option has been used. Pattern-matching of `NUL's is substantially slower than matching other characters. Dynamic resizing of the input buffer is slow, as it entails rescanning all the text matched so far by the current (generally huge) token. Due to both buffering of input and read-ahead, you cannot intermix calls to `<stdio.h>' routines, such as, getchar(), with `flex' rules and expect it to work. Call `input()' instead. The total table entries listed by the `-v' flag excludes the number of table entries needed to determine what rule has been matched. The number of entries is equal to the number of DFA states if the scanner does not use `REJECT', and somewhat greater than the number of states if it does. `REJECT' cannot be used with the `-f' or `-F' options. The `flex' internal algorithms need documentation. File: flex.info, Node: Bibliography, Next: FAQ, Prev: Limitations, Up: Top Additional Reading ****************** You may wish to read more about the following programs: * lex * yacc * sed * awk The following books may contain material of interest: John Levine, Tony Mason, and Doug Brown, _Lex & Yacc_, O'Reilly and Associates. Be sure to get the 2nd edition. M. E. Lesk and E. Schmidt, _LEX - Lexical Analyzer Generator_ Alfred Aho, Ravi Sethi and Jeffrey Ullman, _Compilers: Principles, Techniques and Tools_, Addison-Wesley (1986). Describes the pattern-matching techniques used by `flex' (deterministic finite automata). File: flex.info, Node: FAQ, Next: Appendices, Prev: Bibliography, Up: Top FAQ *** From time to time, the `flex' maintainer receives certain questions. Rather than repeat answers to well-understood problems, we publish them here. * Menu: * When was flex born?:: * How do I expand \ escape sequences in C-style quoted strings?:: * Why do flex scanners call fileno if it is not ANSI compatible?:: * Does flex support recursive pattern definitions?:: * How do I skip huge chunks of input (tens of megabytes) while using flex?:: * Flex is not matching my patterns in the same order that I defined them.:: * My actions are executing out of order or sometimes not at all.:: * How can I have multiple input sources feed into the same scanner at the same time?:: * Can I build nested parsers that work with the same input file?:: * How can I match text only at the end of a file?:: * How can I make REJECT cascade across start condition boundaries?:: * Why cant I use fast or full tables with interactive mode?:: * How much faster is -F or -f than -C?:: * If I have a simple grammar cant I just parse it with flex?:: * Why doesnt yyrestart() set the start state back to INITIAL?:: * How can I match C-style comments?:: * The period isnt working the way I expected.:: * Can I get the flex manual in another format?:: * Does there exist a "faster" NDFA->DFA algorithm?:: * How does flex compile the DFA so quickly?:: * How can I use more than 8192 rules?:: * How do I abandon a file in the middle of a scan and switch to a new file?:: * How do I execute code only during initialization (only before the first scan)?:: * How do I execute code at termination?:: * Where else can I find help?:: * Can I include comments in the "rules" section of the file?:: * I get an error about undefined yywrap().:: * How can I change the matching pattern at run time?:: * How can I expand macros in the input?:: * How can I build a two-pass scanner?:: * How do I match any string not matched in the preceding rules?:: * I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: * Is there a way to make flex treat NULL like a regular character?:: * Whenever flex can not match the input it says "flex scanner jammed".:: * Why doesnt flex have non-greedy operators like perl does?:: * Memory leak - 16386 bytes allocated by malloc.:: * How do I track the byte offset for lseek()?:: * How do I use my own I/O classes in a C++ scanner?:: * How do I skip as many chars as possible?:: * deleteme00:: * Are certain equivalent patterns faster than others?:: * Is backing up a big deal?:: * Can I fake multi-byte character support?:: * deleteme01:: * Can you discuss some flex internals?:: * unput() messes up yy_at_bol:: * The | operator is not doing what I want:: * Why can't flex understand this variable trailing context pattern?:: * The ^ operator isn't working:: * Trailing context is getting confused with trailing optional patterns:: * Is flex GNU or not?:: * ERASEME53:: * I need to scan if-then-else blocks and while loops:: * ERASEME55:: * ERASEME56:: * ERASEME57:: * Is there a repository for flex scanners?:: * How can I conditionally compile or preprocess my flex input file?:: * Where can I find grammars for lex and yacc?:: * I get an end-of-buffer message for each character scanned.:: * unnamed-faq-62:: * unnamed-faq-63:: * unnamed-faq-64:: * unnamed-faq-65:: * unnamed-faq-66:: * unnamed-faq-67:: * unnamed-faq-68:: * unnamed-faq-69:: * unnamed-faq-70:: * unnamed-faq-71:: * unnamed-faq-72:: * unnamed-faq-73:: * unnamed-faq-74:: * unnamed-faq-75:: * unnamed-faq-76:: * unnamed-faq-77:: * unnamed-faq-78:: * unnamed-faq-79:: * unnamed-faq-80:: * unnamed-faq-81:: * unnamed-faq-82:: * unnamed-faq-83:: * unnamed-faq-84:: * unnamed-faq-85:: * unnamed-faq-86:: * unnamed-faq-87:: * unnamed-faq-88:: * unnamed-faq-90:: * unnamed-faq-91:: * unnamed-faq-92:: * unnamed-faq-93:: * unnamed-faq-94:: * unnamed-faq-95:: * unnamed-faq-96:: * unnamed-faq-97:: * unnamed-faq-98:: * unnamed-faq-99:: * unnamed-faq-100:: * unnamed-faq-101:: * What is the difference between YYLEX_PARAM and YY_DECL?:: * Why do I get "conflicting types for yylex" error?:: * How do I access the values set in a Flex action from within a Bison action?:: File: flex.info, Node: When was flex born?, Next: How do I expand \ escape sequences in C-style quoted strings?, Up: FAQ When was flex born? =================== Vern Paxson took over the `Software Tools' lex project from Jef Poskanzer in 1982. At that point it was written in Ratfor. Around 1987 or so, Paxson translated it into C, and a legend was born :-). File: flex.info, Node: How do I expand \ escape sequences in C-style quoted strings?, Next: Why do flex scanners call fileno if it is not ANSI compatible?, Prev: When was flex born?, Up: FAQ How do I expand \ escape sequences in C-style quoted strings? ============================================================= A key point when scanning quoted strings is that you cannot (easily) write a single rule that will precisely match the string if you allow things like embedded escape sequences and newlines. If you try to match strings with a single rule then you'll wind up having to rescan the string anyway to find any escape sequences. Instead you can use exclusive start conditions and a set of rules, one for matching non-escaped text, one for matching a single escape, one for matching an embedded newline, and one for recognizing the end of the string. Each of these rules is then faced with the question of where to put its intermediary results. The best solution is for the rules to append their local value of `yytext' to the end of a "string literal" buffer. A rule like the escape-matcher will append to the buffer the meaning of the escape sequence rather than the literal text in `yytext'. In this way, `yytext' does not need to be modified at all. File: flex.info, Node: Why do flex scanners call fileno if it is not ANSI compatible?, Next: Does flex support recursive pattern definitions?, Prev: How do I expand \ escape sequences in C-style quoted strings?, Up: FAQ Why do flex scanners call fileno if it is not ANSI compatible? ============================================================== Flex scanners call `fileno()' in order to get the file descriptor corresponding to `yyin'. The file descriptor may be passed to `isatty()' or `read()', depending upon which `%options' you specified. If your system does not have `fileno()' support, to get rid of the `read()' call, do not specify `%option read'. To get rid of the `isatty()' call, you must specify one of `%option always-interactive' or `%option never-interactive'. File: flex.info, Node: Does flex support recursive pattern definitions?, Next: How do I skip huge chunks of input (tens of megabytes) while using flex?, Prev: Why do flex scanners call fileno if it is not ANSI compatible?, Up: FAQ Does flex support recursive pattern definitions? ================================================ e.g., %% block "{"({block}|{statement})*"}" No. You cannot have recursive definitions. The pattern-matching power of regular expressions in general (and therefore flex scanners, too) is limited. In particular, regular expressions cannot "balance" parentheses to an arbitrary degree. For example, it's impossible to write a regular expression that matches all strings containing the same number of '{'s as '}'s. For more powerful pattern matching, you need a parser, such as `GNU bison'. File: flex.info, Node: How do I skip huge chunks of input (tens of megabytes) while using flex?, Next: Flex is not matching my patterns in the same order that I defined them., Prev: Does flex support recursive pattern definitions?, Up: FAQ How do I skip huge chunks of input (tens of megabytes) while using flex? ======================================================================== Use `fseek()' (or `lseek()') to position yyin, then call `yyrestart()'. File: flex.info, Node: Flex is not matching my patterns in the same order that I defined them., Next: My actions are executing out of order or sometimes not at all., Prev: How do I skip huge chunks of input (tens of megabytes) while using flex?, Up: FAQ Flex is not matching my patterns in the same order that I defined them. ======================================================================= `flex' picks the rule that matches the most text (i.e., the longest possible input string). This is because `flex' uses an entirely different matching technique ("deterministic finite automata") that actually does all of the matching simultaneously, in parallel. (Seems impossible, but it's actually a fairly simple technique once you understand the principles.) A side-effect of this parallel matching is that when the input matches more than one rule, `flex' scanners pick the rule that matched the _most_ text. This is explained further in the manual, in the section *Note Matching::. If you want `flex' to choose a shorter match, then you can work around this behavior by expanding your short rule to match more text, then put back the extra: data_.* yyless( 5 ); BEGIN BLOCKIDSTATE; Another fix would be to make the second rule active only during the `<BLOCKIDSTATE>' start condition, and make that start condition exclusive by declaring it with `%x' instead of `%s'. A final fix is to change the input language so that the ambiguity for `data_' is removed, by adding characters to it that don't match the identifier rule, or by removing characters (such as `_') from the identifier rule so it no longer matches `data_'. (Of course, you might also not have the option of changing the input language.) File: flex.info, Node: My actions are executing out of order or sometimes not at all., Next: How can I have multiple input sources feed into the same scanner at the same time?, Prev: Flex is not matching my patterns in the same order that I defined them., Up: FAQ My actions are executing out of order or sometimes not at all. ============================================================== Most likely, you have (in error) placed the opening `{' of the action block on a different line than the rule, e.g., ^(foo|bar) { <<<--- WRONG! } `flex' requires that the opening `{' of an action associated with a rule begin on the same line as does the rule. You need instead to write your rules as follows: ^(foo|bar) { // CORRECT! } File: flex.info, Node: How can I have multiple input sources feed into the same scanner at the same time?, Next: Can I build nested parsers that work with the same input file?, Prev: My actions are executing out of order or sometimes not at all., Up: FAQ How can I have multiple input sources feed into the same scanner at the same time? ================================================================================== If ... * your scanner is free of backtracking (verified using `flex''s `-b' flag), * AND you run your scanner interactively (`-I' option; default unless using special table compression options), * AND you feed it one character at a time by redefining `YY_INPUT' to do so, then every time it matches a token, it will have exhausted its input buffer (because the scanner is free of backtracking). This means you can safely use `select()' at the point and only call `yylex()' for another token if `select()' indicates there's data available. That is, move the `select()' out from the input function to a point where it determines whether `yylex()' gets called for the next token. With this approach, you will still have problems if your input can arrive piecemeal; `select()' could inform you that the beginning of a token is available, you call `yylex()' to get it, but it winds up blocking waiting for the later characters in the token. Here's another way: Move your input multiplexing inside of `YY_INPUT'. That is, whenever `YY_INPUT' is called, it `select()''s to see where input is available. If input is available for the scanner, it reads and returns the next byte. If input is available from another source, it calls whatever function is responsible for reading from that source. (If no input is available, it blocks until some input is available.) I've used this technique in an interpreter I wrote that both reads keyboard input using a `flex' scanner and IPC traffic from sockets, and it works fine. File: flex.info, Node: Can I build nested parsers that work with the same input file?, Next: How can I match text only at the end of a file?, Prev: How can I have multiple input sources feed into the same scanner at the same time?, Up: FAQ Can I build nested parsers that work with the same input file? ============================================================== This is not going to work without some additional effort. The reason is that `flex' block-buffers the input it reads from `yyin'. This means that the "outermost" `yylex()', when called, will automatically slurp up the first 8K of input available on yyin, and subsequent calls to other `yylex()''s won't see that input. You might be tempted to work around this problem by redefining `YY_INPUT' to only return a small amount of text, but it turns out that that approach is quite difficult. Instead, the best solution is to combine all of your scanners into one large scanner, using a different exclusive start condition for each. File: flex.info, Node: How can I match text only at the end of a file?, Next: How can I make REJECT cascade across start condition boundaries?, Prev: Can I build nested parsers that work with the same input file?, Up: FAQ How can I match text only at the end of a file? =============================================== There is no way to write a rule which is "match this text, but only if it comes at the end of the file". You can fake it, though, if you happen to have a character lying around that you don't allow in your input. Then you redefine `YY_INPUT' to call your own routine which, if it sees an `EOF', returns the magic character first (and remembers to return a real `EOF' next time it's called). Then you could write: <COMMENT>(.|\n)*{EOF_CHAR} /* saw comment at EOF */ File: flex.info, Node: How can I make REJECT cascade across start condition boundaries?, Next: Why cant I use fast or full tables with interactive mode?, Prev: How can I match text only at the end of a file?, Up: FAQ How can I make REJECT cascade across start condition boundaries? ================================================================ You can do this as follows. Suppose you have a start condition `A', and after exhausting all of the possible matches in `<A>', you want to try matches in `<INITIAL>'. Then you could use the following: %x A %% <A>rule_that_is_long ...; REJECT; <A>rule ...; REJECT; /* shorter rule */ <A>etc. ... <A>.|\n { /* Shortest and last rule in <A>, so * cascaded REJECT's will eventually * wind up matching this rule. We want * to now switch to the initial state * and try matching from there instead. */ yyless(0); /* put back matched text */ BEGIN(INITIAL); } File: flex.info, Node: Why cant I use fast or full tables with interactive mode?, Next: How much faster is -F or -f than -C?, Prev: How can I make REJECT cascade across start condition boundaries?, Up: FAQ Why can't I use fast or full tables with interactive mode? ========================================================== One of the assumptions flex makes is that interactive applications are inherently slow (they're waiting on a human after all). It has to do with how the scanner detects that it must be finished scanning a token. For interactive scanners, after scanning each character the current state is looked up in a table (essentially) to see whether there's a chance of another input character possibly extending the length of the match. If not, the scanner halts. For non-interactive scanners, the end-of-token test is much simpler, basically a compare with 0, so no memory bus cycles. Since the test occurs in the innermost scanning loop, one would like to make it go as fast as possible. Still, it seems reasonable to allow the user to choose to trade off a bit of performance in this area to gain the corresponding flexibility. There might be another reason, though, why fast scanners don't support the interactive option. File: flex.info, Node: How much faster is -F or -f than -C?, Next: If I have a simple grammar cant I just parse it with flex?, Prev: Why cant I use fast or full tables with interactive mode?, Up: FAQ How much faster is -F or -f than -C? ==================================== Much faster (factor of 2-3). File: flex.info, Node: If I have a simple grammar cant I just parse it with flex?, Next: Why doesnt yyrestart() set the start state back to INITIAL?, Prev: How much faster is -F or -f than -C?, Up: FAQ If I have a simple grammar can't I just parse it with flex? =========================================================== Is your grammar recursive? That's almost always a sign that you're better off using a parser/scanner rather than just trying to use a scanner alone. File: flex.info, Node: Why doesnt yyrestart() set the start state back to INITIAL?, Next: How can I match C-style comments?, Prev: If I have a simple grammar cant I just parse it with flex?, Up: FAQ Why doesn't yyrestart() set the start state back to INITIAL? ============================================================ There are two reasons. The first is that there might be programs that rely on the start state not changing across file changes. The second is that beginning with `flex' version 2.4, use of `yyrestart()' is no longer required, so fixing the problem there doesn't solve the more general problem. File: flex.info, Node: How can I match C-style comments?, Next: The period isnt working the way I expected., Prev: Why doesnt yyrestart() set the start state back to INITIAL?, Up: FAQ How can I match C-style comments? ================================= You might be tempted to try something like this: "/*".*"*/" // WRONG! or, worse, this: "/*"(.|\n)"*/" // WRONG! The above rules will eat too much input, and blow up on things like: /* a comment */ do_my_thing( "oops */" ); Here is one way which allows you to track line information: <INITIAL>{ "/*" BEGIN(IN_COMMENT); } <IN_COMMENT>{ "*/" BEGIN(INITIAL); [^*\n]+ // eat comment in chunks "*" // eat the lone star \n yylineno++; } File: flex.info, Node: The period isnt working the way I expected., Next: Can I get the flex manual in another format?, Prev: How can I match C-style comments?, Up: FAQ The '.' isn't working the way I expected. ========================================= Here are some tips for using `.': * A common mistake is to place the grouping parenthesis AFTER an operator, when you really meant to place the parenthesis BEFORE the operator, e.g., you probably want this `(foo|bar)+' and NOT this `(foo|bar+)'. The first pattern matches the words `foo' or `bar' any number of times, e.g., it matches the text `barfoofoobarfoo'. The second pattern matches a single instance of `foo' or a single instance of `bar' followed by one or more `r's, e.g., it matches the text `barrrr' . * A `.' inside `[]''s just means a literal`.' (period), and NOT "any character except newline". * Remember that `.' matches any character EXCEPT `\n' (and `EOF'). If you really want to match ANY character, including newlines, then use `(.|\n)' Beware that the regex `(.|\n)+' will match your entire input! * Finally, if you want to match a literal `.' (a period), then use `[.]' or `"."' File: flex.info, Node: Can I get the flex manual in another format?, Next: Does there exist a "faster" NDFA->DFA algorithm?, Prev: The period isnt working the way I expected., Up: FAQ Can I get the flex manual in another format? ============================================ The `flex' source distribution includes a texinfo manual. You are free to convert that texinfo into whatever format you desire. The `texinfo' package includes tools for conversion to a number of formats. File: flex.info, Node: Does there exist a "faster" NDFA->DFA algorithm?, Next: How does flex compile the DFA so quickly?, Prev: Can I get the flex manual in another format?, Up: FAQ Does there exist a "faster" NDFA->DFA algorithm? ================================================ There's no way around the potential exponential running time - it can take you exponential time just to enumerate all of the DFA states. In practice, though, the running time is closer to linear, or sometimes quadratic. File: flex.info, Node: How does flex compile the DFA so quickly?, Next: How can I use more than 8192 rules?, Prev: Does there exist a "faster" NDFA->DFA algorithm?, Up: FAQ How does flex compile the DFA so quickly? ========================================= There are two big speed wins that `flex' uses: 1. It analyzes the input rules to construct equivalence classes for those characters that always make the same transitions. It then rewrites the NFA using equivalence classes for transitions instead of characters. This cuts down the NFA->DFA computation time dramatically, to the point where, for uncompressed DFA tables, the DFA generation is often I/O bound in writing out the tables. 2. It maintains hash values for previously computed DFA states, so testing whether a newly constructed DFA state is equivalent to a previously constructed state can be done very quickly, by first comparing hash values. File: flex.info, Node: How can I use more than 8192 rules?, Next: How do I abandon a file in the middle of a scan and switch to a new file?, Prev: How does flex compile the DFA so quickly?, Up: FAQ How can I use more than 8192 rules? =================================== `Flex' is compiled with an upper limit of 8192 rules per scanner. If you need more than 8192 rules in your scanner, you'll have to recompile `flex' with the following changes in `flexdef.h': < #define YY_TRAILING_MASK 0x2000 < #define YY_TRAILING_HEAD_MASK 0x4000 -- > #define YY_TRAILING_MASK 0x20000000 > #define YY_TRAILING_HEAD_MASK 0x40000000 This should work okay as long as your C compiler uses 32 bit integers. But you might want to think about whether using such a huge number of rules is the best way to solve your problem. The following may also be relevant: With luck, you should be able to increase the definitions in flexdef.h for: #define JAMSTATE -32766 /* marks a reference to the state that always jams */ #define MAXIMUM_MNS 31999 #define BAD_SUBSCRIPT -32767 recompile everything, and it'll all work. Flex only has these 16-bit-like values built into it because a long time ago it was developed on a machine with 16-bit ints. I've given this advice to others in the past but haven't heard back from them whether it worked okay or not... File: flex.info, Node: How do I abandon a file in the middle of a scan and switch to a new file?, Next: How do I execute code only during initialization (only before the first scan)?, Prev: How can I use more than 8192 rules?, Up: FAQ How do I abandon a file in the middle of a scan and switch to a new file? ========================================================================= Just call `yyrestart(newfile)'. Be sure to reset the start state if you want a "fresh start, since `yyrestart' does NOT reset the start state back to `INITIAL'. File: flex.info, Node: How do I execute code only during initialization (only before the first scan)?, Next: How do I execute code at termination?, Prev: How do I abandon a file in the middle of a scan and switch to a new file?, Up: FAQ How do I execute code only during initialization (only before the first scan)? ============================================================================== You can specify an initial action by defining the macro `YY_USER_INIT' (though note that `yyout' may not be available at the time this macro is executed). Or you can add to the beginning of your rules section: %% /* Must be indented! */ static int did_init = 0; if ( ! did_init ){ do_my_init(); did_init = 1; } File: flex.info, Node: How do I execute code at termination?, Next: Where else can I find help?, Prev: How do I execute code only during initialization (only before the first scan)?, Up: FAQ How do I execute code at termination? ===================================== You can specify an action for the `<<EOF>>' rule. File: flex.info, Node: Where else can I find help?, Next: Can I include comments in the "rules" section of the file?, Prev: How do I execute code at termination?, Up: FAQ Where else can I find help? =========================== You can find the flex homepage on the web at `http://flex.sourceforge.net/'. See that page for details about flex mailing lists as well. File: flex.info, Node: Can I include comments in the "rules" section of the file?, Next: I get an error about undefined yywrap()., Prev: Where else can I find help?, Up: FAQ Can I include comments in the "rules" section of the file? ========================================================== Yes, just about anywhere you want to. See the manual for the specific syntax. File: flex.info, Node: I get an error about undefined yywrap()., Next: How can I change the matching pattern at run time?, Prev: Can I include comments in the "rules" section of the file?, Up: FAQ I get an error about undefined yywrap(). ======================================== You must supply a `yywrap()' function of your own, or link to `libfl.a' (which provides one), or use %option noyywrap in your source to say you don't want a `yywrap()' function. File: flex.info, Node: How can I change the matching pattern at run time?, Next: How can I expand macros in the input?, Prev: I get an error about undefined yywrap()., Up: FAQ How can I change the matching pattern at run time? ================================================== You can't, it's compiled into a static table when flex builds the scanner. File: flex.info, Node: How can I expand macros in the input?, Next: How can I build a two-pass scanner?, Prev: How can I change the matching pattern at run time?, Up: FAQ How can I expand macros in the input? ===================================== The best way to approach this problem is at a higher level, e.g., in the parser. However, you can do this using multiple input buffers. %% macro/[a-z]+ { /* Saw the macro "macro" followed by extra stuff. */ main_buffer = YY_CURRENT_BUFFER; expansion_buffer = yy_scan_string(expand(yytext)); yy_switch_to_buffer(expansion_buffer); } <<EOF>> { if ( expansion_buffer ) { // We were doing an expansion, return to where // we were. yy_switch_to_buffer(main_buffer); yy_delete_buffer(expansion_buffer); expansion_buffer = 0; } else yyterminate(); } You probably will want a stack of expansion buffers to allow nested macros. From the above though hopefully the idea is clear. File: flex.info, Node: How can I build a two-pass scanner?, Next: How do I match any string not matched in the preceding rules?, Prev: How can I expand macros in the input?, Up: FAQ How can I build a two-pass scanner? =================================== One way to do it is to filter the first pass to a temporary file, then process the temporary file on the second pass. You will probably see a performance hit, do to all the disk I/O. When you need to look ahead far forward like this, it almost always means that the right solution is to build a parse tree of the entire input, then walk it after the parse in order to generate the output. In a sense, this is a two-pass approach, once through the text and once through the parse tree, but the performance hit for the latter is usually an order of magnitude smaller, since everything is already classified, in binary format, and residing in memory.