flex.info-4 [plain text]

This is flex.info, produced by makeinfo version 4.5 from flex.texi.

INFO-DIR-SECTION Programming
START-INFO-DIR-ENTRY
* flex: (flex).      Fast lexical analyzer generator (lex replacement).
END-INFO-DIR-ENTRY


   The flex manual is placed under the same licensing conditions as the
rest of flex:

   Copyright (C) 1990, 1997 The Regents of the University of California.
All rights reserved.

   This code is derived from software contributed to Berkeley by Vern
Paxson.

   The United States Government has rights in this work pursuant to
contract no. DE-AC03-76SF00098 between the United States Department of
Energy and the University of California.

   Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

  1.  Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the
     distribution.
   Neither the name of the University nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.

   THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

File: flex.info,  Node: Overriding The Default Memory Management,  Next: A Note About yytext And Memory,  Prev: The Default Memory Management,  Up: Memory Management

Overriding The Default Memory Management
========================================

   Flex calls the functions `yyalloc', `yyrealloc', and `yyfree' when
it needs to allocate or free memory. By default, these functions are
wrappers around the standard C functions, `malloc', `realloc', and
`free', respectively. You can override the default implementations by
telling flex that you will provide your own implementations.

   To override the default implementations, you must do two things:

  1. Suppress the default implementations by specifying one or more of
     the following options:

        * `%option noyyalloc'

        * `%option noyyrealloc'

        * `%option noyyfree'.

  2. Provide your own implementation of the following functions: (1)


          // For a non-reentrant scanner
          void * yyalloc (size_t bytes);
          void * yyrealloc (void * ptr, size_t bytes);
          void   yyfree (void * ptr);
          
          // For a reentrant scanner
          void * yyalloc (size_t bytes, void * yyscanner);
          void * yyrealloc (void * ptr, size_t bytes, void * yyscanner);
          void   yyfree (void * ptr, void * yyscanner);


   In the following example, we will override all three memory
routines. We assume that there is a custom allocator with garbage
collection. In order to make this example interesting, we will use a
reentrant scanner, passing a pointer to the custom allocator through
`yyextra'.


     %{
     #include "some_allocator.h"
     %}
     
     /* Suppress the default implementations. */
     %option noyyalloc noyyrealloc noyyfree
     %option reentrant
     
     /* Initialize the allocator. */
     #define YY_EXTRA_TYPE  struct allocator*
     #define YY_USER_INIT  yyextra = allocator_create();
     
     %%
     .|\n   ;
     %%
     
     /* Provide our own implementations. */
     void * yyalloc (size_t bytes, void* yyscanner) {
         return allocator_alloc (yyextra, bytes);
     }
     
     void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) {
         return allocator_realloc (yyextra, bytes);
     }
     
     void yyfree (void * ptr, void * yyscanner) {      
         /* Do nothing -- we leave it to the garbage collector. */
     }

   ---------- Footnotes ----------

   (1) It is not necessary to override all (or any) of the memory
management routines.  You may, for example, override `yyrealloc', but
not `yyfree' or `yyalloc'.


File: flex.info,  Node: A Note About yytext And Memory,  Prev: Overriding The Default Memory Management,  Up: Memory Management

A Note About yytext And Memory
==============================

   When flex finds a match, `yytext' points to the first character of
the match in the input buffer. The string itself is part of the input
buffer, and is _NOT_ allocated separately. The value of yytext will be
overwritten the next time yylex() is called. In short, the value of
yytext is only valid from within the matched rule's action.

   Often, you want the value of yytext to persist for later processing,
i.e., by a parser with non-zero lookahead. In order to preserve yytext,
you will have to copy it with strdup() or a similar function. But this
introduces some headache because your parser is now responsible for
freeing the copy of yytext. If you use a yacc or bison parser,
(commonly used with flex), you will discover that the error recovery
mechanisms can cause memory to be leaked.

   To prevent memory leaks from strdup'd yytext, you will have to track
the memory somehow. Our experience has shown that a garbage collection
mechanism or a pooled memory mechanism will save you a lot of grief
when writing parsers.


File: flex.info,  Node: Serialized Tables,  Next: Diagnostics,  Prev: Memory Management,  Up: Top

Serialized Tables
*****************

   A `flex' scanner has the ability to save the DFA tables to a file,
and load them at runtime when needed.  The motivation for this feature
is to reduce the runtime memory footprint.  Traditionally, these tables
have been compiled into the scanner as C arrays, and are sometimes
quite large.  Since the tables are compiled into the scanner, the
memory used by the tables can never be freed.  This is a waste of
memory, especially if an application uses several scanners, but none of
them at the same time.

   The serialization feature allows the tables to be loaded at runtime,
before scanning begins. The tables may be discarded when scanning is
finished.

* Menu:

* Creating Serialized Tables::
* Loading and Unloading Serialized Tables::
* Tables File Format::


File: flex.info,  Node: Creating Serialized Tables,  Next: Loading and Unloading Serialized Tables,  Prev: Serialized Tables,  Up: Serialized Tables

Creating Serialized Tables
==========================

   You may create a scanner with serialized tables by specifying:


         %option tables-file=FILE
     or
         --tables-file=FILE

   These options instruct flex to save the DFA tables to the file FILE.
The tables will _not_ be embedded in the generated scanner. The scanner
will not function on its own. The scanner will be dependent upon the
serialized tables. You must load the tables from this file at runtime
before you can scan anything.

   If you do not specify a filename to `--tables-file', the tables will
be saved to `lex.yy.tables', where `yy' is the appropriate prefix.

   If your project uses several different scanners, you can concatenate
the serialized tables into one file, and flex will find the correct set
of tables, using the scanner prefix as part of the lookup key. An
example follows:


     $ flex --tables-file --prefix=cpp cpp.l
     $ flex --tables-file --prefix=c   c.l
     $ cat lex.cpp.tables lex.c.tables  >  all.tables

   The above example created two scanners, `cpp', and `c'. Since we did
not specify a filename, the tables were serialized to `lex.c.tables' and
`lex.cpp.tables', respectively. Then, we concatenated the two files
together into `all.tables', which we will distribute with our project.
At runtime, we will open the file and tell flex to load the tables from
it.  Flex will find the correct tables automatically. (See next
section).


File: flex.info,  Node: Loading and Unloading Serialized Tables,  Next: Tables File Format,  Prev: Creating Serialized Tables,  Up: Serialized Tables

Loading and Unloading Serialized Tables
=======================================

   If you've built your scanner with `%option tables-file', then you
must load the scanner tables at runtime. This can be accomplished with
the following function:

 - Function: int yytables_fload (FILE* FP [, yyscan_t SCANNER])
     Locates scanner tables in the stream pointed to by FP and loads
     them.  Memory for the tables is allocated via `yyalloc'.  You must
     call this function before the first call to `yylex'. The argument
     SCANNER only appears in the reentrant scanner.  This function
     returns `0' (zero) on success, or non-zero on error.

   The loaded tables are *not* automatically destroyed (unloaded) when
you call `yylex_destroy'. The reason is that you may create several
scanners of the same type (in a reentrant scanner), each of which needs
access to these tables.  To avoid a nasty memory leak, you must call
the following function:

 - Function: int yytables_destroy ([yyscan_t SCANNER])
     Unloads the scanner tables. The tables must be loaded again before
     you can scan any more data.  The argument SCANNER only appears in
     the reentrant scanner.  This function returns `0' (zero) on
     success, or non-zero on error.

   *The functions `yytables_fload' and `yytables_destroy' are not
thread-safe.* You must ensure that these functions are called exactly
once (for each scanner type) in a threaded program, before any thread
calls `yylex'.  After the tables are loaded, they are never written to,
and no thread protection is required thereafter - until you destroy
them.


File: flex.info,  Node: Tables File Format,  Prev: Loading and Unloading Serialized Tables,  Up: Serialized Tables

Tables File Format
==================

   This section defines the file format of serialized `flex' tables.

   The tables format allows for one or more sets of tables to be
specified, where each set corresponds to a given scanner. Scanners are
indexed by name, as described below. The file format is as follows:


                      TABLE SET 1
                     +-------------------------------+
             Header  | uint32          th_magic;     |
                     | uint32          th_hsize;     |
                     | uint32          th_ssize;     |
                     | uint16          th_flags;     |
                     | char            th_version[]; |
                     | char            th_name[];    |
                     | uint8           th_pad64[];   |
                     +-------------------------------+
             Table 1 | uint16          td_id;        |
                     | uint16          td_flags;     |
                     | uint32          td_lolen;     |
                     | uint32          td_hilen;     |
                     | void            td_data[];    |
                     | uint8           td_pad64[];   |
                     +-------------------------------+
             Table 2 |                               |
                .    .                               .
                .    .                               .
                .    .                               .
                .    .                               .
             Table n |                               |
                     +-------------------------------+
                      TABLE SET 2
                           .
                           .
                           .
                      TABLE SET N

   The above diagram shows that a complete set of tables consists of a
header followed by multiple individual tables. Furthermore, multiple
complete sets may be present in the same file, each set with its own
header and tables. The sets are contiguous in the file. The only way to
know if another set follows is to check the next four bytes for the
magic number (or check for EOF). The header and tables sections are
padded to 64-bit boundaries. Below we describe each field in detail.
This format does not specify how the scanner will expand the given
data, i.e., data may be serialized as int8, but expanded to an int32
array at runtime. This is to reduce the size of the serialized data
where possible.  Remember, _all integer values are in network byte
order_.

Fields of a table header:

`th_magic'
     Magic number, always 0xF13C57B1.

`th_hsize'
     Size of this entire header, in bytes, including all fields plus
     any padding.

`th_ssize'
     Size of this entire set, in bytes, including the header, all
     tables, plus any padding.

`th_flags'
     Bit flags for this table set. Currently unused.

`th_version[]'
     Flex version in NULL-termninated string format. e.g., `2.5.13a'.
     This is the version of flex that was used to create the serialized
     tables.

`th_name[]'
     Contains the name of this table set. The default is `yytables',
     and is prefixed accordingly, e.g., `footables'. Must be
     NULL-terminated.

`th_pad64[]'
     Zero or more NULL bytes, padding the entire header to the next
     64-bit boundary as calculated from the beginning of the header.

Fields of a table:

`td_id'
     Specifies the table identifier. Possible values are:
    `YYTD_ID_ACCEPT (0x01)'
          `yy_accept'

    `YYTD_ID_BASE   (0x02)'
          `yy_base'

    `YYTD_ID_CHK    (0x03)'
          `yy_chk'

    `YYTD_ID_DEF    (0x04)'
          `yy_def'

    `YYTD_ID_EC     (0x05)'
          `yy_ec '

    `YYTD_ID_META   (0x06)'
          `yy_meta'

    `YYTD_ID_NUL_TRANS (0x07)'
          `yy_NUL_trans'

    `YYTD_ID_NXT (0x08)'
          `yy_nxt'. This array may be two dimensional. See the
          `td_hilen' field below.

    `YYTD_ID_RULE_CAN_MATCH_EOL (0x09)'
          `yy_rule_can_match_eol'

    `YYTD_ID_START_STATE_LIST (0x0A)'
          `yy_start_state_list'. This array is handled specially
          because it is an array of pointers to structs. See the
          `td_flags' field below.

    `YYTD_ID_TRANSITION (0x0B)'
          `yy_transition'. This array is handled specially because it
          is an array of structs. See the `td_lolen' field below.

    `YYTD_ID_ACCLIST (0x0C)'
          `yy_acclist'

`td_flags'
     Bit flags describing how to interpret the data in `td_data'.  The
     data arrays are one-dimensional by default, but may be two
     dimensional as specified in the `td_hilen' field.

    `YYTD_DATA8 (0x01)'
          The data is serialized as an array of type int8.

    `YYTD_DATA16 (0x02)'
          The data is serialized as an array of type int16.

    `YYTD_DATA32 (0x04)'
          The data is serialized as an array of type int32.

    `YYTD_PTRANS (0x08)'
          The data is a list of indexes of entries in the expanded
          `yy_transition' array.  Each index should be expanded to a
          pointer to the corresponding entry in the `yy_transition'
          array. We count on the fact that the `yy_transition' array
          has already been seen.

    `YYTD_STRUCT (0x10)'
          The data is a list of yy_trans_info structs, each of which
          consists of two integers. There is no padding between struct
          elements or between structs.  The type of each member is
          determined by the `YYTD_DATA*' bits.

`td_lolen'
     Specifies the number of elements in the lowest dimension array. If
     this is a one-dimensional array, then it is simply the number of
     elements in this array.  The element size is determined by the
     `td_flags' field.

`td_hilen'
     If `td_hilen' is non-zero, then the data is a two-dimensional
     array.  Otherwise, the data is a one-dimensional array. `td_hilen'
     contains the number of elements in the higher dimensional array,
     and `td_lolen' contains the number of elements in the lowest
     dimension.

     Conceptually, `td_data' is either `sometype td_data[td_lolen]', or
     `sometype td_data[td_hilen][td_lolen]', where `sometype' is
     specified by the `td_flags' field.  It is possible for both
     `td_lolen' and `td_hilen' to be zero, in which case `td_data' is a
     zero length array, and no data is loaded, i.e., this table is
     simply skipped. Flex does not currently generate tables of zero
     length.

`td_data[]'
     The table data. This array may be a one- or two-dimensional array,
     of type `int8', `int16', `int32', `struct yy_trans_info', or
     `struct yy_trans_info*',  depending upon the values in the
     `td_flags', `td_lolen', and `td_hilen' fields.

`td_pad64[]'
     Zero or more NULL bytes, padding the entire table to the next
     64-bit boundary as calculated from the beginning of this table.


File: flex.info,  Node: Diagnostics,  Next: Limitations,  Prev: Serialized Tables,  Up: Top

Diagnostics
***********

   The following is a list of `flex' diagnostic messages:

   * `warning, rule cannot be matched' indicates that the given rule
     cannot be matched because it follows other rules that will always
     match the same text as it.  For example, in the following `foo'
     cannot be matched because it comes after an identifier "catch-all"
     rule:


              [a-z]+    got_identifier();
              foo       got_foo();

     Using `REJECT' in a scanner suppresses this warning.

   * `warning, -s option given but default rule can be matched' means
     that it is possible (perhaps only in a particular start condition)
     that the default rule (match any single character) is the only one
     that will match a particular input.  Since `-s' was given,
     presumably this is not intended.

   * `reject_used_but_not_detected undefined' or
     `yymore_used_but_not_detected undefined'. These errors can occur
     at compile time.  They indicate that the scanner uses `REJECT' or
     `yymore()' but that `flex' failed to notice the fact, meaning that
     `flex' scanned the first two sections looking for occurrences of
     these actions and failed to find any, but somehow you snuck some in
     (via a #include file, for example).  Use `%option reject' or
     `%option yymore' to indicate to `flex' that you really do use
     these features.

   * `flex scanner jammed'. a scanner compiled with `-s' has
     encountered an input string which wasn't matched by any of its
     rules.  This error can also occur due to internal problems.

   * `token too large, exceeds YYLMAX'. your scanner uses `%array' and
     one of its rules matched a string longer than the `YYLMAX'
     constant (8K bytes by default).  You can increase the value by
     #define'ing `YYLMAX' in the definitions section of your `flex'
     input.

   * `scanner requires -8 flag to use the character 'x''. Your scanner
     specification includes recognizing the 8-bit character `'x'' and
     you did not specify the -8 flag, and your scanner defaulted to
     7-bit because you used the `-Cf' or `-CF' table compression
     options.  See the discussion of the `-7' flag, *Note Scanner
     Options::, for details.

   * `flex scanner push-back overflow'. you used `unput()' to push back
     so much text that the scanner's buffer could not hold both the
     pushed-back text and the current token in `yytext'.  Ideally the
     scanner should dynamically resize the buffer in this case, but at
     present it does not.

   * `input buffer overflow, can't enlarge buffer because scanner uses
     REJECT'.  the scanner was working on matching an extremely large
     token and needed to expand the input buffer.  This doesn't work
     with scanners that use `REJECT'.

   * `fatal flex scanner internal error--end of buffer missed'. This can
     occur in a scanner which is reentered after a long-jump has jumped
     out (or over) the scanner's activation frame.  Before reentering
     the scanner, use:

              yyrestart( yyin );
     or, as noted above, switch to using the C++ scanner class.

   * `too many start conditions in <> construct!'  you listed more start
     conditions in a <> construct than exist (so you must have listed at
     least one of them twice).


File: flex.info,  Node: Limitations,  Next: Bibliography,  Prev: Diagnostics,  Up: Top

Limitations
***********

   Some trailing context patterns cannot be properly matched and
generate warning messages (`dangerous trailing context').  These are
patterns where the ending of the first part of the rule matches the
beginning of the second part, such as `zx*/xy*', where the 'x*' matches
the 'x' at the beginning of the trailing context.  (Note that the POSIX
draft states that the text matched by such patterns is undefined.)  For
some trailing context rules, parts which are actually fixed-length are
not recognized as such, leading to the abovementioned performance loss.
In particular, parts using `|' or `{n}' (such as `foo{3}') are always
considered variable-length.  Combining trailing context with the
special `|' action can result in _fixed_ trailing context being turned
into the more expensive _variable_ trailing context.  For example, in
the following:


         %%
         abc      |
         xyz/def

   Use of `unput()' invalidates yytext and yyleng, unless the `%array'
directive or the `-l' option has been used.  Pattern-matching of `NUL's
is substantially slower than matching other characters.  Dynamic
resizing of the input buffer is slow, as it entails rescanning all the
text matched so far by the current (generally huge) token.  Due to both
buffering of input and read-ahead, you cannot intermix calls to
`<stdio.h>' routines, such as, getchar(), with `flex' rules and expect
it to work.  Call `input()' instead.  The total table entries listed by
the `-v' flag excludes the number of table entries needed to determine
what rule has been matched.  The number of entries is equal to the
number of DFA states if the scanner does not use `REJECT', and somewhat
greater than the number of states if it does.  `REJECT' cannot be used
with the `-f' or `-F' options.

   The `flex' internal algorithms need documentation.


File: flex.info,  Node: Bibliography,  Next: FAQ,  Prev: Limitations,  Up: Top

Additional Reading
******************

   You may wish to read more about the following programs:
   * lex

   * yacc

   * sed

   * awk

   The following books may contain material of interest:

   John Levine, Tony Mason, and Doug Brown, _Lex & Yacc_, O'Reilly and
Associates.  Be sure to get the 2nd edition.

   M. E. Lesk and E. Schmidt, _LEX - Lexical Analyzer Generator_

   Alfred Aho, Ravi Sethi and Jeffrey Ullman, _Compilers: Principles,
Techniques and Tools_, Addison-Wesley (1986).  Describes the
pattern-matching techniques used by `flex' (deterministic finite
automata).


File: flex.info,  Node: FAQ,  Next: Appendices,  Prev: Bibliography,  Up: Top

FAQ
***

   From time to time, the `flex' maintainer receives certain questions.
Rather than repeat answers to well-understood problems, we publish them
here.

* Menu:

* When was flex born?::
* How do I expand \ escape sequences in C-style quoted strings?::
* Why do flex scanners call fileno if it is not ANSI compatible?::
* Does flex support recursive pattern definitions?::
* How do I skip huge chunks of input (tens of megabytes) while using flex?::
* Flex is not matching my patterns in the same order that I defined them.::
* My actions are executing out of order or sometimes not at all.::
* How can I have multiple input sources feed into the same scanner at the same time?::
* Can I build nested parsers that work with the same input file?::
* How can I match text only at the end of a file?::
* How can I make REJECT cascade across start condition boundaries?::
* Why cant I use fast or full tables with interactive mode?::
* How much faster is -F or -f than -C?::
* If I have a simple grammar cant I just parse it with flex?::
* Why doesnt yyrestart() set the start state back to INITIAL?::
* How can I match C-style comments?::
* The period isnt working the way I expected.::
* Can I get the flex manual in another format?::
* Does there exist a "faster" NDFA->DFA algorithm?::
* How does flex compile the DFA so quickly?::
* How can I use more than 8192 rules?::
* How do I abandon a file in the middle of a scan and switch to a new file?::
* How do I execute code only during initialization (only before the first scan)?::
* How do I execute code at termination?::
* Where else can I find help?::
* Can I include comments in the "rules" section of the file?::
* I get an error about undefined yywrap().::
* How can I change the matching pattern at run time?::
* How can I expand macros in the input?::
* How can I build a two-pass scanner?::
* How do I match any string not matched in the preceding rules?::
* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
* Is there a way to make flex treat NULL like a regular character?::
* Whenever flex can not match the input it says "flex scanner jammed".::
* Why doesnt flex have non-greedy operators like perl does?::
* Memory leak - 16386 bytes allocated by malloc.::
* How do I track the byte offset for lseek()?::
* How do I use my own I/O classes in a C++ scanner?::
* How do I skip as many chars as possible?::
* deleteme00::
* Are certain equivalent patterns faster than others?::
* Is backing up a big deal?::
* Can I fake multi-byte character support?::
* deleteme01::
* Can you discuss some flex internals?::
* unput() messes up yy_at_bol::
* The | operator is not doing what I want::
* Why can't flex understand this variable trailing context pattern?::
* The ^ operator isn't working::
* Trailing context is getting confused with trailing optional patterns::
* Is flex GNU or not?::
* ERASEME53::
* I need to scan if-then-else blocks and while loops::
* ERASEME55::
* ERASEME56::
* ERASEME57::
* Is there a repository for flex scanners?::
* How can I conditionally compile or preprocess my flex input file?::
* Where can I find grammars for lex and yacc?::
* I get an end-of-buffer message for each character scanned.::
* unnamed-faq-62::
* unnamed-faq-63::
* unnamed-faq-64::
* unnamed-faq-65::
* unnamed-faq-66::
* unnamed-faq-67::
* unnamed-faq-68::
* unnamed-faq-69::
* unnamed-faq-70::
* unnamed-faq-71::
* unnamed-faq-72::
* unnamed-faq-73::
* unnamed-faq-74::
* unnamed-faq-75::
* unnamed-faq-76::
* unnamed-faq-77::
* unnamed-faq-78::
* unnamed-faq-79::
* unnamed-faq-80::
* unnamed-faq-81::
* unnamed-faq-82::
* unnamed-faq-83::
* unnamed-faq-84::
* unnamed-faq-85::
* unnamed-faq-86::
* unnamed-faq-87::
* unnamed-faq-88::
* unnamed-faq-90::
* unnamed-faq-91::
* unnamed-faq-92::
* unnamed-faq-93::
* unnamed-faq-94::
* unnamed-faq-95::
* unnamed-faq-96::
* unnamed-faq-97::
* unnamed-faq-98::
* unnamed-faq-99::
* unnamed-faq-100::
* unnamed-faq-101::
* What is the difference between YYLEX_PARAM and YY_DECL?::
* Why do I get "conflicting types for yylex" error?::
* How do I access the values set in a Flex action from within a Bison action?::


File: flex.info,  Node: When was flex born?,  Next: How do I expand \ escape sequences in C-style quoted strings?,  Up: FAQ

When was flex born?
===================

   Vern Paxson took over the `Software Tools' lex project from Jef
Poskanzer in 1982.  At that point it was written in Ratfor.  Around
1987 or so, Paxson translated it into C, and a legend was born :-).


File: flex.info,  Node: How do I expand \ escape sequences in C-style quoted strings?,  Next: Why do flex scanners call fileno if it is not ANSI compatible?,  Prev: When was flex born?,  Up: FAQ

How do I expand \ escape sequences in C-style quoted strings?
=============================================================

   A key point when scanning quoted strings is that you cannot (easily)
write a single rule that will precisely match the string if you allow
things like embedded escape sequences and newlines.  If you try to
match strings with a single rule then you'll wind up having to rescan
the string anyway to find any escape sequences.

   Instead you can use exclusive start conditions and a set of rules,
one for matching non-escaped text, one for matching a single escape,
one for matching an embedded newline, and one for recognizing the end
of the string.  Each of these rules is then faced with the question of
where to put its intermediary results.  The best solution is for the
rules to append their local value of `yytext' to the end of a "string
literal" buffer.  A rule like the escape-matcher will append to the
buffer the meaning of the escape sequence rather than the literal text
in `yytext'.  In this way, `yytext' does not need to be modified at all.


File: flex.info,  Node: Why do flex scanners call fileno if it is not ANSI compatible?,  Next: Does flex support recursive pattern definitions?,  Prev: How do I expand \ escape sequences in C-style quoted strings?,  Up: FAQ

Why do flex scanners call fileno if it is not ANSI compatible?
==============================================================

   Flex scanners call `fileno()' in order to get the file descriptor
corresponding to `yyin'. The file descriptor may be passed to
`isatty()' or `read()', depending upon which `%options' you specified.
If your system does not have `fileno()' support, to get rid of the
`read()' call, do not specify `%option read'. To get rid of the
`isatty()' call, you must specify one of `%option always-interactive' or
`%option never-interactive'.


File: flex.info,  Node: Does flex support recursive pattern definitions?,  Next: How do I skip huge chunks of input (tens of megabytes) while using flex?,  Prev: Why do flex scanners call fileno if it is not ANSI compatible?,  Up: FAQ

Does flex support recursive pattern definitions?
================================================

   e.g.,


     %%
     block   "{"({block}|{statement})*"}"

   No. You cannot have recursive definitions.  The pattern-matching
power of regular expressions in general (and therefore flex scanners,
too) is limited.  In particular, regular expressions cannot "balance"
parentheses to an arbitrary degree.  For example, it's impossible to
write a regular expression that matches all strings containing the same
number of '{'s as '}'s.  For more powerful pattern matching, you need a
parser, such as `GNU bison'.


File: flex.info,  Node: How do I skip huge chunks of input (tens of megabytes) while using flex?,  Next: Flex is not matching my patterns in the same order that I defined them.,  Prev: Does flex support recursive pattern definitions?,  Up: FAQ

How do I skip huge chunks of input (tens of megabytes) while using flex?
========================================================================

   Use `fseek()' (or `lseek()') to position yyin, then call
`yyrestart()'.


File: flex.info,  Node: Flex is not matching my patterns in the same order that I defined them.,  Next: My actions are executing out of order or sometimes not at all.,  Prev: How do I skip huge chunks of input (tens of megabytes) while using flex?,  Up: FAQ

Flex is not matching my patterns in the same order that I defined them.
=======================================================================

   `flex' picks the rule that matches the most text (i.e., the longest
possible input string).  This is because `flex' uses an entirely
different matching technique ("deterministic finite automata") that
actually does all of the matching simultaneously, in parallel.  (Seems
impossible, but it's actually a fairly simple technique once you
understand the principles.)

   A side-effect of this parallel matching is that when the input
matches more than one rule, `flex' scanners pick the rule that matched
the _most_ text. This is explained further in the manual, in the
section *Note Matching::.

   If you want `flex' to choose a shorter match, then you can work
around this behavior by expanding your short rule to match more text,
then put back the extra:


     data_.*        yyless( 5 ); BEGIN BLOCKIDSTATE;

   Another fix would be to make the second rule active only during the
`<BLOCKIDSTATE>' start condition, and make that start condition
exclusive by declaring it with `%x' instead of `%s'.

   A final fix is to change the input language so that the ambiguity for
`data_' is removed, by adding characters to it that don't match the
identifier rule, or by removing characters (such as `_') from the
identifier rule so it no longer matches `data_'.  (Of course, you might
also not have the option of changing the input language.)


File: flex.info,  Node: My actions are executing out of order or sometimes not at all.,  Next: How can I have multiple input sources feed into the same scanner at the same time?,  Prev: Flex is not matching my patterns in the same order that I defined them.,  Up: FAQ

My actions are executing out of order or sometimes not at all.
==============================================================

   Most likely, you have (in error) placed the opening `{' of the action
block on a different line than the rule, e.g.,


     ^(foo|bar)
     {  <<<--- WRONG!
     
     }

   `flex' requires that the opening `{' of an action associated with a
rule begin on the same line as does the rule.  You need instead to
write your rules as follows:


     ^(foo|bar)   {  // CORRECT!
     
     }


File: flex.info,  Node: How can I have multiple input sources feed into the same scanner at the same time?,  Next: Can I build nested parsers that work with the same input file?,  Prev: My actions are executing out of order or sometimes not at all.,  Up: FAQ

How can I have multiple input sources feed into the same scanner at the same time?
==================================================================================

   If ...
   * your scanner is free of backtracking (verified using `flex''s `-b'
     flag),

   * AND you run your scanner interactively (`-I' option; default
     unless using special table compression options),

   * AND you feed it one character at a time by redefining `YY_INPUT'
     to do so,

   then every time it matches a token, it will have exhausted its input
buffer (because the scanner is free of backtracking).  This means you
can safely use `select()' at the point and only call `yylex()' for
another token if `select()' indicates there's data available.

   That is, move the `select()' out from the input function to a point
where it determines whether `yylex()' gets called for the next token.

   With this approach, you will still have problems if your input can
arrive piecemeal; `select()' could inform you that the beginning of a
token is available, you call `yylex()' to get it, but it winds up
blocking waiting for the later characters in the token.

   Here's another way:  Move your input multiplexing inside of
`YY_INPUT'.  That is, whenever `YY_INPUT' is called, it `select()''s to
see where input is available.  If input is available for the scanner,
it reads and returns the next byte.  If input is available from another
source, it calls whatever function is responsible for reading from that
source.  (If no input is available, it blocks until some input is
available.)  I've used this technique in an interpreter I wrote that
both reads keyboard input using a `flex' scanner and IPC traffic from
sockets, and it works fine.


File: flex.info,  Node: Can I build nested parsers that work with the same input file?,  Next: How can I match text only at the end of a file?,  Prev: How can I have multiple input sources feed into the same scanner at the same time?,  Up: FAQ

Can I build nested parsers that work with the same input file?
==============================================================

   This is not going to work without some additional effort.  The
reason is that `flex' block-buffers the input it reads from `yyin'.
This means that the "outermost" `yylex()', when called, will
automatically slurp up the first 8K of input available on yyin, and
subsequent calls to other `yylex()''s won't see that input.  You might
be tempted to work around this problem by redefining `YY_INPUT' to only
return a small amount of text, but it turns out that that approach is
quite difficult.  Instead, the best solution is to combine all of your
scanners into one large scanner, using a different exclusive start
condition for each.


File: flex.info,  Node: How can I match text only at the end of a file?,  Next: How can I make REJECT cascade across start condition boundaries?,  Prev: Can I build nested parsers that work with the same input file?,  Up: FAQ

How can I match text only at the end of a file?
===============================================

   There is no way to write a rule which is "match this text, but only
if it comes at the end of the file".  You can fake it, though, if you
happen to have a character lying around that you don't allow in your
input.  Then you redefine `YY_INPUT' to call your own routine which, if
it sees an `EOF', returns the magic character first (and remembers to
return a real `EOF' next time it's called).  Then you could write:


     <COMMENT>(.|\n)*{EOF_CHAR}    /* saw comment at EOF */


File: flex.info,  Node: How can I make REJECT cascade across start condition boundaries?,  Next: Why cant I use fast or full tables with interactive mode?,  Prev: How can I match text only at the end of a file?,  Up: FAQ

How can I make REJECT cascade across start condition boundaries?
================================================================

   You can do this as follows.  Suppose you have a start condition `A',
and after exhausting all of the possible matches in `<A>', you want to
try matches in `<INITIAL>'.  Then you could use the following:


     %x A
     %%
     <A>rule_that_is_long    ...; REJECT;
     <A>rule                 ...; REJECT; /* shorter rule */
     <A>etc.
     ...
     <A>.|\n  {
     /* Shortest and last rule in <A>, so
     * cascaded REJECT's will eventually
     * wind up matching this rule.  We want
     * to now switch to the initial state
     * and try matching from there instead.
     */
     yyless(0);    /* put back matched text */
     BEGIN(INITIAL);
     }


File: flex.info,  Node: Why cant I use fast or full tables with interactive mode?,  Next: How much faster is -F or -f than -C?,  Prev: How can I make REJECT cascade across start condition boundaries?,  Up: FAQ

Why can't I use fast or full tables with interactive mode?
==========================================================

   One of the assumptions flex makes is that interactive applications
are inherently slow (they're waiting on a human after all).  It has to
do with how the scanner detects that it must be finished scanning a
token.  For interactive scanners, after scanning each character the
current state is looked up in a table (essentially) to see whether
there's a chance of another input character possibly extending the
length of the match.  If not, the scanner halts.  For non-interactive
scanners, the end-of-token test is much simpler, basically a compare
with 0, so no memory bus cycles.  Since the test occurs in the
innermost scanning loop, one would like to make it go as fast as
possible.

   Still, it seems reasonable to allow the user to choose to trade off
a bit of performance in this area to gain the corresponding
flexibility.  There might be another reason, though, why fast scanners
don't support the interactive option.


File: flex.info,  Node: How much faster is -F or -f than -C?,  Next: If I have a simple grammar cant I just parse it with flex?,  Prev: Why cant I use fast or full tables with interactive mode?,  Up: FAQ

How much faster is -F or -f than -C?
====================================

   Much faster (factor of 2-3).


File: flex.info,  Node: If I have a simple grammar cant I just parse it with flex?,  Next: Why doesnt yyrestart() set the start state back to INITIAL?,  Prev: How much faster is -F or -f than -C?,  Up: FAQ

If I have a simple grammar can't I just parse it with flex?
===========================================================

   Is your grammar recursive? That's almost always a sign that you're
better off using a parser/scanner rather than just trying to use a
scanner alone.


File: flex.info,  Node: Why doesnt yyrestart() set the start state back to INITIAL?,  Next: How can I match C-style comments?,  Prev: If I have a simple grammar cant I just parse it with flex?,  Up: FAQ

Why doesn't yyrestart() set the start state back to INITIAL?
============================================================

   There are two reasons.  The first is that there might be programs
that rely on the start state not changing across file changes.  The
second is that beginning with `flex' version 2.4, use of `yyrestart()'
is no longer required, so fixing the problem there doesn't solve the
more general problem.


File: flex.info,  Node: How can I match C-style comments?,  Next: The period isnt working the way I expected.,  Prev: Why doesnt yyrestart() set the start state back to INITIAL?,  Up: FAQ

How can I match C-style comments?
=================================

   You might be tempted to try something like this:


     "/*".*"*/"       // WRONG!

   or, worse, this:


     "/*"(.|\n)"*/"   // WRONG!

   The above rules will eat too much input, and blow up on things like:


     /* a comment */ do_my_thing( "oops */" );

   Here is one way which allows you to track line information:


     <INITIAL>{
     "/*"              BEGIN(IN_COMMENT);
     }
     <IN_COMMENT>{
     "*/"      BEGIN(INITIAL);
     [^*\n]+   // eat comment in chunks
     "*"       // eat the lone star
     \n        yylineno++;
     }


File: flex.info,  Node: The period isnt working the way I expected.,  Next: Can I get the flex manual in another format?,  Prev: How can I match C-style comments?,  Up: FAQ

The '.' isn't working the way I expected.
=========================================

   Here are some tips for using `.':

   * A common mistake is to place the grouping parenthesis AFTER an
     operator, when you really meant to place the parenthesis BEFORE
     the operator, e.g., you probably want this `(foo|bar)+' and NOT
     this `(foo|bar+)'.

     The first pattern matches the words `foo' or `bar' any number of
     times, e.g., it matches the text `barfoofoobarfoo'. The second
     pattern matches a single instance of `foo' or a single instance of
     `bar' followed by one or more `r's, e.g., it matches the text
     `barrrr' .

   * A `.' inside `[]''s just means a literal`.' (period), and NOT "any
     character except newline".

   * Remember that `.' matches any character EXCEPT `\n' (and `EOF').
     If you really want to match ANY character, including newlines,
     then use `(.|\n)' Beware that the regex `(.|\n)+' will match your
     entire input!

   * Finally, if you want to match a literal `.' (a period), then use
     `[.]' or `"."'


File: flex.info,  Node: Can I get the flex manual in another format?,  Next: Does there exist a "faster" NDFA->DFA algorithm?,  Prev: The period isnt working the way I expected.,  Up: FAQ

Can I get the flex manual in another format?
============================================

   The `flex' source distribution  includes a texinfo manual. You are
free to convert that texinfo into whatever format you desire. The
`texinfo' package includes tools for conversion to a number of formats.


File: flex.info,  Node: Does there exist a "faster" NDFA->DFA algorithm?,  Next: How does flex compile the DFA so quickly?,  Prev: Can I get the flex manual in another format?,  Up: FAQ

Does there exist a "faster" NDFA->DFA algorithm?
================================================

   There's no way around the potential exponential running time - it
can take you exponential time just to enumerate all of the DFA states.
In practice, though, the running time is closer to linear, or sometimes
quadratic.


File: flex.info,  Node: How does flex compile the DFA so quickly?,  Next: How can I use more than 8192 rules?,  Prev: Does there exist a "faster" NDFA->DFA algorithm?,  Up: FAQ

How does flex compile the DFA so quickly?
=========================================

   There are two big speed wins that `flex' uses:

  1. It analyzes the input rules to construct equivalence classes for
     those characters that always make the same transitions.  It then
     rewrites the NFA using equivalence classes for transitions instead
     of characters.  This cuts down the NFA->DFA computation time
     dramatically, to the point where, for uncompressed DFA tables, the
     DFA generation is often I/O bound in writing out the tables.

  2. It maintains hash values for previously computed DFA states, so
     testing whether a newly constructed DFA state is equivalent to a
     previously constructed state can be done very quickly, by first
     comparing hash values.


File: flex.info,  Node: How can I use more than 8192 rules?,  Next: How do I abandon a file in the middle of a scan and switch to a new file?,  Prev: How does flex compile the DFA so quickly?,  Up: FAQ

How can I use more than 8192 rules?
===================================

   `Flex' is compiled with an upper limit of 8192 rules per scanner.
If you need more than 8192 rules in your scanner, you'll have to
recompile `flex' with the following changes in `flexdef.h':


     <    #define YY_TRAILING_MASK 0x2000
     <    #define YY_TRAILING_HEAD_MASK 0x4000
     --
     >    #define YY_TRAILING_MASK 0x20000000
     >    #define YY_TRAILING_HEAD_MASK 0x40000000

   This should work okay as long as your C compiler uses 32 bit
integers.  But you might want to think about whether using such a huge
number of rules is the best way to solve your problem.

   The following may also be relevant:

   With luck, you should be able to increase the definitions in
flexdef.h for:


     #define JAMSTATE -32766 /* marks a reference to the state that always jams */
     #define MAXIMUM_MNS 31999
     #define BAD_SUBSCRIPT -32767

   recompile everything, and it'll all work.  Flex only has these
16-bit-like values built into it because a long time ago it was
developed on a machine with 16-bit ints.  I've given this advice to
others in the past but haven't heard back from them whether it worked
okay or not...


File: flex.info,  Node: How do I abandon a file in the middle of a scan and switch to a new file?,  Next: How do I execute code only during initialization (only before the first scan)?,  Prev: How can I use more than 8192 rules?,  Up: FAQ

How do I abandon a file in the middle of a scan and switch to a new file?
=========================================================================

   Just call `yyrestart(newfile)'. Be sure to reset the start state if
you want a "fresh start, since `yyrestart' does NOT reset the start
state back to `INITIAL'.


File: flex.info,  Node: How do I execute code only during initialization (only before the first scan)?,  Next: How do I execute code at termination?,  Prev: How do I abandon a file in the middle of a scan and switch to a new file?,  Up: FAQ

How do I execute code only during initialization (only before the first scan)?
==============================================================================

   You can specify an initial action by defining the macro
`YY_USER_INIT' (though note that `yyout' may not be available at the
time this macro is executed).  Or you can add to the beginning of your
rules section:


     %%
     /* Must be indented! */
     static int did_init = 0;
     
     if ( ! did_init ){
     do_my_init();
     did_init = 1;
     }


File: flex.info,  Node: How do I execute code at termination?,  Next: Where else can I find help?,  Prev: How do I execute code only during initialization (only before the first scan)?,  Up: FAQ

How do I execute code at termination?
=====================================

   You can specify an action for the `<<EOF>>' rule.


File: flex.info,  Node: Where else can I find help?,  Next: Can I include comments in the "rules" section of the file?,  Prev: How do I execute code at termination?,  Up: FAQ

Where else can I find help?
===========================

   You can find the flex homepage on the web at
`http://flex.sourceforge.net/'. See that page for details about flex
mailing lists as well.


File: flex.info,  Node: Can I include comments in the "rules" section of the file?,  Next: I get an error about undefined yywrap().,  Prev: Where else can I find help?,  Up: FAQ

Can I include comments in the "rules" section of the file?
==========================================================

   Yes, just about anywhere you want to. See the manual for the
specific syntax.


File: flex.info,  Node: I get an error about undefined yywrap().,  Next: How can I change the matching pattern at run time?,  Prev: Can I include comments in the "rules" section of the file?,  Up: FAQ

I get an error about undefined yywrap().
========================================

   You must supply a `yywrap()' function of your own, or link to
`libfl.a' (which provides one), or use


     %option noyywrap

   in your source to say you don't want a `yywrap()' function.


File: flex.info,  Node: How can I change the matching pattern at run time?,  Next: How can I expand macros in the input?,  Prev: I get an error about undefined yywrap().,  Up: FAQ

How can I change the matching pattern at run time?
==================================================

   You can't, it's compiled into a static table when flex builds the
scanner.


File: flex.info,  Node: How can I expand macros in the input?,  Next: How can I build a two-pass scanner?,  Prev: How can I change the matching pattern at run time?,  Up: FAQ

How can I expand macros in the input?
=====================================

   The best way to approach this problem is at a higher level, e.g., in
the parser.

   However, you can do this using multiple input buffers.


     %%
     macro/[a-z]+	{
     /* Saw the macro "macro" followed by extra stuff. */
     main_buffer = YY_CURRENT_BUFFER;
     expansion_buffer = yy_scan_string(expand(yytext));
     yy_switch_to_buffer(expansion_buffer);
     }
     
     <<EOF>>	{
     if ( expansion_buffer )
     {
     // We were doing an expansion, return to where
     // we were.
     yy_switch_to_buffer(main_buffer);
     yy_delete_buffer(expansion_buffer);
     expansion_buffer = 0;
     }
     else
     yyterminate();
     }

   You probably will want a stack of expansion buffers to allow nested
macros.  From the above though hopefully the idea is clear.


File: flex.info,  Node: How can I build a two-pass scanner?,  Next: How do I match any string not matched in the preceding rules?,  Prev: How can I expand macros in the input?,  Up: FAQ

How can I build a two-pass scanner?
===================================

   One way to do it is to filter the first pass to a temporary file,
then process the temporary file on the second pass. You will probably
see a performance hit, do to all the disk I/O.

   When you need to look ahead far forward like this, it almost always
means that the right solution is to build a parse tree of the entire
input, then walk it after the parse in order to generate the output.
In a sense, this is a two-pass approach, once through the text and once
through the parse tree, but the performance hit for the latter is
usually an order of magnitude smaller, since everything is already
classified, in binary format, and residing in memory.