flex.info-1 [plain text]

This is flex.info, produced by makeinfo version 4.5 from flex.texi.

INFO-DIR-SECTION Programming
START-INFO-DIR-ENTRY
* flex: (flex).      Fast lexical analyzer generator (lex replacement).
END-INFO-DIR-ENTRY


   The flex manual is placed under the same licensing conditions as the
rest of flex:

   Copyright (C) 1990, 1997 The Regents of the University of California.
All rights reserved.

   This code is derived from software contributed to Berkeley by Vern
Paxson.

   The United States Government has rights in this work pursuant to
contract no. DE-AC03-76SF00098 between the United States Department of
Energy and the University of California.

   Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

  1.  Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the
     distribution.
   Neither the name of the University nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.

   THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

File: flex.info,  Node: Top,  Next: Copyright,  Prev: (dir),  Up: (dir)

flex
****

   This manual describes `flex', a tool for generating programs that
perform pattern-matching on text.  The manual includes both tutorial and
reference sections.

   This edition of `The flex Manual' documents `flex' version 2.5.33.
It was last updated on 20 February 2006.

* Menu:

* Copyright::
* Reporting Bugs::
* Introduction::
* Simple Examples::
* Format::
* Patterns::
* Matching::
* Actions::
* Generated Scanner::
* Start Conditions::
* Multiple Input Buffers::
* EOF::
* Misc Macros::
* User Values::
* Yacc::
* Scanner Options::
* Performance::
* Cxx::
* Reentrant::
* Lex and Posix::
* Memory Management::
* Serialized Tables::
* Diagnostics::
* Limitations::
* Bibliography::
* FAQ::
* Appendices::
* Indices::

 --- The Detailed Node Listing ---

Format of the Input File

* Definitions Section::
* Rules Section::
* User Code Section::
* Comments in the Input::

Scanner Options

* Options for Specifing Filenames::
* Options Affecting Scanner Behavior::
* Code-Level And API Options::
* Options for Scanner Speed and Size::
* Debugging Options::
* Miscellaneous Options::

Reentrant C Scanners

* Reentrant Uses::
* Reentrant Overview::
* Reentrant Example::
* Reentrant Detail::
* Reentrant Functions::

The Reentrant API in Detail

* Specify Reentrant::
* Extra Reentrant Argument::
* Global Replacement::
* Init and Destroy Functions::
* Accessor Methods::
* Extra Data::
* About yyscan_t::

Memory Management

* The Default Memory Management::
* Overriding The Default Memory Management::
* A Note About yytext And Memory::

Serialized Tables

* Creating Serialized Tables::
* Loading and Unloading Serialized Tables::
* Tables File Format::

FAQ

* When was flex born?::
* How do I expand \ escape sequences in C-style quoted strings?::
* Why do flex scanners call fileno if it is not ANSI compatible?::
* Does flex support recursive pattern definitions?::
* How do I skip huge chunks of input (tens of megabytes) while using flex?::
* Flex is not matching my patterns in the same order that I defined them.::
* My actions are executing out of order or sometimes not at all.::
* How can I have multiple input sources feed into the same scanner at the same time?::
* Can I build nested parsers that work with the same input file?::
* How can I match text only at the end of a file?::
* How can I make REJECT cascade across start condition boundaries?::
* Why cant I use fast or full tables with interactive mode?::
* How much faster is -F or -f than -C?::
* If I have a simple grammar cant I just parse it with flex?::
* Why doesnt yyrestart() set the start state back to INITIAL?::
* How can I match C-style comments?::
* The period isnt working the way I expected.::
* Can I get the flex manual in another format?::
* Does there exist a "faster" NDFA->DFA algorithm?::
* How does flex compile the DFA so quickly?::
* How can I use more than 8192 rules?::
* How do I abandon a file in the middle of a scan and switch to a new file?::
* How do I execute code only during initialization (only before the first scan)?::
* How do I execute code at termination?::
* Where else can I find help?::
* Can I include comments in the "rules" section of the file?::
* I get an error about undefined yywrap().::
* How can I change the matching pattern at run time?::
* How can I expand macros in the input?::
* How can I build a two-pass scanner?::
* How do I match any string not matched in the preceding rules?::
* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
* Is there a way to make flex treat NULL like a regular character?::
* Whenever flex can not match the input it says "flex scanner jammed".::
* Why doesnt flex have non-greedy operators like perl does?::
* Memory leak - 16386 bytes allocated by malloc.::
* How do I track the byte offset for lseek()?::
* How do I use my own I/O classes in a C++ scanner?::
* How do I skip as many chars as possible?::
* deleteme00::
* Are certain equivalent patterns faster than others?::
* Is backing up a big deal?::
* Can I fake multi-byte character support?::
* deleteme01::
* Can you discuss some flex internals?::
* unput() messes up yy_at_bol::
* The | operator is not doing what I want::
* Why can't flex understand this variable trailing context pattern?::
* The ^ operator isn't working::
* Trailing context is getting confused with trailing optional patterns::
* Is flex GNU or not?::
* ERASEME53::
* I need to scan if-then-else blocks and while loops::
* ERASEME55::
* ERASEME56::
* ERASEME57::
* Is there a repository for flex scanners?::
* How can I conditionally compile or preprocess my flex input file?::
* Where can I find grammars for lex and yacc?::
* I get an end-of-buffer message for each character scanned.::
* unnamed-faq-62::
* unnamed-faq-63::
* unnamed-faq-64::
* unnamed-faq-65::
* unnamed-faq-66::
* unnamed-faq-67::
* unnamed-faq-68::
* unnamed-faq-69::
* unnamed-faq-70::
* unnamed-faq-71::
* unnamed-faq-72::
* unnamed-faq-73::
* unnamed-faq-74::
* unnamed-faq-75::
* unnamed-faq-76::
* unnamed-faq-77::
* unnamed-faq-78::
* unnamed-faq-79::
* unnamed-faq-80::
* unnamed-faq-81::
* unnamed-faq-82::
* unnamed-faq-83::
* unnamed-faq-84::
* unnamed-faq-85::
* unnamed-faq-86::
* unnamed-faq-87::
* unnamed-faq-88::
* unnamed-faq-90::
* unnamed-faq-91::
* unnamed-faq-92::
* unnamed-faq-93::
* unnamed-faq-94::
* unnamed-faq-95::
* unnamed-faq-96::
* unnamed-faq-97::
* unnamed-faq-98::
* unnamed-faq-99::
* unnamed-faq-100::
* unnamed-faq-101::
* What is the difference between YYLEX_PARAM and YY_DECL?::
* Why do I get "conflicting types for yylex" error?::
* How do I access the values set in a Flex action from within a Bison action?::

Appendices

* Makefiles and Flex::
* Bison Bridge::
* M4 Dependency::

Indices

* Concept Index::
* Index of Functions and Macros::
* Index of Variables::
* Index of Data Types::
* Index of Hooks::
* Index of Scanner Options::


File: flex.info,  Node: Copyright,  Next: Reporting Bugs,  Prev: Top,  Up: Top

Copyright
*********


   The flex manual is placed under the same licensing conditions as the
rest of flex:

   Copyright (C) 1990, 1997 The Regents of the University of California.
All rights reserved.

   This code is derived from software contributed to Berkeley by Vern
Paxson.

   The United States Government has rights in this work pursuant to
contract no. DE-AC03-76SF00098 between the United States Department of
Energy and the University of California.

   Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

  1.  Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the
     distribution.
   Neither the name of the University nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.

   THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

File: flex.info,  Node: Reporting Bugs,  Next: Introduction,  Prev: Copyright,  Up: Top

Reporting Bugs
**************

   If you have problems with `flex' or think you have found a bug,
please send mail detailing your problem to
<flex-help@lists.sourceforge.net>. Patches are always welcome.


File: flex.info,  Node: Introduction,  Next: Simple Examples,  Prev: Reporting Bugs,  Up: Top

Introduction
************

   `flex' is a tool for generating "scanners".  A scanner is a program
which recognizes lexical patterns in text.  The `flex' program reads
the given input files, or its standard input if no file names are
given, for a description of a scanner to generate.  The description is
in the form of pairs of regular expressions and C code, called "rules".
`flex' generates as output a C source file, `lex.yy.c' by default,
which defines a routine `yylex()'.  This file can be compiled and
linked with the flex runtime library to produce an executable.  When
the executable is run, it analyzes its input for occurrences of the
regular expressions.  Whenever it finds one, it executes the
corresponding C code.


File: flex.info,  Node: Simple Examples,  Next: Format,  Prev: Introduction,  Up: Top

Some Simple Examples
********************

   First some simple examples to get the flavor of how one uses `flex'.

   The following `flex' input specifies a scanner which, when it
encounters the string `username' will replace it with the user's login
name:


         %%
         username    printf( "%s", getlogin() );

   By default, any text not matched by a `flex' scanner is copied to
the output, so the net effect of this scanner is to copy its input file
to its output with each occurrence of `username' expanded.  In this
input, there is just one rule.  `username' is the "pattern" and the
`printf' is the "action".  The `%%' symbol marks the beginning of the
rules.

   Here's another simple example:


                 int num_lines = 0, num_chars = 0;
     
         %%
         \n      ++num_lines; ++num_chars;
         .       ++num_chars;
     
         %%
         main()
                 {
                 yylex();
                 printf( "# of lines = %d, # of chars = %d\n",
                         num_lines, num_chars );
                 }

   This scanner counts the number of characters and the number of lines
in its input. It produces no output other than the final report on the
character and line counts.  The first line declares two globals,
`num_lines' and `num_chars', which are accessible both inside `yylex()'
and in the `main()' routine declared after the second `%%'.  There are
two rules, one which matches a newline (`\n') and increments both the
line count and the character count, and one which matches any character
other than a newline (indicated by the `.' regular expression).

   A somewhat more complicated example:


         /* scanner for a toy Pascal-like language */
     
         %{
         /* need this for the call to atof() below */
         #include math.h>
         %}
     
         DIGIT    [0-9]
         ID       [a-z][a-z0-9]*
     
         %%
     
         {DIGIT}+    {
                     printf( "An integer: %s (%d)\n", yytext,
                             atoi( yytext ) );
                     }
     
         {DIGIT}+"."{DIGIT}*        {
                     printf( "A float: %s (%g)\n", yytext,
                             atof( yytext ) );
                     }
     
         if|then|begin|end|procedure|function        {
                     printf( "A keyword: %s\n", yytext );
                     }
     
         {ID}        printf( "An identifier: %s\n", yytext );
     
         "+"|"-"|"*"|"/"   printf( "An operator: %s\n", yytext );
     
         "{"[\^{}}\n]*"}"     /* eat up one-line comments */
     
         [ \t\n]+          /* eat up whitespace */
     
         .           printf( "Unrecognized character: %s\n", yytext );
     
         %%
     
         main( argc, argv )
         int argc;
         char **argv;
             {
             ++argv, --argc;  /* skip over program name */
             if ( argc > 0 )
                     yyin = fopen( argv[0], "r" );
             else
                     yyin = stdin;
     
             yylex();
             }

   This is the beginnings of a simple scanner for a language like
Pascal.  It identifies different types of "tokens" and reports on what
it has seen.

   The details of this example will be explained in the following
sections.


File: flex.info,  Node: Format,  Next: Patterns,  Prev: Simple Examples,  Up: Top

Format of the Input File
************************

   The `flex' input file consists of three sections, separated by a
line containing only `%%'.


         definitions
         %%
         rules
         %%
         user code

* Menu:

* Definitions Section::
* Rules Section::
* User Code Section::
* Comments in the Input::


File: flex.info,  Node: Definitions Section,  Next: Rules Section,  Prev: Format,  Up: Format

Format of the Definitions Section
=================================

   The "definitions section" contains declarations of simple "name"
definitions to simplify the scanner specification, and declarations of
"start conditions", which are explained in a later section.

   Name definitions have the form:


         name definition

   The `name' is a word beginning with a letter or an underscore (`_')
followed by zero or more letters, digits, `_', or `-' (dash).  The
definition is taken to begin at the first non-whitespace character
following the name and continuing to the end of the line.  The
definition can subsequently be referred to using `{name}', which will
expand to `(definition)'.  For example,


         DIGIT    [0-9]
         ID       [a-z][a-z0-9]*

   Defines `DIGIT' to be a regular expression which matches a single
digit, and `ID' to be a regular expression which matches a letter
followed by zero-or-more letters-or-digits.  A subsequent reference to


         {DIGIT}+"."{DIGIT}*

   is identical to


         ([0-9])+"."([0-9])*

   and matches one-or-more digits followed by a `.' followed by
zero-or-more digits.

   An unindented comment (i.e., a line beginning with `/*') is copied
verbatim to the output up to the next `*/'.

   Any _indented_ text or text enclosed in `%{' and `%}' is also copied
verbatim to the output (with the %{ and %} symbols removed).  The %{
and %} symbols must appear unindented on lines by themselves.

   A `%top' block is similar to a `%{' ... `%}' block, except that the
code in a `%top' block is relocated to the _top_ of the generated file,
before any flex definitions (1).  The `%top' block is useful when you
want certain preprocessor macros to be defined or certain files to be
included before the generated code.  The single characters, `{'  and
`}' are used to delimit the `%top' block, as show in the example below:


         %top{
             /* This code goes at the "top" of the generated file. */
             #include <stdint.h>
             #include <inttypes.h>
         }

   Multiple `%top' blocks are allowed, and their order is preserved.

   ---------- Footnotes ----------

   (1) Actually, `yyIN_HEADER' is defined before the `%top' block.


File: flex.info,  Node: Rules Section,  Next: User Code Section,  Prev: Definitions Section,  Up: Format

Format of the Rules Section
===========================

   The "rules" section of the `flex' input contains a series of rules
of the form:


         pattern   action

   where the pattern must be unindented and the action must begin on
the same line.  *Note Patterns::, for a further description of patterns
and actions.

   In the rules section, any indented or %{ %} enclosed text appearing
before the first rule may be used to declare variables which are local
to the scanning routine and (after the declarations) code which is to be
executed whenever the scanning routine is entered.  Other indented or
%{ %} text in the rule section is still copied to the output, but its
meaning is not well-defined and it may well cause compile-time errors
(this feature is present for POSIX compliance. *Note Lex and Posix::,
for other such features).

   Any _indented_ text or text enclosed in `%{' and `%}' is copied
verbatim to the output (with the %{ and %} symbols removed).  The %{
and %} symbols must appear unindented on lines by themselves.


File: flex.info,  Node: User Code Section,  Next: Comments in the Input,  Prev: Rules Section,  Up: Format

Format of the User Code Section
===============================

   The user code section is simply copied to `lex.yy.c' verbatim.  It
is used for companion routines which call or are called by the scanner.
The presence of this section is optional; if it is missing, the second
`%%' in the input file may be skipped, too.


File: flex.info,  Node: Comments in the Input,  Prev: User Code Section,  Up: Format

Comments in the Input
=====================

   Flex supports C-style comments, that is, anything between /* and */
is considered a comment. Whenever flex encounters a comment, it copies
the entire comment verbatim to the generated source code. Comments may
appear just about anywhere, but with the following exceptions:

   * Comments may not appear in the Rules Section wherever flex is
     expecting a regular expression. This means comments may not appear
     at the beginning of a line, or immediately following a list of
     scanner states.

   * Comments may not appear on an `%option' line in the Definitions
     Section.

   If you want to follow a simple rule, then always begin a comment on a
new line, with one or more whitespace characters before the initial
`/*').  This rule will work anywhere in the input file.

   All the comments in the following example are valid:


     %{
     /* code block */
     %}
     
     /* Definitions Section */
     %x STATE_X
     
     %%
         /* Rules Section */
     ruleA   /* after regex */ { /* code block */ } /* after code block */
             /* Rules Section (indented) */
     <STATE_X>{
     ruleC   ECHO;
     ruleD   ECHO;
     %{
     /* code block */
     %}
     }
     %%
     /* User Code Section */


File: flex.info,  Node: Patterns,  Next: Matching,  Prev: Format,  Up: Top

Patterns
********

   The patterns in the input (see *Note Rules Section::) are written
using an extended set of regular expressions.  These are:

`x'
     match the character 'x'

`.'
     any character (byte) except newline

`[xyz]'
     a "character class"; in this case, the pattern matches either an
     'x', a 'y', or a 'z'

`[abj-oZ]'
     a "character class" with a range in it; matches an 'a', a 'b', any
     letter from 'j' through 'o', or a 'Z'

`[^A-Z]'
     a "negated character class", i.e., any character but those in the
     class.  In this case, any character EXCEPT an uppercase letter.

`[^A-Z\n]'
     any character EXCEPT an uppercase letter or a newline

`r*'
     zero or more r's, where r is any regular expression

`r+'
     one or more r's

`r?'
     zero or one r's (that is, "an optional r")

`r{2,5}'
     anywhere from two to five r's

`r{2,}'
     two or more r's

`r{4}'
     exactly 4 r's

`{name}'
     the expansion of the `name' definition (*note Format::).

`"[xyz]\"foo"'
     the literal string: `[xyz]"foo'

`\X'
     if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C
     interpretation of `\x'.  Otherwise, a literal `X' (used to escape
     operators such as `*')

`\0'
     a NUL character (ASCII code 0)

`\123'
     the character with octal value 123

`\x2a'
     the character with hexadecimal value 2a

`(r)'
     match an `r'; parentheses are used to override precedence (see
     below)

`rs'
     the regular expression `r' followed by the regular expression `s';
     called "concatenation"

`r|s'
     either an `r' or an `s'

`r/s'
     an `r' but only if it is followed by an `s'.  The text matched by
     `s' is included when determining whether this rule is the longest
     match, but is then returned to the input before the action is
     executed.  So the action only sees the text matched by `r'.  This
     type of pattern is called "trailing context".  (There are some
     combinations of `r/s' that flex cannot match correctly. *Note
     Limitations::, regarding dangerous trailing context.)

`^r'
     an `r', but only at the beginning of a line (i.e., when just
     starting to scan, or right after a newline has been scanned).

`r$'
     an `r', but only at the end of a line (i.e., just before a
     newline).  Equivalent to `r/\n'.

     Note that `flex''s notion of "newline" is exactly whatever the C
     compiler used to compile `flex' interprets `\n' as; in particular,
     on some DOS systems you must either filter out `\r's in the input
     yourself, or explicitly use `r/\r\n' for `r$'.

`<s>r'
     an `r', but only in start condition `s' (see *Note Start
     Conditions:: for discussion of start conditions).

`<s1,s2,s3>r'
     same, but in any of start conditions `s1', `s2', or `s3'.

`<*>r'
     an `r' in any start condition, even an exclusive one.

`<<EOF>>'
     an end-of-file.

`<s1,s2><<EOF>>'
     an end-of-file when in start condition `s1' or `s2'

   Note that inside of a character class, all regular expression
operators lose their special meaning except escape (`\') and the
character class operators, `-', `]]', and, at the beginning of the
class, `^'.

   The regular expressions listed above are grouped according to
precedence, from highest precedence at the top to lowest at the bottom.
Those grouped together have equal precedence (see special note on the
precedence of the repeat operator, `{}', under the documentation for
the `--posix' POSIX compliance option).  For example,


         foo|bar*

   is the same as


         (foo)|(ba(r*))

   since the `*' operator has higher precedence than concatenation, and
concatenation higher than alternation (`|').  This pattern therefore
matches _either_ the string `foo' _or_ the string `ba' followed by
zero-or-more `r''s.  To match `foo' or zero-or-more repetitions of the
string `bar', use:


         foo|(bar)*

   And to match a sequence of zero or more repetitions of `foo' and
`bar':


         (foo|bar)*

   In addition to characters and ranges of characters, character classes
can also contain "character class expressions".  These are expressions
enclosed inside `[': and `:]' delimiters (which themselves must appear
between the `[' and `]' of the character class. Other elements may
occur inside the character class, too).  The valid expressions are:


         [:alnum:] [:alpha:] [:blank:]
         [:cntrl:] [:digit:] [:graph:]
         [:lower:] [:print:] [:punct:]
         [:space:] [:upper:] [:xdigit:]

   These expressions all designate a set of characters equivalent to the
corresponding standard C `isXXX' function.  For example, `[:alnum:]'
designates those characters for which `isalnum()' returns true - i.e.,
any alphabetic or numeric character.  Some systems don't provide
`isblank()', so flex defines `[:blank:]' as a blank or a tab.

   For example, the following character classes are all equivalent:


         [[:alnum:]]
         [[:alpha:][:digit:]]
         [[:alpha:][0-9]]
         [a-zA-Z0-9]

   Some notes on patterns are in order.

   * If your scanner is case-insensitive (the `-i' flag), then
     `[:upper:]' and `[:lower:]' are equivalent to `[:alpha:]'.

   * Character classes with ranges, such as `[a-Z]', should be used with
     caution in a case-insensitive scanner if the range spans upper or
     lowercase characters. Flex does not know if you want to fold all
     upper and lowercase characters together, or if you want the
     literal numeric range specified (with no case folding). When in
     doubt, flex will assume that you meant the literal numeric range,
     and will issue a warning. The exception to this rule is a
     character range such as `[a-z]' or `[S-W]' where it is obvious
     that you want case-folding to occur. Here are some examples with
     the `-i' flag enabled:

     Range        Result      Literal Range        Alternate Range
     `[a-t]'      ok          `[a-tA-T]'           
     `[A-T]'      ok          `[a-tA-T]'           
     `[A-t]'      ambiguous   `[A-Z\[\\\]_`a-t]'   `[a-tA-T]'
     `[_-{]'      ambiguous   `[_`a-z{]'           `[_`a-zA-Z{]'
     `[@-C]'      ambiguous   `[@ABC]'             `[@A-Z\[\\\]_`abc]'

   * A negated character class such as the example `[^A-Z]' above
     _will_ match a newline unless `\n' (or an equivalent escape
     sequence) is one of the characters explicitly present in the
     negated character class (e.g., `[^A-Z\n]').  This is unlike how
     many other regular expression tools treat negated character
     classes, but unfortunately the inconsistency is historically
     entrenched.  Matching newlines means that a pattern like `[^"]*'
     can match the entire input unless there's another quote in the
     input.

   * A rule can have at most one instance of trailing context (the `/'
     operator or the `$' operator).  The start condition, `^', and
     `<<EOF>>' patterns can only occur at the beginning of a pattern,
     and, as well as with `/' and `$', cannot be grouped inside
     parentheses.  A `^' which does not occur at the beginning of a
     rule or a `$' which does not occur at the end of a rule loses its
     special properties and is treated as a normal character.

   * The following are invalid:


              foo/bar$
              <sc1>foo<sc2>bar

     Note that the first of these can be written `foo/bar\n'.

   * The following will result in `$' or `^' being treated as a normal
     character:


              foo|(bar$)
              foo|^bar

     If the desired meaning is a `foo' or a
     `bar'-followed-by-a-newline, the following could be used (the
     special `|' action is explained below, *note Actions::):


              foo      |
              bar$     /* action goes here */

     A similar trick will work for matching a `foo' or a
     `bar'-at-the-beginning-of-a-line.


File: flex.info,  Node: Matching,  Next: Actions,  Prev: Patterns,  Up: Top

How the Input Is Matched
************************

   When the generated scanner is run, it analyzes its input looking for
strings which match any of its patterns.  If it finds more than one
match, it takes the one matching the most text (for trailing context
rules, this includes the length of the trailing part, even though it
will then be returned to the input).  If it finds two or more matches of
the same length, the rule listed first in the `flex' input file is
chosen.

   Once the match is determined, the text corresponding to the match
(called the "token") is made available in the global character pointer
`yytext', and its length in the global integer `yyleng'.  The "action"
corresponding to the matched pattern is then executed (*note
Actions::), and then the remaining input is scanned for another match.

   If no match is found, then the "default rule" is executed: the next
character in the input is considered matched and copied to the standard
output.  Thus, the simplest valid `flex' input is:


         %%

   which generates a scanner that simply copies its input (one
character at a time) to its output.

   Note that `yytext' can be defined in two different ways: either as a
character _pointer_ or as a character _array_. You can control which
definition `flex' uses by including one of the special directives
`%pointer' or `%array' in the first (definitions) section of your flex
input.  The default is `%pointer', unless you use the `-l' lex
compatibility option, in which case `yytext' will be an array.  The
advantage of using `%pointer' is substantially faster scanning and no
buffer overflow when matching very large tokens (unless you run out of
dynamic memory).  The disadvantage is that you are restricted in how
your actions can modify `yytext' (*note Actions::), and calls to the
`unput()' function destroys the present contents of `yytext', which can
be a considerable porting headache when moving between different `lex'
versions.

   The advantage of `%array' is that you can then modify `yytext' to
your heart's content, and calls to `unput()' do not destroy `yytext'
(*note Actions::).  Furthermore, existing `lex' programs sometimes
access `yytext' externally using declarations of the form:


         extern char yytext[];

   This definition is erroneous when used with `%pointer', but correct
for `%array'.

   The `%array' declaration defines `yytext' to be an array of `YYLMAX'
characters, which defaults to a fairly large value.  You can change the
size by simply #define'ing `YYLMAX' to a different value in the first
section of your `flex' input.  As mentioned above, with `%pointer'
yytext grows dynamically to accommodate large tokens.  While this means
your `%pointer' scanner can accommodate very large tokens (such as
matching entire blocks of comments), bear in mind that each time the
scanner must resize `yytext' it also must rescan the entire token from
the beginning, so matching such tokens can prove slow.  `yytext'
presently does _not_ dynamically grow if a call to `unput()' results in
too much text being pushed back; instead, a run-time error results.

   Also note that you cannot use `%array' with C++ scanner classes
(*note Cxx::).


File: flex.info,  Node: Actions,  Next: Generated Scanner,  Prev: Matching,  Up: Top

Actions
*******

   Each pattern in a rule has a corresponding "action", which can be
any arbitrary C statement.  The pattern ends at the first non-escaped
whitespace character; the remainder of the line is its action.  If the
action is empty, then when the pattern is matched the input token is
simply discarded.  For example, here is the specification for a program
which deletes all occurrences of `zap me' from its input:


         %%
         "zap me"

   This example will copy all other characters in the input to the
output since they will be matched by the default rule.

   Here is a program which compresses multiple blanks and tabs down to a
single blank, and throws away whitespace found at the end of a line:


         %%
         [ \t]+        putchar( ' ' );
         [ \t]+$       /* ignore this token */

   If the action contains a `}', then the action spans till the
balancing `}' is found, and the action may cross multiple lines.
`flex' knows about C strings and comments and won't be fooled by braces
found within them, but also allows actions to begin with `%{' and will
consider the action to be all the text up to the next `%}' (regardless
of ordinary braces inside the action).

   An action consisting solely of a vertical bar (`|') means "same as
the action for the next rule".  See below for an illustration.

   Actions can include arbitrary C code, including `return' statements
to return a value to whatever routine called `yylex()'.  Each time
`yylex()' is called it continues processing tokens from where it last
left off until it either reaches the end of the file or executes a
return.

   Actions are free to modify `yytext' except for lengthening it
(adding characters to its end-these will overwrite later characters in
the input stream).  This however does not apply when using `%array'
(*note Matching::). In that case, `yytext' may be freely modified in
any way.

   Actions are free to modify `yyleng' except they should not do so if
the action also includes use of `yymore()' (see below).

   There are a number of special directives which can be included
within an action:

`ECHO'
     copies yytext to the scanner's output.

`BEGIN'
     followed by the name of a start condition places the scanner in the
     corresponding start condition (see below).

`REJECT'
     directs the scanner to proceed on to the "second best" rule which
     matched the input (or a prefix of the input).  The rule is chosen
     as described above in *Note Matching::, and `yytext' and `yyleng'
     set up appropriately.  It may either be one which matched as much
     text as the originally chosen rule but came later in the `flex'
     input file, or one which matched less text.  For example, the
     following will both count the words in the input and call the
     routine `special()' whenever `frob' is seen:


                      int word_count = 0;
              %%
          
              frob        special(); REJECT;
              [^ \t\n]+   ++word_count;

     Without the `REJECT', any occurences of `frob' in the input would
     not be counted as words, since the scanner normally executes only
     one action per token.  Multiple uses of `REJECT' are allowed, each
     one finding the next best choice to the currently active rule.  For
     example, when the following scanner scans the token `abcd', it will
     write `abcdabcaba' to the output:


              %%
              a        |
              ab       |
              abc      |
              abcd     ECHO; REJECT;
              .|\n     /* eat up any unmatched character */

     The first three rules share the fourth's action since they use the
     special `|' action.

     `REJECT' is a particularly expensive feature in terms of scanner
     performance; if it is used in _any_ of the scanner's actions it
     will slow down _all_ of the scanner's matching.  Furthermore,
     `REJECT' cannot be used with the `-Cf' or `-CF' options (*note
     Scanner Options::).

     Note also that unlike the other special actions, `REJECT' is a
     _branch_.  code immediately following it in the action will _not_
     be executed.

`yymore()'
     tells the scanner that the next time it matches a rule, the
     corresponding token should be _appended_ onto the current value of
     `yytext' rather than replacing it.  For example, given the input
     `mega-kludge' the following will write `mega-mega-kludge' to the
     output:


              %%
              mega-    ECHO; yymore();
              kludge   ECHO;

     First `mega-' is matched and echoed to the output.  Then `kludge'
     is matched, but the previous `mega-' is still hanging around at the
     beginning of `yytext' so the `ECHO' for the `kludge' rule will
     actually write `mega-kludge'.

   Two notes regarding use of `yymore()'.  First, `yymore()' depends on
the value of `yyleng' correctly reflecting the size of the current
token, so you must not modify `yyleng' if you are using `yymore()'.
Second, the presence of `yymore()' in the scanner's action entails a
minor performance penalty in the scanner's matching speed.

   `yyless(n)' returns all but the first `n' characters of the current
token back to the input stream, where they will be rescanned when the
scanner looks for the next match.  `yytext' and `yyleng' are adjusted
appropriately (e.g., `yyleng' will now be equal to `n').  For example,
on the input `foobar' the following will write out `foobarbar':


         %%
         foobar    ECHO; yyless(3);
         [a-z]+    ECHO;

   An argument of 0 to `yyless()' will cause the entire current input
string to be scanned again.  Unless you've changed how the scanner will
subsequently process its input (using `BEGIN', for example), this will
result in an endless loop.

   Note that `yyless()' is a macro and can only be used in the flex
input file, not from other source files.

   `unput(c)' puts the character `c' back onto the input stream.  It
will be the next character scanned.  The following action will take the
current token and cause it to be rescanned enclosed in parentheses.


         {
         int i;
         /* Copy yytext because unput() trashes yytext */
         char *yycopy = strdup( yytext );
         unput( ')' );
         for ( i = yyleng - 1; i >= 0; --i )
             unput( yycopy[i] );
         unput( '(' );
         free( yycopy );
         }

   Note that since each `unput()' puts the given character back at the
_beginning_ of the input stream, pushing back strings must be done
back-to-front.

   An important potential problem when using `unput()' is that if you
are using `%pointer' (the default), a call to `unput()' _destroys_ the
contents of `yytext', starting with its rightmost character and
devouring one character to the left with each call.  If you need the
value of `yytext' preserved after a call to `unput()' (as in the above
example), you must either first copy it elsewhere, or build your
scanner using `%array' instead (*note Matching::).

   Finally, note that you cannot put back `EOF' to attempt to mark the
input stream with an end-of-file.

   `input()' reads the next character from the input stream.  For
example, the following is one way to eat up C comments:


         %%
         "/*"        {
                     register int c;
     
                     for ( ; ; )
                         {
                         while ( (c = input()) != '*' &&
                                 c != EOF )
                             ;    /* eat up text of comment */
     
                         if ( c == '*' )
                             {
                             while ( (c = input()) == '*' )
                                 ;
                             if ( c == '/' )
                                 break;    /* found the end */
                             }
     
                         if ( c == EOF )
                             {
                             error( "EOF in comment" );
                             break;
                             }
                         }
                     }

   (Note that if the scanner is compiled using `C++', then `input()' is
instead referred to as yyinput(), in order to avoid a name clash with
the `C++' stream by the name of `input'.)

   `YY_FLUSH_BUFFER()' flushes the scanner's internal buffer so that
the next time the scanner attempts to match a token, it will first
refill the buffer using `YY_INPUT()' (*note Generated Scanner::).  This
action is a special case of the more general `yy_flush_buffer()'
function, described below (*note Multiple Input Buffers::)

   `yyterminate()' can be used in lieu of a return statement in an
action.  It terminates the scanner and returns a 0 to the scanner's
caller, indicating "all done".  By default, `yyterminate()' is also
called when an end-of-file is encountered.  It is a macro and may be
redefined.


File: flex.info,  Node: Generated Scanner,  Next: Start Conditions,  Prev: Actions,  Up: Top

The Generated Scanner
*********************

   The output of `flex' is the file `lex.yy.c', which contains the
scanning routine `yylex()', a number of tables used by it for matching
tokens, and a number of auxiliary routines and macros.  By default,
`yylex()' is declared as follows:


         int yylex()
             {
             ... various definitions and the actions in here ...
             }

   (If your environment supports function prototypes, then it will be
`int yylex( void )'.)  This definition may be changed by defining the
`YY_DECL' macro.  For example, you could use:


         #define YY_DECL float lexscan( a, b ) float a, b;

   to give the scanning routine the name `lexscan', returning a float,
and taking two floats as arguments.  Note that if you give arguments to
the scanning routine using a K&R-style/non-prototyped function
declaration, you must terminate the definition with a semi-colon (;).

   `flex' generates `C99' function definitions by default. However flex
does have the ability to generate obsolete, er, `traditional', function
definitions. This is to support bootstrapping gcc on old systems.
Unfortunately, traditional definitions prevent us from using any
standard data types smaller than int (such as short, char, or bool) as
function arguments.  For this reason, future versions of `flex' may
generate standard C99 code only, leaving K&R-style functions to the
historians.  Currently, if you do *not* want `C99' definitions, then
you must use `%option noansi-definitions'.

   Whenever `yylex()' is called, it scans tokens from the global input
file `yyin' (which defaults to stdin).  It continues until it either
reaches an end-of-file (at which point it returns the value 0) or one
of its actions executes a `return' statement.

   If the scanner reaches an end-of-file, subsequent calls are undefined
unless either `yyin' is pointed at a new input file (in which case
scanning continues from that file), or `yyrestart()' is called.
`yyrestart()' takes one argument, a `FILE *' pointer (which can be
NULL, if you've set up `YY_INPUT' to scan from a source other than
`yyin'), and initializes `yyin' for scanning from that file.
Essentially there is no difference between just assigning `yyin' to a
new input file or using `yyrestart()' to do so; the latter is available
for compatibility with previous versions of `flex', and because it can
be used to switch input files in the middle of scanning.  It can also
be used to throw away the current input buffer, by calling it with an
argument of `yyin'; but it would be better to use `YY_FLUSH_BUFFER'
(*note Actions::).  Note that `yyrestart()' does _not_ reset the start
condition to `INITIAL' (*note Start Conditions::).

   If `yylex()' stops scanning due to executing a `return' statement in
one of the actions, the scanner may then be called again and it will
resume scanning where it left off.

   By default (and for purposes of efficiency), the scanner uses
block-reads rather than simple `getc()' calls to read characters from
`yyin'.  The nature of how it gets its input can be controlled by
defining the `YY_INPUT' macro.  The calling sequence for `YY_INPUT()'
is `YY_INPUT(buf,result,max_size)'.  Its action is to place up to
`max_size' characters in the character array `buf' and return in the
integer variable `result' either the number of characters read or the
constant `YY_NULL' (0 on Unix systems) to indicate `EOF'.  The default
`YY_INPUT' reads from the global file-pointer `yyin'.

   Here is a sample definition of `YY_INPUT' (in the definitions
section of the input file):


         %{
         #define YY_INPUT(buf,result,max_size) \
             { \
             int c = getchar(); \
             result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
             }
         %}

   This definition will change the input processing to occur one
character at a time.

   When the scanner receives an end-of-file indication from YY_INPUT, it
then checks the `yywrap()' function.  If `yywrap()' returns false
(zero), then it is assumed that the function has gone ahead and set up
`yyin' to point to another input file, and scanning continues.  If it
returns true (non-zero), then the scanner terminates, returning 0 to
its caller.  Note that in either case, the start condition remains
unchanged; it does _not_ revert to `INITIAL'.

   If you do not supply your own version of `yywrap()', then you must
either use `%option noyywrap' (in which case the scanner behaves as
though `yywrap()' returned 1), or you must link with `-lfl' to obtain
the default version of the routine, which always returns 1.

   For scanning from in-memory buffers (e.g., scanning strings), see
*Note Scanning Strings::. *Note Multiple Input Buffers::.

   The scanner writes its `ECHO' output to the `yyout' global (default,
`stdout'), which may be redefined by the user simply by assigning it to
some other `FILE' pointer.