This is flex.info, produced by makeinfo version 4.5 from flex.texi. INFO-DIR-SECTION Programming START-INFO-DIR-ENTRY * flex: (flex). Fast lexical analyzer generator (lex replacement). END-INFO-DIR-ENTRY The flex manual is placed under the same licensing conditions as the rest of flex: Copyright (C) 1990, 1997 The Regents of the University of California. All rights reserved. This code is derived from software contributed to Berkeley by Vern Paxson. The United States Government has rights in this work pursuant to contract no. DE-AC03-76SF00098 between the United States Department of Energy and the University of California. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. File: flex.info, Node: Top, Next: Copyright, Prev: (dir), Up: (dir) flex **** This manual describes `flex', a tool for generating programs that perform pattern-matching on text. The manual includes both tutorial and reference sections. This edition of `The flex Manual' documents `flex' version 2.5.33. It was last updated on 20 February 2006. * Menu: * Copyright:: * Reporting Bugs:: * Introduction:: * Simple Examples:: * Format:: * Patterns:: * Matching:: * Actions:: * Generated Scanner:: * Start Conditions:: * Multiple Input Buffers:: * EOF:: * Misc Macros:: * User Values:: * Yacc:: * Scanner Options:: * Performance:: * Cxx:: * Reentrant:: * Lex and Posix:: * Memory Management:: * Serialized Tables:: * Diagnostics:: * Limitations:: * Bibliography:: * FAQ:: * Appendices:: * Indices:: --- The Detailed Node Listing --- Format of the Input File * Definitions Section:: * Rules Section:: * User Code Section:: * Comments in the Input:: Scanner Options * Options for Specifing Filenames:: * Options Affecting Scanner Behavior:: * Code-Level And API Options:: * Options for Scanner Speed and Size:: * Debugging Options:: * Miscellaneous Options:: Reentrant C Scanners * Reentrant Uses:: * Reentrant Overview:: * Reentrant Example:: * Reentrant Detail:: * Reentrant Functions:: The Reentrant API in Detail * Specify Reentrant:: * Extra Reentrant Argument:: * Global Replacement:: * Init and Destroy Functions:: * Accessor Methods:: * Extra Data:: * About yyscan_t:: Memory Management * The Default Memory Management:: * Overriding The Default Memory Management:: * A Note About yytext And Memory:: Serialized Tables * Creating Serialized Tables:: * Loading and Unloading Serialized Tables:: * Tables File Format:: FAQ * When was flex born?:: * How do I expand \ escape sequences in C-style quoted strings?:: * Why do flex scanners call fileno if it is not ANSI compatible?:: * Does flex support recursive pattern definitions?:: * How do I skip huge chunks of input (tens of megabytes) while using flex?:: * Flex is not matching my patterns in the same order that I defined them.:: * My actions are executing out of order or sometimes not at all.:: * How can I have multiple input sources feed into the same scanner at the same time?:: * Can I build nested parsers that work with the same input file?:: * How can I match text only at the end of a file?:: * How can I make REJECT cascade across start condition boundaries?:: * Why cant I use fast or full tables with interactive mode?:: * How much faster is -F or -f than -C?:: * If I have a simple grammar cant I just parse it with flex?:: * Why doesnt yyrestart() set the start state back to INITIAL?:: * How can I match C-style comments?:: * The period isnt working the way I expected.:: * Can I get the flex manual in another format?:: * Does there exist a "faster" NDFA->DFA algorithm?:: * How does flex compile the DFA so quickly?:: * How can I use more than 8192 rules?:: * How do I abandon a file in the middle of a scan and switch to a new file?:: * How do I execute code only during initialization (only before the first scan)?:: * How do I execute code at termination?:: * Where else can I find help?:: * Can I include comments in the "rules" section of the file?:: * I get an error about undefined yywrap().:: * How can I change the matching pattern at run time?:: * How can I expand macros in the input?:: * How can I build a two-pass scanner?:: * How do I match any string not matched in the preceding rules?:: * I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: * Is there a way to make flex treat NULL like a regular character?:: * Whenever flex can not match the input it says "flex scanner jammed".:: * Why doesnt flex have non-greedy operators like perl does?:: * Memory leak - 16386 bytes allocated by malloc.:: * How do I track the byte offset for lseek()?:: * How do I use my own I/O classes in a C++ scanner?:: * How do I skip as many chars as possible?:: * deleteme00:: * Are certain equivalent patterns faster than others?:: * Is backing up a big deal?:: * Can I fake multi-byte character support?:: * deleteme01:: * Can you discuss some flex internals?:: * unput() messes up yy_at_bol:: * The | operator is not doing what I want:: * Why can't flex understand this variable trailing context pattern?:: * The ^ operator isn't working:: * Trailing context is getting confused with trailing optional patterns:: * Is flex GNU or not?:: * ERASEME53:: * I need to scan if-then-else blocks and while loops:: * ERASEME55:: * ERASEME56:: * ERASEME57:: * Is there a repository for flex scanners?:: * How can I conditionally compile or preprocess my flex input file?:: * Where can I find grammars for lex and yacc?:: * I get an end-of-buffer message for each character scanned.:: * unnamed-faq-62:: * unnamed-faq-63:: * unnamed-faq-64:: * unnamed-faq-65:: * unnamed-faq-66:: * unnamed-faq-67:: * unnamed-faq-68:: * unnamed-faq-69:: * unnamed-faq-70:: * unnamed-faq-71:: * unnamed-faq-72:: * unnamed-faq-73:: * unnamed-faq-74:: * unnamed-faq-75:: * unnamed-faq-76:: * unnamed-faq-77:: * unnamed-faq-78:: * unnamed-faq-79:: * unnamed-faq-80:: * unnamed-faq-81:: * unnamed-faq-82:: * unnamed-faq-83:: * unnamed-faq-84:: * unnamed-faq-85:: * unnamed-faq-86:: * unnamed-faq-87:: * unnamed-faq-88:: * unnamed-faq-90:: * unnamed-faq-91:: * unnamed-faq-92:: * unnamed-faq-93:: * unnamed-faq-94:: * unnamed-faq-95:: * unnamed-faq-96:: * unnamed-faq-97:: * unnamed-faq-98:: * unnamed-faq-99:: * unnamed-faq-100:: * unnamed-faq-101:: * What is the difference between YYLEX_PARAM and YY_DECL?:: * Why do I get "conflicting types for yylex" error?:: * How do I access the values set in a Flex action from within a Bison action?:: Appendices * Makefiles and Flex:: * Bison Bridge:: * M4 Dependency:: Indices * Concept Index:: * Index of Functions and Macros:: * Index of Variables:: * Index of Data Types:: * Index of Hooks:: * Index of Scanner Options:: File: flex.info, Node: Copyright, Next: Reporting Bugs, Prev: Top, Up: Top Copyright ********* The flex manual is placed under the same licensing conditions as the rest of flex: Copyright (C) 1990, 1997 The Regents of the University of California. All rights reserved. This code is derived from software contributed to Berkeley by Vern Paxson. The United States Government has rights in this work pursuant to contract no. DE-AC03-76SF00098 between the United States Department of Energy and the University of California. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of the University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. File: flex.info, Node: Reporting Bugs, Next: Introduction, Prev: Copyright, Up: Top Reporting Bugs ************** If you have problems with `flex' or think you have found a bug, please send mail detailing your problem to <flex-help@lists.sourceforge.net>. Patches are always welcome. File: flex.info, Node: Introduction, Next: Simple Examples, Prev: Reporting Bugs, Up: Top Introduction ************ `flex' is a tool for generating "scanners". A scanner is a program which recognizes lexical patterns in text. The `flex' program reads the given input files, or its standard input if no file names are given, for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code, called "rules". `flex' generates as output a C source file, `lex.yy.c' by default, which defines a routine `yylex()'. This file can be compiled and linked with the flex runtime library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code. File: flex.info, Node: Simple Examples, Next: Format, Prev: Introduction, Up: Top Some Simple Examples ******************** First some simple examples to get the flavor of how one uses `flex'. The following `flex' input specifies a scanner which, when it encounters the string `username' will replace it with the user's login name: %% username printf( "%s", getlogin() ); By default, any text not matched by a `flex' scanner is copied to the output, so the net effect of this scanner is to copy its input file to its output with each occurrence of `username' expanded. In this input, there is just one rule. `username' is the "pattern" and the `printf' is the "action". The `%%' symbol marks the beginning of the rules. Here's another simple example: int num_lines = 0, num_chars = 0; %% \n ++num_lines; ++num_chars; . ++num_chars; %% main() { yylex(); printf( "# of lines = %d, # of chars = %d\n", num_lines, num_chars ); } This scanner counts the number of characters and the number of lines in its input. It produces no output other than the final report on the character and line counts. The first line declares two globals, `num_lines' and `num_chars', which are accessible both inside `yylex()' and in the `main()' routine declared after the second `%%'. There are two rules, one which matches a newline (`\n') and increments both the line count and the character count, and one which matches any character other than a newline (indicated by the `.' regular expression). A somewhat more complicated example: /* scanner for a toy Pascal-like language */ %{ /* need this for the call to atof() below */ #include math.h> %} DIGIT [0-9] ID [a-z][a-z0-9]* %% {DIGIT}+ { printf( "An integer: %s (%d)\n", yytext, atoi( yytext ) ); } {DIGIT}+"."{DIGIT}* { printf( "A float: %s (%g)\n", yytext, atof( yytext ) ); } if|then|begin|end|procedure|function { printf( "A keyword: %s\n", yytext ); } {ID} printf( "An identifier: %s\n", yytext ); "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); "{"[\^{}}\n]*"}" /* eat up one-line comments */ [ \t\n]+ /* eat up whitespace */ . printf( "Unrecognized character: %s\n", yytext ); %% main( argc, argv ) int argc; char **argv; { ++argv, --argc; /* skip over program name */ if ( argc > 0 ) yyin = fopen( argv[0], "r" ); else yyin = stdin; yylex(); } This is the beginnings of a simple scanner for a language like Pascal. It identifies different types of "tokens" and reports on what it has seen. The details of this example will be explained in the following sections. File: flex.info, Node: Format, Next: Patterns, Prev: Simple Examples, Up: Top Format of the Input File ************************ The `flex' input file consists of three sections, separated by a line containing only `%%'. definitions %% rules %% user code * Menu: * Definitions Section:: * Rules Section:: * User Code Section:: * Comments in the Input:: File: flex.info, Node: Definitions Section, Next: Rules Section, Prev: Format, Up: Format Format of the Definitions Section ================================= The "definitions section" contains declarations of simple "name" definitions to simplify the scanner specification, and declarations of "start conditions", which are explained in a later section. Name definitions have the form: name definition The `name' is a word beginning with a letter or an underscore (`_') followed by zero or more letters, digits, `_', or `-' (dash). The definition is taken to begin at the first non-whitespace character following the name and continuing to the end of the line. The definition can subsequently be referred to using `{name}', which will expand to `(definition)'. For example, DIGIT [0-9] ID [a-z][a-z0-9]* Defines `DIGIT' to be a regular expression which matches a single digit, and `ID' to be a regular expression which matches a letter followed by zero-or-more letters-or-digits. A subsequent reference to {DIGIT}+"."{DIGIT}* is identical to ([0-9])+"."([0-9])* and matches one-or-more digits followed by a `.' followed by zero-or-more digits. An unindented comment (i.e., a line beginning with `/*') is copied verbatim to the output up to the next `*/'. Any _indented_ text or text enclosed in `%{' and `%}' is also copied verbatim to the output (with the %{ and %} symbols removed). The %{ and %} symbols must appear unindented on lines by themselves. A `%top' block is similar to a `%{' ... `%}' block, except that the code in a `%top' block is relocated to the _top_ of the generated file, before any flex definitions (1). The `%top' block is useful when you want certain preprocessor macros to be defined or certain files to be included before the generated code. The single characters, `{' and `}' are used to delimit the `%top' block, as show in the example below: %top{ /* This code goes at the "top" of the generated file. */ #include <stdint.h> #include <inttypes.h> } Multiple `%top' blocks are allowed, and their order is preserved. ---------- Footnotes ---------- (1) Actually, `yyIN_HEADER' is defined before the `%top' block. File: flex.info, Node: Rules Section, Next: User Code Section, Prev: Definitions Section, Up: Format Format of the Rules Section =========================== The "rules" section of the `flex' input contains a series of rules of the form: pattern action where the pattern must be unindented and the action must begin on the same line. *Note Patterns::, for a further description of patterns and actions. In the rules section, any indented or %{ %} enclosed text appearing before the first rule may be used to declare variables which are local to the scanning routine and (after the declarations) code which is to be executed whenever the scanning routine is entered. Other indented or %{ %} text in the rule section is still copied to the output, but its meaning is not well-defined and it may well cause compile-time errors (this feature is present for POSIX compliance. *Note Lex and Posix::, for other such features). Any _indented_ text or text enclosed in `%{' and `%}' is copied verbatim to the output (with the %{ and %} symbols removed). The %{ and %} symbols must appear unindented on lines by themselves. File: flex.info, Node: User Code Section, Next: Comments in the Input, Prev: Rules Section, Up: Format Format of the User Code Section =============================== The user code section is simply copied to `lex.yy.c' verbatim. It is used for companion routines which call or are called by the scanner. The presence of this section is optional; if it is missing, the second `%%' in the input file may be skipped, too. File: flex.info, Node: Comments in the Input, Prev: User Code Section, Up: Format Comments in the Input ===================== Flex supports C-style comments, that is, anything between /* and */ is considered a comment. Whenever flex encounters a comment, it copies the entire comment verbatim to the generated source code. Comments may appear just about anywhere, but with the following exceptions: * Comments may not appear in the Rules Section wherever flex is expecting a regular expression. This means comments may not appear at the beginning of a line, or immediately following a list of scanner states. * Comments may not appear on an `%option' line in the Definitions Section. If you want to follow a simple rule, then always begin a comment on a new line, with one or more whitespace characters before the initial `/*'). This rule will work anywhere in the input file. All the comments in the following example are valid: %{ /* code block */ %} /* Definitions Section */ %x STATE_X %% /* Rules Section */ ruleA /* after regex */ { /* code block */ } /* after code block */ /* Rules Section (indented) */ <STATE_X>{ ruleC ECHO; ruleD ECHO; %{ /* code block */ %} } %% /* User Code Section */ File: flex.info, Node: Patterns, Next: Matching, Prev: Format, Up: Top Patterns ******** The patterns in the input (see *Note Rules Section::) are written using an extended set of regular expressions. These are: `x' match the character 'x' `.' any character (byte) except newline `[xyz]' a "character class"; in this case, the pattern matches either an 'x', a 'y', or a 'z' `[abj-oZ]' a "character class" with a range in it; matches an 'a', a 'b', any letter from 'j' through 'o', or a 'Z' `[^A-Z]' a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter. `[^A-Z\n]' any character EXCEPT an uppercase letter or a newline `r*' zero or more r's, where r is any regular expression `r+' one or more r's `r?' zero or one r's (that is, "an optional r") `r{2,5}' anywhere from two to five r's `r{2,}' two or more r's `r{4}' exactly 4 r's `{name}' the expansion of the `name' definition (*note Format::). `"[xyz]\"foo"' the literal string: `[xyz]"foo' `\X' if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C interpretation of `\x'. Otherwise, a literal `X' (used to escape operators such as `*') `\0' a NUL character (ASCII code 0) `\123' the character with octal value 123 `\x2a' the character with hexadecimal value 2a `(r)' match an `r'; parentheses are used to override precedence (see below) `rs' the regular expression `r' followed by the regular expression `s'; called "concatenation" `r|s' either an `r' or an `s' `r/s' an `r' but only if it is followed by an `s'. The text matched by `s' is included when determining whether this rule is the longest match, but is then returned to the input before the action is executed. So the action only sees the text matched by `r'. This type of pattern is called "trailing context". (There are some combinations of `r/s' that flex cannot match correctly. *Note Limitations::, regarding dangerous trailing context.) `^r' an `r', but only at the beginning of a line (i.e., when just starting to scan, or right after a newline has been scanned). `r$' an `r', but only at the end of a line (i.e., just before a newline). Equivalent to `r/\n'. Note that `flex''s notion of "newline" is exactly whatever the C compiler used to compile `flex' interprets `\n' as; in particular, on some DOS systems you must either filter out `\r's in the input yourself, or explicitly use `r/\r\n' for `r$'. `<s>r' an `r', but only in start condition `s' (see *Note Start Conditions:: for discussion of start conditions). `<s1,s2,s3>r' same, but in any of start conditions `s1', `s2', or `s3'. `<*>r' an `r' in any start condition, even an exclusive one. `<<EOF>>' an end-of-file. `<s1,s2><<EOF>>' an end-of-file when in start condition `s1' or `s2' Note that inside of a character class, all regular expression operators lose their special meaning except escape (`\') and the character class operators, `-', `]]', and, at the beginning of the class, `^'. The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence (see special note on the precedence of the repeat operator, `{}', under the documentation for the `--posix' POSIX compliance option). For example, foo|bar* is the same as (foo)|(ba(r*)) since the `*' operator has higher precedence than concatenation, and concatenation higher than alternation (`|'). This pattern therefore matches _either_ the string `foo' _or_ the string `ba' followed by zero-or-more `r''s. To match `foo' or zero-or-more repetitions of the string `bar', use: foo|(bar)* And to match a sequence of zero or more repetitions of `foo' and `bar': (foo|bar)* In addition to characters and ranges of characters, character classes can also contain "character class expressions". These are expressions enclosed inside `[': and `:]' delimiters (which themselves must appear between the `[' and `]' of the character class. Other elements may occur inside the character class, too). The valid expressions are: [:alnum:] [:alpha:] [:blank:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:] These expressions all designate a set of characters equivalent to the corresponding standard C `isXXX' function. For example, `[:alnum:]' designates those characters for which `isalnum()' returns true - i.e., any alphabetic or numeric character. Some systems don't provide `isblank()', so flex defines `[:blank:]' as a blank or a tab. For example, the following character classes are all equivalent: [[:alnum:]] [[:alpha:][:digit:]] [[:alpha:][0-9]] [a-zA-Z0-9] Some notes on patterns are in order. * If your scanner is case-insensitive (the `-i' flag), then `[:upper:]' and `[:lower:]' are equivalent to `[:alpha:]'. * Character classes with ranges, such as `[a-Z]', should be used with caution in a case-insensitive scanner if the range spans upper or lowercase characters. Flex does not know if you want to fold all upper and lowercase characters together, or if you want the literal numeric range specified (with no case folding). When in doubt, flex will assume that you meant the literal numeric range, and will issue a warning. The exception to this rule is a character range such as `[a-z]' or `[S-W]' where it is obvious that you want case-folding to occur. Here are some examples with the `-i' flag enabled: Range Result Literal Range Alternate Range `[a-t]' ok `[a-tA-T]' `[A-T]' ok `[a-tA-T]' `[A-t]' ambiguous `[A-Z\[\\\]_`a-t]' `[a-tA-T]' `[_-{]' ambiguous `[_`a-z{]' `[_`a-zA-Z{]' `[@-C]' ambiguous `[@ABC]' `[@A-Z\[\\\]_`abc]' * A negated character class such as the example `[^A-Z]' above _will_ match a newline unless `\n' (or an equivalent escape sequence) is one of the characters explicitly present in the negated character class (e.g., `[^A-Z\n]'). This is unlike how many other regular expression tools treat negated character classes, but unfortunately the inconsistency is historically entrenched. Matching newlines means that a pattern like `[^"]*' can match the entire input unless there's another quote in the input. * A rule can have at most one instance of trailing context (the `/' operator or the `$' operator). The start condition, `^', and `<<EOF>>' patterns can only occur at the beginning of a pattern, and, as well as with `/' and `$', cannot be grouped inside parentheses. A `^' which does not occur at the beginning of a rule or a `$' which does not occur at the end of a rule loses its special properties and is treated as a normal character. * The following are invalid: foo/bar$ <sc1>foo<sc2>bar Note that the first of these can be written `foo/bar\n'. * The following will result in `$' or `^' being treated as a normal character: foo|(bar$) foo|^bar If the desired meaning is a `foo' or a `bar'-followed-by-a-newline, the following could be used (the special `|' action is explained below, *note Actions::): foo | bar$ /* action goes here */ A similar trick will work for matching a `foo' or a `bar'-at-the-beginning-of-a-line. File: flex.info, Node: Matching, Next: Actions, Prev: Patterns, Up: Top How the Input Is Matched ************************ When the generated scanner is run, it analyzes its input looking for strings which match any of its patterns. If it finds more than one match, it takes the one matching the most text (for trailing context rules, this includes the length of the trailing part, even though it will then be returned to the input). If it finds two or more matches of the same length, the rule listed first in the `flex' input file is chosen. Once the match is determined, the text corresponding to the match (called the "token") is made available in the global character pointer `yytext', and its length in the global integer `yyleng'. The "action" corresponding to the matched pattern is then executed (*note Actions::), and then the remaining input is scanned for another match. If no match is found, then the "default rule" is executed: the next character in the input is considered matched and copied to the standard output. Thus, the simplest valid `flex' input is: %% which generates a scanner that simply copies its input (one character at a time) to its output. Note that `yytext' can be defined in two different ways: either as a character _pointer_ or as a character _array_. You can control which definition `flex' uses by including one of the special directives `%pointer' or `%array' in the first (definitions) section of your flex input. The default is `%pointer', unless you use the `-l' lex compatibility option, in which case `yytext' will be an array. The advantage of using `%pointer' is substantially faster scanning and no buffer overflow when matching very large tokens (unless you run out of dynamic memory). The disadvantage is that you are restricted in how your actions can modify `yytext' (*note Actions::), and calls to the `unput()' function destroys the present contents of `yytext', which can be a considerable porting headache when moving between different `lex' versions. The advantage of `%array' is that you can then modify `yytext' to your heart's content, and calls to `unput()' do not destroy `yytext' (*note Actions::). Furthermore, existing `lex' programs sometimes access `yytext' externally using declarations of the form: extern char yytext[]; This definition is erroneous when used with `%pointer', but correct for `%array'. The `%array' declaration defines `yytext' to be an array of `YYLMAX' characters, which defaults to a fairly large value. You can change the size by simply #define'ing `YYLMAX' to a different value in the first section of your `flex' input. As mentioned above, with `%pointer' yytext grows dynamically to accommodate large tokens. While this means your `%pointer' scanner can accommodate very large tokens (such as matching entire blocks of comments), bear in mind that each time the scanner must resize `yytext' it also must rescan the entire token from the beginning, so matching such tokens can prove slow. `yytext' presently does _not_ dynamically grow if a call to `unput()' results in too much text being pushed back; instead, a run-time error results. Also note that you cannot use `%array' with C++ scanner classes (*note Cxx::). File: flex.info, Node: Actions, Next: Generated Scanner, Prev: Matching, Up: Top Actions ******* Each pattern in a rule has a corresponding "action", which can be any arbitrary C statement. The pattern ends at the first non-escaped whitespace character; the remainder of the line is its action. If the action is empty, then when the pattern is matched the input token is simply discarded. For example, here is the specification for a program which deletes all occurrences of `zap me' from its input: %% "zap me" This example will copy all other characters in the input to the output since they will be matched by the default rule. Here is a program which compresses multiple blanks and tabs down to a single blank, and throws away whitespace found at the end of a line: %% [ \t]+ putchar( ' ' ); [ \t]+$ /* ignore this token */ If the action contains a `}', then the action spans till the balancing `}' is found, and the action may cross multiple lines. `flex' knows about C strings and comments and won't be fooled by braces found within them, but also allows actions to begin with `%{' and will consider the action to be all the text up to the next `%}' (regardless of ordinary braces inside the action). An action consisting solely of a vertical bar (`|') means "same as the action for the next rule". See below for an illustration. Actions can include arbitrary C code, including `return' statements to return a value to whatever routine called `yylex()'. Each time `yylex()' is called it continues processing tokens from where it last left off until it either reaches the end of the file or executes a return. Actions are free to modify `yytext' except for lengthening it (adding characters to its end-these will overwrite later characters in the input stream). This however does not apply when using `%array' (*note Matching::). In that case, `yytext' may be freely modified in any way. Actions are free to modify `yyleng' except they should not do so if the action also includes use of `yymore()' (see below). There are a number of special directives which can be included within an action: `ECHO' copies yytext to the scanner's output. `BEGIN' followed by the name of a start condition places the scanner in the corresponding start condition (see below). `REJECT' directs the scanner to proceed on to the "second best" rule which matched the input (or a prefix of the input). The rule is chosen as described above in *Note Matching::, and `yytext' and `yyleng' set up appropriately. It may either be one which matched as much text as the originally chosen rule but came later in the `flex' input file, or one which matched less text. For example, the following will both count the words in the input and call the routine `special()' whenever `frob' is seen: int word_count = 0; %% frob special(); REJECT; [^ \t\n]+ ++word_count; Without the `REJECT', any occurences of `frob' in the input would not be counted as words, since the scanner normally executes only one action per token. Multiple uses of `REJECT' are allowed, each one finding the next best choice to the currently active rule. For example, when the following scanner scans the token `abcd', it will write `abcdabcaba' to the output: %% a | ab | abc | abcd ECHO; REJECT; .|\n /* eat up any unmatched character */ The first three rules share the fourth's action since they use the special `|' action. `REJECT' is a particularly expensive feature in terms of scanner performance; if it is used in _any_ of the scanner's actions it will slow down _all_ of the scanner's matching. Furthermore, `REJECT' cannot be used with the `-Cf' or `-CF' options (*note Scanner Options::). Note also that unlike the other special actions, `REJECT' is a _branch_. code immediately following it in the action will _not_ be executed. `yymore()' tells the scanner that the next time it matches a rule, the corresponding token should be _appended_ onto the current value of `yytext' rather than replacing it. For example, given the input `mega-kludge' the following will write `mega-mega-kludge' to the output: %% mega- ECHO; yymore(); kludge ECHO; First `mega-' is matched and echoed to the output. Then `kludge' is matched, but the previous `mega-' is still hanging around at the beginning of `yytext' so the `ECHO' for the `kludge' rule will actually write `mega-kludge'. Two notes regarding use of `yymore()'. First, `yymore()' depends on the value of `yyleng' correctly reflecting the size of the current token, so you must not modify `yyleng' if you are using `yymore()'. Second, the presence of `yymore()' in the scanner's action entails a minor performance penalty in the scanner's matching speed. `yyless(n)' returns all but the first `n' characters of the current token back to the input stream, where they will be rescanned when the scanner looks for the next match. `yytext' and `yyleng' are adjusted appropriately (e.g., `yyleng' will now be equal to `n'). For example, on the input `foobar' the following will write out `foobarbar': %% foobar ECHO; yyless(3); [a-z]+ ECHO; An argument of 0 to `yyless()' will cause the entire current input string to be scanned again. Unless you've changed how the scanner will subsequently process its input (using `BEGIN', for example), this will result in an endless loop. Note that `yyless()' is a macro and can only be used in the flex input file, not from other source files. `unput(c)' puts the character `c' back onto the input stream. It will be the next character scanned. The following action will take the current token and cause it to be rescanned enclosed in parentheses. { int i; /* Copy yytext because unput() trashes yytext */ char *yycopy = strdup( yytext ); unput( ')' ); for ( i = yyleng - 1; i >= 0; --i ) unput( yycopy[i] ); unput( '(' ); free( yycopy ); } Note that since each `unput()' puts the given character back at the _beginning_ of the input stream, pushing back strings must be done back-to-front. An important potential problem when using `unput()' is that if you are using `%pointer' (the default), a call to `unput()' _destroys_ the contents of `yytext', starting with its rightmost character and devouring one character to the left with each call. If you need the value of `yytext' preserved after a call to `unput()' (as in the above example), you must either first copy it elsewhere, or build your scanner using `%array' instead (*note Matching::). Finally, note that you cannot put back `EOF' to attempt to mark the input stream with an end-of-file. `input()' reads the next character from the input stream. For example, the following is one way to eat up C comments: %% "/*" { register int c; for ( ; ; ) { while ( (c = input()) != '*' && c != EOF ) ; /* eat up text of comment */ if ( c == '*' ) { while ( (c = input()) == '*' ) ; if ( c == '/' ) break; /* found the end */ } if ( c == EOF ) { error( "EOF in comment" ); break; } } } (Note that if the scanner is compiled using `C++', then `input()' is instead referred to as yyinput(), in order to avoid a name clash with the `C++' stream by the name of `input'.) `YY_FLUSH_BUFFER()' flushes the scanner's internal buffer so that the next time the scanner attempts to match a token, it will first refill the buffer using `YY_INPUT()' (*note Generated Scanner::). This action is a special case of the more general `yy_flush_buffer()' function, described below (*note Multiple Input Buffers::) `yyterminate()' can be used in lieu of a return statement in an action. It terminates the scanner and returns a 0 to the scanner's caller, indicating "all done". By default, `yyterminate()' is also called when an end-of-file is encountered. It is a macro and may be redefined. File: flex.info, Node: Generated Scanner, Next: Start Conditions, Prev: Actions, Up: Top The Generated Scanner ********************* The output of `flex' is the file `lex.yy.c', which contains the scanning routine `yylex()', a number of tables used by it for matching tokens, and a number of auxiliary routines and macros. By default, `yylex()' is declared as follows: int yylex() { ... various definitions and the actions in here ... } (If your environment supports function prototypes, then it will be `int yylex( void )'.) This definition may be changed by defining the `YY_DECL' macro. For example, you could use: #define YY_DECL float lexscan( a, b ) float a, b; to give the scanning routine the name `lexscan', returning a float, and taking two floats as arguments. Note that if you give arguments to the scanning routine using a K&R-style/non-prototyped function declaration, you must terminate the definition with a semi-colon (;). `flex' generates `C99' function definitions by default. However flex does have the ability to generate obsolete, er, `traditional', function definitions. This is to support bootstrapping gcc on old systems. Unfortunately, traditional definitions prevent us from using any standard data types smaller than int (such as short, char, or bool) as function arguments. For this reason, future versions of `flex' may generate standard C99 code only, leaving K&R-style functions to the historians. Currently, if you do *not* want `C99' definitions, then you must use `%option noansi-definitions'. Whenever `yylex()' is called, it scans tokens from the global input file `yyin' (which defaults to stdin). It continues until it either reaches an end-of-file (at which point it returns the value 0) or one of its actions executes a `return' statement. If the scanner reaches an end-of-file, subsequent calls are undefined unless either `yyin' is pointed at a new input file (in which case scanning continues from that file), or `yyrestart()' is called. `yyrestart()' takes one argument, a `FILE *' pointer (which can be NULL, if you've set up `YY_INPUT' to scan from a source other than `yyin'), and initializes `yyin' for scanning from that file. Essentially there is no difference between just assigning `yyin' to a new input file or using `yyrestart()' to do so; the latter is available for compatibility with previous versions of `flex', and because it can be used to switch input files in the middle of scanning. It can also be used to throw away the current input buffer, by calling it with an argument of `yyin'; but it would be better to use `YY_FLUSH_BUFFER' (*note Actions::). Note that `yyrestart()' does _not_ reset the start condition to `INITIAL' (*note Start Conditions::). If `yylex()' stops scanning due to executing a `return' statement in one of the actions, the scanner may then be called again and it will resume scanning where it left off. By default (and for purposes of efficiency), the scanner uses block-reads rather than simple `getc()' calls to read characters from `yyin'. The nature of how it gets its input can be controlled by defining the `YY_INPUT' macro. The calling sequence for `YY_INPUT()' is `YY_INPUT(buf,result,max_size)'. Its action is to place up to `max_size' characters in the character array `buf' and return in the integer variable `result' either the number of characters read or the constant `YY_NULL' (0 on Unix systems) to indicate `EOF'. The default `YY_INPUT' reads from the global file-pointer `yyin'. Here is a sample definition of `YY_INPUT' (in the definitions section of the input file): %{ #define YY_INPUT(buf,result,max_size) \ { \ int c = getchar(); \ result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ } %} This definition will change the input processing to occur one character at a time. When the scanner receives an end-of-file indication from YY_INPUT, it then checks the `yywrap()' function. If `yywrap()' returns false (zero), then it is assumed that the function has gone ahead and set up `yyin' to point to another input file, and scanning continues. If it returns true (non-zero), then the scanner terminates, returning 0 to its caller. Note that in either case, the start condition remains unchanged; it does _not_ revert to `INITIAL'. If you do not supply your own version of `yywrap()', then you must either use `%option noyywrap' (in which case the scanner behaves as though `yywrap()' returned 1), or you must link with `-lfl' to obtain the default version of the routine, which always returns 1. For scanning from in-memory buffers (e.g., scanning strings), see *Note Scanning Strings::. *Note Multiple Input Buffers::. The scanner writes its `ECHO' output to the `yyout' global (default, `stdout'), which may be redefined by the user simply by assigning it to some other `FILE' pointer.