Yacc: Yet Another Compiler-Compiler


                     Stephen C. Johnson



          Computer program  input  generally  has  some
     structure;  in  fact,  every computer program that
     does input  can  be  thought  of  as  defining  an
     ``input  language''  which  it  accepts.  An input
     language  may  be  as  complex  as  a  programming
     language,  or  as simple as a sequence of numbers.
     Unfortunately, usual input facilities are limited,
     difficult to use, and often are lax about checking
     their inputs for validity.

          Yacc provides a general tool  for  describing
     the  input  to  a computer program.  The Yacc user
     specifies the structures of  his  input,  together
     with  code to be invoked as each such structure is
     recognized.  Yacc turns such a specification  into
     a  subroutine that handles the input process; fre-
     quently, it is convenient and appropriate to  have
     most of the flow of control in the user's applica-
     tion handled by this subroutine.

          The input subroutine produced by Yacc calls a
     user-supplied  routine  to  return  the next basic
     input item.  Thus, the user can specify his  input
     in  terms  of  individual  input characters, or in
     terms of higher level constructs such as names and
     numbers.   The user-supplied routine may also han-
     dle idiomatic features such as  comment  and  con-
     tinuation  conventions,  which typically defy easy
     grammatical specification.

          Yacc is written in portable C.  The class  of
     specifications  accepted  is  a  very general one:
     LALR(1) grammars with disambiguating rules.

          In addition to compilers for C, APL,  Pascal,
     RATFOR,  etc.,  Yacc  has  also been used for less
     conventional   languages,   including   a   photo-
     typesetter   language,   several  desk  calculator
     languages, a document retrieval system, and a For-
     tran debugging system.





     Yacc provides a general tool for imposing structure  on
the  input  to a computer program.  The Yacc user prepares a
specification of the  input  process;  this  includes  rules
describing  the  input  structure,  code  to be invoked when
these rules are recognized, and a low-level  routine  to  do
the  basic input.  Yacc then generates a function to control
the input process.  This function, called  a  parser,  calls
the  user-supplied  low-level  input  routine  (the lexical
analyzer) to pick up the basic items  (called  tokens)  from
the  input  stream.  These tokens are organized according to
the input structure rules, called grammar rules; when one of
these rules has been recognized, then user code supplied for
this rule, an action, is invoked; actions have  the  ability
to  return  values  and  make  use  of  the  values of other
actions.

     Yacc is written in a portable dialect of C Ritchie Ker-
nighan Language Prentice and the actions, and output subrou-
tine, are in C as well.  Moreover,  many  of  the  syntactic
conventions of Yacc follow C.

     The heart of the input specification is a collection of
grammar  rules.   Each rule describes an allowable structure
and gives it a name.  For example, one grammar rule might be

        date  :  month_name  day  ','  year   ;

Here, date, month_name, day, and year  represent  structures
of  interest  in  the input process; presumably, month_name,
day, and year are defined elsewhere.   The  comma  ``,''  is
enclosed in single quotes; this implies that the comma is to
appear literally in the  input.   The  colon  and  semicolon
merely  serve as punctuation in the rule, and have no signi-
ficance in controlling the input.  Thus, with proper defini-
tions, the input

        July  4, 1776

might be matched by the above rule.

     An important part of the input process is  carried  out
by  the lexical analyzer.  This user routine reads the input
stream, recognizing the lower level structures, and communi-
cates these tokens to the parser.  For historical reasons, a
structure recognized by the lexical  analyzer  is  called  a
terminal  symbol,  while  the  structure  recognized  by the
parser is called a  nonterminal.  To avoid  confusion,
terminal symbols will usually be referred to as tokens.

     There is considerable leeway  in  deciding  whether  to
recognize  structures  using the lexical analyzer or grammar
rules.  For example, the rules


        month_name  :  'J' 'a' 'n'   ;
        month_name  :  'F' 'e' 'b'   ;

                 . . .

        month_name  :  'D' 'e' 'c'   ;

might be used in the above example.   The  lexical  analyzer
would   only  need  to  recognize  individual  letters,  and
month_name would be a nonterminal  symbol.   Such  low-level
rules  tend  to waste time and space, and may complicate the
specification beyond Yacc's ability to deal with  it.   Usu-
ally,  the lexical analyzer would recognize the month names,
and return an indication that a month_name was seen; in this
case, month_name would be a token.

     Literal characters such as ``,'' must  also  be  passed
through  the  lexical  analyzer,  and  are  also  considered
tokens.

     Specification files are very flexible.  It is realively
easy to add to the above example the rule

        date  :  month '/' day '/' year   ;

allowing

        7 / 4 / 1776

as a synonym for

        July 4, 1776

In most cases, this new rule could be ``slipped  in''  to  a
working  system  with  minimal  effort, and little danger of
disrupting existing input.

     The input being read may not conform to the  specifica-
tions.   These  input  errors  are  detected  as early as is
theoretically possible with a left-to-right scan; thus,  not
only  is  the chance of reading and computing with bad input
data substantially reduced, but the bad data can usually  be
quickly  found.   Error  handling,  provided  as part of the
input specifications, permits the reentry of  bad  data,  or
the  continuation  of  the input process after skipping over
the bad data.

     In some cases, Yacc fails  to  produce  a  parser  when
given  a set of specifications.  For example, the specifica-
tions may be self contradictory, or they may require a  more
powerful  recognition mechanism than that available to Yacc.
The former cases represent design errors; the  latter  cases
can  often  be corrected by making the lexical analyzer more
powerful, or by rewriting some of the grammar rules.   While
Yacc  cannot  handle  all possible specifications, its power
compares favorably with similar systems; moreover, the  con-
structions  which  are difficult for Yacc to handle are also
frequently difficult for human beings to handle.  Some users
have  reported that the discipline of formulating valid Yacc
specifications for their input revealed errors of conception
or design early in the program development.

     The theory underlying Yacc  has  been  described  else-
where.   Aho  Johnson  Surveys LR Parsing Aho Johnson Ullman
Ambiguous Grammars Aho  Ullman  Principles  Compiler  Design
Yacc  has been extensively used in numerous practical appli-
cations, including lint, Johnson Lint the  Portable  C  Com-
piler,  Johnson  Portable  Compiler  Theory and a system for
typesetting mathematics.  Kernighan Cherry typesetting  sys-
tem CACM

     The next several sections describe the basic process of
preparing  a  Yacc  specification;  Section  1 describes the
preparation of grammar rules, Section 2 the  preparation  of
the  user  supplied actions associated with these rules, and
Section 3 the preparation of lexical analyzers.   Section  4
describes  the operation of the parser.  Section 5 discusses
various reasons why Yacc may be unable to produce  a  parser
from  a  specification,  and what to do about it.  Section 6
describes a simple  mechanism  for  handling  operator  pre-
cedences  in  arithmetic  expressions.   Section 7 discusses
error detection  and  recovery.   Section  8  discusses  the
operating  environment  and  special features of the parsers
Yacc produces.   Section  9  gives  some  suggestions  which
should  improve  the  style and efficiency of the specifica-
tions.  Section 10 discusses some advanced topics, and  Sec-
tion  11  gives  acknowledgements.   Appendix  A has a brief
example, and Appendix B gives a summary of  the  Yacc  input
syntax.   Appendix C gives an example using some of the more
advanced  features  of  Yacc,  and,  finally,   Appendix   D
describes  mechanisms  and  syntax  no  longer actively sup-
ported, but provided for historical  continuity  with  older
versions of Yacc.

1: Basic Specifications

     Names refer to either tokens  or  nonterminal  symbols.
Yacc  requires token names to be declared as such.  In addi-
tion, for reasons discussed in Section 3, it is often desir-
able to include the lexical analyzer as part of the specifi-
cation file; it may be useful to include other  programs  as
well.  Thus, every specification file consists of three sec-
tions: the declarations, (grammar) rules, and programs.  The
sections are separated by double percent ``%%'' marks.  (The
percent ``%'' is generally used in Yacc specifications as an
escape character.)


     In other words, a full specification file looks like

        declarations
        %%
        rules
        %%
        programs


     The declaration section may be empty.  Moreover, if the
programs section is omitted, the second %% mark may be omit-
ted also; thus, the smallest legal Yacc specification is

        %%
        rules


     Blanks, tabs, and newlines are ignored except that they
may not appear in names or multi-character reserved symbols.
Comments may appear wherever  a  name  is  legal;  they  are
enclosed in /* . . . */, as in C and PL/I.

     The rules section is made up of  one  or  more  grammar
rules.  A grammar rule has the form:

        A  :  BODY  ;

A represents a  nonterminal  name,  and  BODY  represents  a
sequence  of zero or more names and literals.  The colon and
the semicolon are Yacc punctuation.

     Names may be of arbitrary length, and may be made up of
letters,   dot  ``.'',  underscore  ``_'',  and  non-initial
digits.  Upper and lower case  letters  are  distinct.   The
names  used  in  the  body  of  a grammar rule may represent
tokens or nonterminal symbols.

     A literal consists of a character  enclosed  in  single
quotes  ``'''.   As  in  C, the backslash ``\'' is an escape
character within literals, and all the C escapes are  recog-
nized.  Thus

        '\n'    newline
        '\r'    return
        '\''    single quote ``'''
        '\\'    backslash ``\''
        '\t'    tab
        '\b'    backspace
        '\f'    form feed
        '\xxx'  ``xxx'' in octal

For a number of technical reasons, the NUL  character  ('\0'
or 0) should never be used in grammar rules.

     If there are several grammar rules with the  same  left
hand  side,  the  vertical  bar  ``|''  can be used to avoid
rewriting the left hand side.  In addition, the semicolon at
the  end  of  a  rule  can be dropped before a vertical bar.
Thus the grammar rules

        A       :       B  C  D   ;
        A       :       E  F   ;
        A       :       G   ;

can be given to Yacc as

        A       :       B  C  D
                |       E  F
                |       G
                ;

It is not necessary that all grammar  rules  with  the  same
left  side  appear  together  in  the grammar rules section,
although it makes the input much more readable,  and  easier
to change.

     If a nonterminal symbol matches the empty string,  this
can be indicated in the obvious way:

        empty :   ;


     Names representing tokens must  be  declared;  this  is
most simply done by writing

        %token   name1  name2 . . .

in the declarations section.  (See Sections 3 , 5, and 6 for
much  more  discussion).   Every  name  not  defined  in the
declarations section is assumed to represent  a  nonterminal
symbol.   Every  nonterminal  symbol must appear on the left
side of at least one rule.

     Of all the nonterminal symbols, one, called  the  start
symbol,  has  particular importance.  The parser is designed
to recognize the start symbol; thus, this symbol  represents
the largest, most general structure described by the grammar
rules.  By default, the start symbol is taken to be the left
hand  side  of  the first grammar rule in the rules section.
It is possible, and in fact desirable, to declare the  start
symbol  explicitly  in  the  declarations  section using the
%start keyword:

        %start   symbol


     The end of the input to the parser  is  signaled  by  a
special  token,  called the endmarker.  If the tokens up to,

but not including, the  endmarker  form  a  structure  which
matches the start symbol, the parser function returns to its
caller after the endmarker is seen; it  accepts  the  input.
If  the  endmarker  is  seen  in any other context, it is an
error.

     It is the job of the user-supplied lexical analyzer  to
return the endmarker when appropriate; see section 3, below.
Usually the endmarker represents some reasonably obvious I/O
status, such as ``end-of-file'' or ``end-of-record''.

2: Actions

     With each grammar rule, the user may associate  actions
to  be  performed  each  time  the rule is recognized in the
input process.  These actions may  return  values,  and  may
obtain  the  values returned by previous actions.  Moreover,
the lexical  analyzer  can  return  values  for  tokens,  if
desired.

     An action is an arbitrary C statement, and as such  can
do  input  and  output, call subprograms, and alter external
vectors and variables.  An action is  specified  by  one  or
more  statements,  enclosed in curly braces ``{'' and ``}''.
For example,

        A       :       '('  B  ')'
                                {       hello( 1, "abc" );  }

and

        XXX     :       YYY  ZZZ
                                {       printf("a message\n");
                                        flag = 25;   }

are grammar rules with actions.

     To facilitate easy communication  between  the  actions
and  the parser, the action statements are altered slightly.
The symbol ``dollar sign'' ``$'' is used as a signal to Yacc
in this context.

     To  return  a  value,  the  action  normally  sets  the
pseudo-variable  ``$$''  to  some  value.   For  example, an
action that does nothing but return the value 1 is

                {  $$ = 1;  }


     To obtain the values returned by previous  actions  and
the  lexical  analyzer,  the  action  may  use  the  pseudo-
variables $1, $2, . . ., which refer to the values  returned
by  the components of the right side of a rule, reading from
left to right.  Thus, if the rule is

        A       :       B  C  D   ;

for example, then $2 has the value returned by C, and $3 the
value returned by D.

     As a more concrete example, consider the rule

        expr    :       '('  expr  ')'   ;

The value returned by this rule is usually the value of  the
_e_x_p_r in parentheses.  This can be indicated by

        expr    :        '('  expr  ')'         {  $$ = $2 ;  }


     By default, the value of a rule is  the  value  of  the
first element in it ($1).  Thus, grammar rules of the form

        A       :       B    ;

frequently need not have an explicit action.

     In the examples above, all the actions came at the  end
of  their  rules.  Sometimes, it is desirable to get control
before a rule is fully parsed.  Yacc permits an action to be
written in the middle of a rule as well as at the end.  This
rule is assumed to return a value,  accessible  through  the
usual mechanism by the actions to the right of it.  In turn,
it may access the values returned  by  the  symbols  to  its
left.  Thus, in the rule

        A       :       B
                                {  $$ = 1;  }
                        C
                                {   x = $2;   y = $3;  }
                ;

the effect is to set x to 1, and y to the value returned  by
C.

     Actions that do not terminate a rule are actually  han-
dled by Yacc by manufacturing a new nonterminal symbol name,
and a new rule matching this name to the empty string.   The
interior  action  is the action triggered off by recognizing
this added rule.  Yacc actually treats the above example  as
if it had been written:


        $ACT    :       /* empty */
                                {  $$ = 1;  }
                ;

        A       :       B  $ACT  C
                                {   x = $2;   y = $3;  }
                ;


     In many applications, output is not  done  directly  by
the actions; rather, a data structure, such as a parse tree,
is constructed in memory, and transformations are applied to
it before output is generated.  Parse trees are particularly
easy to construct, given routines to build and maintain  the
tree  structure  desired.  For example, suppose there is a C
function node, written so that the call

        node( L, n1, n2 )

creates a node with label L, and descendants n1 and n2,  and
returns  the  index  of  the newly created node.  Then parse
tree can be built by supplying actions such as:

        expr    :       expr  '+'  expr
                                {  $$ = node( '+', $1, $3 );  }

in the specification.

     The user may define other variables to be used  by  the
actions.   Declarations  and  definitions  can appear in the
declarations section,  enclosed  in  the  marks  ``%{''  and
``%}''.   These  declarations  and  definitions  have global
scope, so they are known to the action  statements  and  the
lexical analyzer.  For example,

        %{   int variable = 0;   %}

could be placed in the declarations section, making variable
accessible to all of the actions.  The Yacc parser uses only
names beginning in ``yy''; the user should avoid such names.

     In these examples, all the values are integers: a  dis-
cussion  of  values  of other types will be found in Section
10.

3: Lexical Analysis

     The user must supply a lexical  analyzer  to  read  the
input   stream  and  communicate  tokens  (with  values,  if
desired)  to  the  parser.   The  lexical  analyzer  is   an
integer-valued  function called yylex.  The function returns
an integer, the token number, representing the kind of token
read.   If  there  is a value associated with that token, it

should be assigned to the external variable yylval.

     The parser and the lexical analyzer must agree on these
token  numbers  in  order  for communication between them to
take place.  The numbers may be chosen by Yacc, or chosen by
the  user.   In either case, the ``# define'' mechanism of C
is used to  allow  the  lexical  analyzer  to  return  these
numbers  symbolically.   For example, suppose that the token
name DIGIT has been defined in the declarations  section  of
the  Yacc  specification  file.  The relevant portion of the
lexical analyzer might look like:

        yylex(){
                extern int yylval;
                int c;
                . . .
                c = getchar();
                . . .
                switch( c ) {
                        . . .
                case '0':
                case '1':
                  . . .
                case '9':
                        yylval = c-'0';
                        return( DIGIT );
                        . . .
                        }
                . . .


     The intent is to return a token number of DIGIT, and  a
value  equal  to the numerical value of the digit.  Provided
that the lexical analyzer code is  placed  in  the  programs
section of the specification file, the identifier DIGIT will
be defined as the token number  associated  with  the  token
DIGIT.

     This mechanism leads to clear, easily modified  lexical
analyzers;  the  only pitfall is the need to avoid using any
token names in the grammar that are reserved or  significant
in  C  or the parser; for example, the use of token names if
or while will almost  certainly  cause  severe  difficulties
when the lexical analyzer is compiled.  The token name error
is reserved for error  handling,  and  should  not  be  used
naively (see Section 7).

     As mentioned above, the token numbers may be chosen  by
Yacc  or by the user.  In the default situation, the numbers
are chosen by Yacc.  The default token number for a  literal
character  is  the  numerical  value of the character in the
local character set.  Other names are assigned token numbers
starting at 257.

     To  assign  a  token  number  to  a  token   (including
literals), the first appearance of the token name or literal
in the declarations section can be immediately followed by a
nonnegative  integer.  This integer is taken to be the token
number of the name  or  literal.   Names  and  literals  not
defined  by  this mechanism retain their default definition.
It is important that all token numbers be distinct.

     For historical reasons, the endmarker must  have  token
number 0 or negative.  This token number cannot be redefined
by the user; thus, all lexical analyzers should be  prepared
to  return 0 or negative as a token number upon reaching the
end of their input.

     A very useful tool for constructing  lexical  analyzers
is  the  Lex program developed by Mike Lesk.  Lesk Lex These
lexical analyzers are designed to work in close harmony with
Yacc   parsers.    The   specifications  for  these  lexical
analyzers use regular expressions instead of grammar  rules.
Lex  can be easily used to produce quite complicated lexical
analyzers, but there remain some languages (such as FORTRAN)
which  do not fit any theoretical framework, and whose lexi-
cal analyzers must be crafted by hand.

4: How the Parser Works

     Yacc turns the specification file  into  a  C  program,
which parses the input according to the specification given.
The algorithm used to  go  from  the  specification  to  the
parser  is  complex, and will not be discussed here (see the
references for more information).  The parser  itself,  how-
ever,  is relatively simple, and understanding how it works,
while not strictly necessary, will nevertheless make  treat-
ment  of error recovery and ambiguities much more comprehen-
sible.

     The parser produced by Yacc consists of a finite  state
machine with a stack.  The parser is also capable of reading
and remembering the next input token (called  the lookahead
token).   The  current state is always the one on the top of
the stack.  The states of the finite state machine are given
small  integer labels; initially, the machine is in state 0,
the stack contains only state 0, and no lookahead token  has
been read.

     The machine has only  four  actions  available  to  it,
called  shift,  reduce,  accept,  and  error.  A move of the
parser is done as follows:

1.   Based on its current state, the parser decides  whether
     it needs a lookahead token to decide what action should
     be done; if it needs one, and does  not  have  one,  it
     calls yylex to obtain the next token.

2.   Using the current state, and  the  lookahead  token  if
     needed, the parser decides on its next action, and car-
     ries it out.  This may result in  states  being  pushed
     onto  the stack, or popped off of the stack, and in the
     lookahead token being processed or left alone.

     The shift action is the most common action  the  parser
takes.   Whenever a shift action is taken, there is always a
lookahead token.  For example, in state 56 there may  be  an
action:

                IF      shift 34

which says, in state 56, if the lookahead token is  IF,  the
current state (56) is pushed down on the stack, and state 34
becomes the current state (on the top of  the  stack).   The
lookahead token is cleared.

     The reduce action keeps the stack from growing  without
bounds.   Reduce actions are appropriate when the parser has
seen the right hand side of a grammar rule, and is  prepared
to  announce  that  it  has  seen  an  instance of the rule,
replacing the right hand side by the left hand side.  It may
be  necessary  to  consult  the  lookahead  token  to decide
whether to reduce, but usually  it  is  not;  in  fact,  the
default  action  (represented  by a ``.'') is often a reduce
action.

     Reduce actions are associated with  individual  grammar
rules.   Grammar rules are also given small integer numbers,
leading to some confusion.  The action

                .       reduce 18

refers to grammar rule 18, while the action

                IF      shift 34

refers to state 34.

     Suppose the rule being reduced is

        A       :       x  y  z    ;

The reduce action depends on the left hand symbol (A in this
case),  and  the  number  of  symbols on the right hand side
(three in this case).  To reduce,  first  pop  off  the  top
three  states  from  the  stack  (In  general, the number of
states popped equals the number of symbols on the right side
of  the rule).  In effect, these states were the ones put on
the stack while recognizing x, y, and z, and no longer serve
any  useful purpose.  After popping these states, a state is
uncovered which was the  state  the  parser  was  in  before
beginning  to process the rule.  Using this uncovered state,
and the symbol on the left side of the rule, perform what is
in  effect  a  shift  of A.  A new state is obtained, pushed
onto the stack, and parsing continues.  There  are  signifi-
cant  differences  between  the  processing of the left hand
symbol and an ordinary shift of a token,  however,  so  this
action  is  called a goto action.  In particular, the looka-
head token is cleared by a shift, and is not affected  by  a
goto.   In  any  case, the uncovered state contains an entry
such as:

                A       goto 20

causing state 20 to be pushed onto the stack, and become the
current state.

     In effect, the reduce action ``turns back  the  clock''
in the parse, popping the states off the stack to go back to
the state where the right hand side of the  rule  was  first
seen.   The  parser  then behaves as if it had seen the left
side at that time.  If the right hand side of  the  rule  is
empty,  no states are popped off of the stack: the uncovered
state is in fact the current state.

     The reduce action is also important in the treatment of
user-supplied  actions  and values.  When a rule is reduced,
the code supplied with the rule is executed before the stack
is  adjusted.   In addition to the stack holding the states,
another stack, running in parallel with it, holds the values
returned  from the lexical analyzer and the actions.  When a
shift takes place, the external variable  yylval  is  copied
onto  the value stack.  After the return from the user code,
the reduction is carried out.  When the goto action is done,
the  external variable yyval is copied onto the value stack.
The pseudo-variables $1, $2, etc., refer to the value stack.

     The other two  parser  actions  are  conceptually  much
simpler.   The accept action indicates that the entire input
has been seen and that it matches the  specification.   This
action  appears  only  when  the lookahead token is the end-
marker, and indicates that the parser has successfully  done
its  job.  The error action, on the other hand, represents a
place where  the  parser  can  no  longer  continue  parsing
according  to  the  specification.   The input tokens it has
seen, together with the lookahead token, cannot be  followed
by  anything that would result in a legal input.  The parser
reports an error, and attempts to recover the situation  and
resume parsing: the error recovery (as opposed to the detec-
tion of error) will be covered in Section 7.

     It is time for an example!  Consider the specification

        %token  DING  DONG  DELL
        %%
        rhyme   :       sound  place
                ;
        sound   :       DING  DONG
                ;
        place   :       DELL
                ;


     When Yacc is invoked with the -_v option, a file  called
y.output  is  produced, with a human-readable description of
the parser.  The y.output file corresponding  to  the  above
grammar (with some statistics stripped off the end) is:


        state 0
                $accept  :  _rhyme  $end

                DING  shift 3
                .  error

                rhyme  goto 1
                sound  goto 2

        state 1
                $accept  :   rhyme_$end

                $end  accept
                .  error

        state 2
                rhyme  :   sound_place

                DELL  shift 5
                .  error

                place   goto 4

        state 3
                sound   :   DING_DONG

                DONG  shift 6
                .  error

        state 4
                rhyme  :   sound  place_    (1)

                .   reduce  1

        state 5
                place  :   DELL_    (3)

                .   reduce  3

        state 6
                sound   :   DING  DONG_    (2)

                .   reduce  2

Notice that, in addition to  the  actions  for  each  state,
there  is a description of the parsing rules being processed
in each state.  The _ character is used to indicate what has
been  seen,  and what is yet to come, in each rule.  Suppose
the input is

        DING  DONG  DELL

It is instructive to follow the steps of  the  parser  while
processing this input.

     Initially, the current state is state  0.   The  parser
needs  to  refer to the input in order to decide between the
actions available in state 0, so the first token,  DING,  is
read,  becoming  the lookahead token.  The action in state 0
on DING is is ``shift 3'', so state 3  is  pushed  onto  the
stack,  and the lookahead token is cleared.  State 3 becomes
the current state.  The next token, DONG, is read,  becoming
the  lookahead  token.   The  action in state 3 on the token
DONG is ``shift 6'', so state 6 is pushed  onto  the  stack,
and  the lookahead is cleared.  The stack now contains 0, 3,
and 6.  In state 6, without even consulting  the  lookahead,
the parser reduces by rule 2.

                sound  :   DING  DONG

This rule has two symbols on the right  hand  side,  so  two
states,  6  and  3,  are popped off of the stack, uncovering
state 0.  Consulting the description of state 0, looking for
a goto on sound,

                sound   goto 2

is obtained; thus state 2 is pushed onto the stack, becoming
the current state.

     In state 2, the next token, DELL, must  be  read.   The
action  is ``shift 5'', so state 5 is pushed onto the stack,
which now has 0, 2, and 5 on it, and the lookahead token  is
cleared.   In  state 5, the only action is to reduce by rule
3.  This has one symbol on  the  right  hand  side,  so  one
state, 5, is popped off, and state 2 is uncovered.  The goto
in state 2 on place, the left side of rule 3,  is  state  4.
Now,  the  stack contains 0, 2, and 4.  In state 4, the only
action is to reduce by rule 1.  There are two symbols on the
right,  so  the  top  two  states are popped off, uncovering
state 0 again.  In state 0, there is a goto on rhyme causing
the parser to enter state 1.  In state 1, the input is read;
the endmarker is obtained,  indicated  by  ``$end''  in  the
y.output  file.  The action in state 1 when the endmarker is
seen is to accept, successfully ending the parse.

     The reader is urged to consider how  the  parser  works
when  confronted  with  such  incorrect strings as DING DONG
DONG, DING DONG, DING DONG DELL DELL, etc.   A  few  minutes
spend  with  this and other simple examples will probably be
repaid when problems arise in more complicated contexts.

5: Ambiguity and Conflicts

     A set of grammar rules is ambiguous if  there  is  some
input string that can be structured in two or more different
ways.  For example, the grammar rule
        expr    :       expr  '-'  expr

is a natural way of expressing the  fact  that  one  way  of
forming an arithmetic expression is to put two other expres-
sions together with  a  minus  sign  between  them.   Unfor-
tunately,  this grammar rule does not completely specify the
way that all complex inputs should be structured.  For exam-
ple, if the input is

        expr  -  expr  -  expr

the rule allows this input to be structured as either

        (  expr  -  expr  )  -  expr

or as

        expr  -  (  expr  -  expr  )

(The first is called left  association,  the  second  right
association).

     Yacc detects such ambiguities when it is attempting  to
build the parser.  It is instructive to consider the problem
that confronts the parser when it is given an input such as

        expr  -  expr  -  expr

When the parser has read the second expr, the input that  it
has seen:

        expr  -  expr

matches the right side  of  the  grammar  rule  above.   The
parser  could  reduce the input by applying this rule; after
applying the rule; the input is  reduced  to  expr(the  left
side  of  the  rule).   The parser would then read the final
part of the input:

        -  expr

and again reduce.  The effect of this is to  take  the  left
associative interpretation.

     Alternatively, when the parser has seen

        expr  -  expr

it could defer the immediate application of  the  rule,  and
continue reading the input until it had seen

        expr  -  expr  -  expr

It could then apply the rule to the rightmost three symbols,
reducing them to expr and leaving

        expr  -  expr

Now the rule can be reduced once more; the effect is to take
the right associative interpretation.  Thus, having read

        expr  -  expr

the parser can do two legal things, a shift or a  reduction,
and  has  no way of deciding between them.  This is called a
shift / reduce conflict.  It may also happen that the parser
has  a  choice  of  two  legal  reductions; this is called a
reduce / reduce conflict.  Note that  there  are  never  any
``Shift/shift'' conflicts.

     When there are shift/reduce or reduce/reduce conflicts,
Yacc still produces a parser.  It does this by selecting one
of the valid  steps  wherever  it  has  a  choice.   A  rule
describing  which  choice  to  make  in a given situation is
called a disambiguating rule.

     Yacc invokes two disambiguating rules by default:

1.   In a shift/reduce conflict, the default is  to  do  the
     shift.

2.   In a reduce/reduce conflict, the default is  to  reduce
     by the earlier grammar rule (in the input sequence).

     Rule 1 implies that reductions  are  deferred  whenever
there  is  a  choice,  in favor of shifts.  Rule 2 gives the
user rather crude control over the behavior of the parser in
this   situation,  but  reduce/reduce  conflicts  should  be
avoided whenever possible.

     Conflicts may arise because of  mistakes  in  input  or
logic,  or  because  the  grammar  rules,  while consistent,
require a more complex parser than Yacc can construct.   The
use of actions within rules can also cause conflicts, if the
action must be done before the parser can be sure which rule
is  being  recognized.   In  these cases, the application of
disambiguating rules  is  inappropriate,  and  leads  to  an
incorrect  parser.  For this reason, Yacc always reports the
number of shift/reduce and reduce/reduce conflicts  resolved
by Rule 1 and Rule 2.

     In general, whenever it is possible to  apply  disambi-
guating rules to produce a correct parser, it is also possi-
ble to rewrite the grammar rules so that the same inputs are
read but there are no conflicts.  For this reason, most pre-
vious parser generators  have  considered  conflicts  to  be
fatal  errors.   Our  experience  has  suggested  that  this
rewriting  is  somewhat  unnatural,  and   produces   slower
parsers;  thus,  Yacc will produce parsers even in the pres-
ence of conflicts.

     As an example of the  power  of  disambiguating  rules,
consider a fragment from a programming language involving an
``if-then-else'' construction:

        stat    :       IF  '('  cond  ')'  stat
                |       IF  '('  cond  ')'  stat  ELSE  stat
                ;

In these rules, IF and ELSE are tokens, cond is a  nontermi-
nal symbol describing conditional (logical) expressions, and
stat is a nonterminal  symbol  describing  statements.   The
first rule will be called the simple-if rule, and the second
the if-else rule.

     These two rules form an ambiguous  construction,  since
input of the form

        IF  (  C1  )  IF  (  C2  )  S1  ELSE  S2

can be structured according to these rules in two ways:

        IF  (  C1  )  {
                IF  (  C2  )  S1
                }
        ELSE  S2

or

        IF  (  C1  )  {
                IF  (  C2  )  S1
                ELSE  S2
                }

The second interpretation is the one given in most  program-
ming  languages having this construct.  Each ELSE is associ-
ated with the last  preceding  ``un-ELSE'd''  IF.   In  this
example, consider the situation where the parser has seen

        IF  (  C1  )  IF  (  C2  )  S1

and is looking at the ELSE.  It can  immediately  reduce  by
the simple-if rule to get

        IF  (  C1  )  stat

and then read the remaining input,

        ELSE  S2

and reduce

        IF  (  C1  )  stat  ELSE  S2

by the if-else rule.  This leads to the first of  the  above
groupings of the input.

     On the other hand, the ELSE may be  shifted,  S2  read,
and then the right hand portion of

        IF  (  C1  )  IF  (  C2  )  S1  ELSE  S2

can be reduced by the if-else rule to get

        IF  (  C1  )  stat

which can be reduced by the simple-if rule.  This  leads  to
the  second  of  the  above groupings of the input, which is
usually desired.

     Once again the parser can do two valid things  -  there
is a shift/reduce conflict.  The application of disambiguat-
ing rule 1 tells the parser to shift  in  this  case,  which
leads to the desired grouping.

     This shift/reduce conflict arises only when there is  a
particular current input symbol, ELSE, and particular inputs
already seen, such as

        IF  (  C1  )  IF  (  C2  )  S1

In general, there may be many conflicts, and each  one  will
be  associated  with an input symbol and a set of previously
read inputs.  The previously read inputs  are  characterized
by the state of the parser.

     The conflict messages of Yacc are  best  understood  by
examining the verbose (-v) option output file.  For example,
the output corresponding to the above conflict  state  might
be:

23: shift/reduce conflict (shift 45, reduce 18) on ELSE

state 23

          stat  :  IF  (  cond  )  stat_         (18)
          stat  :  IF  (  cond  )  stat_ELSE  stat

         ELSE     shift 45
         .        reduce 18


The first line describes the conflict, giving the state  and
the  input  symbol.  The ordinary state description follows,
giving the grammar rules active in the state, and the parser
actions.  Recall that the underline marks the portion of the
grammar rules which has been seen.  Thus in the example,  in
state 23 the parser has seen input corresponding to

        IF  (  cond  )  stat

and the two grammar rules shown are  active  at  this  time.
The  parser can do two possible things.  If the input symbol
is ELSE, it is possible to shift into state  45.   State  45
will have, as part of its description, the line

        stat  :  IF  (  cond  )  stat  ELSE_stat

since the ELSE will have been shifted in this  state.   Back
in  state 23, the alternative action, described by ``.'', is
to be done if the input symbol is not  mentioned  explicitly
in  the above actions; thus, in this case, if the input sym-
bol is not ELSE, the parser reduces by grammar rule 18:

        stat  :  IF  '('  cond  ')'  stat

Once again, notice that the numbers following ``shift'' com-
mands  refer  to  other  states, while the numbers following
``reduce'' commands refer to grammar rule numbers.   In  the
y.output  file,  the  rule  numbers  are printed after those
rules which can be reduced.  In most one states, there  will
be  at  most  reduce  action possible in the state, and this
will be the default command.  The user who encounters  unex-
pected  shift/reduce conflicts will probably want to look at
the verbose output to decide whether the default actions are
appropriate.   In really tough cases, the user might need to
know more about the behavior and construction of the  parser
than can be covered here.  In this case, one of the theoret-
ical references Aho Johnson Surveys Parsing Aho Johnson Ull-
man  Deterministic  Ambiguous  Aho  Ullman Principles Design
might be consulted; the services of a local guru might  also
be appropriate.

6: Precedence

     There is one common situation  where  the  rules  given
above for resolving conflicts are not sufficient; this is in
the parsing of arithmetic expressions.  Most of the commonly
used  constructions for arithmetic expressions can be natur-
ally described by the notion of precedence levels for opera-
tors, together with information about left or right associa-
tivity.  It turns out that ambiguous grammars with appropri-
ate  disambiguating rules can be used to create parsers that
are faster and easier to write than parsers constructed from
unambiguous  grammars.  The basic notion is to write grammar
rules of the form

        expr  :  expr  OP  expr

and

        expr  :  UNARY  expr

for all binary and unary operators desired.  This creates  a
very  ambiguous  grammar,  with  many parsing conflicts.  As
disambiguating rules, the user specifies the precedence,  or
binding  strength,  of  all  the operators, and the associa-
tivity of the binary operators.  This information is  suffi-
cient  to  allow  Yacc  to  resolve the parsing conflicts in
accordance with these rules, and  construct  a  parser  that
realizes the desired precedences and associativities.

     The precedences and  associativities  are  attached  to
tokens  in  the  declarations  section.   This  is done by a
series of  lines  beginning  with  a  Yacc  keyword:  %left,
%right,  or %nonassoc, followed by a list of tokens.  All of
the tokens on the same line are assumed  to  have  the  same
precedence  level and associativity; the lines are listed in
order of increasing precedence or binding strength.  Thus,

        %left  '+'  '-'
        %left  '*'  '/'

describes the  precedence  and  associativity  of  the  four
arithmetic  operators.  Plus and minus are left associative,
and have lower precedence than star  and  slash,  which  are
also  left  associative.   The  keyword  %right  is  used to
describe  right  associative  operators,  and  the   keyword
%nonassoc  is  used to describe operators, like the operator
.LT. in Fortran, that may  not  associate  with  themselves;
thus,

        A  .LT.  B  .LT.  C

is illegal  in  Fortran,  and  such  an  operator  would  be
described with the keyword %nonassoc in Yacc.  As an example
of the behavior of these declarations, the description

        %right  '='
        %left  '+'  '-'
        %left  '*'  '/'

        %%

        expr    :       expr  '='  expr
                |       expr  '+'  expr
                |       expr  '-'  expr
                |       expr  '*'  expr
                |       expr  '/'  expr
                |       NAME
                ;

might be used to structure the input

        a  =  b  =  c*d  -  e  -  f*g

as follows:

        a = ( b = ( ((c*d)-e) - (f*g) ) )

When this mechanism is used, unary operators must,  in  gen-
eral, be given a precedence.  Sometimes a unary operator and
a binary operator have the same symbolic representation, but
different  precedences.  An example is unary and binary '-';
unary minus may be given the same  strength  as  multiplica-
tion,  or  even  higher,  while  binary  minus  has  a lower
strength than multiplication.  The keyword,  %prec,  changes
the  precedence  level  associated with a particular grammar
rule.  %prec appears immediately after the body of the gram-
mar  rule,  before  the  action or closing semicolon, and is
followed by a token name or literal.   It  causes  the  pre-
cedence  of the grammar rule to become that of the following
token name or literal.  For example,  to  make  unary  minus
have  the  same precedence as multiplication the rules might
resemble:

        %left  '+'  '-'
        %left  '*'  '/'

        %%

        expr    :       expr  '+'  expr
                |       expr  '-'  expr
                |       expr  '*'  expr
                |       expr  '/'  expr
                |       '-'  expr      %prec  '*'
                |       NAME
                ;


     A token declared by %left, %right, and  %nonassoc  need
not be, but may be, declared by %token as well.

     The precedences and associativities are used by Yacc to
resolve  parsing conflicts; they give rise to disambiguating
rules.  Formally, the rules work as follows:

1.   The precedences and associativities  are  recorded  for
     those tokens and literals that have them.

2.   A precedence and associativity is associated with  each
     grammar rule; it is the precedence and associativity of
     the last token or literal in the body of the rule.   If
     the  %prec  construction  is  used,  it  overrides this
     default.  Some grammar rules may have no precedence and
     associativity associated with them.

3.   When there is a reduce/reduce conflict, or there  is  a
     shift/reduce  conflict  and  either the input symbol or
     the grammar rule has no precedence  and  associativity,
     then  the  two disambiguating rules given at the begin-
     ning of the section are used,  and  the  conflicts  are
     reported.

4.   If there is a shift/reduce conflict, and both the gram-
     mar  rule  and  the input character have precedence and
     associativity associated with them, then  the  conflict
     is  resolved  in  favor of the action (shift or reduce)
     associated with the higher  precedence.   If  the  pre-
     cedences  are the same, then the associativity is used;
     left  associative  implies  reduce,  right  associative
     implies shift, and nonassociating implies error.

     Conflicts resolved by precedence are not counted in the
number  of shift/reduce and reduce/reduce conflicts reported
by Yacc.  This means that mistakes in the  specification  of
precedences  may disguise errors in the input grammar; it is
a good idea to be sparing with precedences, and use them  in
an  essentially  ``cookbook'' fashion, until some experience
has been gained.  The y.output file is very useful in decid-
ing whether the parser is actually doing what was intended.

7: Error Handling

     Error handling is an extremely difficult area, and many
of  the problems are semantic ones.  When an error is found,
for example, it may  be  necessary  to  reclaim  parse  tree
storage,  delete  or  alter symbol table entries, and, typi-
cally, set switches to avoid generating any further output.

     It is seldom acceptable to stop all processing when  an
error  is  found; it is more useful to continue scanning the
input to find further syntax  errors.   This  leads  to  the
problem  of getting the parser ``restarted'' after an error.
A general class of algorithms to do this involves discarding
a  number of tokens from the input string, and attempting to
adjust the parser so that input can continue.

     To allow the user some control over this process,  Yacc
provides  a  simple,  but  reasonably general, feature.  The
token name ``error'' is reserved for error  handling.   This
name  can  be  used in grammar rules; in effect, it suggests
places where errors are expected, and  recovery  might  take
place.   The  parser  pops its stack until it enters a state
where the token ``error'' is legal.  It then behaves  as  if
the  token  ``error''  were the current lookahead token, and
performs the action encountered.   The  lookahead  token  is
then  reset  to the token that caused the error.  If no spe-
cial error rules have been specified, the  processing  halts
when an error is detected.

     In order to prevent a cascade of  error  messages,  the
parser,  after  detecting  an  error, remains in error state
until three tokens have been successfully read and  shifted.
If  an error is detected when the parser is already in error
state, no message is given, and the input token  is  quietly
deleted.

     As an example, a rule of the form

        stat    :       error

would, in effect, mean that on a  syntax  error  the  parser
would  attempt to skip over the statement in which the error
was seen.  More precisely, the parser will scan ahead, look-
ing  for three tokens that might legally follow a statement,
and start processing at the first of these;  if  the  begin-
nings of statements are not sufficiently distinctive, it may
make a false start in the middle of a statement, and end  up
reporting a second error where there is in fact no error.

     Actions may be used with  these  special  error  rules.
These  actions might attempt to reinitialize tables, reclaim
symbol table space, etc.

     Error rules such as the above  are  very  general,  but
difficult to control.  Somewhat easier are rules such as

        stat    :       error  ';'

Here, when there is an error, the parser  attempts  to  skip
over  the  statement, but will do so by skipping to the next
';'.  All tokens after the error and  before  the  next  ';'
cannot be shifted, and are discarded.  When the ';' is seen,
this rule will be reduced, and any ``cleanup'' action  asso-
ciated with it performed.

     Another form of error rule arises in interactive appli-
cations,  where  it  may be desirable to permit a line to be
reentered after an error.  A possible error rule might be

        input   :       error  '\n'  {  printf( "Reenter last line: " );  }  input
                                {       $$  =  $4;  }

There is one potential difficulty with  this  approach;  the
parser  must  correctly process three input tokens before it
admits that it has correctly resynchronized after the error.
If  the  reentered  line  contains an error in the first two
tokens, the parser deletes the offending tokens,  and  gives
no  message; this is clearly unacceptable.  For this reason,
there is a mechanism that can be used to force the parser to
believe  that  an  error has been fully recovered from.  The
statement

        yyerrok ;









PS1:15-26                Yacc: Yet Another Compiler-Compiler


in an action resets the parser to its normal mode.  The last
example is better written

        input   :       error  '\n'
                                {       yyerrok;
                                        printf( "Reenter last line: " );   }
                        input
                                {       $$  =  $4;  }
                ;


     As mentioned above, the token  seen  immediately  after
the  ``error''  symbol is the input token at which the error
was discovered.  Sometimes, this is inappropriate; for exam-
ple, an error recovery action might take upon itself the job
of finding the correct place to resume input.  In this case,
the previous lookahead token must be cleared.  The statement

        yyclearin ;

in an action will have this effect.   For  example,  suppose
the  action  after  error  were  to  call some sophisticated
resynchronization  routine,  supplied  by  the  user,   that
attempted  to advance the input to the beginning of the next
valid statement.  After this routine was  called,  the  next
token  returned by yylex would presumably be the first token
in a legal statement; the old, illegal token  must  be  dis-
carded,  and the error state reset.  This could be done by a
rule like

        stat    :       error
                                {       resynch();
                                        yyerrok ;
                                        yyclearin ;   }
                ;


     These mechanisms are admittedly crude, but do allow for
a  simple, fairly effective recovery of the parser from many
errors; moreover, the user can get control to deal with  the
error actions required by other portions of the program.

8: The Yacc Environment

     When the user inputs a specification to Yacc, the  out-
put  is a file of C programs, called y.tab.c on most systems
(due to local file system conventions, the names may  differ
from  installation  to installation).  The function produced
by Yacc is called yyparse; it is an integer valued function.
When  it  is  called, it in turn repeatedly calls yylex, the
lexical analyzer supplied by the user  (see  Section  3)  to
obtain   input  tokens.   Eventually,  either  an  error  is
detected, in which case (if no error recovery  is  possible)
yyparse returns the value 1, or the lexical analyzer returns
the endmarker token and the parser accepts.  In  this  case,
yyparse returns the value 0.

     The user must provide a certain amount  of  environment
for  this  parser in order to obtain a working program.  For
example, as with every C program, a program called main must
be  defined,  that eventually calls yyparse.  In addition, a
routine called yyerror prints a message when a syntax  error
is detected.

     These two routines must be  supplied  in  one  form  or
another  by  the  user.  To ease the initial effort of using
Yacc, a library has been provided with default  versions  of
main and yyerror.  The name of this library is system depen-
dent; on many systems the library is accessed by a -ly argu-
ment to the loader.  To show the triviality of these default
programs, the source is given below:

        main(){
                return( yyparse() );
                }

and

        # include < stdio.h>

        yyerror(s) char *s; {
                fprintf( stderr, "%s\n", s );
                }

The argument to yyerror is a string containing an error mes-
sage,  usually  the  string  ``syntax  error''.  The average
application will want to do better than  this.   Ordinarily,
the  program should keep track of the input line number, and
print it along with the  message  when  a  syntax  error  is
detected.  The external integer variable yychar contains the
lookahead token number at the time the error  was  detected;
this  may  be of some interest in giving better diagnostics.
Since the main program is probably supplied by the user  (to
read  arguments,  etc.)  the  Yacc library is useful only in
small projects, or in the earliest stages of larger ones.

     The external integer variable yydebug is  normally  set
to 0.  If it is set to a nonzero value, the parser will out-
put a verbose description of its actions, including  a  dis-
cussion  of which input symbols have been read, and what the
parser actions are.  Depending on the operating environment,
it may be possible to set this variable by using a debugging
system.

9: iHints for Preparing Specifications

     This section contains miscellaneous hints on  preparing
efficient,  easy  to  change, and clear specifications.  The
individual subsections are more or less independent.

Input Style

     It is  difficult  to  provide  rules  with  substantial
actions  and  still have a readable specification file.  The
following style hints owe much to Brian Kernighan.

a.   Use all capital letters for token names, all lower case
     letters  for  nonterminal names.  This rule comes under
     the heading of ``knowing who to blame  when  things  go
     wrong.''

b.   Put grammar rules and actions on separate lines.   This
     allows  either  to be changed without an automatic need
     to change the other.

c.   Put all rules with the same left  hand  side  together.
     Put  the  left hand side in only once, and let all fol-
     lowing rules begin with a vertical bar.

d.   Put a semicolon only after the last rule with  a  given
     left  hand  side,  and  put the semicolon on a separate
     line.  This allows new rules to be easily added.

e.   Indent rule bodies by two tab stops, and action  bodies
     by three tab stops.

     The example in Appendix A  is  written  following  this
style,  as are the examples in the text of this paper (where
space permits).  The user must make up his  own  mind  about
these  stylistic questions; the central problem, however, is
to make the rules visible through the morass of action code.

Left Recursion

     The algorithm used by the  Yacc  parser  encourages  so
called ``left recursive'' grammar rules: rules of the form

        name    :       name  rest_of_rule  ;

These rules frequently arise when writing specifications  of
sequences and lists:

        list    :       item
                |       list  ','  item
                ;

and

        seq     :       item
                |       seq  item
                ;

In each of these cases, the first rule will be  reduced  for
the first item only, and the second rule will be reduced for
the second and all succeeding items.

     With right recursive rules, such as

        seq     :       item
                |       item  seq
                ;

the parser would be a bit bigger, and  the  items  would  be
seen,  and  reduced, from right to left.  More seriously, an
internal stack in the parser would be in danger of overflow-
ing  if  a  very  long  sequence  were read.  Thus, the user
should use left recursion wherever reasonable.

     It is worth considering whether a  sequence  with  zero
elements  has  any  meaning, and if so, consider writing the
sequence specification with an empty rule:

        seq     :       /* empty */
                |       seq  item
                ;

Once again, the first rule would always be  reduced  exactly
once,  before  the  first item was read, and then the second
rule would be reduced once for each item  read.   Permitting
empty  sequences  often leads to increased generality.  How-
ever, conflicts might arise if Yacc is asked to decide which
empty  sequence  it  has seen, when it hasn't seen enough to
know!

Lexical Tie-ins

     Some lexical decisions depend on context.  For example,
the  lexical  analyzer might want to delete blanks normally,
but not within quoted strings.  Or names  might  be  entered
into a symbol table in declarations, but not in expressions.

     One way of handling this situation is to create a  glo-
bal  flag  that is examined by the lexical analyzer, and set
by actions.  For example, suppose a program consists of 0 or
more  declarations,  followed by 0 or more statements.  Con-
sider:

        %{
                int dflag;
        %}
          ...  other declarations ...

        %%

        prog    :       decls  stats
                ;

        decls   :       /* empty */
                                {       dflag = 1;  }
                |       decls  declaration
                ;

        stats   :       /* empty */
                                {       dflag = 0;  }
                |       stats  statement
                ;

            ...  other rules ...

The flag dflag is now 0 when reading statements, and 1  when
reading  declarations,  except for  the  first token in the
first statement.  This token must  be  seen  by  the  parser
before  it  can  tell that the declaration section has ended
and the statements have begun.  In many cases,  this  single
token exception does not affect the lexical scan.

     This kind of ``backdoor'' approach can be elaborated to
a  noxious  degree.   Nevertheless,  it  represents a way of
doing some things that are difficult, if not impossible,  to
do otherwise.

Reserved Words

     Some programming languages permit the user to use words
like  ``if'', which are normally reserved, as label or vari-
able names, provided that such use does  not  conflict  with
the  legal  use  of these names in the programming language.
This is extremely hard to do in the framework of Yacc; it is
difficult  to  pass information to the lexical analyzer tel-
ling it ``this instance of  `if'  is  a  keyword,  and  that
instance  is  a variable''.  The user can make a stab at it,
using the mechanism described in the last subsection, but it
is difficult.

     A number of  ways  of  making  this  easier  are  under
advisement.   Until  then, it is better that the keywords be
reserved; that is, be forbidden for use as  variable  names.
There  are  powerful  stylistic reasons for preferring this,
anyway.

10: Advanced Topics

     This section discusses a number of advanced features of
Yacc.

Simulating Error and Accept in Actions

     The parsing actions of error and accept  can  be  simu-
lated  in  an  action by use of macros YYACCEPT and YYERROR.
YYACCEPT causes yyparse  to  return  the  value  0;  YYERROR
causes  the  parser to behave as if the current input symbol
had been a  syntax  error;  yyerror  is  called,  and  error
recovery takes place.  These mechanisms can be used to simu-
late parsers with multiple endmarkers  or  context-sensitive
syntax checking.

Accessing Values in Enclosing Rules.

     An action may refer to values returned  by  actions  to
the  left  of the current rule.  The mechanism is simply the
same as with ordinary actions, a dollar sign followed  by  a
digit,  but  in  this  case  the digit may be 0 or negative.
Consider

        sent    :       adj  noun  verb  adj  noun
                                {  look at the sentence . . .  }
                ;

        adj     :       THE             {       $$ = THE;  }
                |       YOUNG   {       $$ = YOUNG;  }
                . . .
                ;

        noun    :       DOG
                                {       $$ = DOG;  }
                |       CRONE
                                {       if( $0 == YOUNG ){
                                                printf( "what?\n" );
                                                }
                                        $$ = CRONE;
                                        }
                ;
                . . .

In the action following the word CRONE, a check is made that
the  preceding token shifted was not YOUNG.  Obviously, this
is only possible when a great deal is known about what might
precede  the symbol noun in the input.  There is also a dis-
tinctly unstructured flavor about  this.   Nevertheless,  at
times  this  mechanism  will  save  a great deal of trouble,
especially when a few combinations are to be  excluded  from
an otherwise regular structure.

Support for Arbitrary Value Types

     By default, the values returned by actions and the lex-
ical analyzer are integers.  Yacc can also support values of
other types, including structures.  In addition, Yacc  keeps
track  of  the  types,  and inserts appropriate union member
names so that the resulting parser  will  be  strictly  type
checked.   The  Yacc value stack (see Section 4) is declared
to be a union of the various types of values  desired.   The
user  declares  the union, and associates union member names
to each token and nonterminal symbol having a  value.   When
the  value  is  referenced  through a $$ or $n construction,
Yacc will automatically insert the appropriate  union  name,
so  that  no unwanted conversions will take place.  In addi-
tion, type checking  commands  such  as  Lint  Johnson  Lint
Checker 1273 will be far more silent.

     There are three mechanisms used  to  provide  for  this
typing.   First,  there is a way of defining the union; this
must be done by the user since other programs,  notably  the
lexical  analyzer,  must  know about the union member names.
Second, there is a way of associating a  union  member  name
with tokens and nonterminals.  Finally, there is a mechanism
for describing the type of those few values where  Yacc  can
not easily determine the type.

     To declare the union, the user includes in the declara-
tion section:

        %union  {
                body of union ...
                }

This declares the Yacc value stack, and the  external  vari-
ables  yylval  and  yyval, to have type equal to this union.
If Yacc was invoked with the -_d option, the  union  declara-
tion  is  copied  onto the y.tab.h file.  Alternatively, the
union may be declared in a header file, and a  typedef  used
to  define  the  variable  YYSTYPE  to represent this union.
Thus, the header file might also have said:

        typedef union {
                body of union ...
                } YYSTYPE;

The header file must be included in  the  declarations  sec-
tion, by use of %{ and %}.

     Once YYSTYPE is defined, the union member names must be
associated  with the various terminal and nonterminal names.
The construction

        < name >


is used to indicate a union member name.   If  this  follows
one  of  the  keywords %token, %left, %right, and %nonassoc,
the union member name is associated with the tokens  listed.
Thus, saying

        %left  < optype>  '+'  '-'

will cause any reference to values  returned  by  these  two
tokens  to  be  tagged  with  the  union member name optype.
Another keyword, %type, is used similarly to associate union
member names with nonterminals.  Thus, one might say

        %type  < nodetype>  expr  stat


     There remain a couple of cases where  these  mechanisms
are  insufficient.  If there is an action within a rule, the
value returned by this action has no a priori  type.   Simi-
larly,  reference  to  left context values (such as $0 - see
the previous subsection ) leaves Yacc with no  easy  way  of
knowing  the  type.   In this case, a type can be imposed on
the reference by inserting a union member  name,  between  < 
and  >,  immediately  after the first $.  An example of this
usage is

        rule    :       aaa  {  $< intval>$  =  3;  } bbb
                                {       fun( $< intval>2, $< other>0 );  }
                ;

This syntax has little to recommend it,  but  the  situation
arises rarely.

     A sample specification is given  in  Appendix  C.   The
facilities  in  this subsection are not triggered until they
are used: in particular, the use of %type will turn on these
mechanisms.   When  they  are used, there is a fairly strict
level of checking.  For example, use of $n or $$ to refer to
something  with  no  defined  type  is  diagnosed.  If these
facilities are not triggered, the Yacc value stack  is  used
to hold _i_n_t's, as was true historically.

11: Acknowledgements

     Yacc owes much to  a  most  stimulating  collection  of
users,  who  have  goaded me beyond my inclination, and fre-
quently beyond my ability, in their endless search for ``one
more feature''.  Their irritating unwillingness to learn how
to do things my way has usually led to my doing things their
way;  most  of  the  time, they have been right.  B. W. Ker-
nighan, P. J. Plauger, S. I. Feldman, C. Imagna, M. E. Lesk,
and  A.  Snyder  will  recognize  some of their ideas in the
current version of Yacc.  C. B.  Haley  contributed  to  the
error  recovery  algorithm.  D. M. Ritchie, B. W. Kernighan,
and  M.  O.  Harris  helped  translate  this  document  into
English.   Al  Aho also deserves special credit for bringing
the mountain to Mohammed, and other favors.


Appendix A:  A Simple Example

     This example gives the complete Yacc specification  for
a  small  desk calculator; the desk calculator has 26 regis-
ters, labeled ``a'' through ``z'',  and  accepts  arithmetic
expressions  made  up  of  the  operators +, -, *, /, % (mod
operator), & (bitwise and), | (bitwise or), and  assignment.
If  an  expression  at  the  top level is an assignment, the
value is not printed; otherwise it is.  As in C, an  integer
that begins with 0 (zero) is assumed to be octal; otherwise,
it is assumed to be decimal.

     As an example of a Yacc specification, the desk  calcu-
lator  does  a reasonable job of showing how precedences and
ambiguities  are  used,  and  demonstrating   simple   error
recovery.   The major oversimplifications are that the lexi-
cal analysis phase is much simpler than  for  most  applica-
tions, and the output is produced immediately, line by line.
Note the way that decimal and octal integers are read in  by
the  grammar  rules; This job is probably better done by the
lexical analyzer.


%{
#  include  < stdio.h>
#  include  < ctype.h>

int  regs[26];
int  base;

%}

%start  list

%token  DIGIT  LETTER

%left  '|'
%left  '&'
%left  '+'  '-'
%left  '*'  '/'  '%'
%left  UMINUS      /*  supplies  precedence  for  unary  minus  */

%%      /*  beginning  of  rules  section  */

list :    /*  empty  */
     |    list  stat  '\n'
     |    list  error  '\n'
               {    yyerrok;  }
     ;

stat :    expr
               {    printf( "%d\n", $1 );  }
     |    LETTER  '='  expr
               {    regs[$1]  =  $3;  }









Yacc: Yet Another Compiler-Compiler                PS1:15-37


     ;

expr :    '('  expr  ')'
               {    $$  =  $2;  }
     |    expr  '+'  expr
               {    $$  =  $1  +  $3;  }
     |    expr  '-'  expr
               {    $$  =  $1  -  $3;  }
     |    expr  '*'  expr
               {    $$  =  $1  *  $3;  }
     |    expr  '/'  expr
               {    $$  =  $1  /  $3;  }
     |    expr  '%'  expr
               {    $$  =  $1  %  $3;  }
     |    expr  '&'  expr
               {    $$  =  $1  &  $3;  }
     |    expr  '|'  expr
               {    $$  =  $1  |  $3;  }
     |    '-'  expr        %prec  UMINUS
               {    $$  =  -  $2;  }
     |    LETTER
               {    $$  =  regs[$1];  }
     |    number
     ;

number    :    DIGIT
               {    $$ = $1;    base  =  ($1==0)  ?  8  :  10;  }
     |    number  DIGIT
               {    $$  =  base * $1  +  $2;  }
     ;

%%      /*  start  of  programs  */

yylex() {      /*  lexical  analysis  routine  */
              /*  returns  LETTER  for  a  lower  case  letter,  yylval = 0  through  25  */
              /*  return  DIGIT  for  a  digit,  yylval = 0  through  9  */
              /*  all  other  characters  are  returned  immediately  */

     int  c;

     while(  (c=getchar())  ==  ' '  )  {/*  skip  blanks  */  }

     /*  c  is  now  nonblank  */

     if(  islower(  c  )  )  {
          yylval  =  c  -  'a';
          return  (  LETTER  );
          }
     if(  isdigit(  c  )  )  {
          yylval  =  c  -  '0';
          return(  DIGIT  );
          }
     return(  c  );
     }


Appendix B: Yacc Input Syntax

     This Appendix has a description of the Yacc input  syn-
tax,  as  a Yacc specification.  Context dependencies, etc.,
are not considered.  Ironically, the Yacc  input  specifica-
tion  language is most naturally specified as an LR(2) gram-
mar; the sticky part comes when an identifier is seen  in  a
rule,  immediately  following an action.  If this identifier
is followed by a colon, it is the start of  the  next  rule;
otherwise  it  is  a continuation of the current rule, which
just happens to have an action embedded in  it.   As  imple-
mented,  the  lexical  analyzer  looks ahead after seeing an
identifier, and decide  whether  the  next  token  (skipping
blanks,  newlines,  comments,  etc.)  is a colon.  If so, it
returns the token C_IDENTIFIER.  Otherwise, it returns IDEN-
TIFIER.   Literals  (quoted  strings)  are  also returned as
IDENTIFIERS, but never as part of C_IDENTIFIERs.


            /*  grammar  for  the  input  to  Yacc  */

      /*  basic  entities  */
%token      IDENTIFIER  /*   includes  identifiers   and  literals  */
%token      C_IDENTIFIER      /*    identifier  (but  not  literal)  followed  by  colon    */
%token      NUMBER            /*    [0-9]+    */

      /*  reserved  words:    %type  =>  TYPE,  %left  =>  LEFT,  etc.  */

%token      LEFT  RIGHT  NONASSOC  TOKEN  PREC  TYPE  START  UNION

%token      MARK  /*  the  %%  mark  */
%token      LCURL /*  the  %{  mark  */
%token      RCURL /*  the  %}  mark  */

      /*  ascii  character  literals  stand  for  themselves  */

%start      spec

%%

spec  :     defs  MARK  rules  tail
      ;

tail  :     MARK  {    In  this  action,  eat  up  the  rest  of  the  file    }
      |     /*  empty:  the  second  MARK  is  optional  */
      ;

defs  :     /*  empty  */
      |     defs  def
      ;

def   :     START  IDENTIFIER
      |     UNION  {  Copy union definition  _t_o  output  }
      |     LCURL  {  Copy  C  code  to  output  file   }  RCURL


      |     ndefs  rword  tag  nlist
      ;

rword :     TOKEN
      |     LEFT
      |     RIGHT
      |     NONASSOC
      |     TYPE
      ;

tag   :     /*  empty:  union  tag  is  optional  */
      |     '< '  IDENTIFIER  '>'
      ;

nlist :     nmno
      |     nlist  nmno
      |     nlist  ','  nmno
      ;

nmno  :     IDENTIFIER        /*  NOTE:  literal  illegal  with  %type  */
      |     IDENTIFIER  NUMBER      /*  NOTE:  illegal  with  %type  */
      ;

      /*  rules  section  */

rules :     C_IDENTIFIER  rbody  prec
      |     rules  rule
      ;

rule  :     C_IDENTIFIER  rbody  prec
      |     '|'  rbody  prec
      ;

rbody :     /*  empty  */
      |     rbody  IDENTIFIER
      |     rbody  act
      ;

act   :     '{'  { Copy  action,  translate  $$,  etc.  }  '}'
      ;

prec  :     /*  empty  */
      |     PREC  IDENTIFIER
      |     PREC  IDENTIFIER  act
      |     prec  ';'
      ;


Appendix C: An Advanced Example

     This Appendix gives an example of a grammar using  some
of  the advanced features discussed in Section 10.  The desk
calculator example in Appendix A is modified  to  provide  a
desk  calculator  that  does  floating point interval arith-
metic.  The calculator understands floating point constants,
the  arithmetic  operations  +,  -,  *,  /,  unary  -, and =
(assignment), and has 26  floating  point  variables,  ``a''
through  ``z''.   Moreover,  it  also understands intervals,
written

                ( x , y )

where x is less than or equal to y.  There are  26  interval
valued  variables ``A'' through ``Z'' that may also be used.
The usage is similar to  that  in  Appendix  A;  assignments
return  no value, and print nothing, while expressions print
the (floating or interval) value.

     This example explores a number of interesting  features
of  Yacc  and  C.  Intervals are represented by a structure,
consisting of the left and right endpoint values, stored  as
double's.  This structure is given a type name, INTERVAL, by
using typedef.  The Yacc value stack can also contain float-
ing  point  scalars,  and  integers  (used to index into the
arrays holding  the  variable  values).   Notice  that  this
entire  strategy  depends  strongly  on being able to assign
structures and unions in C.  In fact, many  of  the  actions
call functions that return structures as well.

     It is also worth noting the use of  YYERROR  to  handle
error  conditions: division by an interval containing 0, and
an interval presented in the wrong order.   In  effect,  the
error  recovery  mechanism of Yacc is used to throw away the
rest of the offending line.

     In addition to the mixing of types on the value  stack,
this  grammar also demonstrates an interesting use of syntax
to keep track of the  type  (e.g.  scalar  or  interval)  of
intermediate   expressions.   Note  that  a  scalar  can  be
automatically promoted to an interval if the context demands
an  interval value.  This causes a large number of conflicts
when the grammar is run through Yacc: 18 Shift/Reduce and 26
Reduce/Reduce.   The  problem  can be seen by looking at the
two input lines:

                2.5 + ( 3.5 - 4. )

and

                2.5 + ( 3.5 , 4. )

Notice that the 2.5 is to be  used  in  an  interval  valued









Yacc: Yet Another Compiler-Compiler                PS1:15-41


expression in the second example, but this fact is not known
until the ``,'' is read; by this time, 2.5 is finished,  and
the  parser  cannot  go back and change its mind.  More gen-
erally, it might be necessary to  look  ahead  an  arbitrary
number of tokens to decide whether to convert a scalar to an
interval.  This problem is evaded by having  two  rules  for
each  binary  interval  valued  operator:  one when the left
operand is a scalar, and one when the  left  operand  is  an
interval.   In the second case, the right operand must be an
interval, so the conversion will be  applied  automatically.
Despite  this  evasion, there are still many cases where the
conversion may be applied or not, leading to the above  con-
flicts.   They  are resolved by listing the rules that yield
scalars first in the specification file; in  this  way,  the
conflicts  will  be  resolved  in  the  direction of keeping
scalar valued  expressions  scalar  valued  until  they  are
forced to become intervals.

     This way of handling multiple types  is  very  instruc-
tive,  but  not  very  general.  If there were many kinds of
expression types, instead of just two, the number  of  rules
needed  would  increase dramatically, and the conflicts even
more dramatically.  Thus, while this example is instructive,
it  is better practice in a more normal programming language
environment to keep the type  information  as  part  of  the
value, and not as part of the grammar.

     Finally, a word about the lexical analysis.   The  only
unusual  feature  is  the  treatment  of floating point con-
stants.  The C library routine atof is used to do the actual
conversion  from  a  character  string to a double precision
value.   If  the  lexical  analyzer  detects  an  error,  it
responds  by  returning a token that is illegal in the gram-
mar, provoking a syntax error  in  the  parser,  and  thence
error recovery.



%{

#  include  < stdio.h>
#  include  < ctype.h>

typedef  struct  interval  {
        double  lo,  hi;
        }  INTERVAL;

INTERVAL  vmul(),  vdiv();

double  atof();

double  dreg[ 26 ];
INTERVAL  vreg[ 26 ];

%}

%start    lines

%union    {
        int  ival;
        double  dval;
        INTERVAL  vval;
        }

%token  < ival>  DREG  VREG      /*  indices  into  dreg,  vreg  arrays  */

%token  < dval>  CONST           /*  floating  point  constant  */

%type  < dval>  dexp             /*  expression  */

%type  < vval>  vexp             /*  interval  expression  */

        /*  precedence  information  about  the  operators  */

%left   '+'  '-'
%left   '*'  '/'
%left   UMINUS        /*  precedence  for  unary  minus  */

%%

lines   :       /*  empty  */
        |       lines  line
        ;

line    :       dexp  '\n'
                        {       printf(  "%15.8f\n",  $1  );  }
        |       vexp  '\n'
                        {       printf(  "(%15.8f  ,  %15.8f  )\n",  $1.lo,  $1.hi  );  }
        |       DREG  '='  dexp  '\n'
                        {       dreg[$1]  =  $3;  }



        |       VREG  '='  vexp  '\n'
                        {       vreg[$1]  =  $3;  }
        |       error  '\n'
                        {       yyerrok;  }
        ;

dexp    :       CONST
        |       DREG
                        {       $$  =  dreg[$1];  }
        |       dexp  '+'  dexp
                        {       $$  =  $1  +  $3;  }
        |       dexp  '-'  dexp
                        {       $$  =  $1  -  $3;  }
        |       dexp  '*'  dexp
                        {       $$  =  $1  *  $3;  }
        |       dexp  '/'  dexp
                        {       $$  =  $1  /  $3;  }
        |       '-'  dexp       %prec  UMINUS
                        {       $$  =  - $2;  }
        |       '('  dexp  ')'
                        {       $$  =  $2;  }
        ;

vexp    :       dexp
                        {       $$.hi  =  $$.lo  =  $1;  }
        |       '('  dexp  ','  dexp  ')'
                        {
                        $$.lo  =  $2;
                        $$.hi  =  $4;
                        if(  $$.lo  >  $$.hi  ){
                                printf(  "interval  out  of  order\n"  );
                                YYERROR;
                                }
                        }
        |       VREG
                        {       $$  =  vreg[$1];    }
        |       vexp  '+'  vexp
                        {       $$.hi  =  $1.hi  +  $3.hi;
                                $$.lo  =  $1.lo  +  $3.lo;    }
        |       dexp  '+'  vexp
                        {       $$.hi  =  $1  +  $3.hi;
                                $$.lo  =  $1  +  $3.lo;    }
        |       vexp  '-'  vexp
                        {       $$.hi  =  $1.hi  -  $3.lo;
                                $$.lo  =  $1.lo  -  $3.hi;    }
        |       dexp  '-'  vexp
                        {       $$.hi  =  $1  -  $3.lo;
                                $$.lo  =  $1  -  $3.hi;    }
        |       vexp  '*'  vexp
                        {       $$  =  vmul(  $1.lo,  $1.hi,  $3  );  }
        |       dexp  '*'  vexp
                        {       $$  =  vmul(  $1,  $1,  $3  );  }
        |       vexp  '/'  vexp
                        {       if(  dcheck(  $3  )  )  YYERROR;
                                $$  =  vdiv(  $1.lo,  $1.hi,  $3  );  }
        |       dexp  '/'  vexp
                        {       if(  dcheck(  $3  )  )  YYERROR;
                                $$  =  vdiv(  $1,  $1,  $3  );  }
        |       '-'  vexp       %prec  UMINUS
                        {       $$.hi  =  -$2.lo;    $$.lo  =  -$2.hi;    }
        |       '('  vexp  ')'
                        {       $$  =  $2;  }
        ;

%%

#  define  BSZ  50        /*  buffer  size  for  floating  point  numbers  */

        /*  lexical  analysis  */

yylex(){
        register  c;

        while(  (c=getchar())  ==  ' '  ){  /*  skip  over  blanks  */  }

        if(  isupper(  c  )  ){
                yylval.ival  =  c  -  'A';
                return(  VREG  );
                }
        if(  islower(  c  )  ){
                yylval.ival  =  c  -  'a';
                return(  DREG  );
                }

        if(  isdigit(  c  )  ||  c=='.'  ){
                /*  gobble  up  digits,  points,  exponents  */

                char  buf[BSZ+1],  *cp  =  buf;
                int  dot  =  0,  exp  =  0;

                for(  ;  (cp-buf)< BSZ  ;  ++cp,c=getchar()  ){

                        *cp  =  c;
                        if(  isdigit(  c  )  )  continue;
                        if(  c  ==  '.'  ){
                                if(  dot++  ||  exp  )  return(  '.'  );    /*  will  cause  syntax  error  */
                                continue;
                                }

                        if(  c  ==  'e'  ){
                                if(  exp++  )  return(  'e'  );    /*  will  cause  syntax  error  */
                                continue;
                                }

                        /*  end  of  number  */
                        break;
                        }
                *cp  =  '\0';


                if(  (cp-buf)  >=  BSZ  )  printf(  "constant  too  long:  truncated\n"  );
                else  ungetc(  c,  stdin  );    /*  push  back  last  char  read  */
                yylval.dval  =  atof(  buf  );
                return(  CONST  );
                }
        return(  c  );
        }

INTERVAL  hilo(  a,  b,  c,  d  )  double  a,  b,  c,  d;  {
        /*  returns  the  smallest  interval  containing  a,  b,  c,  and  d  */
        /*  used  by  *,  /  routines  */
        INTERVAL  v;

        if(  a>b  )  {  v.hi  =  a;    v.lo  =  b;  }
        else  {  v.hi  =  b;    v.lo  =  a;  }

        if(  c>d  )  {
                if(  c>v.hi  )  v.hi  =  c;
                if(  d< v.lo  )  v.lo  =  d;
                }
        else  {
                if(  d>v.hi  )  v.hi  =  d;
                if(  c< v.lo  )  v.lo  =  c;
                }
        return(  v  );
        }

INTERVAL  vmul(  a,  b,  v  )  double  a,  b;    INTERVAL  v;  {
        return(  hilo(  a*v.hi,  a*v.lo,  b*v.hi,  b*v.lo  )  );
        }

dcheck(  v  )  INTERVAL  v;  {
        if(  v.hi  >=  0.  &&  v.lo  < =  0.  ){
                printf(  "divisor  interval  contains  0.\n"  );
                return(  1  );
                }
        return(  0  );
        }

INTERVAL  vdiv(  a,  b,  v  )  double  a,  b;    INTERVAL  v;  {
        return(  hilo(  a/v.hi,  a/v.lo,  b/v.hi,  b/v.lo  )  );
        }

Appendix D: Old Features Supported but not Encouraged

     This Appendix mentions synonyms and features which  are
supported  for  historical continuity, but, for various rea-
sons, are not encouraged.

1.   Literals may also be delimited by double quotes ``"''.

2.   Literals may be more than one character long.   If  all
     the  characters are alphabetic, numeric, or _, the type
     number of the  literal  is  defined,  just  as  if  the
     literal  did not have the quotes around it.  Otherwise,
     it is difficult to find the value for such literals.

     The  use  of  multi-character  literals  is  likely  to
     mislead  those  unfamiliar with Yacc, since it suggests
     that Yacc is doing a job which must be actually done by
     the lexical analyzer.

3.   Most places where % is legal, backslash  ``\''  may  be
     used.   In  particular, \\ is the same as %%, \left the
     same as %left, etc.

4.   There are a number of other synonyms:

             %<  is the same as %left
             %> is the same as %right
             %binary and %2 are the same as %nonassoc
             %0 and %term are the same as %token
             %= is the same as %prec


5.   Actions may also have the form

             ={ . . . }

     and the curly braces can be dropped if the action is  a
     single C statement.

6.   C code between %{ and %} used to be  permitted  at  the
     head  of  the rules section, as well as in the declara-
     tion section.