1
                                                         (17 Feb 00)
 
                    **************************************
                    *                                    *
                    * Section 5 - Programmer's Reference *
                    *                                    *
                    **************************************
 
 
              This section describes features of the GAMESS
          implementation which are true for all machines.  See the
          section 'hardware specifics' for information on each
          machine type.  The contents of this section are:
 
                    o  Installation overview (sequential mode)
 
                    o  Files comprising the GAMESS distribution.

                    o  Running distributed data parallel GAMESS.
                          parallelization history
                          DDI process/memory schematic
                          memory allocations and check jobs
                          transport protocols and installation
                          representative performance examples
                          a very few programming details

                    o  Altering program limits
 
                    o  Names of source code modules
 
                    o  Programming conventions
 
                    o  Parallel broadcast identifiers
 
                    o  Disk files used by GAMESS
 
                    o  Contents of DICTNRY master file
 
1
 
                          Installation overview
 
              GAMESS will run on a number of different machines
          under FORTRAN 77 compilers.  However, even given the F77
          standard there are still a number of differences between
          various machines.  For example some machines have 32 bit
          word lengths, requiring the use of double precision, while
          others have 64 bit words and are used in single precision.
 
              Although there are many types of computers, there is
          only one (1) version of GAMESS.
 
              This portability is made possible mainly by keeping
          machine dependencies to a minimum (that is, writing in
          F77, not vendor specific language extensions).  The
          unavoidable few statements which do depend on the hardware
          are commented out, for example, with "*IBM" in columns
          1-4.  Before compiling GAMESS on an IBM machine, these
          four columns must be replaced by 4 blanks.  The process of
          turning on a particular machine's specialized code is
          dubbed "activation".
 
              A semi-portable FORTRAN 77 program to activate the
          desired machine dependent lines is supplied with the
          GAMESS package as program ACTVTE.  Before compiling ACTVTE
          on your machine, use your text editor to activate the very
          few machine dependent lines in ACTVTE before compiling it.
          Be careful not to change the DATA initialization!
 
              The task of building an executable form of GAMESS is
          this:
                    activate     compile        link
                *.SRC --->  *.FOR  --->  *.OBJ  ---> *.EXE
                source     FORTRAN       object    executable
                 code        code         code       image
          where the intermediate files *.FOR and *.OBJ are discarded
          once the executable has been linked.  It may seem odd at
          first to delete FORTRAN code, but this can always be
          reconstructed from the master source code using ACTVTE.
 
              The advantage of maintaining only one master version
          is obvious.  Whenever any improvements are made, they are
          automatically in place for all the currently supported
          machines.  There is no need to make the same changes in a
          plethora of other versions.
 
              The control language needed to activate, compile, and
          link GAMESS on your brand of computer is probably present
          on the distribution tape.  These files should not be used
          without some examination and thought on your part, but
          should give you a starting point.
 
1
 
              There may be some control language procedures for one
          computer that cannot be duplicated on another.  However,
          some general comments apply:  Files named COMP will
          compile a single module.  COMPALL will compile all
          modules.  LKED will link together an executable image.
          RUNGMS will run a GAMESS job, and RUNALL will run all the
          example jobs.
 
              The first step in installing GAMESS should be to print
          the manual.  If you are reading this, you've got that
          done!  The second step would be to get the source code
          activator compiled and linked (note that the activator
          must be activated manually before it is compiled).  Third,
          you should now compile all the source modules (if you have
          an IBM, you should also assemble the two provided files).
          Fourth, link the program.  Finally, run all the short
          tests, and very carefully compare the key results shown in
          the 'sample input' section against your outputs.  These
          "correct" results are from a IBM RS/6000, so there may be
          very tiny (last digit) precision differences for other
          machines.  That's it!
 
              Before starting the installation, you should read the
          pages decribing your computer in the 'Hardware Specifics'
          section of the manual.  There may be special instructions
          for your machine.
 
1
 
                            Files for GAMESS
 
             *.DOC            The files you are reading now. You
                              should print these on 8.5 by 11 inch
                              white paper, using column one as
                              carriage control.  Double sided, 3
                              hole, 10 pitch laser output is best!
             *.SRC            source code for each module
             *.ASM            IBM mainframe assembler source
             *.C              C code used by some UNIX systems.
             EXAM*.INP        short test jobs (see TESTS.DOC).
 
              These are files related to some utility programs:
 
             ACTVTE.CODE      Source code activator.  Note that you
                              must use a text editor to MANUALLY
                              activate this program before using it.
             MBLDR.*          model builder (internal to Cartesian)
             CARTIC.*         Cartesian to internal coordinates
             CLENMO.*         cleans up $VEC groups
 
              There are files related to X windows graphics.
          See the file INTRO.MAN for their names.
 
              The remaining files are command language for the
          various machines.
 
             *.COM   VAX command language.  PROBE is especially
                     useful for persons learning GAMESS.
             *.MVS   IBM command language for MVS (dreaded JCL).
             *.CMS   IBM command language for CMS.  These should
                     be copied to filetype EXEC.
             *.CSH   UNIX C shell command language.  These should
                     have the "extension" omitted, and have their
                     mode changed to executable.
 
1
 
                   Running Distributed Data Parallel GAMESS
 
              It is difficult to write a description of the parallel
          nature of GAMESS that separates what is important for the
          installer of GAMESS or programmers of GAMESS from the end
          user.  Users of GAMESS should read most of this section,
          skipping only the most technical parts, in order to be able
          to effectively run this program.

              Efficient use of GAMESS requires an understanding of
          three critical issues:  The first is the difference between
          two types of memory (replicated MEMORY and distributed
          MEMDDI) and how these relate to the physical memory of the
          computer which you are using.  Second, you must understand
          to some extent the degree to which each type of computation
          scales so that the proper number of nodes is selected.
          Finally, most systems run two copies of GAMESS on each
          processor, and if you read on you will find out why this is
          so.

              Since all code needed to implement the Distributed Data
          Interface (DDI) is provided with the GAMESS source code
          distribution, the program compiles and links ready for 
          parallel execution on all machine types.  Of course, you
          may choose to run on only one processor, in which case
          GAMESS will behave as if it is a sequential code, and the
          full functionality of the program is available.

              Below you will find the following topics:
                 parallelization history
                 DDI process/memory schematic
                 memory allocations and check jobs
                 transport protocols and installation
                 representative performance examples
                 a very few programming details

                                  * * *
          
              We began to parallelize GAMESS in 1991 as part of the
          joint ARPA/Air Force piece of the Touchstone Delta project.
          Today, nearly all ab initio methods run in parallel, 
          although some of these still have a step or two running
          sequentially only.  Only the MP2 energy for MCSCF, and RHF
          CI gradients have no parallel method coded.  We have not
          parallelized the semi-empirical MOPAC runs, and probably 
          never will.  Additional parallel work is in progress under
          a DoD CHSSI software initiative which "kicked off" in
          1996.  This has already led to the DDI-based parallel MP2
          gradient program, after development of the DDI programming
          toolkit itself.
 
1

              In 1991, the parallel machine of choice was the Intel
          Hypercube although small clusters of workstations could
          also be used as a parallel computer.  In order to have
          the best blend of portability and functionality, we chose
          in 1991 to use the TCGMSG message passing library rather
          than one of the early vendor's specialized libraries.  As
          the major companies began to market parallel machines, and
          as MPI version 1 emerged as a standard, we began to use
          MPI on some equipment in 1996, while still using the very
          resilient TCGMSG library on everything else.  However, in
          June 1999, we retired our old friend TCGMSG when the
          message passing library used by GAMESS changed to the
          Distributed Data Interface, or DDI.  We will discuss later
          the low level message transports which DDI relies on:
          SHMEM, TCP/IP sockets, or MPI-1.

              Two people have been extremely influential upon the
          current parallel methodology.  Theresa Windus, a graduate
          student in the early 1990s, created the first parallel 
          versions.  Graham Fletcher, a postdoc in the late 1990s,
          is responsible for the addition of distributed data
          programming concepts.

                                  * * *

              DDI contains the usual parallel programming calls, such
          as initialization/closure, point to point messages, and
          the collective operations global sum and broadcast.  These
          simple parts of DDI support all parallel methods developed
          in GAMESS from 1991-1999, which were based on replicated
          storage rather than distributed data.  However, DDI also
          contains additional routines to support distributed memory
          usage.

              DDI attempts to exploit the entire machine in a scalable
          way.  While our early work concentrated on exploiting the
          use of p processors and p disks, it required that all data
          in memory be replicated on every one of the p nodes.  The
          use of memory also becomes scalable only if the data is
          distributed across the aggregate memory of the parallel
          machine.  The concept of distributed memory is contained in
          the Remote Memory Access portion of MPI version 2, but so
          far MPI-2 is not available from American computer vendors.
          A similar concept has been implemented in the Global Array
          tools of Pacific Northwest National Laboratory.  
 
              Basically, the idea is to provide three subroutine calls
          to access memory on remote nodes: PUT, GET, and ACCUMULATE.
          These give access to a class of memory which is assumed to
          be slower than local memory, but faster than disk:

            <--- fastest                                 slowest --->
            registers cache(s) local_memory remote_memory disks tapes
            <--- smallest                                biggest --->

1
 
          Because DDI accesses memory on other nodes by means of an
          explicit subroutine call, the programmer is aware that a
          message must be transmitted.  This awareness of the access
          overhead should encourage algorithms that transfer many
          data items in a single message.  Use of a subroutine call
          to reach remote memory is a recognition of the non-uniform
          memory access (NUMA) nature of parallel computers.  In
          other words, the Distributed Data Interface (DDI) is an
          explicitly message passing implementation of global shared
          memory.  

              In order to have one node pass data items to a second
          node when the second node needs them, without any significant
          delay, the computing job on the first node must interrupt
          its computation briefly to furnish the data.  This type of
          communication is referred to as "one sided messages" or
          "active messages" since the first node is an unwitting
          participant in the process, which is driven entirely by the
          requirements of the second node.

              The Cray T3E has a library named SHMEM to support this
          type of one sided messages (and good hardware support for
          this too) so, on the T3E, GAMESS runs as a single process
          per CPU.  Its memory image looks like this:

                      node 0           node 1
                        p=0              p=1
                  ---------------   ---------------
                  |    GAMESS   |   |    GAMESS   |
                  |   quantum   |   |   quantum   |
                  |  chem code  |   |  chem code  |
                  ---------------   ---------------
                  |  DDI code   |   |  DDI code   |   Input keywords:
                  ---------------   ---------------
                  |  replicated |   | replicated  |       <-- MEMORY
                  |  data       |   | data        |
              -----------------------------------------
              |   |             |   |             |   |   <-- MEMDDI
              |   |  distributed|   | distributed |   |
              |   |  data       |   | data        |   |
              |   |             |   |             |   |
              |   |             |   |             |   |
              |   |             |   |             |   |
              |   ---------------   ---------------   |
              -----------------------------------------

          where the box drawn around the distributed data is meant to
          imply that a large data array is residing in the memory of
          all nodes, in this example, half on one and half on the
          other.  At the present time, the DDI routines support only
          two dimensional FORTRAN arrays, organized so that columns
          are kept on a single node's memory.  Up to 10 matrices may
          be distributed in this fashion.  

1

              Note that the input keyword MEMORY gives the amount of
          storage used to duplicate small matrices on every node,
          while MEMDDI gives the -total- distributed memory required
          by the job.  Thus, if you are running on p nodes, the memory
          that is used on any given node is 
 
                 total on any 1 node = MEMORY + MEMDDI/p
 
          Since MEMDDI is very large, its units are in millions of
          words.  The keyword MEMORY is in units of words (64 bit
          quantity) and so you must either convert units carefully
          or use the MWORDS synonym for MEMORY (for which the units
          are also millions of words).  Since good execution speed
          requires that you not exceed the physical memory belonging
          to your nodes, it is important to understand that when
          MEMDDI is large, you will need to choose a sufficiently
          large number of nodes to keep the memory on any 1 node
          reasonable.

              To repeat, the DDI philosophy is to add more processors
          not just for their compute performance or extra disk space,
          but also to aggregate a very large total memory.  Bigger
          problems will require more nodes to obtain sufficiently
          large total memories!  We will give an example of how you
          can estimate the number of nodes a little ways below.
          
              If the GAMESS task running as process p=1 in the above
          example needs some values previously computed, it issues a 
          call to DDI_GET.  The DDI routines in process p=1 then 
          figure out where this "patch" of data in the big rectangular
          distributed storage actually resides.  Suppose this is on
          process p=0.  The DDI routines in p=1 send a message to
          p=0 to interupt its computations, after which p=0 sends a
          bulk data message to process p=1's buffer.  This buffer
          resides in part of the replicated storage of p=1, where
          computations can occur.  Thus distributed data is accessed
          only by DDI_GET, DDI_PUT (its counterpart for storage of
          data items), and DDI_ACC (which accumulates new terms into
          the distributed data).  Note that the quantum chemistry
          layer of process p=1 was sheltered from most of the details
          regarding which node owned the patch of data that process
          p=1 wanted to obtain.  These details are managed by the
          DDI layer.  

              It is the programmer's responsibility to minimize the
          number of GET/PUT/ACC calls, and to design algorithms that
          maximize the chance that the patches of data are actually
          within the local node's portion of the distributed data.
          Note that with the exception of DDI_ACC's simple addition,
          no arithmetic is done directly upon the distributed data.
          Instead, DDI_GET and DDI_PUT should be thought of as
          analogous to the FORTRAN READ and WRITE statements that
          transfer data between disk storage and local memory where
          computations may occur.

1

              Since MPI-2 is unavailable, and vendor specific "one
          sided messaging" libraries such as the T3E's SHMEM are
          scarce, all other platforms adopt the following strategy.
          It involves two GAMESS processes running on every node:

                      node 0           node 1
                        p=0              p=1
                  ---------------   ---------------
                  |    GAMESS  X|   |    GAMESS  X|        compute
                  |   quantum   |   |   quantum   |       processes
                  |  chem code  |   |  chem code  |
                  ---------------   ---------------
                  |  DDI code   |   |  DDI code   |   Input keyword:
                  ---------------   ---------------
                  |  replicated |   | replicated  |       <-- MEMORY
                  |  data       |   | data        |
                  ---------------   --------------- 

                        p=2              p=3
                  ---------------   ---------------
                  |    GAMESS   |   |    GAMESS   |         data 
                  |   quantum   |   |   quantum   |       servers
                  |  chem code  |   |  chem code  |
                  ---------------   ---------------
                  |  DDI code  X|   |  DDI code  X|
                  ---------------   ---------------
              -----------------------------------------  Input keyword:
              |   |             |   |             |   |   <-- MEMDDI
              |   |  distributed|   | distributed |   |
              |   |  data       |   | data        |   |
              |   |             |   |             |   |
              |   |             |   |             |   |
              |   |             |   |             |   |
              |   ---------------   ---------------   |
              -----------------------------------------

          The first half of the processes do quantum chemistry, and
          the X indicates that they spend most of their time
          executing some sort of chemistry.  Hence the name "compute
          process".  Soon after execution, the second half of the
          processes call a servicing DDI routine which consists of an
          infinite loop to deal with GET, PUT, and ACC requests until
          such time as the job ends.  The X shows that these "data
          servers" execute only DDI support code.  This makes the
          data server's quantum chemistry routines the equivalent of
          the human appendix.  The whole problem of interupts is now
          in the hands of the operating system, as the data servers
          are distinct processes.  To follow the same example as
          before, when the compute process p=1 needs data that turns
          out to reside on node 0, a request is sent to the data
          server p=2 to transfer information back to the compute
          process p=1.  The compute process p=0 is completely unaware
          that such a transaction has occurred.
 
              The formula for the memory required by any single node
          is unchanged, if p is the total number of nodes used,
                 total on any 1 node = MEMORY + MEMDDI/p.

1
          
                                  * * *
 
              At present, only closed shell MP2 gradients, and the
          ZAPT open shell MP2 energy take advantage of the new
          distributed memory options.  We expect to adapt other
          methods to use this technique of memory aggregation, but
          currently all other types of jobs run with MEMDDI=0 and
          therefore use only replicated storage.  In this case the
          data server processes still run, but are dormant because
          no distributed memory access is attempted.  For example,
          in an SCF computation (no hessian or MP2 follow on) the
          memory needed is on the order of the square of the basis
          set size, for such quantities as the orbital coefficients,
          density, Fock, overlap matrices, and so on.  These are
          simply duplicated on every node in the MEMORY region.

              Check runs (EXETYP=CHECK) need to run quickly, and
          the fastest turn around always comes on one node only.
          Runs which do not currently exploit MEMDDI distributed
          storage will formally allocate their MEMORY needs, and
          feel out their storage needs while skipping almost all of
          the real work.  Since MEMORY is replicated, the amount
          that is needed on 1 node remains unchanged if you later
          do the true computation on more than 1 node.

              Check jobs which involve MEMDDI storage are a little
          bit trickier.  As noted, we want to run on only 1 node
          to get fast turn around.  However, MEMDDI is typically a
          large amount of memory, and this is unlikely to be 
          available on a single node.  The solution is that the
          data server process does not actually allocate the
          MEMDDI storage, instead it just remembers what you gave
          as input and checks to see if this will be adequate.  So,
          you can input MEMDDI=1000 (1000 million words is equal
          to 1,000 * 1,000,000 * 8 = 8 GBytes and run this check
          job on a computer with only 256 MB of RAM.

              Of course, the actual computation will have to run on
          a large number of such processors.  Let us continue with
          this example of a run requiring 8 GBytes of distributed
          data on 256 MB nodes.  Suppose that MEMORY is 2500000 in
          this case (when MEMDDI is used, MEMORY is typically just
          a few million words).  We need to reserve some memory
          for the operating system (16 MBytes, say) and for the
          GAMESS program and local storage (approx 16 MB, it is a
          big program, and the compute processes should be swapped
          into memory).  Thus our hypothetical 256 MB node has
          224 MB available, assuming no one else is running.  The
          rest of the computations proceed in million/mega words,
          so the available memory per node is 224/8 = 28.  We must
          choose the number of processors p to satisfy
                           needed <= available
                MEMORY + MEMDDI/p <= free physical memory
                     2.5 + 1000/p <= 28
          so this example requires p >= 39 compute processes.

1

              One more subtle point about CHECK runs with MEMDDI is
          that since you are running on 1 node only, the code does
          not know that you wish to run the parallel MP2 algorithm
          instead of the sequential algorithm.  You must force the
          CHECK job into the parallel section of the program by
           $system parall=.true. $end
          There's no harm leaving this line in for the true runs,
          as any job with more than one compute process is parallel
          regardless of the input value PARALL.
          
                                  * * *

              The next section deals with compilation and execution
          of GAMESS.  If someone else has already figured these
          things out for you, you may skip ahead to the section that
          illustrates how the code's performance scales.
  
              The purpose of this section is to describe only how
          the choices for low level message passing to support the
          DDI subroutines impact upon installation.  More explicit
          directions for the compiling process can be found in the
          first two sections of this chapter, in the readme.unix
          file, and notes on your machine and its compilers are to be
          found in the IRON.DOC chapter and in the 'comp' script.
          This section has the best explanation available of how to
          execute the program.

              The message traffic generated by DDI calls is sent by
          SHMEM on the Cray T3E, by MPI-1 on large parallel computers,
          and by TCP/IP sockets on networks of workstations.  We 
          cover each of these three classes of machines next.

                                   ---

              The Cray T3E's SHMEM library affords a single process
          implementation of GAMESS.  The T3E's message passing is
          contained in DDIT3E.SRC, and selecting the T3E target
          when compiling will use only this file and link against
          the SHMEM library.  The 'rungms' script has a special
          target to permit execution using this library.

                                   ---

              In general we expect a large vendor supplied parallel
          computer such as the IBM SP, SGI Origin, and large systems
          from companies such as Fujitsu, NEC, and Hitachi to have a
          MPI-1 library available.  It is furthermore reasonable to
          assume that an expensive machine in this class has a budget
          sufficient to purchase the vendor's MPI library.  Therefore
          the compute process/data server model outlined above will
          activate *MPI lines in the source file DDI.SRC, and link
          with the MPI-1 library.  Each requires a special target in
          the 'rungms' execution script.  Execution will require the
          vendor's "kickoff" routine to start two processes on each
          node, the second half of these will automatically become
          the data servers.

1

              Since DDI is brand new, we have correct control language
          in the scripts 'compall', 'comp', 'lked', and 'rungms' for
          the IBM SP only.  Sometime this summer we will try to get
          this finished for the Origin, and we will try to work with
          persons owning the Japanese systems to re-enable GAMESS on
          them.  There's no reason to doubt the MPI-1 messaging is
          correct since it is fully functional on the IBM SP, and so
          these other machines require only control language.  Or so
          we hope!
                                   ---
 
              The third class of machines are technical workstations
          running Unix.  In this category we include the IBM RS/6000,
          Compaq AXP (yours may say Digital on the front), Sun, HP,
          and SGI workstations, and also Intel-based Linux systems.
          These are characterized by low cost, implying that even if
          a vendor offers MPI-1 on these systems, the software may
          not have been purchased.  However, all of these have the
          TCP/IP socket library that has been in Unix for decades 
          now.  Chances are that a vendor MPI-1 runs over sockets on
          this class equipment anyway, so DDI might just as well talk 
          directly to sockets.  The DDI layer consists of C language
          routines to open the socket connections and transmit data
          through them, in the file DDISOC.C.  Higher level concepts
          such as global communications are written in FORTRAN, as
          the *SOC lines in DDI.SRC.  Besides these two files which
          link into the GAMESS executable, we need a way to fire up
          the compute processes and data servers.  This is DDIKICK.C 
          which is referred to as the "kickoff program".

              The compiling and linking scripts 'compall', 'comp',
          and 'lked' have targets for each kind of workstation, as
          their compilers have various options.  When the compiling
          and linking is done you should have two programs, namely
          ddikick.x and gamess.01.x.  The latter can be run on one
          or more CPUs, as it is sequential if you run on one node,
          and parallel whenever you run on more than one.  The
          execution script 'rungms' has a common target of 'sockets'
          since all six machines we have mentioned used ddikick.x to
          start processes.  This script has more details about how
          to run, but we will describe here what the arguments to
          the kickoff program are.

              The command in 'rungms' to fire up GAMESS is
          % ddikick.x Inputfile Exepath Exename Scratchdir \
                  Nhosts Hostname_0 Hostname_1 ... Hostname_N-1
          The Inputfile name is not actually used, but it will be
          displayed by the 'ps' command so you can tell what is
          actually being run.  Exepath is the name of the directory
          that contains the program to be executed.  Exename is the
          name of the GAMESS executable (which might have a different
          "version number" than 01).  The best situation is to have
          Exepath in an NFS mounted partition that all nodes can
          access, so that you have only one copy of the big GAMESS 
          executable.  However, you could carefully FTP a copy to all
          nodes using always exactly the same file name, such as
          /usr/local/bin/gamess.01.x.  

1

              Note that since only one executable name is specified,
          only one vendor's computers can be used at a time.  This
          limitation arises from a lack of XDR calls in the DDI layer
          to convert data types from one internal representation of
          numeric data to another machine's.

              Scratchdir is the name of a large working disk space,
          such as /scr/mike, in which all temporary files are placed.
          These files should be automatically deleted by the execution
          script as the job ends.  If the nodes do not happen to have
          the same scratch area name, you can make it "feel like"
          they do with soft links such as "ln -s /actualname /scr".
          Under no circumcumstance should you make Scratchdir an NFS
          partition, as serious I/O happens to this directory.  Ideally
          Scratchdir is a striped multi-disk Ultra-2-wide partition,
          with 9+ GBytes free space per node.  However, the GAMESS
          output (stdout) and two supplemental ASCII output files
          PUNCH and IRCDATA can and probably should be sent over NFS
          to the user's permanent disk space on a file server.  This
          serves the purpose of allowing the user to monitor the
          simulation as it runs, and gets the results to a place where
          it can be backed up once in a while.  The files written into
          Scratchdir should be erased by the 'rungms' script upon
          normal exit.

             Individual file names are set by the 'rungms' script's
          setenv commands.  Some of the files are written to only by
          the master process running on node 0 (stdout and PUNCH
          are good examples of this), but other files are distributed
          across all node's Scratchdirs (scalable disk usage).  The
          atomic integrals AOINTS is a good example of this.  The
          rungms setenv's will define this file as xxx.F08 on node 0,
          where xxx is the name of the input file, and the rest of
          the name comes from its being FORTRAN unit 8 internally.
          On other nodes the file name will have the node number
          appended, xxx.F08.001, xxx.F08.002, and so on.  Obviously,
          only the compute processes own disk files.

              Nhosts is the number of compute processes to be run.
          If you want to run sequentially, just ensure Nhosts is 1.
          Hostname_0 is the "master node", which handles reading
          the one input file, and writing the one output file.  This
          host must be the same host that is executing the 'rungms'
          script, or else the environment variables that define the
          files don't get properly accepted.  Supply a total of
          Nhosts Hostnames.  One compute process will be started
          on each of these (process IDs 0,1,...Nhosts-1), and then
          one data server will be run on each as well (for a total of
          2 times Nhosts processes).  If you have SMP systems, such
          as a four processor machine, set Nhosts=4, and repeat its
          Hostname 4 times.

1

              Execution is by a direct system call if the process is
          to run on the host running 'rungms' and which is therefore
          also running 'ddikick.x'.  Remote hosts are reached by the
          command 'rsh', so users will need to use a .rhosts file to
          authenticate themselves (unless your system is using some
          replacement for this such as Kerberos).  The .rhosts file
          needs to be in your home directory, and looks like this:
                   si.fi.ameslab.gov  mike
                   ga.fi.ameslab.gov  mike
                   ge.fi.ameslab.gov  mike
                      ...and so on...
          except your user name is probably not 'mike'.  Note that
          ddikick.x has no mechanism to support the user name on one
          of the machines being 'schmidt' instead of 'mike'.  
      
              Assuming that all goes well, the job will terminate
          orderly by each compute process telling its local data
          server to cease execution.  Upon the successful suicide
          of the data server, the compute process reports to the
          dormant (but still running) ddikick.x that it is ready
          to end.  When all compute processes have checked in, the
          kickoff program informs each that it is OK to stop, and
          following this ddikick.x exits.  

               Abnormal terminations are of course less predictable.
          However, when ddikick.x is informed by the system that one
          of its children has died, it tries to send a kill command
          to all its other children, and so hopefully all processes
          are then eliminated.  However, depending on the exact
          circumstances in which the abnormal end occurs, the system
          may have a few processes left over for manual termination.
          If you decide that a GAMESS job should be killed, use the
          Unix 'kill' command to take out either the compute or data
          server process on the master node, or one of the 'rsh' 
          processes that have launched GAMESS onto the remote nodes.
          Do not kill ddikick.x directly, instead stop any of these
          child processes, so that ddikick.x will terminate all the
          other processes for you.

              Before ending this section on DDI over TCP/IP sockets
          on workstation class machines, we should comment on the
          network requirements.  It is not reasonable to run jobs
          that use MEMDDI distributed memory on 10 megabit/second
          Ethernet since the bandwidth is just too small.  However,
          if you use only the replicated MEMORY storage you should
          be able to get by on this old network cable.  As will be
          shown below, a switched Fast Ethernet is capable of decent
          performance on such 100 megabit/second cables.  Both the
          host adapters and the switch itself are now inexpensive.
          Gigabit ethernet (1000 mbit/sec) is pricy, and although
          the bandwidth is good, the latency remains too large.

1

                                  * * *

              This section describes the way in which the various
          quantum chemistry computations run in parallel, and shows
          some typical performance data.  This should give you as
          the user some idea how many nodes can be efficiently used
          for various SCFTYP and RUNTYP jobs.  There's a different
          subsection for 4 different kinds of runs, followed by a
          summary.

              Many of the performance data you will see below were
          obtained on a 16 node Intel Pentium II Linux (Beowulf-type)
          cluster costing $49,000, of which $3,000 went into the
          switched Fast Ethernet component.  512 MB/node means this
          cluster has an aggregate memory of 8 GB.  For more details,
          see
            http://www.msg.ameslab.gov/GAMESS/page/not/written/yet

                                   ---

              The HF wavefunctions can be evaluated in parallel
          using either conventional disk storage of the integrals,
          or via direct recomputation of the integrals.  Assuming the
          I/O speed of your system is good, direct SCF is *always*
          slower than disk storage.  Some experimenting will show
          which is more effective on your hardware.  As an example of
          the scaling performance of RHF, ROHF, UHF, or GVB jobs 
          that involve only computation of the energy or its gradient,
          we include here a timing table from the 16 node PC cluster.
          The molecule is luciferin, which together with the enzyme
          luciferase is involved in firefly light production.  The
          chemical formula is C11N2S2O3H8, and RHF/6-31G(d) has 294
          atomic orbitals.  There's no molecular symmetry.  The run
          is done as direct SCF because the total amount of AO
          integrals is 3.8 GBytes, and Linux does not permit files 
          to exceed 2 GBytes (of course use of 2 or more nodes can
          be run conventional as this disk file will be distributed 
          across all available disks).  The CPU timing data is

                          p=1  p=2  p=3  p=4  p=8 p=12 p=16
           1e- ints       1.6  0.8  0.6  0.4  0.3  0.3  0.1
           Huckel guess    22   18   16   14   14   12   12
           15 RHF iters  5536 2802 1891 1436  753  519  406
           properties     7.5  7.3  7.3  7.3  7.8  7.0  7.0
           1e- gradient  11.5  5.7  4.1  2.7  1.4  1.0  0.8
           2e- gradient  1339  658  437  328  105  110   83
                         ---- ---- ---- ---- ---- ---- ----
           total CPU     6917 3491 2357 1790  941  649  509 seconds
           total wall    6924 3540 2408 1820  979  696  559 seconds

          Note that direct SCF should run with the wall time very
          close to the CPU time as there is essentially no I/O and
          not that much communication (MEMDDI storage is not used
          by this kind of run).  Wall clock speedup from 1 to 16
          nodes is 12.4, and for this type of run we frequently use
          8, 16, or 32 nodes depending on availability.

1

              An idea of the variation in time with basis set
          size can be gained from the following runs made by
          Johannes Grotendorst, Juelich, Germany, on a Cray T3E
          or Intel Paragon, using 32 nodes on either.  These
          data were collected in about 1996, pre-DDI days, and
          as you can see, before all the Paragons were unplugged.
          The data is still representative.  Each molecule is an
          asymmetric organic compound, computing the RHF energy
          and gradient, using the 6-31G(d) basis set:

                                          T3E    Paragon
          taxol,    1032 AOs, CPU TIME = 546.8      --  minutes
          cAMP,      356 AOs, CPU TIME =  14.6    106.4
          luciferin, 294 AOs, CPU TIME =   8.9     67.2
          nicotine,  208 AOs, CPU TIME =   3.8     26.1
          thymine,   149 AOs, CPU TIME =   1.5     12.2
          alanine,   104 AOs, CPU TIME =   0.5      5.2
          glycine,    85 AOs, CPU TIME =   0.3      3.2

              If you are interested in an explanation of how the
          parallel SCF is implimented, see the main GAMESS paper,
            M.W.Schmidt, K.K.Baldridge, J.A.Boatz, S.T.Elbert,
              M.S.Gordon, J.H.Jensen, S.Koseki, N.Matsunaga,
          K.A.Nguyen, S.J.Su, T.L.Windus, M.Dupuis, J.A.Montgomery
                   J.Comput.Chem.  14, 1347-1363(1993)

                                   ---

              For the next type of computation, we discuss the MP2
          correction.  For UHF + MP2, only the second order energy
          can be computed, and the parallization strategy is similar
          to the replicated MEMORY code used by the MCSCF program.
          This is described below.  The MCSCF + MP2 code does not run
          parallel, unfortunately.  So here we are describing the
          closed shell RHF + MP2 energy or gradient, or the ROHF +
          ZAPT-type MP2 energy.  These two types of computations make
          use of the MEMDDI distributed data region.

             The example is a benzoquinone precursor to hongconin,
          a cardioprotective natural product.  The formula is
          C11O4H10, and 6-31G(d) has 245 AOs.  There are 39 valence 
          orbitals included in the MP2 treatment, and 15 core 
          orbitals.  MEMDDI must be 156 million words, so the same
          type of memory computation that was used above tells us
          that our 512 MB/node PC cluster must have at least three
          processors to aggregate the required MEMDDI.  MOREAD was
          used to provide converged RHF orbitals, so only 3 RHF
          iterations are performed.  The timing data are CPU and wall
          times (seconds) in the 1st/2nd lines:

1

                          p=3      p=4      p=8     p=12     p=16
            RHF iters     208      157       83       58       46
                          214      163       91       68       55
            MP2 step    8,935    6,966    3,417    2,283    1,724
                       12,529   10,046    5,763    4,013    3,056 
            2e- grad    2,181    1,712      838      552      420
                        2,490    1,981      991      677      499
            total CPU  11,335    8,846    4,347    2,902    2,199
            total wall 15,248   12,206    6,859    4,772    3,624

                                 3-->12  4-->16
                 CPU speedup      3.91    4.02
                 wall speedup     3.20    3.37

          On a T3E machine with 600 MHz nodes and 256 MB/node, we
          should have been able to run on as few as 6 nodes, but the
          available data for the same calculation starts from 8:

                              p=8  p=32 p=128
                total CPU    3108   814   282
                total wall   3154   850   340

          Wall clock performance is considerably better as you would
          expect on a machine where very good communications exist.
          Wall speedup for a 4x or 16x increase in node number is 3.7
          and 9.3.  Larger molecules which still have some computation
          left at 128 nodes do better than this.  We often use 64 or
          128 nodes for this type of run.

              As noted, the number of nodes is more influenced by a
          need to aggregate the necessary total MEMDDI, more than by
          concerns about scalability.  MEMDDI is typically large for
          MP2 parallel runs, as it is proportional to the number of
          occupied orbitals squared times the number of AOs squared.
  
              For more details on the distributed data parallel MP2
          program, see
            G.D.Fletcher, A.P.Rendell, P.Sherwood
                Mol.Phys. 91, 431-438(1997)
            G.D.Fletcher, M.W.Schmidt, M.S.Gordon
                Adv.Chem.Phys. 110, 267-294 (1999)
            G.D.Fletcher, M.W.Schmidt, B.M.Bode, M.S.Gordon
                Comput.Phys.Commun.  submitted

1

                                   ---

              The next type of computation we will consider is the
          analytic computation of the hessian for RHF, ROHF, or GVB
          wavefunctions.  The current implementation of the response
          equations is in the MO basis, and since the solver is not
          parallelized, so when this stage of the computation is
          reached, work is done only by process 0 while the other
          processes sit idle.  This is a sequential bottleneck.  The
          integral transformation is parallelized according to the
          same strategy as described below for MCSCF jobs.  Thus our
          early paper on analytic hessians which ran a sequential
          transformation no longer correctly describes the program.
          Thus the total scalability is better than was shown in
             T.L.Windus, M.W.Schmidt, M.S.Gordon
                  Chem.Phys.Lett.  216, 375-379(1993)

              The example we will consider is the same SbC4O2NH4 test
          that was used in this early paper.  We use the 3-21G* basis
          (110 AOs, 2 million words used).  The hardware is the same
          PC Linux cluster.  Table 1 of the reference should be:

                                p=1     p=2     p=3     p=4
               1e- ints        0.15    0.09    0.09    0.07 seconds
               Huckel          4.23    3.97    4.05    4.15
               2e- ints       14.84    7.26    4.89    3.68
               RHF iters      35.17   19.29   13.89   10.84
               properties      0.39    0.40    0.38    0.42
               dupl.2e- ints   N/A    14.75   14.71   14.78
               int.transf.   232.89  123.91   85.01   68.20
               1e- hess        4.47    2.84    2.04    1.21
               2e- hess      668.13  334.60  225.50  168.31
               CPHF          224.48  210.80  206.21  192.29
                            -------  ------  ------  ------
               total CPU     1185.0   718.1   557.0   464.2
               total wall    1188     733     600     494

          Clearly, the final response equation (CPHF) step is a
          sequential bottleneck, as is the fact that the orbital 
          hessian in this step is stored entirely on the disk space 
          of node 0.  Since the integral transformation is run in
          replicated MEMORY rather than distributing this, and since
          it also needs a duplicated AO integral file be stored on
          every node, the code is clearly not scalable to very many
          processors.  Typically we would not request more than 3
          or 4 processors for an analytic hessian job.

                                   ---

              As the final example, we turn to MCSCF energy/gradient
          runs.  The parallelization of the integral transformation
          was done before the introduction of the distributed data
          concept, and hence this kind of job uses only replicated
          MEMORY at present.  In addition, the determinant CI step is
          not yet converted to parallel execution, so if you run on
          more than one node, MCSCF jobs must use $MCSCF CISTEP=GUGA.

1

              Our work on parallization of MCSCF was described in
          the paper
              T.L.Windus, M.W.Schmidt, M.S.Gordon
                 Theoret.Chim.Acta  89, 77-88(1994).
          This points out that MCSCF has many bottlenecks, of which
          the most important are the integral transformation, the
          optimization of the CI coefficients, and the optimization
          of the orbital coefficients.  The amount of time spent
          in each depends on the number of atomic orbitals, and the
          size of the active space, and the number of filled MOs.
          Since this parallelization paper came out, we have added
          a new converger for MCSCF, namely SOSCF, and here we will
          show the performance of this default converger on a fairly
          large example.  Before doing so, we point out again that
          our new CI optimization step over determinants is faster
          on one node, but it will refuse to run in parallel, so the
          data shown use the older GUGA CSF program.
          
              The integral transformation step is run in replicated
          MEMORY, using "passes" over the occupied orbitals.  The
          different passes involve different occupied orbitals so
          the work is trivially distributed across all nodes.  A
          better description of this can be found in the reference
          given.  Basically the same approach is currently being
          used during analytic hessians or UHF MP2.  The method
          distributes CPU time fairly well, but it does not make
          efficient use of memory as every node must have the
          same memory is required to run on 1 node (order of the
          basis set cubed).  

              There are two strategies to govern the placement of AO
          integrals, each of which has to be available on every node.
          One is to store the file on the disks in a distributed
          fashion, and as each node reads its subset, to broadcast
          these on the communication channel to all other nodes.
          This is AOINTS=DIST in the $TRANS or $MP2 input, and is
          appropriate only for machines with good communications.
          When disk I/O is faster than the communications, such as
          is likely for workstation clusters, the entire 2e- AO
          integral file is duplicated on every node (AOINTS=DUP).
          This is not scalable either in generation of integrals, or
          in their disk storage, but it takes pressure off the
          communications channel.  You may want to experiment with
          the AOINTS keyword to see if the default for your machine 
          is well chosen.

              The example we choose is at a transition state for
          the water molecule assisted proton transfer in the
          first excited stat of 7-azaindole.  The formula is
          C7N2H6(H2O), there are 190 active orbitals, and the
          active space is the 10 pi electrons in 9 pi orbitals
          of the azaindole portion.  There are 5,292 CSFs.  See
          Figure 6 of G.M.Chaban, M.S.Gordon  J.Phys.Chem.A 103,
          185-189(1999) if you are interested in this chemistry.
          The timing data (seconds) from our PC Linux cluster are

1

                           p=1     p=2      p=3      p=4
          dup. 2e- ints  373.2   381.2    381.4    389.9
          DRT              0.4     0.4      0.4      0.4
          transform.     448.0   237.4    181.3    139.1
          Hamilton.        3.1     2.5      2.4      2.3
          diag. H         27.4    15.1     11.1      9.1
          2e- dens.        2.4     2.3      2.2      2.2
          orb. update     60.1    56.7     56.2     56.4
          iters 2-16    6723.3  4429.9   3554.2   2885.7
          1e- grad         6.2     2.6      3.2      1.4
          2e- grad       912.3   463.9    327.3    242.8
                        ------  ------   ------   ------
          total CPU     9485.1  5599.1   4526.9   3736.1
          total wall    15,703   9,537    7,763    6,192

          The first iteration is broken down into its primary steps
          from the integral transformation to the orbital update,
          inclusive.  Typically we would not use more than 4 to 8
          processors for a parallel MCSCF job.

                                   ---

              In summary, most ab initio computations will run
          in less time on more than one node.  However, some
          things can be run only on 1 node, namely
             semi-empirical runs
             determinant based MCSCF
             MCQDPT2 perturbation correction to MCSCF
             RHF+CI gradient
             PCM solvation model
          Several steps run with little or no speedup, and thus
          represent sequential bottlenecks that limit scalability.
          They do not prevent jobs from running, but restrict the
          total number of nodes that can be effectively used:
             HF: solution of SCF equations
             HF analytic hessians: the coupled Hartree-Fock
             MCSCF: orbital improvement steps
             MCSCF/CI: Hamiltonian and 2e- density matrix
             energy localizations: the orbital localization step
             transition moments/spin-orbit: the final property step
          Future versions of GAMESS will address these bottlenecks.  

          A short summary of the useful number of nodes (based on
          data like the above) would be approximately
              RHF, ROHF, UHF, GVB energy and/or gradient  16-32+
              analytic hessians for these                   3-4
              RHF + MP2 gradient, ZAPT energy            64-128+
              UHF + MP2 energy                             8-16
              GUGA CI or MCSCF                              4-8

1

                                  * * *

              The final section of this description of DDI is a very
          sketchy introduction to programming.  At this point, if you
          are interested only in using the program, you may cease
          reading.

              DDI has subroutine calls to do ordinary message passing
          parallel programming.  These are calls to initialize and
          terminate the various processes, point to point send and
          receive, and collective operations like global sum and the
          broadcast.  Not necessarily every routine one would expect
          is included, as we just programmed what we needed in GAMESS.
          In addition we have calls for distributed data manipulation,
          which include creation and destruction of the arrays, the
          put, get, and accumulation operations mentioned above, and
          a routine to query what part of the distributed array is 
          stored locally.

              The full API for DDI is in comments in the beginning of
          the source file DDI.SRC, and you can then look at the
          individual routines to see what the calling arguments do.
          The source code in DDI.SRC, and calls to it from GAMESS are
          what serves as documentation for use of DDI at the present.
          Every DDI routine is a subroutine, not a function, and each
          begins with "DDI_" so you can easily locate all parallel
          constructs in GAMESS by a search for "CALL DDI_".  

              We don't really intend for DDI to be a general parallel
          programming library, rather it's a part of GAMESS.  For
          example, we need to link a small program using DDI against
          some GAMESS objects to have memory management code, and so
          forth.  These are ddi.o, ddisoc.o, unport.o, and zunix.o
          and maybe a timing routine.

              We close with a simple (and not very useful) program
          that broadcasts information to all nodes.  It illustrates
          the proper initialization and closure of the DDI library,
          requests replicated MEMORY but not distributed MEMDDI, and
          so does not illustrate distributed data programming.  The
          idea was to keep it a one-pager.

1

                program bcast
                implicit double precision(a-h,o-z)
                parameter (maxmsg=500000)
                real tarray(2)
                common /fmcom / xx(1)
                data exetyp/8hRUN     /
          c        open DDI, tell it integer word length is 32 bit
                nwdvar=2
                call ddi_pbeg(nwdvar)
          c        request allocation of only replicated memory
                memrep=maxmsg
                memddi=0
                call ddi_memory(memrep,memddi,exetyp)
                call setfm(memrep)
          c        start a clock so we can time this
                xstart = etime(tarray)
          c        learn which of the processes I am
                call ddi_nproc(nproc,me)
                master=0
                if(me.eq.master) write(iw,9000) nproc
          c        dynamically allocate a replicated array 
                call valfm(loadfm)
                lbuff = loadfm + 1
                last  = lbuff  + maxmsg
                need = last - loadfm - 1
                call getfm(need)
          c        fill it up with nothing but ones
                if(me.eq.master) then
                   do i=1,maxmsg
                      xx(lbuff-1+i) = 1.0d+00
                   end do
                end if
          c        send it to all the other compute processes
                call ddi_bcast(102,'F',xx(lbuff),maxmsg,master)
          c        we are now done with the replicated storage
                call retfm(need)
          c        so, all we've done is time a broadcast.
                xstop = etime(tarray)
                write(6,9010) me,xstop-xstart
          c        close the DDI library gracefully
                istat=0
                call ddi_pend(istat)
                stop
           9000 format(1x,'running',i4,' processes.')
           9010 format(1x,'node',i4,' total job time between',
               *          ' pbeg/pend is',f7.2)
                end
1

                          Altering program limits

               Almost all arrays in GAMESS are allocated dynamically,
          but some variables must be held in common as their use is
          ubiquitous.  An example would be the common block which
          holds the basis set.  The following Unix script, which we
          call 'mung', changes the PARAMETER statements that set
          various limitations:

          #!/bin/csh
          #
          #       automatically change GAMESS' built-in dimensions
          #
          chdir /u3/mike/gamess/source
          #
          foreach FILE (*.src)
             set FILE=$FILE:r
             echo ===== redimensioning in $FILE =====
             echo "C 01 JAN 98 - SELECT NEW DIMENSIONS" \
                       > $FILE.munged
             sed -e "/MXATM=500/s//MXATM=100/" \
                 -e "/MXFRG=50/s//MXFRG=1/" \
                 -e "/MXDFG=5/s//MXDFG=1/" \
                 -e "/MXPT=100/s//MXPT=1/" \
                 -e "/MXAOCI=768/s//MXAOCI=768/" \
                 -e "/MXRT=100/s//MXRT=100/" \
                 -e "/MXSP=100/s//MXSP=1/" \
                 -e "/MXTS=2500/s//MXTS=1/" \
                 -e "/MXSH=1000/s//MXSH=1000/" \
                 -e "/MXGSH=30/s//MXGSH=30/" \
                 -e "/MXGTOT=5000/s//MXGTOT=5000/" \
                 $FILE.src >> $FILE.munged
             mv $FILE.munged $FILE.src
          end
          exit
 
          In this script,
             MXATM = max number of atoms
             MXFRG = max number of effective fragment potentials
             MXDFG = max number of different effective fragments
             MXPT  = max number of effective fragment points
             MXAOCI= max number of basis functions in CI/MCSCF
             MXRT  = max number of CI roots saved by $GUGDIA
             MXSP  = max number of spheres (sfera) in PCM
             MXTS  = max number of tesserae in PCM
             MXSH  = max number of symmetry unique shells
             MXGSH = max number of Gaussians per shell
             MXGTOT= max number of symmetry unique Gaussians
 
             The script shows how to -minimize- memory use, by a
          a small decrease in the number of atoms, and turning off
          the effective fragment and PCM dimensioning.  Little can
          be saved by reducing the other adjustable parameters.  Of
          course, the 'mung' script shown above could also be used
          to increase the dimensions...
             
1
 
                      Names of source code modules
 
               The source code for GAMESS is divided into a number
          of sections, called modules, each of which does related
          things, and is a handy size to edit.  The following is a
          list of the different modules, what they do, and notes on
          their machine dependencies.
 
                                                        machine
          module   description                         dependency
          -------  -------------------------           ----------
          ALDECI   Ames Lab determinant full CI code       1
          BASECP   SBKJC and HW valence basis sets
          BASEXT   DH, MC, 6-311G extended basis sets
          BASHUZ   Huzinaga MINI/MIDI basis sets to Xe
          BASHZ2   Huzinaga MINI/MIDI basis sets Cs-Rn
          BASN21   N-21G basis sets
          BASN31   N-31G basis sets
          BASSTO   STO-NG basis sets
          BLAS     level 1 basic linear algebra subprograms
          CPHF     coupled perturbed Hartree-Fock          1
          CPROHF   open shell/TCSCF CPHF                   1
          DDI      message passing library interface code  9
          DDIT3E   message passing code (used on T3E only) 9
          DELOCL   delocalized coordinates
          DFTSTB   dummy subroutines
          DRC      dynamic reaction coordinate
          ECP      pseudopotential integrals
          ECPDER   pseudopotential derivative integrals
          ECPHW    Hay/Wadt effective core potentials
          ECPLIB   initialization code for ECP
          ECPSBK   Stevens/Basch/Krauss/Jasien/Cundari ECPs
          EIGEN    Givens-Householder, Jacobi diagonalization
          EFDRVR   fragment only calculation drivers
          EFELEC   fragment-fragment interactions
          EFGRD2   2e- integrals for EFP numerical hessian
          EFGRDA   ab initio/fragment gradient integrals
          EFGRDB   "    "       "        "        "
          EFGRDC   "    "       "        "        "
          EFINP    effective fragment potential input
          EFINTA   ab initio/fragment integrals
          EFINTB   "    "       "        "
          EFPAUL   effective fragment Pauli repulsion
          EFPCOV   EFP style QM/MM boundary code
          FFIELD   finite field polarizabilities
          FRFMT    free format input scanner
          GAMESS   main program, single point energy
                   and energy gradient drivers, misc.
          GRADEX   traces gradient extremals
          GRD1     one electron gradient integrals
          GRD2A    two electron gradient integrals         1
          GRD2B    specialized sp gradient integrals
          GRD2C    general spdfg gradient integrals
 
               (continued...)

1
 
                                                        machine
          module   description                         dependency
          -------  -------------------------           ----------
          GUESS    initial orbital guess
          GUGDGA   Davidson CI diagonalization             1
          GUGDGB       "    "        "                     1
          GUGDM    1 particle density matrix
          GUGDM2   2 particle density matrix               1
          GUGDRT   distinct row table generation
          GUGEM    GUGA method energy matrix formation     1
          GUGSRT   sort transformed integrals              1
          GVB      generalized valence bond HF-SCF         1
          HESS     hessian computation drivers
          HSS1A    one electron hessian integrals
          HSS1B     "     "        "        "
          HSS2A    two electron hessian integrals          1
          HSS2B     "     "        "        "
          INPUTA   read geometry, basis, symmetry, etc.
          INPUTB    "     "        "       "
          INPUTC    "     "        "       "
          INT1     one electron integrals
          INT2A    two electron integrals                  1
          INT2B     "     "        "
          IOLIB    input/output routines,etc.              2
          LAGRAN   CI Lagrangian matrix                    1
          LOCAL    various localization methods            1
          LOCCD    LCD SCF localization analysis
          LOCPOL   LCD SCF polarizability analysis
          MCCAS    FOCAS/SOSCF MCSCF calculation           1
          MCQDPT   multireference perturbation theory      1
          MCQDWT   weights for MR-perturbation theory
          MCQUD    QUAD MCSCF calculation                  1
          MCSCF    FULLNR MCSCF calculation                1
          MCTWO    two electron terms for FULLNR MCSCF     1
          MOROKM   Morokuma energy decomposition           1
          MP2      2nd order Moller-Plesset                1
          MP2DDI   distributed data parallel MP2
          MP2GRD   CPHF and density for MP2 gradients      1
          MPCDAT   MOPAC parameterization
          MPCGRD   MOPAC gradient
          MPCINT   MOPAC integrals
          MPCMOL   MOPAC molecule setup
          MPCMSC   miscellaneous MOPAC routines
          MTHLIB   printout, matrix math utilities
          NAMEIO   namelist I/O simulator
          ORDINT   sort atomic integrals                   1
          PARLEY   communicate to other programs
          PCM      Polarizable Continuum Model setup
          PCMCAV   PCM cavity creation
          PCMDER   PCM gradients
          PCMDIS   PCM dispersion energy
          PCMPOL   PCM polarizabilities
          PCMVCH   PCM repulsion and escaped charge
 
               (continued...)

1
 
                                                        machine
          module   description                         dependency
          -------  -------------------------           ----------
          PRPEL    electrostatic properties
          PRPLIB   miscellaneous properties
          PRPPOP   population properties
          QMMM     temporary dummy routines
          RESC     relativistic elim. small component
          RHFUHF   RHF, UHF, and ROHF HF-SCF               1
          RXNCRD   intrinsic reaction coordinate
          RYSPOL   roots for Rys polynomials
          SCFMI    molecular interaction SCF code
          SCFLIB   HF-SCF utility routines, DIIS code
          SCRF     self consistent reaction field
          SOBRT    full Breit-Pauli spin-orbit compling
          SOFFAC   spin-orbit matrix element form factors
          SOZEFF   1e- spin-orbit coupling terms
          STATPT   geometry and transition state finder
          SURF     PES scanning
          SYMORB   orbital symmetry assignment
          SYMSLC      "        "         "
          TDHF     time-dependent Hartree-Fock NLO         1
          TRANS    partial integral transformation         1
          TRFDM2   two particle density backtransform      1
          TRNSTN   CI transition moments
          TRUDGE   nongradient optimization
          UNPORT   unportable, nasty code            3,4,5,6,7,8
          VECTOR   vectorized version routines            10
          VIBANL   normal coordinate analysis
          VSCF     anharmonic frequencies
          ZHEEV    complex matrix diagonalization
          ZMATRX   internal coordinates
 
          UNIX versions use the C code ZUNIX.C for dynamic memory.
          Most UNIX versions use DDISOC.C to talk to TCP/IP sockets,
          and DDIKICK.C to load GAMESS for execution.

          The IBM mainframe version uses the following assembler
          language routines:  ZDATE.ASM, ZTIME.ASM.

              The machine dependencies noted above are:
          1) packing/unpacking           2) OPEN/CLOSE statments
          3) machine specification       4) fix total dynamic memory
          5) subroutine walkback         6) error handling calls
          7) timing calls                8) LOGAND function
          9) message passing calls.  DDI.SRC has both socket calls
             (*SOC) and MPI-1 calls (*MPI) programmed.
          10) vector library calls
 
1
 
                          Programming Conventions
 
                   The following "rules" should be adhered
                   to in making changes in GAMESS.  These
                   rules are important in maintaining
                   portability, and should be strictly
                   adhered to.
 
              Rule 1.  If there is a way to do it that works on all
          computers, do it that way.  Commenting out statements for
          the different types of computers should be your last
          resort.  If it is necessary to add lines specific to your
          computer, PUT IN CODE FOR ALL OTHER SUPPORTED MACHINES.
          Even if you don't have access to all the types of
          supported hardware, you can look at the other machine
          specific examples found in GAMESS, or ask for help from
          someone who does understand the various machines.  If a
          module does not already contain some machine specific
          statements (see the above list) be especially reluctant to
          introduce dependencies.
 
              Rule 2.  a) Use IMPLICIT DOUBLE PRECISION(A-H,O-Z)
          specification statements throughout.  b) All floating
          point constants should be entered as if they were in
          double precision.  The constants should contain a decimal
          point and a signed two digit exponent.  A legal constant
          is 1.234D-02.  Illegal examples are 1D+00, 5.0E+00, and
          3.0D-2.  c) Double precision BLAS names are used
          throughout, for example DDOT instead of SDOT.
 
                   The source code activator ACTVTE will
                   automatically convert these double
                   precision constructs into the correct
                   single precision expressions for machines
                   that have 64 rather than 32 bit words.
 
              Rule 3.  FORTRAN 77 allows the use of generic
          functions.  Thus the routine SQRT should be used in place
          of DSQRT, as this will automatically be given the correct
          precision by the compilers.  Use ABS, COS, INT, etc.  Your
          compiler manual will tell you all the generic names.
 
              Rule 4.  Every routine in GAMESS begins with a card
          containing the name of the module and the routine.  An
          example is "C*MODULE xxxxxx  *DECK yyyyyy".  The second
          star is in column 18.  Here, xxxxxx is the name of the
          module, and yyyyyy is the name of the routine.
          Furthermore, the individual decks yyyyyy are stored in
          alphabetical order.  This rule is designed to make it
          easier for a person completely unfamiliar with GAMESS to
          find routines.  The trade off for this is that the driver
          for a particular module is often found somewhere in the
          middle of that module.
 
1
 
              Rule 5.  Whenever a change is made to a module, this
          should be recorded at the top of the module.  The
          information required is the date, initials of the person
          making the change, and a terse summary of the change.
 
              Rule 6.  No lower case characters, no more than 6
          letter variable names, no imbedded tabs, statements must
          lie between columns 7 and 72, etc.  In other words, old
          style syntax.
 
                                 * * *
 
                   The next few "rules" are not adhered to
                   in all sections of GAMESS.  Nonetheless
                   they should be followed as much as
                   possible, whether you are writing new
                   code, or modifying an old section.
 
              Rule 7.  Stick to the FORTRAN naming convention for
          integer (I-N) and floating point variables (A-H,O-Z).  If
          you've ever worked with a program that didn't obey this,
          you'll understand why.
 
              Rule 8.  Always use a dynamic memory allocation
          routine that calls the real routine.  A good name for the
          memory routine is to replace the last letter of the real
          routine with the letter M for memory.
 
              Rule 9.  All the usual good programming techniques,
          such as indented DO loops ending on CONTINUEs,
          IF-THEN-ELSE where this is clearer, 3 digit statement
          labels in ascending order, no three branch GO TO's,
          descriptive variable names, 4 digit FORMATs, etc, etc.
 
                   The next set of rules relates to coding
                   practices which are necessary for the
                   parallel version of GAMESS to function
                   sensibly.  They must be followed without
                   exception!
 
              Rule 10.  All open, rewind, and close operations on
          sequential files must be performed with the subroutines
          SEQOPN, SEQREW, and SEQCLO respectively.  You can find
          these routines in IOLIB, they are easy to use.
 
1
 
              Rule 11.  All READ and WRITE statements for the
          formatted files 5, 6, 7 (variables IR, IW, IP, or named
          files INPUT, OUTPUT, PUNCH) must be performed only by the
          master task.  Therefore, these statements must be enclosed
          in "IF (MASWRK) THEN" clauses.  The MASWRK variable is
          found in the /PAR/ common block, and is true on the master
          process only.  This avoids duplicate output from the other
          processes.  At the present time, all other disk files in
          GAMESS also obey this rule.
 
              Rule 12.  All error termination is done by means of
          "CALL ABRT" rather than a STOP statement.  Since this
          subroutine never returns, it is OK to follow it with a
          STOP statement, as compilers may not be happy without a
          STOP as the final executable statment in a routine.

1
                     List of parallel broadcast identifiers
 
              GAMESS uses DDI calls to pass messages between the
          parallel processes.  Every message is identified by a
          unique number, hence the following list of how the numbers
          are used at present.  If you need to add to these, look at
          the existing code and use the following numbers as
          guidelines to make your decision.  All broadcast numbers
          must be between 1 and 32767.
 
               20            : Parallel timing
              100 -  199     : DICTNRY file reads
              200 -  204     : Restart info from the DICTNRY file
              210 -  214     : Pread
              220 -  224     : PKread
              225            : RAread
              230            : SQread
              250 -  265     : Nameio
              275 -  310     : Free format
              325 -  329     : $PROP group input
              350 -  354     : $VEC group input
              400 -  424     : $GRAD group input
              425 -  449     : $HESS group input
              450 -  474     : $DIPDR group input
              475 -  499     : $VIB group input
              500 -  599     : matrix utility routines
              800 -  830     : Orbital symmetry
              900            : ECP 1e- integrals
              910            : 1e- integrals
              920 -  975     : EFP and SCRF integrals
              980 -  999     : property integrals
             1000 - 1025     : SCF wavefunctions
             1030 - 1040     : reserved for Kurt
             1050            : Coulomb integrals
             1200 - 1215     : MP2
             1300 - 1320     : localization
             1495 - 1499     : reserved for Jim Shoemaker
             1500            : One-electron gradients
             1505 - 1599     : EFP and SCRF gradients
             1600 - 1602     : Two-electron gradients
             1605 - 1620     : One-electron hessians
             1650 - 1665     : Two-electron hessians
             1700 - 1750     : integral transformation
             1800            : GUGA sorting
             1850 - 1865     : GUGA CI diagonalization
             1900 - 1910     : GUGA DM2 generation
             2000 - 2010     : MCSCF
             2100 - 2120     : coupled perturbed HF
             2300 - 2399     : reserved for spin-orbit

1
 
                        Disk files used by GAMESS

             These files must be defined by your control language
          for executing GAMESS.  For example, on UNIX the "name"
          field shown below should be set in the environment to the
          actual file name to be used.  Most runs will open only a
          subset of the files shown below, with only files 5, 6, 7,
          and 10 existing in every run.  Only files 4, 5, 6, and 7
          contain formatted data.
 
          unit  name     contents
          ----  ----     --------
           4   IRCDATA   archive results punched by IRC runs,
                         restart data for numerical HESSIAN runs,
                         summary of results for DRC.
 
           5   INPUT     Namelist input file. This MUST be a disk
                         file, as GAMESS rewinds this file often.
 
           6   OUTPUT    Print output (FT06F001 on IBM mainframes)
                         If not defined, UNIX systems will use the
                         standard output for this file.
 
           7   PUNCH     Punch output. A copy of the $DATA deck,
                         orbitals for every geometry calculated,
                         hessian matrix, normal modes from FORCE,
                         properties output, IRC restart data, etc.
 
           8   AOINTS    Two e- integrals in AO basis
 
           9   MOINTS    Two e- integrals in MO basis
 
          10   DICTNRY   Master dictionary, for contents see below.
 
          11   DRTFILE   Distinct row table file for -CI- or -MCSCF-
 
          12   CIVECTR   Eigenvector file for -CI- or -MCSCF-
 
          13   CASINTS   semi-transformed ints for FOCAS/SOSCF MCSCF
                         scratch file during spin-orbit coupling
 
          14   CIINTS    Sorted integrals for -CI- or -MCSCF-
 
          15   WORK15    GUGA loops for Hamiltonian diagonal;
                         ordered two body density matrix for MCSCF;
                         scratch storage during GUGA Davidson diag;
                         Hessian update info during 2nd order SCF;
                         [ia|jb] integrals during MP2 gradient
 
          16   WORK16    GUGA loops for Hamiltonian off-diagonal;
                         unordered GUGA DM2 matrix for MCSCF;
                         orbital hessian during MCSCF;
                         orbital hessian for analytic hessian CPHF;
                         orbital hessian during MP2 gradient CPHF;
                         two body density during MP2 gradient
 
1

                   (disk files, continued)
 
          unit  name     contents
          ----  ----     --------
          17   CSFSAVE   CSF data for state to state transition runs.
 
          18   FOCKDER   derivative Fock matrices for analytic hess
 
          20   DASORT    Sort file for various -MCSCF- or -CI- steps;
                         also used by SCF level DIIS

          23   JKFILE    J and K "Fock" matrices for -GVB-;
                         Hessian update info during SOSCF MCSCF;
                         orbital gradient and hessian for QUAD MCSCF

          24   ORDINT    sorted AO integrals;
                         integral subsets during Morokuma analysis
 
          25   EFPIND    electric field integrals for EFP

          26   PCMDATA   gradient and D-inverse data for PCM runs
 
          27   PCMINTS   normal projections of PCM field gradients

          30   DAFL30    direct access file for FOCAS MCSCF's DIIS;
                         form factor sorting for Breit spin-orbit

               files 50-63 are used primarily for MCQDPT runs.
               files 51-54 are also used during spin-orbit runs.

          50   MCQD50    Direct access file for MC-QDPT, its
                         contents are documented in source code.
          51   MCQD51    One-body coupling constants <I/Eij/J> for
                         CAS-CI and other routines
          52   MCQD52    One-body coupling constants for perturb.
          53   MCQD53    One-body coupling constants extracted
                         from MCQD52
          54   MCQD54    One-body coupling constants extracted
                         further from MCQD52
          55   MCQD55    Sorted 2-e integrals
          56   MCQD56    Half transformed 2-e integral
          57   MCQD57    Sorted half transformed 2-e integral of
                         the (ii/aa) type
          58   MCQD58    Sorted half transformed 2-e integral of
                         the (ei/aa) type
          59   MCQD59    2-e integral in MO basis of the (ii/ii),
                         (ei/ii), (ei/ei) types
          60   MCQD60    2-e integral in MO basis arranged for
                         perturbation calculations
          61   MCQD61    One-body coupling constants between state
                         and CSF <Alpha/Eij/J>
          62   MCQD62    Two-body coupling constants between state
                         and CSF <Alpha/Eij,kl/J>
          63   MCQD63    canonical Fock orbitals  (FORMATTED)
          64   MCQD64    Spin functions and orbital configuration
                         functions (FORMATTED)
 
1
 
                  Contents of the direct access file 'DICTNRY'
 
               1. Atomic coordinates
               2. various energy quantities in /ENRGYS/
               3. Gradient vector
               4. Hessian (force constant) matrix
             5-6. not used
               7. PTR - symmetry transformation for p orbitals
               8. DTR - symmetry transformation for d orbitals
               9. FTR - symmetry transformation for f orbitals
              10. GTR - symmetry transformation for g orbitals
              11. Bare nucleus Hamiltonian integrals
              12. Overlap integrals
              13. Kinetic energy integrals
              14. Alpha Fock matrix (current)
              15. Alpha orbitals
              16. Alpha density matrix
              17. Alpha energies or occupation numbers
              18. Beta Fock matrix (current)
              19. Beta orbitals
              20. Beta density matrix
              21. Beta energies or occupation numbers
              22. Error function interpolation table
              23. Old alpha Fock matrix
              24. Older alpha Fock matrix
              25. Oldest alpha Fock matrix
              26. Old beta Fock matrix
              27. Older beta Fock matrix
              28. Oldest beta Fock matrix
              29. Vib 0 gradient for FORCE runs
              30. Vib 0 alpha orbitals in FORCE
              31. Vib 0 beta  orbitals in FORCE
              32. Vib 0 alpha density matrix in FORCE
              33. Vib 0 beta  density matrix in FORCE
              34. dipole derivative tensor in FORCE.
              35. frozen core Fock operator
              36. Lagrangian multipliers
              37. floating point part of common block /OPTGRD/
          int 38. integer part of common block /OPTGRD/
              39. ZMAT of input internal coords
          int 40. IZMAT of input internal coords
              41. B matrix of redundant internal coords
              42. not used.
              43. Force constant matrix in internal coordinates.
              44. SALC transformation
              45. symmetry adapted Q matrix
              46. S matrix for symmetry coordinates
              47. ZMAT for symmetry internal coords
          int 48. IZMAT for symmetry internal coords
              49. B matrix
              50. B inverse matrix

1

              51. overlap matrix in Lowdin basis,
                  temp Fock matrix storage for ROHF
              52. genuine MOPAC overlap matrix
              53. MOPAC repulsion integrals
              54. exchange integrals for screening
              55. orbital gradient during SOSCF MCSCF
              56. orbital displacement during SOSCF MCSCF
              57. orbital hessian during SOSCF MCSCF
              58. reserved for Pradipta
              59. Coulomb integrals in Ruedenberg localizations
              60. exchange integrals in Ruedenberg localizations
              61. temp MO storage for GVB and ROHF-MP2
              62. temp density for GVB
              63. dS/dx matrix for hessians
              64. dS/dy matrix for hessians
              65. dS/dz matrix for hessians
              66. derivative hamiltonian for OS-TCSCF hessians
              67. partially formed EG and EH for hessians
              68. MCSCF first order density in MO basis
              69. alpha Lowdin populations
              70. beta Lowdin populations
              71. alpha orbitals during localization
              72. beta orbitals during localization
              73. alpha localization transformation
              74. beta localization transformation
              75. fitted EFP interfragment repulsion values
           76-77. not used
              78. "Erep derivative" matrix associated with F-a terms
              79. "Erep derivative" matrix associated with S-a terms
              80. EFP 1-e Fock matrix including induced dipole terms
              81. not used
              82. MO-based Fock matrix without any EFP contributions
              83. LMO centroids of charge
              84. d/dx dipole velocity integrals
              85. d/dy dipole velocity integrals
              86. d/dz dipole velocity integrals
              87. unmodified h matrix during SCRF or EFP
              88. not used
              89. EFP multipole contribution to one e- Fock matrix
              90. ECP coefficients
          int 91. ECP labels
              92. ECP coefficients
          int 93. ECP labels
              94. bare nucleus Hamiltonian during FFIELD runs
              95. x dipole integrals, in AO basis
              96. y dipole integrals, in AO basis
              97. z dipole integrals, in AO basis
              98. former coords for Schlegel geometry search
              99. former gradients for Schlegel geometry search
             100. not used

1

               records 101-248 are used for NLO properties

          101. U'x(0)         149. U''xx(-2w;w,w)    200. UM''xx(-w;w,0) 
          102.   y            150.    xy             201.     xy         
          103.   z            151.    xz             202.     xz         
          104. G'x(0)         152.    yy             203.     yz         
          105.   y            153.    yz             204.     yy         
          106.   z            154.    zz             205.     yz         
          107. U'x(w)         155. G''xx(-2w;w,w)    206.     zx         
          108.   y            156.    xy             207.     zy         
          109.   z            157.    xz             208.     zz         
          110. G'x(w)         158.    yy             209. U''xx(0;w,-w)  
          111.   y            159.    yz             210.    xy          
          112.   z            160.    zz             211.    xz          
          113. U'x(2w)        161. e''xx(-2w;w,w)    212.    yz          
          114.   y            162.    xy             213.    yy          
          115.   z            163.    xz             214.    yz          
          116. G'x(2w)        164.    yy             215.    zx          
          117.   y            165.    yz             216.    zy          
          118.   z            166.    zz             217.    zz          
          119. U'x(3w)        167. UM''xx(-2w;w,w)   218. G''xx(0;w,-w)  
          120.   y            168.     xy            219.    xy          
          121.   z            169.     xz            220.    xz          
          122. G'x(3w)        170.     yy            221.    yz          
          123.   y            171.     yz            222.    yy          
          124.   z            172.     zz            223.    yz          
          125. U''xx(0)       173. U''xx(-w;w,0)     224.    zx          
          126.    xy          174.    xy             225.    zy          
          127.    xz          175.    xz             226.    zz          
          128.    yy          176.    yz             227. e''xx(0;w,-w)  
          129.    yz          177.    yy             228.    xy          
          130.    zz          178.    yz             229.    xz          
          131. G''xx(0)       179.    zx             230.    yz          
          132.    xy          180.    zy             231.    yy          
          133.    xz          181.    zz             232.    yz          
          134.    yy          182. G''xx(-w;w,0)     233.    zx          
          135.    yz          183.    xy             234.    zy          
          136.    zz          184.    xz             235.    zz          
          137. e''xx(0)       185.    yz             236. UM''xx(0;w,-w) 
          138.    xy          186.    yy             237.     xy         
          139.    xz          187.    yz             238.     xz         
          140.    yy          188.    zx             239.     yz         
          141.    yz          189.    zy             240.     yy         
          142.    zz          190.    zz             241.     yz         
          143. UM''xx(0)      191. e''xx(-w;w,0)     242.     zx         
          144.     xy         192.    xy             243.     zy         
          145.     xz         193.    xz             244.     zz         
          146.     yy         194.    yz          
          147.     yz         195.    yy          
          148.     zz         196.    yz          
                              197.    zx          
                              198.    zy          
                              199.    zz          
1

              245. old NLO Fock matrix
              246. older NLO Fock matrix
              247. oldest NLO Fock matrix
              249. not used
              250. transition density matrix in AO basis
              251. static polarizability tensor alpha
              252. X dipole integrals in MO basis
              253. Y dipole integrals in MO basis
              254. Z dipole integrals in MO basis
              255. alpha MO symmetry labels
              256. beta MO symmetry labels
          257-261. reserved for Cheol Choi
          262-279. not used
              280. Zero field LMOs during numerical polarizability
              281. Alpha zero field dens. during num. polarizability
              282. Beta zero field dens. during num. polarizability
              283. zero field Fock matrix. during num. polarizability
          284-289. not used
          290-299. reserved for Alex Granovsky
              300. Z-vector during MP2 gradient
              301. Pocc during MP2 gradient
              302. Pvir during MP2 gradient
              303. Wai during MP2 gradient
              304. Lagrangian Lai during MP2 or CI gradient
              305. Wocc during MP2 gradient
              306. Wvir during MP2 gradient
              307. P(MP2)-P(RHF) during MP2 gradient
              308. SCF density during MP2 gradient
              309. energy weighted density during MP2 gradient
              311. Supermolecule h during Morokuma
              312. Supermolecule S during Morokuma
              313. Monomer 1 orbitals during Morokuma
              314. Monomer 2 orbitals during Morokuma
              315. combined monomer orbitals during Morokuma
              316. nonorthogonal SCF orbitals during SCF-MI
              317. unzeroed Fock matrix when MOs are frozen
              318. MOREAD orbitals when MOs are frozen
              319. bare Hamiltonian without EFP contribution
              320. MCSCF active orbital density
              321. MCSCF DIIS error matrix
              322. MCSCF orbital rotation indices
              323. Hamiltonian matrix during QUAD MCSCF
              324. MO symmetry labels during MCSCF
              330. CEL matrix during PCM
              331. VEF matrix during PCM
              332. QEFF matrix during PCM
              333. ELD matrix during PCM
          340-354. reserved for Kurt's code
          360-369. reserved for Rob Bell
              370. left transf. during RESC spin-orbit
              371. right transf. during RESC spin-orbit
              370. basis A (large component) during NESC
              371. basis B (small component) during NESC
              372. difference basis set A-B1
              373. basis N (rel. normalized large component)
              374. basis B1 (small component) during NESC

1
 
              375. charges of non-relativistic atoms in NESC
              376. common nuclear charges for all NESC basis
              377. common coordinates for all NESC basis
              378. common exponent values for all NESC basis
              379. Lz integrals

              In order to correctly pass data between different
          machine types when running in parallel, it is required
          that a DAF record must contain only floating point values,
          or only integer values.  No logical or Hollerith data may
          be stored.  The final calling argument to DAWRIT and
          DAREAD must be 0 or 1 to indicate floating point or
          integer values are involved.  The records containing
          integers are so marked in the list below.
 
              Physical record 1 (containing the DAF directory) is
          written whenever a new record is added to the file.  This
          is invisible to the programmer.  The numbers shown above
          are "logical record numbers", and are the only thing that
          the programmer need be concerned with.