Flag Description for Fujitsu Siemens Siemens Computers

fsc-mix-pgi-path.xml Flag Description for Fujitsu Siemens Siemens Computers

Compilers: PGI Workstation Complete 7.2 and PathScale Compiler Suite 3.2

Operating system: Linux

One or more of the following settings may have been set. If so, the corresponding notes sections of the report will say so; and you can read below to find out more about what these settings mean.

Environment Variables

MP_BIND

This Environment Variable controls the runtime behavior of binaries compiled with the PGI compilers.
It can be set to yes or y to bind processes or threads executing in a parallel region to physical processors, or to no or n to disable such binding. The default is to not bind processes to processors.
This is an execution time environment variable interpreted by the PGI runtime support libraries. It does not affect the behavior of the PGI compilers in any way. Note: the MP_BIND environment variable is not supported on all platforms.

MP_BLIST

This Environment Variable controls the runtime behavior of binaries compiled with the PGI compilers.
In addition to the MP_BIND variable, it is possible to define the thread-CPU relationship. For example, setting MP_BLIST=3,2,1,0 maps CPUs 3, 2, 1 and 0 to threads 0, 1, 2 and 3 respectively.

OMP_NUM_THREADS

This Environment Variable controls the runtime behavior of binaries compiled with the PGI and PathScale compilers.
This Environment Variable sets the maximum number of threads to use for OpenMP* parallel regions if no other value is specified in the application. This environment variable applies to both -openmp and -parallel (Linux and Mac OS X) or /Qopenmp and /Qparallel (Windows). Example syntax on a Linux system with 8 cores:
export OMP_NUM_THREADS=8
Default is the number of cores visible to the OS.

PGI_HUGE_PAGES

This Environment Variable controls the runtime behavior of binaries compiled with the PGI compilers.
The maximum number of huge pages an application is allowed to use can be set at run time via the environment variable PGI_HUGE_PAGES. If not set, then the process may use all available huge pages when compiled with "-Msmartalloc=huge" or a maximum of n pages where the value of n is set via the compile time flag "-Msmartalloc=huge:n."

KMP_AFFINITY

KMP_AFFINITY = < physical | logical >, starting-core-id
This Environment Variable specifies the static mapping of user threads to physical cores, for example, if you have a system configured with 8 cores, OMP_NUM_THREADS=8 and KMP_AFFINITY=physical,2. Thread 0 will mapped to core 2, thread 1 will be mapped to core 3, and so on in a round-robin fashion.

Linux commands

ulimit -s < n | unlimited >

This Linux command (a bash builtin command) sets the stack size to n kbytes, or unlimited to allow the stack size to grow without limit.

ulimit -l < n | unlimited >

This Linux command (a bash builtin command) sets the maximum size of memory that may be locked into physical memory.

numactl -m nodes --physcpubind=cpus command

numactl runs processes with a specific NUMA scheduling or memory placement policy. The policy is set for command and inherited by all of its children. The arguments used here are:

-m nodes
Only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes. Synonym: --membind nodes
--physcpubind=cpus
Only execute process on cpus. This accepts physical cpu numbers as shown in the processor fields of /proc/cpuinfo. Synonym: -C cpus
command
Command to be executed under control of numactl.

numactl has many more options which are not described here since they are not used.

SPEC config file feature submit

submit = echo "$command" >run.sh ; $BIND bash run.sh

When running multiple copies of benchmarks, the SPEC config file feature submit is sometimes used to cause individual jobs to be bound to specific processors. This specific submit command is used for Linux. The description of the elements of the command are:

$command
Program to be started. In this case, the benchmark instance to be started.
echo "$command" >run.sh
The Linux command echo is used to write the command to be executed into the file run.sh.
bash run.sh
Execute the commands in file run.sh with the bash shell.
$BIND
Use a value from your bind list. This variable is actually interpreted by specinvoke.
bind list (contained in config file)
List of values with which $BIND should be replaced in a submit command. For example, the following list gives good results with a 2-chip quad-core configuration.
```
bind0 = numactl -m 0 --physcpubind=0
bind1 = numactl -m 0 --physcpubind=1
bind2 = numactl -m 0 --physcpubind=2
bind3 = numactl -m 0 --physcpubind=3
bind4 = numactl -m 1 --physcpubind=4
bind5 = numactl -m 1 --physcpubind=5
bind6 = numactl -m 1 --physcpubind=6
bind7 = numactl -m 1 --physcpubind=7
```
In this example, the first benchmark instance uses bind0, so "$BIND bash runsh" is expanded to become "numactl -m 0 --physcpubind=0 bash run.sh". The second instance uses bind1, and so on.
If there are more copies than bind values, they will be re-used in a circular fashion. If there are more bind values specified than copies, then only as many as needed will be used.

Linux Huge Page settings

In order to take full advantage of using PGI's huge page runtime library, your system must be configured to use huge pages. It is safe to run binaries compiled with "-Msmartalloc=huge" on systems not configured to use huge pages, however, you will not benefit from the performance improvements huge pages offer. To configure your system for huge pages perform the following steps:

Create a mount point for the huge pages: "mkdir /mnt/hugepages"
The huge page file system needs to be mounted when the systems reboots. Add the following to a system boot configuration file before any services are started: "mount -t hugetlbfs nodev /mnt/hugepages"
Set vm/nr_hugepages=N in your /etc/sysctl.conf file where N is the maximum number of pages the system may allocate.
Reboot to have the changes take effect.
If desired, use the Environment Variable PGI_HUGE_PAGES to control the run-time behavior.

Note that further information about huge pages may be found in your Linux documentation file: /usr/src/linux/Documentation/vm/hugetlbpage.txt

]]> Invoke the PGI C compiler.
Also used to invoke linker for C programs.

]]> pgcc Invoke the PGI C++ compiler.
Also used to invoke linker for C++ programs.

]]> pgcpp Invoke the PGI Fortran 95 compiler.
Also used to invoke linker for Fortran programs and for mixed C / Fortran.

]]> pgf95 pathcc Invoke the PathScale C compiler.
Also used to invoke linker for C programs.

]]> pathCC Invoke the PathScale C++ compiler.
Also used to invoke linker for C++ programs.

]]> pathf95 Invoke the PathScale Fortran 77, 90 and 95 compilers.
Also used to invoke linker for Fortran programs and for mixed C / Fortran. pathf90 and pathf95 are synonymous.

]]> Disable warning messages.

]]> -w Don't include Fortran main program object module.

]]> -Mnomain Use C99 language features.

]]> -c9x -fno-second-underscore CFP2006:

If -funderscoring is in effect, and the original Fortran external identifier contained an underscore, -fsecond-underscore appends a second underscore to the one added by -funderscoring. -fno-second-underscore does not append a second underscore. The default is both -funderscoring and -fsecond-underscore, the same defaults as g77 uses. -fno-second-underscore corresponds to the default policies of PGI Fortran and Intel Fortran. ]]> Chooses generally optimal flags for the target platform. As of the PGI 7.0 release, the flags "-fast" and "-fastsse" are equivlent for 64-bit compilations. For 32-bit compilations "-fast" does not include "-Mscalarsse", "-Mcache_align", or "-Mvect=sse".

]]> -fast Disable C++ exception handling support.

]]> --no_exceptions Disable C++ run time type information support.

]]> --no_rtti Generate zero-overhead C++ exception handlers.

]]> Inline functions declared with the inline keyword.

]]> -Mautoinline Disable inlining of functions declared with the inline keyword.

]]> -Mnoautoinline Align "unconstrained" data objects of size greater than or equal to 16 bytes on cache-line boundaries. An "unconstrained" object is a variable or array that is not a member of an aggregate structure or common block, is not allocatable, and is not an automatic array. On by default on 64-bit Linux systems.

]]> -Mcache_align Align doubles on double alignment boundaries

]]> -Mdalign Do not align doubles on double alignment boundaries

]]> -Mnodalign Set SSE to flush-to-zero mode; if a floating-point underflow occurs, the value is set to zero.

]]> -Mflushz Treat denormalized numbers as zero. Included with "-fast" on Intel based systems. For AMD based systems, "-Mdaz" is not included by default with "-fast".

]]> -Mdaz Generate code to set up a stack frame.

]]> -Mframe Eliminates operations that set up a true stack frame pointer for every function. With this option enabled, you cannot perform a traceback on the generated code and you cannot access local variables.

]]> -Mnoframe CPU2006 flags file rule used to split an optimization flag containing sub-options into multiple flag descriptions. Please refer to the flag file rule of the various sub-options for the actual flag description. Instructs the compiler to use relaxed precision in the calculation of floating-point reciprocal square root (1/sqrt). Can result in improved performance at the expense of numerical accuracy.

]]> -Mfprelaxed=rsqrt Instructs the compiler to use relaxed precision in the calculation of floating-point square root. Can result in improved performance at the expense of numerical accuracy.

]]> -Mfprelaxed=sqrt Instructs the compiler to use relaxed precision in the calculation of floating-point division. Can result in improved performance at the expense of numerical accuracy.

]]> -Mfprelaxed=div Instructs the compiler to allow floating-point expression reordering, including factoring. Can result in improved performance at the expense of numerical accuracy.

]]> -Mfprelaxed=order Instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The default on an AMD system is "-Mfprelaxed=sqrt,rsqrt,order". The default on an Intel system is "-Mfprelaxed=rsqrt,sqrt,div,order"

]]> -Mfprelaxed CPU2006 flags file rule used to split an optimization flag containing sub-options into multiple flag descriptions. Please refer to the flag file rule of the various sub-options for the actual flag description. Instructs the compiler to use low-precision approximation in the calculation of reciprocal square root (1/sqrt). Can result in improved performance at the expense of numerical accuracy.

]]> -Mfpapprox=rsqrt Instructs the compiler to use low-precision approximation in the calculation of square root. Can result in improved performance at the expense of numerical accuracy.

]]> -Mfpapprox=sqrt Instructs the compiler to use low-precision approximation in the calculation of divides. Can result in improved performance at the expense of numerical accuracy.

]]> -Mfpapprox=div Instructs the compiler to perform low-precision approximation in the calculation of floating-point division, square-root, and reciprocal square root. Can result in improved performance at the expense of numerical accuracy.

]]> -Mfpapprox -Mpre -Mpre CPU2006 flags file rule used to split an optimization flag containing sub-options into multiple flag descriptions. Please refer to the flag file rule of the various sub-options for the actual flag description. Set the fetch-ahead distance for prefetch instructions to $1 cache lines

]]> -Mprefetch=d:m Set maximum number of prefetch instructions to generate for a given loop to $1.

]]> -Mprefetch=n:p Use the prefetchnta instruction.

]]> -Mprefetch_nta Use the prefetch instruction.

]]> -Mprefetch=plain Use the prefetcht0 instruction.

]]> -Mprefetch=t0 Use the AMD-specific prefetchw instruction.

]]> -Mprefetch=w Enable generation of prefetch instructions on processors where they are supported.

]]> -Mprefetch Disable generation of prefetch instructions.

]]> -Mnoprefetch Use SSE/SSE2 instructions to perform scalar floating-point arithmetic on targets where these instructions are supported.

]]> -Mscalarsse Do not use SSE/SSE2 instructions to perform scalar floating-point arithmetic; use x87 operations instead.

]]> -Mnoscalarsse Instructs the compiler to extend the sign bit that is set as a result of an object's conversion from one data type to an object of a larger signed data type.

]]> -Msignextend Aligns or does not align innermost loops on 32 byte boundaries with -tp barcelona. Small loops on barcelona systems may run fast if aligned on 32-byte boundaries; however, in practice, most assemblers do not yet implement efficient padding causing some programs to run more slowly with this as default. Use -Mloop32 on systems with an assembler tuned for barcleona. The default is -Mnoloop32.

]]> -Mloop32 Treat individual array element references as candidates for possible loop-carried redundancy elimination. The default is to eliminate only redundant expressions involving two or more operands.

]]> -Mlre_array Allow expression re-association; specifying this sub-option can increase opportunities for loop-carried redundancy elimination.

]]> -Mlre=assoc Disable expression re-association.

]]> -Mlre=noassoc Enables loop-carried redundancy elimination, an optimization that can reduce the number of arithmetic operations and memory references in loops.

]]> -Mlre Disable loop-carried redundancy elimination.

]]> -Mnolre Instructs the compiler not to perform idiom recognition or introduce calls to hand-optimized vector functions.

]]> -Mnovintr Generate profile-feedback instrumentation (PFI); this includes extra code to collect run-time statistics and dump them to a trace file for use in a subsequent compilation. PFI gathers information about a program's execution and data values but does not gather information from hardware performance counters. PFI does gather data for optimizations which are unique to profile-feedback optimization.

The indirect sub-option enables collection of indirect function call targets, which can be used for indirect function call inlining.

]]> -Mpfi=indirect Enable profile-feedback optimizations including indirect function call inlining. This option requires a pgfi.out file generated from a binary built with -Mpfi=indirect.

]]> -Mpfo=indirect Generate profile-feedback instrumentation (PFI); this includes extra code to collect run-time statistics and dump them to a trace file for use in a subsequent compilation. PFI gathers information about a program's execution and data values but does not gather information from hardware performance counters. PFI does gather data for optimizations which are unique to profile-feedback optimization.

]]> -Mpfi Enable profile-feedback optimizations.

]]> -Mpfo CPU2006 flags file rule used to split an optimization flag containing sub-options into multiple flag descriptions. Please refer to the flag file rule of the various sub-options for the actual flag description. Interprocedural Analysis option: Recognize when targets of pointer dummy are aligned.

]]> -Mipa=align Interprocedural Analysis option: Disable recognizition when targets of pointer dummy are aligned.

]]> -Mipa=noalign Interprocedural Analysis option: Remove arguments replaced by -Mipa=ptr,const

]]> -Mipa=arg Interprocedural Analysis option: Do not remove arguments replaced by -Mipa=ptr,const

]]> -Mipa=noarg Interprocedural Analysis option: Generate call graph information for pgicg tool.

]]> -Mipa=cg Interprocedural Analysis option: Do not generate call graph information for pgicg tool.

]]> -Mipa=nocg Interprocedural Analysis option: Enable interprocedural constant propagation.

]]> -Mipa=const Interprocedural Analysis option: Disable interprocedural constant propagation.

]]> -Mipa=noconst Interprocedural Analysis option: Used with -Mipa=inline to specify functions which should not be inlined.

]]> -Mipa=except:func Instructs the compiler to perform interprocedural analysis. Equivalant to -Mipa=align,arg,const,f90ptr,shape,globals,libc,localarg,ptr,pure.

]]> -Mipa=fast Interprocedural Analysis option: Force all objects to recompile regardless whether IPA information has changed.

]]> -Mipa=force Interprocedural Analysis option: Optimize references to global values.

]]> -Mipa=globals Interprocedural Analysis option: Do not optimize references to global values.

]]> -Mipa=noglobals Interprocedural Analysis option: Automatically determine which functions to inline, limit to $1 levels where $1 is a supplied constant value. If no value is suppiled, then the default value of 2 is used. IPA-based function inlining is performed from leaf routines upward.

]]> n -Mipa=inline:4 Interprocedural Analysis option: Automatically determine which functions to inline, limit to 2 levels (default). IPA-based function inlining is performed from leaf routines upward.

]]> -Mipa=inline Interprocedural Analysis option: Automatically determine which functions to inline, independent of information gathered from profile guided feedback (-Mpfi), limit to $1 levels where $1 is a supplied constant value. If no value is suppiled, then the default value of 2 is used. IPA-based function inlining is performed from leaf routines upward.

]]> n -Mipa=inlinenopfo:4 Interprocedural Analysis option: Automatically determine which functions to inline, independent of information gathered from profile guided feedback (-Mpfi), limit to 2 levels (default). IPA-based function inlining is performed from leaf routines upward.

]]> -Mipa=inlinenopfo Interprocedural Analysis option: Inline static functions which are outside the scope of the current file.

]]> -Mipa=staticfunc Interprocedural Analysis option: Allow inlining of routines from libraries.

]]> -Mipa=libinline Interprocedural Analysis option: Do not inline routines from libraries.

]]> -Mipa=nolibinline Interprocedural Analysis option: Specifies the number of concurent IPA second pass compliation proccess that may be performed. This option speeds-up the compilation time on multi-core systems but does not perform any optimizations.

]]> -Mipa=libc Interprocedural Analysis option: Used to optimize calls to certain functions in the system standard C library, libc.

]]> -Mipa=libc Interprocedural Analysis option: Allow recompiling and optimization of routines from libraries using IPA information.

]]> -Mipa=libopt Interprocedural Analysis option: Don't optimize routines in libraries.

]]> -Mipa=nolibopt Interprocedural Analysis option: -Mipa=arg plus externalizes local pointer targets.

]]> -Mipa=localarg Interprocedural Analysis option: -Mipa=arg plus externalizes local pointer targets.

]]> -Mipa=localarg Interprocedural Analysis option: Do not externalize local pointer targets.

]]> -Mipa=nolocalarg Interprocedural Analysis option: Enable pointer disambiguation across procedure calls.

]]> -Mipa=ptr Interprocedural Analysis option: Disable pointer disambiguation.

]]> -Mipa=noptr Interprocedural Analysis option: Fortran 90/95 Pointer disambiguation across calls.

]]> -Mipa=f90ptr Interprocedural Analysis option: Disable Fortran 90/95 pointer disambiguation

]]> -Mipa=nof90ptr Interprocedural Analysis option: Pure function detection.

]]> -Mipa=pure Interprocedural Analysis option: Disable pure function detection.

]]> -Mipa=nopure Interprocedural Analysis option: Allows inlining in Fortran even when array shapes do not match.

]]> -Mipa=reshape Interprocedural Analysis option: Perform Fortran 90 array shape propagation.

]]> -Mipa=shape Interprocedural Analysis option: Disable Fortran 90 array shape propagation.

]]> -Mipa=noshape Interprocedural Analysis option: Remove functions that are never called.

]]> -Mipa=vestigial Interprocedural Analysis option: Do not remove functions that are never called.

]]> -Mipa=novestigial Enable Interprocedural Analysis.

]]> -Mipa CPU2006 flags file rule used to split an optimization flag containing sub-options into multiple flag descriptions. Please refer to the flag file rule of the various sub-options for the actual flag description. Instructs the parallelizer to generate alternate serial code for parallelized loops. Without arguments, the parallelizer determines an appropriate cutoff length and generates serial code to be executed whenever the loop count is less than or equal to that length.

]]> -Mconcur=altcode Instructs the parallelizer to generate alternate serial code for parallelized loops. With arguments, the serial altcode is executed whenever the loop count is less than or equal to $1.

]]> -Mconcur=altcode:n Always execute the parallelized version of a loop regardless of the loop count.

]]> -Mconcur=noaltcode Disables parallelization of loops with reductions.

]]> -Mconcur=noassoc Assume loops containing calls are safe to parallelize and allows loops containing calls to be candidates for parallelization. Also, no minimum loop count threshold must be satisfied before parallelization will occur, and last values of scalars are assumed to be safe.

]]> -Mconcur=cncall Do not assume loops containing calls are safe to parallelize.

]]> -Mconcur=nocncall Parallelize with block distribution. Contiguous blocks of iterations of a parallelizable loop are assigned to the available processors.

]]> -Mconcur=dist:bloc Parallelize with cyclic distribution. The outermost parallelizable loop in any loop nest is parallelized. If a parallelized loop is innermost, its iterations are allocated to processors cyclically. For example, if there are 3 processors executing a loop, processor 0 performs iterations 0, 3, 6, etc.; processor 1 performs iterations 1, 4, 7, etc.; and processor 2 performs iterations 2, 5, 8, etc.

]]> -Mconcur=dist:cyclic Enable parallelization of innermost loops.

]]> -Mconcur=innermost Disable parallelization of innermost loops.

]]> -Mconcur=noinnermost Instructs the compiler to enable auto-concurrentization of loops. If -Mconcur is specified, multiple processors will be used to execute loops that the compiler determines to be parallelizable.

The environment variables MP_BIND, MP_LIST, and OMP_NUM_THREADS may be used to optimise the runtime behavior of binaries compiled with -Mconcur. These variables are described below in the section "System and other Tuning Information".

]]> -Mconcur CPU2006 flags file rule used to split an optimization flag containing sub-options into multiple flag descriptions. Please refer to the flag file rule of the various sub-options for the actual flag description. Instructs the inliner to inline the functions within the library filename.ext.

]]> -Minline=lib:filename.ext Instructs the inliner to inline all eligible functions except $1, a function in the source text. Multiple functions can be listed, comma-separated.

]]> -Minline=except:func Instructs the inliner to inline function func.

]]> -Minline=name:func Allows inlining in Fortran even when array shapes do not match.

]]> -Minline=reshape Instructs the inliner to inline functions with $1 or fewer statements where $1 is a supplied constant value.

]]> n -Minline=size:n Instructs the inliner to perform $1 levels of inlining where $1 is a supplied constant value. If no value is suppiled, then the default value of 2 is used.

]]> n -Minline=levels:4 Instructs the inliner to perform 1 level of inlining.

]]> -Minline Disable constant propagation from assertions derived from equality conditionals.

]]> -Mnopropcond Link with the huge page runtime library and allocate a maximum of $1 huge pages where $1 is a supplied constant value. If no constant value is supplied, then the maximum number of huge pages the application can use is limited by the number of huge pages the operating system has available or the value of the environment variable PGI_HUGE_PAGES.

Note that setting PGI_HUGE_PAGES will override the value of $1. This environment variable is described below in the section "System and Other Tuning Information".

]]> n -Msmartalloc=huge:150 Link with the huge page runtime library. The maximum number of huge pages the application can use is limited by the number of huge pages the operating system has available or the value of the environment variable PGI_HUGE_PAGES. This environment variable is described below in the section "System and Other Tuning Information".

]]> -Msmartalloc=huge Link with the huge page runtime library. Use huge pages for an executable's .BSS section.

]]> -Msmartalloc\=hugebss Adds a call to the routine "mallopt" in the main routine. This option can have a dramatic impact on the performance of programs that dynamically allocate memory, especially for those which have a few large mallocs. To be effective, this switch must be specified when compiling the file containing the Fortran, C, or C++ main routine.

]]> -Msmartalloc CPU2006 flags file rule used to split an optimization flag containing sub-options into multiple flag descriptions. Please refer to the flag file rule of the various sub-options for the actual flag description. Assume all pointers and arrays are independent and safe for aggressive optimizations, and in particular that no pointers or arrays overlap of conflict with each other.

]]> -Msafeptr=all Instructs the compiler that arrays and pointers are treated with the same copyin and copyout semantics as Fortran dummy arguments.

]]> -Msafeptr=arg Instructs the compiler that local pointers and arrays do not overlap or conflict with each other and are independent.

]]> -Msafeptr=auto Instructs the compiler that local pointers and arrays do not overlap or conflict with each other and are independent.

]]> -Msafeptr=local Instructs the compiler that static pointers and arrays do not overlap or conflict with each other and are independent.

]]> -Msafeptr=static Instructs the compiler that global or external pointers and arrays do not overlap or conflict with each other and are independent.

]]> -Msafeptr=global Instructs the C/C++ compiler to override data dependencies between pointers of a given storage class.

]]> -Msafeptr CPU2006 flags file rule used to split an optimization flag containing sub-options into multiple flag descriptions. Please refer to the flag file rule of the various sub-options for the actual flag description. Instructs the compiler to completely unroll loops with a constant loop count of less than or equal to $1 where $1 is a supplied constant value. If no constant value is given, then a default of 4 is used.

]]> n -Munroll=c:1 "-Munroll=n:n" instructs the compiler to unroll loops $1 times where $1 is a supplied constant value. If no constant value is given, then a default of 4 is used.

]]> n -Munroll=n:n "-Munroll=m:n" instructs the compiler to unroll loops with multiple blocks $1 times where $1 is a supplied constant value. If no constant value is given, then a default of 4 is used.

]]> n -Munroll=m:n Instructs the compiler to unroll loops with multiple blocks using the default value of 4 times

]]> -Munroll=m Invokes the loop unroller.

]]> -Munroll Disable loop unrolling.

]]> -Mnounroll Don't check dependence relations for vector or parallel code.

]]> Allow parallelization of loops with conditional scalar assignments.

]]> Generate code to check for zero loop increments.

]]> Enable an optional post-pass instruction scheduling.

]]> -Msmart Disable an optional post-pass instruction scheduling.

]]> -Mnosmart CPU2006 flags file rule used to split an optimization flag containing sub-options into multiple flag descriptions. Please refer to the flag file rule of the various sub-options for the actual flag description. Disable automatic vector pipelining.

]]> -Mnovect Instructs the vectorizer to generate alternate code for vectorized loops when appropriate. For each vectorized loop the compiler decides whether to generate altcode and what type or types to generate, which may be any or all of:

Altcode without iteration peeling
Altcode with non-temporal stores and other data cache optimizations
Altcode base on array alignments calculated dynamically at runtime.

The compiler also determines suitable loop count and array alignment conditions for executing the altcode.

]]> -Mvect=altcode Disables alternate code generation for vectorized loops.

]]> -Mvect=noaltcode Instructs the vectorizer to enable certain associativity conversions that can change the results of a computations due to roundoff error. A typical optimization is to change an arithmetic operation to an arithmetic opteration that is mathmatically correct, but can be computationally different, due to round-off error.

]]> -Mvect=assoc Instructs the vectorizer to disable associativity conversions.

]]> -Mvect=noassoc Instructs the vectorizer, when performing cache tiling optimizations, to assume a cache size of $1. The default size is processor dependent.

]]> n -Mvect=cachesize:n Instructs the vectorizer to enable loop fusion.

]]> -Mvect=fuse Instructs the vectorizer to disable vectorization of indirect array references.

]]> -Mvect=nogather Instructs the vectorizer to enable idiom recognition.

]]> -Mvect=idiom Instructs the vectorizer to disable idiom recognition.

]]> -Mvect=noidiom Generate vector loops for all loops where possible regardless of the number of statements in the loop. This overrides a heuristic in the vectorizer that ordinarily prevents vectorization of loops with a number of statements that exceed a certain threshold.

]]> -Mvect=nosizelimit Instructs the vectorizer to generate partial vectorization.

]]> -Mvect=partial Instructs the vectorizer to generate prefetch instructions.

]]> -Mvect=prefetch Instructs the vectorizer to search for vectorizable loops and, where possible, make use of SSE, SSE2, and prefetch instructions.

]]> -Mvect=sse Instructs the driver to disable the -Mvect=sse option which is part of the "-fast" option.

]]> -Mvect=nosse Enable automatic vector pipelining.

]]> -Mvect Disables -Ktrap=fp.

]]> -Mnofptrap -Ktrap is only processed by the compilers when compiling main functions' programs. The options inv, denorm, divz, ovf, unf, and inexact correspond to the processor's exception mask bits invalid operation, denormalized operand, divide-by-zero, overflow, underflow, and precision, respectively. Normally, the processor's exception mask bits are on (floating-point exceptions are masked the processor recovers from the exceptions and continues). If a floating-point exception occurs and its corresponding mask bit is off (or unmasked ), execution terminates with an arithmetic exception (C's SIGFPE signal). -Ktrap=fp is equivalent to -Ktrap=inv,divz,ovf.

]]> -Ktrap=fp Enable long branches.

]]> -Mlongbranch Link with the AMD Core Math Library. Available from www.amd.com

]]> -lacml Use the -mp option to instruct the compiler to interpret user-inserted OpenMP shared-memory parallel programming directives and generate an executable file which will utilize multiple processors in a shared-memory parallel system.

When used strictly as a linker flag, the PGI OpenMP runtime will be linked and users can use the environment variables MP_BIND and MP_BLIST to bind a serial program to a CPU.

The environment variables MP_BIND, MP_LIST, and OMP_NUM_THREADS may be used to optimise the runtime behavior of binaries compiled with -mp. These variables are described below in the section "System and other Tuning Information".

]]> -mp The align sub-option to -mp forces loop iterations to be allocated to OpenMP processes using an algorithm that maximizes alignment of vector sub-sections in loops that are both parallelized and vectorized for SSE. This can improve performance in program units that include many such loops. It can result in load-balancing problems that significantly decrease performance in program units with relatively short loops that contain a large amount of work in each iteration.

]]> -mp=align The numa suboption to -mp uses libnuma on systems where it is available.

]]> -mp=numa The nonuma suboption to -mp tells the driver to not link with libnuma.

]]> -mp=nonuma (For use only on 64-bit Linux targets) Generate code for the medium memory model in the linux86-64 execution environment. The default small memory model of the linux86-64 environment limits the combined area for a user's object or executable to 1GB, with the Linux kernel managing usage of the second 1GB of address for system routines, shared libraries, stacks, etc. Programs are started at a fixed address, and the program can use a single instruction to make most memory references. The medium memory model allows for larger than 2GB data areas, or .bss sections. Program units compiled using either -mcmodel=medium or -fpic require additional instructions to reference memory. The effect on performance is a function of the data-use of the application. The -mcmodel=medium switch must be used at both compile time and link time to create 64-bit executables. Program units compiled for the default small memory model can be linked into medium memory model executables as long as they are compiled -fpic, or position-independent.

]]> -mcmodel=medium Enable support for 64-bit indexing and single static data objects larger than 2GB in size. This option is default in the presence of -mcmodel=medium. Can be used separately together with the default small memory model for certain 64-bit applications that manage their own memory space.

]]> -Mlarge_arrays Enable dead store elimination.

]]> Enable optimizations using ANSI C type-based pointer disambiguation.

]]> Set the optimization level to -O2

]]> -O A basic block is generated for each C statement. No scheduling is done between statements. No global optimizations are performed.

]]> -O0 Level-one optimization specifies local optimization (-O1). The compiler performs scheduling of basic blocks as well as register allocation. This optimization level is a good choice when the code is very irregular; that is it contains many short statements containing IF statements and the program does not contain loops (DO or DO WHILE statements). For certain types of code, this optimization level may perform better than level-two (-O2) although this case rarely occurs.

The PGI compilers perform many different types of local optimizations, including but not limited to:

Algebraic identity removal
Constant folding
Common subexpression elimination
Local register optimization
Peephole optimizations
Redundant load and store elimination
Strength reductions

]]> -O1 Level-two optimization (-O2 or -O) specifies global optimization. The -fast option generally will specify global optimization; however, the -fast switch will vary from release to release depending on a reasonable selection of switches for any one particular release. The -O or -O2 level performs all level-one local optimizations as well as global optimizations. Control flow analysis is applied and global registers are allocated for all functions and subroutines. Loop regions are given special consideration. This optimization level is a good choice when the program contains loops, the loops are short, and the structure of the code is regular.

The PGI compilers perform many different types of global optimizations, including but not limited to:

Branch to branch elimination
Constant propagation
Copy propagation
Dead store elimination
Global register allocation
Invariant code motion
Induction variable elimination

]]> -O2 All level 1 and 2 optimizations are performed. In addition, this level enables more aggressive code hoisting and scalar replacement optimizations that may or may not be profitable.

]]> -O3 Performs all level 1, 2, and 3 optimizations and enables hoisting of guarded invariant floating point expressions.

]]> -O4 Create a Unified Binary using multiple targets. Specify the type of the target processor as AMD64 Processor 32-bit mode.

]]> -tp k8-32 Specify the type of the target processor as AMD64 Processor 64-bit mode.

]]> -tp k8-64 Specify the type of the target processor as AMD64 Barcelona Processor 64-bit mode.

]]> -tp barcelona-64 Specify the type of the target processor as AMD64 Barcelona Processor 32-bit mode.

]]> -tp barcelona-32 Specify the type of the target processor as AMD64 Barcelona Processor 32-bit mode.

]]> -tp barcelona Specify the type of the target processor as Intel Penryn Processor 64-bit mode.

]]> -tp penryn-64 Specify the type of the target processor as Intel Penryn Processor 32-bit mode.

]]> -tp penryn-32 Specify the type of the target processor as Intel Penryn Processor 32-bit mode.

]]> -tp penryn Specify the type of the target processor as Intel P7 Architecture with EM64t, 64-bit mode.

]]> -tp p7-64 Specify the type of the target processor as Intel P7 Architecture (Pentium 4, Xeon, Centrino).

]]> -tp p7 Specify the type of the target processor as Intel Core 2 EM64T or compatible architecture using 64-bit mode.

]]> -tp core2-64 Specify the type of the target processor as Intel Core 2 or compatible architecture using 32-bit mode.

]]> -tp core2 Use the unified AMD/Intel 64-bit mode.

]]> -tp x64 Experimental flags.

]]> -fb_create fbdata -fb_create <path>
Used to specify that an instrumented executable program is to be generated. Such an executable is suitable for producing feedback data files with the specified prefix for use in feedback-directed optimization (FDO). The commonly used prefix is "fbdata".
This is OFF by default.

During the training run, the instrumented executable produces information regarding execution paths and data values, but does not generate information by using hardware performance counters.

]]> -fb_opt fbdata -fb_opt <prefix for feedback data files>
Used to specify feedback-directed optimization (FDO) by extracting feedback data from files with the specified prefix, which were previously generated using -fb-create. The commonly used prefix is "fbdata". The same optimization flags should be used for both the -fb-create and fb_opt compile steps. Feedback data files created from executables compiled with different optimization flags may give checksum errors.
FDO is OFF by default.

During the -fb_opt compilation phase, information regarding execution paths and data values are used to improve the information available to the optimizer. FDO enables some optimizations which are only performed when the feedback data file is available. The safety of optimizations performed under FDO is consistent with the level of safety implied by the other optimization flags (outside of fb_create and fb_opt) specified on the compile and link lines.

]]> -march=<cpu-type>

Compiler will optimize code for selected cpu-type: opteron, athlon, athlon64, athlon64fx, barcelona, em64t, pentium4, xeon, core, anyx86, auto.
The default value, auto, means to optimize for the platform on which the compiler is running, as determined by reading /proc/cpuinfo.
anyx86 means a generic 32-bit x86 processor without SSE2 support.
barcelona is AMD's first Quad-core processor family.

]]> -O3 Specify the basic level of optimization desired.
The options can be one of the following:

0 Turn off all optimizations.

1 Turn on local optimizations that can be done quickly. Do peephole optimizations and instruction scheduling.

2 Turn on extensive optimization. This is the default.
The optimizations at this level are generally conservative, in the sense that they are virtually always beneficial and avoid changes which affect such things as floating point accuracy. In addition to the level 1 optimizations, do inner loop unrolling, if-conversion, two passes of instruction scheduling, global register allocation, dead store elimination, instruction scheduling across basic blocks, and partial redundancy elimination.

3 Turn on aggressive optimization.
The optimizations at this level are distinguished from -O2 by their aggressiveness, generally seeking highest-quality generated code even if it requires extensive compile time. They may include optimizations that are generally beneficial but may hurt performance.
This includes but is not limited to turning on the Loop Nest Optimizer, -LNO:opt=1, and setting -OPT:roundoff=1:IEEE_arithmetic=2:Olimit=9000:reorg_common=ON.

s Specify that code size is to be given priority in tradeoffs with execution time.

If no value is specified, 2 is assumed.

]]> -Ofast Equivalent to -O3 -ipa -OPT:Ofast -fno-math-errno -ffast-math.
Use optimizations selected to maximize performance. Although the optimizations are generally safe, they may affect floating point accuracy due to rearrangement of computations.

NOTE: -Ofast enables -ipa (inter-procedural analysis), which places limitations on how libraries and .o files are built.

]]> -apo This auto-parallelizing option signals the compiler to automatically convert sequential code into parallel code where it is safe and beneficial to do so.

The default number of threads used at run-time is the number of CPUs available in the machine. This number of threads can also be controlled by setting the OMP_NUM_THREADS environment variable.

]]> -ipa Invoke inter-procedural analysis (IPA). Specifying this option is identical to specifying -IPA or -IPA:. Default settings for the individual IPA suboptions are used.

]]> -m3dnow Enable the use of 3DNow instructions.

]]> -ffast-math -ffast-math improves FP speed by relaxing ANSI & IEEE rules. -fno-fast-math tells the compiler to conform to ANSI and IEEE math rules at the expense of speed. -ffast- math implies -OPT:IEEE_arithmetic=2 -fno-math-errno. -fno-fast-math implies -OPT:IEEE_arithmetic=1 -fmath-errno.

]]> -fno-exceptions (For C++ only) -fexceptions enables exception handling. This is the default. -fno-exceptions disables exception handling.

]]> -fno-math-errno Do not set ERRNO after calling math functions that are executed with a single instruction, e.g. sqrt. A program that relies on IEEE exceptions for math error handling may want to use this flag for speed while maintaining IEEE arithmetic compatibility. This is implied by -Ofast. The default is -fmath-errno.

]]> -m32 Compile for 32-bit ABI, also known as x86 or IA32.

]]> -m64 Compile for 64-bit ABI, also known as AMD64, x86_64, or IA32e. On a 32-bit host, the default is 32-bit ABI. On a 64-bit host, the default is 64-bit ABI if the target platform (-march/-mcpu/-mtune) is 64-bit; otherwise the default is 32-bit.

]]> This rule is used to split a flag group containing sub-options into multiple flag descriptions. Please refer to the flag file rule of the various sub-options for the actual flag description. -CG:cflow The Code Generation option group -CG: controls the optimizations and transformations of the instruction-level code generator.

-CG:flow : OFF disables control flow optimization in the code generation. Default is ON.

]]> -CG:cse_regs -CG:cse_regs=N : When performing common subexpression elimination during code generation, assume there are N extra integer registers available over the number provided by the CPU. N can be positive, zero, or negative. The default is positive infinity. See also -CG:sse_cse_regs.

]]> -CG:gcm -CG:gcm : Specifying OFF disables the instruction-level global code motion optimization phase. The default is ON.

]]> -CG:load_exe -CG:load_exe=N : Specify the threshold for subsuming a memory load operation into the operand of an arithmetic instruction. The value of 0 turns off this subsumption optimization. If N is 1, this subsumption is performed only when the result of the load has only one use. This subsumption is not performed if the number of times the result of the load is used exceeds the value N, a non-negative integer.
If the ABI is 64-bit and the language is Fortran, the default for N is 2, otherwise the default is 1.

]]> -CG:local_fwd_sched -CG:local_fwd_sched : this optimization option is deprecated.

]]> -CG:local_sched_alg -CG:local_sched_alg : Select the basic block instruction scheduling algorithm. If 0, perform backward scheduling, where instructions are scheduled from the bottom of the basic block to the top. If 1, perform forward scheduling. If 2, schedule the instructions twice - once in the forward direction and once in the backward direction - and take the better of the two schedules. The default value of this option is determined by the compiler during compilation.

]]> -CG:locs_shallow_depth -CG:locs_shallow_depth=(ON|OFF): When performing local instruction scheduling to reduce register usage, give priority to instructions that have shallow depths in the dependence graph. The default is OFF.

]]> -CG:movnti -CG:movnti=N : Convert ordinary stores to non-temporal stores when writing memory blocks of size larger than N KB. When N is set to 0, this transformation is avoided. The default value is 1000 (KB).

]]> -CG:p2align -CG:p2align : Align loop heads to 64-byte boundaries. The default is off.

]]> -CG:post_local_sched -CG:post_local_sched: Enable the local scheduler phase after register allocation. The default is on.

]]> -CG:prefetch -CG:prefetch : Enable or disable generation of prefetch instructions in the code generator. The default is on.

]]> -CG:prefer_legacy_regs -CG:prefer_legacy_regs: Tell the local register allocator to use the first 8 integer and SSE registers whenever possible (%rax-%rbp,%xmm0-%xmm7). Instructions using these registers have smaller instruction sizes. The default is OFF

]]> -CG:ptr_load_use -CG:ptr_load_use=N: Add a latency of N cycles between an instruction that loads a pointer and an instruction that uses the pointer. The extra latency will force the instruction scheduler to schedule the pointer load earlier. In general, it is beneficial to load pointers as soon as possible so that dependent memory instructions can begin execution. N is 4 by default. ("Load pointer" instructions include load-execute instructions that compute a pointer result.)

]]> -CG:push_pop_int_saved_regs -CG:push_pop_int_saved_regs: Use the X86 push and pop instructions to save the integer callee-saved registers at function prologs and epilogs instead of mov instructions to and from memory locations based off the stack pointer. The default is ON when the CPU target is barcelona, and OFF otherwise.

]]> -CG:sse_cse_regs -CG:sse_cse_regs=N : When performing common subexpression elimination during code generation, assume there are N extra SSE registers available over the number provided by the CPU. N can be positive, zero, or negative. The default is positive infinity.

]]> -CG:use_prefetchnta -CG:use_prefetchnta: Prefetch when data is non-temporal at all levels of the cache hierarchy. This is for data streaming situations in which the data will not need to be re-used soon. The default is OFF.

]]> -INLINE:aggressive -INLINE:aggressive : Tell the compiler to be more aggressive about inlining. The default is -INLINE:aggressive=OFF.

]]> -IPA:callee_limit The inter-procedural analyzer option group -IPA: controls application of inter-procedural analysis and optimization.

-IPA:callee_limit=N : Functions whose size exceeds this limit will never be automatically inlined by the compiler. The default is 500.

]]> -IPA:linear -IPA:linear : Controls conversion of a multi-dimensional array to a single dimensional (linear) array that covers the same block of memory. When inlining Fortran subroutines, IPA tries to map formal array parameters to the shape of the actual parameter. In the case that it cannot map the parameter, it linearizes the array reference. By default, IPA will not inline such callsites because they may cause performance problems. The default is OFF.

]]> -IPA:plimit -IPA:plimit=N : This option stops inlining into a specific subprogram once it reaches size N in the intermediate representation. Default is 2500.

]]> -IPA:pu_reorder -IPA:pu_reorder=N : Control re-ordering the layout of program units based on their invocation patterns in feedback compilation to minimize instruction cache misses. This option is ignored unless under feedback compilation.

0 Disable procedure reordering. This is the default for non-C++ programs.

1 Reorder based on the frequency in which different procedures are invoked. This is the default for C++ programs.

2 Reorder based on caller-callee relationship.

]]> -IPA:space -IPA:space=N : Inline until a program expansion of N % is reached. For example, -IPA:space=20 limits code expansion due to inlining to approximately 20 %. Default is no limit.

]]> -LNO:blocking Specify options and transformations performed on loop nests by the Loop Nest Optimizer (LNO). The -LNO options are enabled only if -O3 is also specified on the pathf95 command line.

-LNO:blocking : Enable or disable the cache blocking transformation. The default is ON.

]]> -LNO:full_unroll -LNO:full_unroll,fu=N : Fully unroll loops with trip_count <= N inside LNO. N can be any integer between 0 and 100. The default value for N is 5. Setting this flag to 0 disables full unrolling of small trip count loops inside LNO.

]]> -LNO:full_unroll_outer -LNO:full_unroll_outer=(on|off|0|1) : Control the full unrolling of loops with known trip count that do not contain a loop and are not contained in a loop. The conditions implied by both the full_unroll and the full_unroll_size options must be satisfied for the loop to be fully unrolled. The default is OFF.

]]> -LNO:full_unroll_size -LNO:full_unroll_size=N : Fully unroll loops with unrolled loop size <= N inside LNO. N can be any integer between 0 and 10000. The conditions implied by the full_unroll option must also be satisfied for the loop to be fully unrolled. The default value for N is 2000.

]]> -LNO:fission -LNO:fission=N : Perform loop fission. N can be one of the following:
0 = Disable loop fission (default)
1 = Perform normal loop fission as necessary
2 = Specify that fission be tried before fusion

Because -LNO:fusion is on by default, turning on fission without turning off fusion may result in their effects being nullified. Ordinarily, fusion is applied before fission. Specifying -LNO:fission=2 will turn on fission and cause it to be applied before fusion.

]]> -LNO:fusion -LNO:fusion=N : Perform loop fusion. N can be one of the following:
0 = Loop fusion is off
1 = Perform conservative loop fusion
2 = Perform aggressive loop fusion
The default is 1.

]]> -LNO:ignore_feedback -LNO:ignore_feedback=(on|off|0|1) : If the flag is ON then feedback information from the loop annotations will be ignored in LNO transformations. The default is OFF.

]]> -LNO:interchange -LNO:interchange=(on|off|0|1) : Disable the loop interchange transformation in the loop nest optimizer. Default is ON.

]]> -LNO:minvariant Enable or disable moving loop-invariant expressions out of loops. The default is ON.

]]> -LNO:opt=0 This option controls the LNO optimization level. The options can be one of the following:
0 = Disable nearly all loop nest optimizations.
1 = Perform full loop nest transformations. This is the default.

]]> -LNO:outer_unroll_max -LNO:outer_unroll_max,ou_max=N : The Outer_unroll_max option indicates that the com- piler may unroll outer loops in a loop nest by as many as N per loop, but no more. The default is 5.

]]> -LNO:ou_prod_max -LNO:ou_prod_max=N : This option indicates that the product of unrolling of the various outer loops in a given loop nest is not to exceed N, where N is a positive integer. The default is 16.

]]> -LNO:prefetch_ahead -LNO:prefetch_ahead=N : Prefetch N cache line(s) ahead. The default is 2.

]]> -LNO:prefetch -LNO:prefetch=(0|1|2|3) : This option specifies the level of prefetching.

0 = Prefetch disabled.

1 = Prefetch is done only for arrays that are always referenced in each iteration of a loop.

2 = Prefetch is done without the above restriction. This is the default.

3 = Most aggressive.

]]> -LNO:pf2 This option selectively disables or enables prefetching for Level 2 cache. The default is ON

]]> -LNO:sclrze Turn ON or OFF the optimization that replaces an array by a scalar variable. The default is ON.

]]> -LNO:simd -LNO:simd=(0|1|2) : This option enables or disables inner loop vectorization.

0 = Turn off the vectorizer.

1 = (Default) Vectorize only if the compiler can determine that there is no undesirable performance impact due to sub-optimal alignment. Vectorize only if vectorization does not introduce accuracy problems with floating-point operations.

2 = Vectorize without any constraints (most aggressive).

]]> -LNO:trip_count -LNO:trip_count=N : This flag is to provide an assumed loop trip-count if it is unknown at compile time. LNO uses this information for loop transformations and prefetch, etc. N can be any positive integer, and the default value is 1000.

]]> -LNO:vintr -LNO:vintr=(0|1|2) : This flag controls loop vectorization to make use of vector intrinsic routines (Note: a vector intrinsic routine is called once to compute a math intrinsic for the entire vector). -LNO:vintr=1 is the default. -LNO:vintr=0 turns off the vintr optimization. Under -LNO:vintr=2 the compiler will do aggressive optimization for all vector intrinsic rou- tines. Note that -LNO:vintr=2 could be unsafe in that some of these routines could have accuracy problems.

]]> -OPT:alias The -OPT: option group controls miscellaneous optimizations. These options override defaults based on the main optimization level.

-OPT:alias=<name>
Specify the pointer aliasing model to be used. By specifying one or more of the following for <name>, the compiler is able to make assumptions throughout the compilation:

typed
Assume that the code adheres to the ANSI/ISO C standard which states that two pointers of different types cannot point to the same location in memory. This is ON by default when -OPT:Ofast is specified.

restrict
Specify that distinct pointers are assumed to point to distinct, non-overlapping objects. This is OFF by default.

disjoint
Specify that any two pointer expressions are assumed to point to distinct, non-overlapping objects. This is OFF by default.

no_f90_pointer_alias
Specify that any two Fortran 90 pointer expressions are assumed to point to distinct, non-overlapping objects. This is OFF by default.

]]> -OPT:div_split -OPT:div_split=(ON|OFF)
Enable or disable changing x/y into x*(recip(y)). This is OFF by default, but enabled by -OPT:Ofast or -OPT:IEEE_arithmetic=3. This transformation generates fairly accurate code.

]]> -OPT:fast_complex -OPT:fast_complex
Setting fast_complex=ON enables fast calculations for values declared to be of the type complex. When this is set to ON, complex absolute value (norm) and complex division use fast algorithms that overflow for an operand (the divisor, in the case of division) that has an absolute value that is larger than the square root of the largest representable floating-point number. This would also apply to an underflow for a value that is smaller than the square root of the smallest representable floating point number.
OFF is the default.
fast_complex=ON is enabled if -OPT:roundoff=3 is in effect.

]]> -OPT:goto -OPT:goto
Disable or enable the conversion of GOTOs into higher-level structures like FOR loops. The default is ON for -O2 or higher.

]]> -OPT:IEEE_arithmetic -OPT:IEEE_arithmetic,IEEE_arith,IEEE_a=(1|2|3)
Specify the level of conformance to IEEE 754 floating pointing roundoff/overflow behavior. The options can be one of the following:

1 Adhere to IEEE accuracy. This is the default when optimization levels -O0, -O1 and -O2 are in effect.

2 May produce inexact result not conforming to IEEE 754. This is the default when -O3 is in effect.

3 All mathematically valid transformations are allowed.

]]> -OPT:malloc_alg -OPT:malloc_alg=(0|1)
Select an alternate malloc algorithm which may improve speed. The compiler adds setup code in the C/C++/Fortran "main" function to enable the chosen algorithm. The default is 1 when -OPT:Ofast is specified. Otherwise, the default is 0.

]]> -OPT:Ofast -OPT:Ofast
Use optimizations selected to maximize performance. Although the optimizations are generally safe, they may affect floating point accuracy due to rearrangement of computations. This effectively turns on the following optimizations: -OPT:ro=2:Olimit=0:div_split=ON:alias=typed:malloc_alg=1.

]]> -OPT:Olimit -OPT:Olimit=N
Disable optimization when size of program unit is > N. When N is 0, program unit size is ignored and optimization process will not be disabled due to compile time limit. The default is 0 when -OPT:Ofast is specified, 9000 when -O3 is specified; otherwise the default is 6000.

]]> -OPT:ro -OPT:roundoff,ro=(0|1|2|3)
Specify the level of acceptable departure from source language floating-point, round-off, and overflow semantics. The options can be one of the following:

0 = Inhibit optimizations that might affect the floating-point behavior. This is the default when optimization levels -O0, -O1, and -O2 are in effect.

1 = Allow simple transformations that might cause limited round-off or overflow differences. Compounding such transformations could have more extensive effects. This is the default when -O3 is in effect.

2 = Allow more extensive transformations, such as the reordering of reduction loops. This is the default level when -OPT:Ofast is specified.

3 = Enable any mathematically valid transformation.

]]> -OPT:rsqrt -OPT:rsqrt=(0|1|2)
This option specifies if the RSQRT machine instruction should be used to calculate reciprocal square root. RSQRT is faster but potentially less accurate than the regular square root operation.
0 means not to use RSQRT.
1 means to use RSQRT followed by instructions to refine the result.
2 means to use RSQRT by itself.
Default is 1 when -OPT:roundoff=2 or greater, else the default is 0.

]]> -OPT:treeheight -OPT:treeheight=(ON|OFF)
The value ON enables re-association in expressions to reduce the expressions' tree height. The default is OFF.

]]> -OPT:unroll_size -OPT:unroll_size=N
Set the ceiling of maximum number of instructions for an unrolled inner loop. If N=0, the ceiling is disregarded. The default is 40.

]]> -OPT:unroll_times_max -OPT:unroll_times_max=N
Unroll inner loops by a maximum of N. The default is 4.

]]> -WOPT:aggstr The -WOPT: Specifies options that affect the global optimizer. The options are enabled at -O2 or above.

-WOPT:aggstr=N
This controls the aggressiveness of the strength reduction optimization performed by the scalar optimizer, in which induction expressions within a loop are replaced by temporaries that are incremented together with the loop variable. When strength reduction is overdone, the additional temporaries increase register pressure, resulting in excessive register spills that decrease performance. The value specified must be a positive integer value, which specifies the maximum number of induction expressions that will be strength-reduced across an index variable increment. When set at 0, strength reduction is only performed for non-trivial induction expressions. The default is 11.

]]> -WOPT:if_conv -WOPT:if_conv=(0|1|2):
Controls the optimization that translates simple IF statements to conditional move instructions in the target CPU. Setting to 0 suppresses this optimization. The value of 1 designates conservative if-conversion, in which the context around the IF statement is used in deciding whether to if-convert. The value of 2 enables aggressive if-conversion by causing it to be performed regardless of the context. The default is 1.

]]> -WOPT:mem_opnds -WOPT:mem_opnds=(ON|OFF)
Makes the scalar optimizer preserve any memory operands of arithmetic operations so as to help bring about subsumption of memory loads into the operands of arithmetic operations. Load subsumption is the combining of an arithmetic instruction and a memory load into one instruction. Default is OFF.

]]> -WOPT:retype_expr -WOPT:retype_expr=(ON|OFF)
Enables the optimization in the compiler that converts 64-bit address computation to use 32-bit arithmetic as much as possible. Default is OFF.

]]> -WOPT:unroll Control the unrolling of innermost loops in the scalar optimizer. Setting to 0 suppresses this unroller. The default is 1, which makes the scalar optimizer unroll only loops that contain IF statements. Setting to 2 makes the unrolling to also apply to loop bodies that are straight line code, which duplicates the unrolling done in the code generator, and is thus unnecessary. The default setting of 1 makes this unrolling complementary to what is done in the code generator. This unrolling is not affected by the unrolling options under the -OPT group.

]]> -GRA:optimize_boundary The -GRA: Option group for Global Register Allocator.

-GRA:optimize_boundary=(ON|OFF)
Allow the Global Register Allocator to allocate the same register to different variables in the same basic-block. Default is OFF.

]]> -LANG:copyinout The -LANG: This controls the language option group.

-LANG:copyinout=(ON|OFF)
When an array section is passed as the actual argument in a call, the compiler sometimes copies the array section to a temporary array and passes the temporary array, thus promoting locality in the accesses to the array argument. This optimization is relevant only to Fortran, and this flag controls the aggressiveness of this optimization. The default is ON for -O2 or higher and OFF otherwise.

]]> -TENV:frame_pointer The -TENV: This option specifies the target environment option group. These options control the target environment assumed and/or produced by the compiler

-TENV:frame_pointer=(ON|OFF)
Default is ON for C++ and OFF otherwise. Local variables in the function stack frame are addressed via the frame pointer register. Ordinarily, the compiler will replace this use of frame pointer by addressing local variables via the stack pointer when it determines that the stack pointer is fixed throughout the function invocation. This frees up the frame pointer for other purposes. Turning this flag on forces the compiler to use the frame pointer to address local variables. This flag defaults to ON for C++ because the exception handling mechanism relies on the frame pointer register being used to address local variables. This flag can be turned OFF for C++ for programs that do not throw exceptions.

]]> Link with static libraries.

]]> Staticily link with the PGI runtime libraries. System libraries may still be dynamically linked.

]]> Link with dynamic libraries.

]]> -static -static
Suppress dynamic linking at runtime for shared libraries; use static linking instead.

]]> Specifies a directory to search for libraries. Use -L to add directories to the search path for library files. Multiple -L options are valid. However, the position of multiple -L options is important relative to -l options supplied.

]]> -L/path/to/libs Link using MicroQuill's SmartHeap 8 (32-bit) library for Linux. Description from Microquill:

SmartHeap is a fast (3X-100X faster than compiler-supplied libraries), portable (Windows, Linux, Solaris, HP-UX, IBM-AIX, Dec OSF Tru64, SGI Irix), reliable, ANSI-compliant malloc/operator new library. SmartHeap supports multiple memory pools, includes a fixed-size allocator, and is thread-safe. SmartHeap also includes comprehensive memory debugging APIs to detect leakage, overwrites, double-frees, wild pointers, out of memory, references to previously freed memory, and other memory errors.

]]>