With high rank counts, there is a integer overflow bug that will cause unnecessary loops which results in abnormal long runtime at certian rank counts. The fix is to avoid such scenarios.

In the original parallel.f, 

       DO NX = 1, NPROCS
          DO NY = 1, NPROCS
             DO NZ = 1, NPROCS
                IF ( NX * NY * NZ .EQ. NPROCS ) THEN
                   ......
                   DO K = 1, NZ
                      DO J = 1, NY
                         DO I = 1, NX


The problem happens when NX*NY*NX(integer) is larger than 4 bytes.  The fix is to avoid unnecessary loops with the following changes:

        DO NX = 1, NPROCS
           DO NY = 1, NPROCS/NX
              DO NZ = 1, NPROCS/NX/NY

