Author Topic: INVALID: x: Transactions per second was x% of target (threshold is at least x%)  (Read 4625 times)

Okkreh

  • Newbie
  • *
  • Posts: 4
  • Karma: +0/-0
Hi,

I saw a +year old thread about something similar without any progress/resolution so decided to create new topic.

Issue:

Quote
WARNING: CPU:LU:100%: Transactions per second was 103.2% of target, near the limit of at most 104.0%
INVALID: CPU:Sort:75%: Transactions per second was 95.4% of target (threshold is at least 98.0%)
WARNING: CPU:Compress:100%: The coefficient of variation among clients was 74.5%
WARNING: CPU:Compress:75%: The coefficient of variation among clients was 72.2%
WARNING: CPU:Compress:50%: The coefficient of variation among clients was 72.2%
WARNING: CPU:Compress:25%: The coefficient of variation among clients was 72.2%
WARNING: CPU:CryptoAES:100%: The coefficient of variation among clients was 83.8%
WARNING: CPU:CryptoAES:75%: The coefficient of variation among clients was 85.2%
WARNING: CPU:CryptoAES:50%: The coefficient of variation among clients was 85.2%
WARNING: CPU:CryptoAES:25%: The coefficient of variation among clients was 85.2%
WARNING: CPU:Sort:75%: The coefficient of variation among clients was 5.6%
WARNING: CPU:SSJ:87.5%: The coefficient of variation among clients was 79.0%
WARNING: CPU:SSJ:75%: The coefficient of variation among clients was 78.8%
WARNING: CPU:SSJ:62.5%: The coefficient of variation among clients was 78.8%
WARNING: CPU:SSJ:50%: The coefficient of variation among clients was 78.7%
WARNING: CPU:SSJ:37.5%: The coefficient of variation among clients was 79.1%
WARNING: CPU:SSJ:25%: The coefficient of variation among clients was 78.7%
WARNING: CPU:SSJ:12.5%: The coefficient of variation among clients was 79.1%

Setup:

Quote
Windows 10 as "test orchestrator"
RHEL 7.7 as "SUT"
Yokogawa WT500
PTDaemon 1.9.2
PCsensor USB9097+DS18B20

Number of Nodes   1
CPU Name   Intel(R) Xeon(R) Gold 6330N CPU @ 2.20GHz
Total Number of Processors   2
Total Number of Cores   56
Total Number of Threads   112
Total Physical Memory   251.3 GB
Total Number of Memory DIMMs   16
Total Number of Storage Devices 2

SERT 2.0.3
OS Version   7.7   JVM Vendor   OpenJDK
Filesystem   XFS   JVM Version   11.0.1+13
Additional Software   None   Client Configuration ID   Intel_Lin_HS18_2

Debug/other:

We have played around quite a bit and now the only worklet that is still causing "INVALID" is "CPU:Sort 75%". We've tried to play around with the CPU warmup timers but after we go above "8" some other tests will start to report "INVALID". Effect on "SORT" seems to be somewhat minor and random but always on "INVALID" range anyways.

Quote
CPU: Sort
Total Clients   112
CPU Threads per Client   1
Sample Client Command-line   numactl -l --physcpubind=88 /opt/jdk-11.0.1/bin/java -classpath lib/sert.jar:lib/chauffeur.jar:lib/chauffeurCommon.jar:lib/ptdaemonClientApi.jar:lib/mtrandom.jar:lib/xsrandom.jar:lib/saxon9he.jar:lib/groovy.jar:lib/groovy-jsr223.jar -Djava.util.logging.config.file=logging.properties -DtotalHostHardwareThreads=112 -Xms256m -Xmx256m -XX:+UseParallelOldGC -XX:+AggressiveOpts -XX:+UseLargePages -XX:ParallelGCThreads=1 -Djava.security.egd=file:/dev/./urandom -XX:SurvivorRatio=60 -XX:TargetSurvivorRatio=90 org.spec.chauffeur.client.ClientJvm -director localhost:33137 -jvmid 65 -numJvms 112 -hostId localhost

Any ideas are welcome.

Few things that I'm still unsure/haven't had time to study:

- Is the client vs. cpu vs. thread amount discrepancy(clients 112, cpus 88, threads 112) expected?
- Wondering if the host OS installation(RHEL 7.7. 3.10.0-1062) is fully supporting this CPU(governors etc. aren't available at the moment with this OS)?
- I believe there is no "Client Configuration ID" for RHEL 8 available for this CPU/setup?
- Any idea if changing to Windows Server 2019 with "Intel_Win_HS18_5" could work as a WA?

BR,

GregDarnell

  • Moderator
  • Newbie
  • *****
  • Posts: 21
  • Karma: +1/-0
I'd suggest that you focus on fixing the apparent memory imbalance as shown by the extremely high client CVs before trying to solve the Sort issue.  The extremely high client CVs on the most memory-intensive CPU worklets indicate that some threads have much better access to memory than others.  I suspect that the memory worklets would fail with similar or worse client CVs, or perhaps not complete at all.

There are many possible causes. The most common is that the memory is not installed symmetrically between processors. Other potential issues are a failing memory DIMM, memory mirroring or sparing enabled, or issues with OS installation such as affinity/numactl.  Changing to a Windows OS would help with the last issue on that list, but not the others.

The "88" you are referring to is simply an example of the processor binding for one particular thread and is not related to the CPU quantity.

Okkreh

  • Newbie
  • *
  • Posts: 4
  • Karma: +0/-0
Hi,

At the time when I created the post our setup had 1 DIMM per memory channel(16/32). After your comments we tried to populate all DIMM slots(32/32) and at first it seemed to resolve the issue(not full run). But immediately when we did a full run the same warnings/invalids appeared.

Then we reverted back to(16/32) DIMM setup and noticed that the outcome is rather random between server restarts. By random I mean that if we ran some specific test we might not see any invalids/warnings and every 2nd or 3rd server reset/full run we got no warnings/invalids. The memory setup(16 | 32)/SERT training values didn't seem to matter much in the end.

We played around with the setup quite a bit and we couldn't get to a solid conclusion. Good news is that we ultimately got valid result with both memory setups. But it was a huge struggle as we couldn't understand how e.g. three back to back runs have so different outcome when nothing in the SW setup/HW setup changes.

BR,

SertUser

  • Newbie
  • *
  • Posts: 1
  • Karma: +0/-0
Hello,

I am seeing similar error with the CPU "LU" worklet and no memory issues. Everything else seems to be valid, but for this one worklet.

Could you please shed some light on how to resolve this? Maybe another run with updated measurement times? I am using the latest SERT version on a server with a single CPU and having more than 1GB of memory per thread allocated.

SertUser

Sanjay Sharma

  • Moderator
  • Newbie
  • *****
  • Posts: 4
  • Karma: +0/-0
This response applies to the last two comments in this thread:

@Okkreh - We're sorry to hear that you had to go through numerous tries to get a valid set of results with SERT. Although we have seen occasional issues with a worklet or two in the past, and unless they were caused by unbalanced hardware configuration or faulty hardware, they have usually been alleviated by adding more warmup runs. The situation you describe is not something we have encountered, and understand that it can be very frustrating. We would need to investigate this further and would request you to create a SERT defect and provide us with the logs. You can file the defect at with the SERT Support Request Form located here: https://www.spec.org/sert/feedback/issuereport.html, attaching the log files as described in section 7.1 of the user guide: https://www.spec.org/sert2/SERT-userguide.pdf


@SertUser - For your issue, if you have not already tried it, I would recommend increasing the number of warmup intervals. The mechanics of doing so are documented in section 7.2.3 of the SERT User Guide located at: https://www.spec.org/sert2/SERT-userguide.pdf.