Author Topic: Troubleshooting "The coefficient of variation among hosts was x%" errors  (Read 5118 times)

GregDarnell

  • Moderator
  • Newbie
  • *****
  • Posts: 21
  • Karma: +1/-0
In multi-node and blade server configurations, if every node is not identically configured, the variation of performance between nodes will be too high.  In this case, the run will be invalidated with an error such as "INVALID: The coefficient of variation among hosts was 17.6%, which is greater than the threshold of 10.0% in the 100% interval of worklet Compress".

In virtually all cases we have seen, the issue was caused by configuration problems in one or more nodes.  SERT requires that all nodes be identically configured.   

However, configuration mistakes and hardware issues can occur, and this error can be difficult to isolate. There are several steps that should be followed to isolate the problem node(s) and then test the fixes.

If you are using the GUI, you can preview the discovery data using the preview button in the Discovery section early in the test.  For multi-node configurations, any differences detected by the discovery process will be highlighted, and those should be resolved before continuing.   The report will identify many common causes of variability, such as mismatched processors or memory.

If the discovery process does not find any mismatches between nodes, and a run has been completed with errors, you can use an XSL template provided with SERT to generate a detailed result file. 

If, for example, your results file is in results-0005, you can execute the following command from the SERT base directory:
reporter.bat -s -c -x results\XSL-File\csv-subset-detail.xsl -o result-subset-detail.csv -r results\sert-0005\results.xml

This will produce a file results\sert-0005\csv-subset-detail.csv, which includes performance results for every Java instance for each SERT measurement interval.  Each individual result is named "hostname:instance", where instance identifies a Java instance on that node.  For a system with high variation between hosts, it is usually fairly easy to visually inspect the csv file and identify the node(s) whose values are inconsistent with the majority.

There are many, many potential causes of such variations: differing BIOS settings, thermal issues, memory errors, incorrect memory population, etc.

To shorten the time required to test possible fixes, it is recommended that you pick a known "good" node and run only the worklet showing the highest variation, using those results as a baseline.  Then test the inconsistent nodes in the same way until you come up with matching results.  Once the problem nodes have performance matching the baseline node, you should be able to do a full run of all nodes without any further CV errors.