SPEC Community

Product Support => SPECvirt_sc2013 => Topic started by: Miles on August 03, 2017, 01:42:12 AM

Title: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on August 03, 2017, 01:42:12 AM
Hi
I try to run a 2T1W(batch workload) test but failed.
I use two clients, client1 and client2, and execute runspecvirt.sh on client1.
Should I correct the hosts files?
Thanks.

primectrl.out
2017-08-03 11:36:08:456 Thu Aug 03 11:36:08 CST 2017
2017-08-03 11:36:08:456 specvirt: maxPreRunTime = 1501
2017-08-03 11:36:08:456 specvirt: runTime = 7200
2017-08-03 11:36:08:456 specvirt: runTime = 7200
2017-08-03 11:36:08:457 specvirt: runTime = 600
2017-08-03 11:36:08:457 specvirt: runTime = 600
2017-08-03 11:36:08:458 Validator: [WARNING] WORKLOAD_LABEL[0] value is: Batch Server; should be Application Server
2017-08-03 11:36:08:458 Validator: [WARNING] WORKLOAD_SCORE_TMAX_VALUE[3] value is: 143.60; should be 0
2017-08-03 11:36:08:458 Validator: [WARNING] WORKLOAD_LABEL[3] value is: Mail Server; should be Batch Server
2017-08-03 11:36:08:458 Validator: [WARNING] WORKLOAD_SCORE_TMAX_VALUE[2] value is: 174.30; should be 143.60
2017-08-03 11:36:08:458 Validator: [WARNING] WORKLOAD_LABEL[2] value is: Application Server; should be Mail Server
2017-08-03 11:36:08:458 Validator: [WARNING] WORKLOAD_LOAD_LEVEL[0] value is: 0; should be 100
2017-08-03 11:36:08:458 Validator: [WARNING] WORKLOAD_LOAD_LEVEL[3] value is: 500; should be 0
2017-08-03 11:36:08:459 Validator: [WARNING] NUM_WORKLOADS value is: 1; should be 4
2017-08-03 11:36:08:459 Validator: [WARNING] WORKLOAD_LOAD_LEVEL[2] value is: 100; should be 500
2017-08-03 11:36:08:459 Validator: [WARNING] WORKLOAD_SCORE_TMAX_VALUE[0] value is: 0; should be 174.30
2017-08-03 11:36:08:459 Validator: [WARNING] RESULT_FILE_NAMES[0] must contain Atomicity.html
2017-08-03 11:36:08:459 Validator: [WARNING] RESULT_FILE_NAMES[0] must contain  Audit.report
2017-08-03 11:36:08:459 Validator: [WARNING] RESULT_FILE_NAMES[0] must contain  Dealer.detail
2017-08-03 11:36:08:460 Validator: [WARNING] RESULT_FILE_NAMES[0] must contain  Dealer.summary
2017-08-03 11:36:08:460 Validator: [WARNING] RESULT_FILE_NAMES[0] must contain  Mfg.detail
2017-08-03 11:36:08:460 Validator: [WARNING] RESULT_FILE_NAMES[0] must contain  Mfg.summary
2017-08-03 11:36:08:460 Validator: [WARNING] RESULT_FILE_NAMES[0] must contain  result.props
2017-08-03 11:36:08:460 Validator: [WARNING] RESULT_FILE_NAMES[0] must contain  SPECjAppServer.summary
2017-08-03 11:36:08:460 Validator: [WARNING] Non-compliant configuration.
2017-08-03 11:36:08:460 [WARNING] This will be a non-compliant benchmark result!
2017-08-03 11:36:08:480 RMI server started: client1:9990
2017-08-03 11:36:08:483 [INFO] This is a perf-only benchmark run. Skipping active idle polling interval.
2017-08-03 11:36:08:483 PrimeControl: preparing client drivers.
2017-08-03 11:36:08:483 PrimeControl: PRIME_HOST 0 = client1:1092
2017-08-03 11:36:08:483 PrimeControl: PRIME_HOST 0 = client2:1092
2017-08-03 11:36:08:484 PrimeControl: Master 1: client1:1092
2017-08-03 11:36:08:484 PrimeControl: Master 2: client2:1092
2017-08-03 11:36:08:485 PrimeControl: adding host client1:1092
2017-08-03 11:36:08:489 PrimeControl: adding host client2:1092
2017-08-03 11:36:08:497 First client for 0: 192.168.1.8:1902
2017-08-03 11:36:08:512 PrimeControl: [ERROR] exception  thrown:
java.lang.NullPointerException
   at org.spec.virt.clientmgr.getClients(clientmgr.java:232)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
   at sun.rmi.transport.Transport$1.run(Transport.java:177)
   at sun.rmi.transport.Transport$1.run(Transport.java:174)
   at java.security.AccessController.doPrivileged(Native Method)
   at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
   at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:556)
   at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:811)
   at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:670)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
   at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:275)
   at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:252)
   at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:161)
   at org.spec.virt.clientmgr_Stub.getClients(Unknown Source)
   at org.spec.virt.PrimeControl.initClients(PrimeControl.java:600)
   at org.spec.virt.PrimeControl.runInterval(PrimeControl.java:326)
   at org.spec.virt.PrimeControl.access$800(PrimeControl.java:32)
   at org.spec.virt.PrimeControl$1.run(PrimeControl.java:201)
2017-08-03 11:36:08:513 PrimeControl: terminating run. Please wait...
2017-08-03 11:36:09:515 specvirt: Done!
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on August 03, 2017, 10:42:55 AM
Miles, if you're doing test runs, you might set POLL_INTERVAL_SEC to something shorter so you don't have to wait so long to see if it failed. Batch needs more than a half hour to run, but to make sure it's working, you could set RAMP_SECONDS = 300, WARMUP_SECONDS= 600, and POLL_INTERVAL_SEC = 900.

I need the Clientmgr*.log files to diagnose further. What's in Clientmgr*_1092.log?

Is the specpoll process running on batchserver2?

Code: [Select]
ssh batchserver2 "ps -ef|grep -i poll"
Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on August 03, 2017, 09:22:34 PM
Hi Lisa
Yes, the specpoll process is running on batchserver2.
No error message is in Clientmgr1_1092.out.

Another question, in exampleVM, there is "specdriver" in /etc/hosts, I don't know what it means.
In 2-tile test, is client1 or client2 specdriver? or both?

Clientmgr1_1092.out
2017-08-03 15:02:06:287 Creating clientmgr using RMI Registry port 1092
2017-08-03 15:02:06:306 client1:1092 ready...

Clientmgr1_1088.out
2017-08-03 15:02:06:301 Creating clientmgr using RMI Registry port 1088
2017-08-03 15:02:06:320 client1:1088 ready...

Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on August 04, 2017, 11:01:32 AM
The specdriver alias on the client is needed for appserver. The client pieces on SPECjAppServer2004 look for this alias on the appserver VM the same way it looks for the specdb alias for the dbserver.

Have you tried running only one tile with batchserver? Maybe the problem is with the client when you add the second tile.

Lisa

Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on August 06, 2017, 09:23:32 PM
Hi
I modified WORKLOAD_CLIENTS

Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on August 15, 2017, 12:41:05 PM
Hi
It passed with 2T1W, but failed with 2T4W
I think there is some failure in my web workload, because it completed the run when only 3 workloads(without web workload).

primectrl.out
2017-08-15 18:46:19:718 specvirt: waiting on 5 prime clients.
2017-08-15 18:46:19:725 setting hostsReady = true
2017-08-15 18:46:19:800 specvirt: waiting on 4 prime clients.
2017-08-15 18:46:20:258 RemoteException while trying to get workload build number: exception java.rmi.UnmarshalException: Error unmarshaling return header; nested exception is:
   java.io.EOFException
2017-08-15 18:46:20:259 PrimeControl: [ERROR] masters[0] build numbers (null) do not match the specvirt prime controller's (80). Please update complete harness and retry.
2017-08-15 18:46:20:259 PrimeControl: [ERROR] masters[1] build numbers (null) do not match the specvirt prime controller's (80). Please update complete harness and retry.
2017-08-15 18:46:20:259 PrimeControl: [ERROR] masters[4] build numbers (null) do not match the specvirt prime controller's (80). Please update complete harness and retry.
2017-08-15 18:46:20:259 PrimeControl: [ERROR] masters[5] build numbers (null) do not match the specvirt prime controller's (80). Please update complete harness and retry.
2017-08-15 18:46:20:259 PrimeControl: [ERROR] startMasters() failed!
2017-08-15 18:46:20:259 PrimeControl: sending abortTest() to prime clients.
2017-08-15 18:46:20:259 PrimeControl: id=1, abortID=-1
2017-08-15 18:46:20:259 PrimeControl: masters[1]=client1:1096
2017-08-15 18:46:20:260 PrimeControl: id=6, abortID=-1
2017-08-15 18:46:20:260 PrimeControl: id=7, abortID=-1
2017-08-15 18:46:20:260 PrimeControl: masters[6]=client2:1094
2017-08-15 18:46:20:260 PrimeControl: masters[7]=client2:1092
2017-08-15 18:46:20:260 PrimeControl: id=2, abortID=-1
2017-08-15 18:46:20:260 PrimeControl: id=3, abortID=-1
2017-08-15 18:46:20:260 PrimeControl: masters[2]=client1:1094
2017-08-15 18:46:20:260 PrimeControl: masters[3]=client1:1092
2017-08-15 18:46:20:260 PrimeControl: [ERROR] exception occurred sending abortTest signal to specweb_Stub[UnicastRef [liveRef: [endpoint:[192.168.1.8:39656](remote),objID:[-26afdb66:15de58053da:-7ffe, 3407266484835053569]]]]. Exception was:
 java.rmi.ConnectException: Connection refused to host: 192.168.1.8; nested exception is:
   java.net.ConnectException: Connection refused
2017-08-15 18:46:28:106 PrimeControl: id=5, abortID=-1
..

Clientmgr1_1096.out
-> 2017-08-15 18:46:19:523 SpecwebControl: * Running SPECweb_Support workload
-> 2017-08-15 18:46:19:523 Configuration: Clearing workload.
-> 2017-08-15 18:46:19:526 RemoteLoadGen: Total clients: 1
-> 2017-08-15 18:46:19:596 HttpRequestSched: [ERROR] initServers() exception reported making HTTP request:
-> java.lang.NullPointerException
-> 2017-08-15 18:46:19:596 HttpRequestSched: [ERROR] initConorg.spec.specweb.Connection@46a329dc; threadByteArray: [B@2114ebf; useSSL: false
-> 2017-08-15 18:46:19:596 HttpRequestSched: [ERROR] response: HTTP/1.0 302 Redirect
-> Date: Fri Jun 16 16:29:40 2017
...
-> 2017-08-15 18:46:19:596 RemoteLoadGen: [ERROR] Unable to successfully initialize workload variables. Terminating.
-> 2017-08-15 18:46:19:596 SpecwebControl: [ERROR] Could not create all client threads.
-> 2017-08-15 18:46:19:596 SpecwebControl: [ERROR] setupWorkload() failed!
-> 2017-08-15 18:46:19:596 SpecwebControl: [ERROR] runTests() failed!
-> 2017-08-15 18:46:19:596 SpecwebControl: [ERROR] Benchmark run failed!
-> 2017-08-15 18:46:19:601 SpecwebControl: Terminating run. Please wait...


Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on August 15, 2017, 01:22:40 PM
Hmm. From the prime client, please check the output of the following on all workload VMs, clients, and prime client. This case assumes that 211 is the last octet of infraserver1, 217 is the last octet of the prime client, and 10.140.3 is the network:

Code: [Select]
for i in `seq 211 217`; do  ssh 10.140.3.$i "java -jar /opt/SPECvirt/specvirt.jar -v"; echo $i;  done
Does this command report anything other than:

Code: [Select]
SPECvirt_sc2013 v1.1, build: 80
If not, you need to make sure you've installed SPECvirt correctly. It's easiest to do a full installation on every VM.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on August 16, 2017, 12:11:13 AM
Hi Lisa
Do you mean that I should copy "SPECvirt" folder onto all VMs?
But the progress still failed after I completed the action.

Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on August 16, 2017, 11:20:28 AM
Yes, SPECvirt needs to be installed on every VM and client. You can't just copy the directory - you need the entire /opt directory with all the workload and harness software.

Rather than cloning a clean VM and installing the workload software every time, you can clone an existing, working tile. That's the easiest way to scale up. See https://www.spec.org/forums/index.php?topic=15.msg145#msg145 (https://www.spec.org/forums/index.php?topic=15.msg145#msg145) for the steps on cloning a tile.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on August 22, 2017, 05:25:43 AM
Hi
It can start the run but aborts after the polling phase starts because of
Connection refused to host: 100.100.1.8 from webserver1.
(100.100.1.8 is the SPECVIRT_HOST)

But the connection between them works well.
The firewall is disabled.

primectrl.out
            ...
2017-08-22 16:08:12:627 PrimeControl: checking polling start response times...
2017-08-22 16:08:12:628 PrimeControl: sleeping for 0 sec
2017-08-22 16:08:12:628 PrimeControl: sending results counter reset command.
2017-08-22 16:08:12:629 PrimeControl: polling for 7200 sec
0,0,2017-08-22 16:08:22:691,14,10.0,11,10.0,18,10.0,298,17.25
0,1,2017-08-22 16:08:22:734,475,0,0,475,0,0,0,0,0,2147483647,0,50
0,2,2017-08-22 16:08:22:656,730,626,104,0,2037856,699,21459
0,3,2017-08-22 16:08:22:634,1,1,3487,3487,3487,3487,0,1
1,0,2017-08-22 16:08:20:148,14,10.0,16,10.0,29,10.0,0,0.00
1,1,2017-08-22 16:08:20:170,127,41,0,86,4748035,1620291,0,0,41,8558,417399,57
1,2,2017-08-22 16:08:20:122,757,670,87,0,1984379,712,21226
1,3,2017-08-22 16:08:20:098,1,1,1700,1700,1700,1700,1,0
                    (aborted)         

Clientmgr1_1096.out
-> 2017-08-22 15:53:12:610 SpecwebControl: Warming up for 900 seconds.
-> 2017-08-22 16:08:12:612 SpecwebControl: Clearing results.
-> 2017-08-22 16:08:12:614 SpecwebControl: Starting 7200-second runtime.
-> 2017-08-22 16:08:12:633 SpecwebControl: Clearing results.
-> 2017-08-22 16:08:22:734,475,0,0,475,0,0,0,0,0,2147483647,0,50
-> 2017-08-22 16:12:12:655 RemoteLoadGen: Warning: RMI exception trying to contact client1:1010. Retrying...
-> 2017-08-22 16:12:12:656 RemoteLoadGen: [ERROR] Unable to contact client1:1010
-> 2017-08-22 16:12:12:656 RemoteLoadGen: [ERROR] 1 remote clients, but only 0 responded
-> 2017-08-22 16:12:12:656 SpecwebControl: [ERROR] Client(s) not responding. Aborting test.
-> 2017-08-22 16:12:12:656 RemoteLoadGen: [ERROR] Remote exception setting server reset data collection from client1:1010
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.8; nested exception is:
->    java.net.ConnectException: Connection refused
-> 2017-08-22 16:12:12:656 SpecwebControl: Stopping remote clients.
-> 2017-08-22 16:12:12:658 RemoteLoadGen: 180-second ramp-down starting.
-> 2017-08-22 16:12:12:658 RemoteLoadGen: stopping client client1:1010; abort=false
-> 2017-08-22 16:12:12:658 RemoteLoadGen: [ERROR] Remote exception stopping clients from client1:1010
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.8; nested exception is:
->    java.net.ConnectException: Connection refused
-> 2017-08-22 16:12:12:659 SpecwebControl: Waiting for remote clients to stop.
-> 2017-08-22 16:12:12:659 RemoteLoadGen: [ERROR] Remote exception waiting for clients to complete from client1:1010
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.8; nested exception is:
->    java.net.ConnectException: Connection refused
-> 2017-08-22 16:12:12:659 SpecwebControl: [ERROR] runWorkload() failed!
-> 2017-08-22 16:12:12:659 SpecwebControl: [ERROR] runTests() failed!
-> 2017-08-22 16:12:12:659 SpecwebControl: [ERROR] Benchmark run failed!
-> 2017-08-22 16:12:12:661 SpecwebControl: Terminating run. Please wait...
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on August 22, 2017, 01:00:05 PM
Looks like the SPECpolling agent is down on webserver1. What are the contents of:

Code: [Select]
ssh webserver1 "cat /tmp/pollme*"
Should look something like:

Code: [Select]
Creating RMI listener using RMI Registry port 8001
webserver1-int/10.100.1.8:8001 ready...

I stop and restart SPECpoll on each workload VM before every test. To automate this, have runspecvirt.sh call pollInit.sh in the helper directory, or you can run pollmecheck.sh manually to see that all SPECpoll processes are up.

I recommend setting NUM_WORKLOADS = 2 to run only app/dbserver and web/infraserver until you get these working as well as setting RAMP_SECONDS = 600.

btw, looking through your Control.config, you can simplify some of the entries. Instead of having each tile and workload number in the indexes for PRIME_HOST_INIT_SCRIPT, you can set the value for the entire tile and/or workload this way:

Code: [Select]
PRIME_HOST_INIT_SCRIPT[0][0] = "jAppInitRstr.sh"
PRIME_HOST_INIT_SCRIPT[1][0] = "jAppInit.sh"
PRIME_HOST_INIT_SCRIPT[2][0] = "jAppInit.sh"
PRIME_HOST_INIT_SCRIPT[3][0] = "jAppInit.sh"
PRIME_HOST_INIT_SCRIPT[1] = "webInit.sh"
PRIME_HOST_INIT_SCRIPT[2] = "mailInit.sh"
PRIME_HOST_INIT_SCRIPT[3] = "batchInit.sh"


Same for RAMP_SECONDS:

Code: [Select]
RAMP_SECONDS[0] = 600
RAMP_SECONDS[1] = 600
RAMP_SECONDS[2] = 600
RAMP_SECONDS[3] = 600

Better yet:
Code: [Select]
RAMP_SECONDS= 600
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on August 22, 2017, 01:03:19 PM
Also, you're on the right track with testing web with WORKLOAD_LOAD_LEVEL[1] = 1000 vs. WORKLOAD_LOAD_LEVEL[1] = 2500. This shows you quickly if your SUT is under-configured. Are you running 10GbE between the clients and SUT? One tile uses about 1.4GbE per tile. If you don't have 10GbE, you can split off webserver onto its own client.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on August 23, 2017, 01:59:07 PM
Hi
I got "Creating RMI listener using RMI Registry port 8001 webserver1-int/10.100.1.8:8001 ready..."
 with command "ssh webserver1 "cat /tmp/pollme*"

I changed the network to 10GbE, but sill failed.
The following is the main error message in Clientmgr1_1098.out:

WARNING: IOP00810011: Exception from readValue on ValueHandler in CDRInputStream
org.omg.CORBA.MARSHAL: WARNING: IOP00810011: Exception from readValue on ValueHandler in CDRInputStream  vmcid: OMG  minor code: 11 completed: Maybe
...

WARNING [javax.enterprise.resource.corba.ORBUtil]: IOP00810061: Could not
read exception from UEInfoServiceContext
...



Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on August 23, 2017, 02:56:08 PM
The SPECpoll process is running on webserver1 - that's good.

The new error is with the GlassFish and the dbserver. I run into this error on occasion and found the only way to fix this is clone a dbserver VM if you have one or re-run the example VM scripts on a clean VM to remake it a dbserver.  I wish I had a better answer.

Make sure you don't run jAppInitRstr.sh on appserver2 or 3 or 4. Only run jAppInit.sh on them.

Are you restoring the dbserver between runs? Please post the contents of jAppInitRstr.sh and post the output of:

Code: [Select]
ssh dbserver1 "ls -lh /dbstore/backup"
It should be about 989MB.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on August 24, 2017, 03:18:27 AM
Hi
Yes, I ran jAppInitRstr.sh only on appserver1.
I execute loaddb.sh before I start a run.
I cloned a new dbserver and use it on continue but got the following errors:

primectrl.out
2017-08-24 14:42:54:016 PrimeControl: checking polling start response times...
2017-08-24 14:42:54:018 PrimeControl: sleeping for 0 sec
2017-08-24 14:42:54:018 PrimeControl: sending results counter reset command.
2017-08-24 14:42:54:018 PrimeControl: polling for 7200 sec
2017-08-24 14:42:54:593 resetCounters() call failed on 100.100.1.8; aborting...
2017-08-24 14:43:23:919 [ERROR] One or more workloads exceeded maximum allowed polling response delay!
2017-08-24 14:43:23:919 PrimeControl: sending abortTest() to prime clients.
       ...
2017-08-24 14:43:23:919 PrimeControl: masters[2]=client1:1094
2017-08-24 14:43:23:922 PrimeControl: [ERROR] exception occurred sending abortTest signal to specweb_Stub[UnicastRef [liveRef: [endpoint:[100.100.1.8:6228](remote),objID:[ca345ed:15e12e621cd:-7ffe, -4227181212894052388]]]]. Exception was:
 java.rmi.ConnectException: Connection refused to host: 100.100.1.8; nested exception is:
   java.net.ConnectException: Connection refused
2017-08-24 14:43:24:923 specvirt: benchmark run failed!
2017-08-24 14:43:24:923 specvirt: Done!


And still find "[ERROR] Remote exception clearing statistics from client1:1010"

-> java.rmi.ConnectException: Connection refused to host: 100.100.1.8; in Clientmgr1_1096.out.

Should I increase any value in Control.config such as "POLL_INTERVAL_SEC"?
Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on August 24, 2017, 10:39:13 AM
Hi. Don't increase POLL_INTERVAL_SEC. That just makes the duration of the test longer, and you're not able to run the test in the first place. Randomly changing parameters in Control.config won't fix this.

I think your problem is with the internal vs. external VNICS. Your error on the client is "Connection refused to host: 100.100.1.8; nested exception is...".  The client doesn't have an internal VNIC and doesn't use the 100.100.X.X network - only web/infraserver and app/dbserver do. 

Is /etc/hosts on all of the clients populated correctly? On client1 it should look something like:

Code: [Select]
192.168.1.1     infraserver1 infraserver
192.168.1.2     webserver1 webserver
192.168.1.3     mailserver1 mailserver
192.168.1.4     appserver1 appserver specdelivery specemulator
192.168.1.5     dbserver1 dbserver
192.168.1.6     batchserver1 batchserver
192.168.1.8     client1 specdriver

192.168.1.11    infraserver2
192.168.1.12    webserver2
192.168.1.13    mailserver2
192.168.1.14    appserver2
192.168.1.16    batchserver2
192.168.1.18    client2

On client2 it should look like:

Code: [Select]
192.168.1.1     infraserver1
192.168.1.2     webserver1
192.168.1.3     mailserver1
192.168.1.4     appserver1
192.168.1.5     dbserver dbserver1
192.168.1.6     batchserver1
192.168.1.8     client1

192.168.1.11    infraserver2 infraserver
192.168.1.12    webserver2 webserver
192.168.1.13    mailserver2 mailserver
192.168.1.14    appserver2 appserver specdelivery specemulator
192.168.1.16    batchserver2 batchserver
192.168.1.18    client2 specdriver

Only the workload VMs need to have the internal VNIC addresses. Do not have these in a client's /etc/hosts. That is, the clients don't need and can't use:

Code: [Select]
100.100.1.X       infraserverX-int
100.100.1.X       webserverX-int
100.100.1.X       appserverX-int
100.100.1.X       dbserverX-int specdb

These work only the app/dbserver and web/infraserver VMs because they have internal VNICs. Traffic between the clients and VMs is on an external VNIC.

Please check the contents of /etc/hosts. have runspecvirt.sh call pollInit.sh and then pollmecheck.sh before every test to make sure SPECpoll is running on the workload VMs. Reduce all RAMP_SECONDS = 300. Run run one tile with appserver and webserver only, and see if that runs (NUM_TILES = 1, NUM_WORKLOADS = 2). If that works, run two tiles with appserver and webserver only (NUM_TILES = 2, NUM_WORKLOADS = 2). Let us know how it goes.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on August 24, 2017, 10:55:14 PM
Hi
[1]
My external VNICS setting is "100.100.1.x" and internal setting is "10.10.1.x".
The run of 1T2W aborted because of the error "[ERROR] Unable to contact client1:1010", but it can work before...
I have checked with pollmecheck.sh and all VMs worked well.
I will check my Control.config.
Thanks.

Clientmgr1_1096.out
-> 2017-08-25 09:50:13:113 RemoteLoadGen: Finished starting clients.
-> 2017-08-25 09:50:13:120 SpecwebControl: Warming up for 600 seconds.
-> 2017-08-25 10:00:13:122 SpecwebControl: Clearing results.
-> 2017-08-25 10:00:13:123 SpecwebControl: Starting 7200-second runtime.
-> 2017-08-25 10:00:13:130 SpecwebControl: Clearing results.
-> 2017-08-25 10:00:23:127 RemoteLoadGen: Warning: RMI exception trying to contact client1:1010. Retrying...
-> 2017-08-25 10:00:23:128 RemoteLoadGen: [ERROR] Unable to contact client1:1010
-> 2017-08-25 10:00:23:128 RemoteLoadGen: [ERROR] 1 remote clients, but only 0 responded
-> 2017-08-25 10:00:23:128 SpecwebControl: [ERROR] Client(s) not responding. Aborting test.
-> 2017-08-25 10:00:23:128 RemoteLoadGen: [ERROR] Remote exception setting server reset data collection from client1:1010
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.8; nested exception is:
->    java.net.ConnectException: Connection refused
-> 2017-08-25 10:00:23:128 SpecwebControl: Stopping remote clients.
-> 2017-08-25 10:00:23:129 RemoteLoadGen: 180-second ramp-down starting.

[2]
One more question: if I run more than 5 tiles, what is the hostname of the second dbserver for tile2~8? dbserver2 or dbserver5?
Do I only need to modify the /etc/hosts files of all VMs and clients for more than one dbserver and SPECvirt will let them work?
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on August 25, 2017, 12:50:01 PM
Hi Miles,

I'm sorry this is giving you such trouble. The Connection refused errors typically are due to either the firewall being on or the SPECpoll process not being up. Are you sure when you cloned the client and the web/infraservers you changed the hostnames and IP addresses and that they're correct in each client's /etc/hosts file? Can you ping all the workload VMs from the clients? It looks like you have a conflict with webserver2. Do you use a different version of Java JDK there?

Can you run only app/dbserver on client1 (NUM_TILES = 1, NUM_WORKLOADS = 1)? Can you run two of the appservers against one dbserver (NUM_TILES = 2, NUM_WORKLOADS = 1)?

> if I run more than 5 tiles, what is the hostname of the second dbserver for tile2~8?
> dbserver2 or dbserver5? Do I only need to modify the /etc/hosts files of all VMs and
> clients for more than one dbserver and SPECvirt will let them work?

You can call it whichever you want as long as appserver5-8 and client5-8 point to the correct hostname and dbserver alias. Under the helper directory in hosts.clients-5tile, we recommend that you call it dbserver5, which is what most if not all submissions use. Here's how tile 6 would look:

Code: [Select]
# ssh client6 "grep db /etc/hosts"
10.10.1.x dbserver5 dbserver

# ssh appserver6 "grep db /etc/hosts"
10.10.1.x dbserver5 dbserver
100.100.1.x dbserver5-int specdb

# ssh dbserver5 "grep db /etc/hosts"
10.10.1.x dbserver5 dbserver
100.100.1.x dbserver5-int specdb

# ssh dbserver5 "cat /tmp/pollme.out"
Creating RMI listener using RMI Registry port 8001
dbserver5-int/100.100.1.x:8001 ready...

Hope this helps.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on August 25, 2017, 12:58:17 PM
Also, you don't have to have every tile's hostname and IP address in every /etc/hosts file. You only need the VMs that the particular tile is using. So for client6 the minimum you need is:

Code: [Select]
# External client-to-VM traffic for client
10.10.1.x     infraserver infraserver6
10.10.1.x     webserver webserver6
10.10.1.x     mailserver mailserver6
10.10.1.x     appserver appserver6 specdelivery specemulator
10.10.1.x     dbserver dbserver5
10.10.1.x     batchserver batchserver6
10.10.1.x     client6 specdriver

For appserver6 you need at a minimum:

Code: [Select]
# External client-to-VM traffic for workload VM
10.10.1.x     appserver appserver6 specdelivery specemulator
10.10.1.x     dbserver dbserver5
10.10.1.x     client6 specdriver

# Internal VM-to-VM only traffic on workload VM
100.100.1.x   appserver6-int
100.100.1.x   dbserver5-int specdb

Finally, it'd probably help you to review a recent submission to see how others have set up their testbeds. Lenovo ran 22 tiles in submission #52 at https://www.spec.org/virt_sc2013/results/res2016q2, so download the supporting TGZ tarball to see changes they made to the VM's OSes (/etc/hosts, /etc/sysconfig/network-scripts/ifcfg-eth*, and so on).

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on August 31, 2017, 10:51:52 PM
Hi
Sorry, the following is my error message:

1. in Clientmgr1_1088.out:
-> 2017-08-25 10:00:13:195 WorkloadScheduler[1164]: [FATAL] Exceeded max allowed overthink time of 72 sec. Please ensure that neither the server or client(s) are overloaded. If server is overloaded, consider reducing the number of SIMULTANEOUS_SESSIONS requested. If client(s) appear overloaded, add more clients.

2. in Clientmgr1_1096.out:

-> java.rmi.ConnectException: Connection refused to host: 100.100.1.8; nested exception is:
->     java.net.ConnectException: Connection refused
-> 2017-08-25 10:00:23:128 SpecwebControl: Stopping remote clients.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on September 01, 2017, 11:09:11 AM
Please work on getting one workload working before adding a second. See https://www.spec.org/forums/index.php?topic=80.msg558#msg558 (https://www.spec.org/forums/index.php?topic=80.msg558#msg558).
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on September 01, 2017, 12:18:42 PM
Hi
Yes, it works when 1 tile but failed after adding the second.

The connection is good but there is always error in Clientmgr1_1096.out when running 2 tiles or more:
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.8; nested exception is:
->     java.net.ConnectException: Connection refused
-> 2017-08-25 10:00:23:128 SpecwebControl: Stopping remote clients.

Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on September 01, 2017, 04:02:27 PM
Are you mounting infraserver2-int on webserver2? After you cloned webserver1 to webserver2, did you run Wafgen to generate the unique datastore for webserver2? After cloning, you need to run Wafgen on every infra/webserver pair. See https://www.spec.org/forums/index.php?topic=15.msg145#msg145 (https://www.spec.org/forums/index.php?topic=15.msg145#msg145) under Extra for webserver and Extra for infraserver.

I'd prefer though you focus on getting one workload working first. Pick appserver https://www.spec.org/forums/index.php?topic=80.msg558#msg558 (https://www.spec.org/forums/index.php?topic=80.msg558#msg558) or webserver (here), and let's concentrate on one to get it going with six tiles.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on September 04, 2017, 08:23:09 AM
Hi
[1]
Yes,I have mounted infraserver2-int on webserver2 and run Wafgen on every infra/webserver pair.
Always terminated due to connection refused to client1 but the connection is working.

100.100.1.7   : wclient1
100.100.1.17 : wclient2

[2]
I modified WORKLOAD_LOAD_LEVEL
But I think it a bad modification because I got a low score of web-workload.

Clientmgr1_1096.out
-> 2017-09-04 21:08:10:035 ResultsFile: Invalid Run! Weighted percentage difference (2.62%) for home in Iteration 1 is too high. Expected: 62630 requests, Actual: 42363
-> 2017-09-04 21:08:10:035 ResultsFile: Invalid Run! Weighted percentage difference (4.07%) for search in Iteration 1 is too high. Expected: 97435 requests, Actual: 65965
-> 2017-09-04 21:08:10:036 ResultsFile: Invalid Run! Weighted percentage difference (3.81%) for catalog in Iteration 1 is too high. Expected: 90454 requests, Actual: 61038
-> 2017-09-04 21:08:10:036 ResultsFile: Invalid Run! Weighted percentage difference (8.13%) for product in Iteration 1 is too high. Expected: 191391 requests, Actual: 128591
-> 2017-09-04 21:08:10:036 ResultsFile: Invalid Run! Weighted percentage difference (7.37%) for fileCatalog in Iteration 1 is too high. Expected: 174002 requests, Actual: 117072
-> 2017-09-04 21:08:10:036 ResultsFile: Invalid Run! Weighted percentage difference (4.35%) for file in Iteration 1 is too high. Expected: 104396 requests, Actual: 70764
-> 2017-09-04 21:08:10:036 ResultsFile: Invalid Run! Weighted percentage difference (2.18%) for download in Iteration 1 is too high. Expected: 52220 requests, Actual: 35405
-> 2017-09-04 21:08:10:036 ResultsFile: Invalid Run! Sum of weighted percentage difference (32.53%) exceeds 1.5% for Iteration 1
-> 2017-09-04 21:08:10:036 SpecwebControl: **** SPECweb2005 benchmark completed
-> RESULT[0][0].home.TIME_GOOD is 0!
-> RESULT[0][0].search.TIME_GOOD is 0!
-> RESULT[0][0].catalog.TIME_GOOD is 0!
-> RESULT[0][0].product.TIME_GOOD is 0!
-> RESULT[0][0].fileCatalog.TIME_GOOD is 0!
-> RESULT[0][0].file.TIME_GOOD is 0!
-> RESULT[0][0].download.TIME_GOOD is 0!
-> RESULT[0][0].home.TIME_GOOD is 0!
-> RESULT[0][0].search.TIME_GOOD is 0!
-> RESULT[0][0].catalog.TIME_GOOD is 0!
-> RESULT[0][0].product.TIME_GOOD is 0!
-> RESULT[0][0].fileCatalog.TIME_GOOD is 0!
-> RESULT[0][0].file.TIME_GOOD is 0!
-> RESULT[0][0].download.TIME_GOOD is 0!
2017-09-04 21:08:12:429 Terminating processes. Please wait...
2017-09-04 21:08:12:430 Killing master procs ...
2017-09-04 21:08:12:430 Done killing procs ...

[3]
Although I have installed 10 GbE on the SUT and Client, I didn't setup SR-IOV environment. Is it necessary?

Thanks.

Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on September 05, 2017, 10:55:35 AM
1. You said you're using 100.100 for the internal vNICs?

Code: [Select]
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.8; nested exception is:
->     java.net.ConnectException: Connection refused

The clients don't use the internal vNICS, only the external vNICs on 10.10, so they need to be on the 10.10 network. Please capture the output of ifconfig on client1, webserver1, and infraserver1 and post it here. Also please post the contents of each VM's /etc/hosts file.

What is wclient? Did you split the web/infraserver workload onto a dedicate client? This is only required when you have 1 GbE, since each client uses about 1.4 GbE. With 10 GbE, you don't need to split off web/infraserver.

Have you looked at the Example VM guide and made sure you've run all the steps?

2. The run didn't complete successfully. TIME_GOOD is 0, which means no web transactions happened, which is why your score is 0.

3. SR-IOV isn't necessary. If it was, we'd have instructed you how to set up and use it in the Example VM guide.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on September 05, 2017, 09:56:28 PM
Hi Lisa
1.
    100.100.xx.xx is for the external vNICs       
    10.10.xx.xx     is for the internal vNICs
    wclient is a client which is dedicated to web workload

    I will attach or POST more details later.

    Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on September 06, 2017, 10:21:33 AM
Hi Miles. I'm sorry I confused which network the internal vNICs use. Can webserver1-int and infraserver1-int ping each other over the internal network? And you're sure SPECpoll is running?

Separating the web workloads to their own dedicated clients can work but is unnecessary if you use 10 GbE.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on September 06, 2017, 11:14:40 AM
Hi Lisa
The internal vNICs use 10.10.1.xx.
Webserver1-int and infraserver1-int can ping each other over the internal network.
I have checked all SPECpoll running in the VMs with pollmecheck.sh.

Both SUT and Client are installed a 10 Gb NIC card.

The symptom is the progress is good until warm up step, and connection refused happened later.

Thanks.



Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: abond on September 06, 2017, 01:25:25 PM
Hey Miles,

I noticed on this thread that you have various connection refused type messages.  What steps do you go through between your run attempts?  Many times on an unsuccessful run various processes can be left around that prevent the appropriate process from starting correctly.  I would recommend that you reboot the client and your workload VMs between each run attempt.  It is always good to start everything in a fresh state when trying to debug these kinds of issues.  Is this what you are currently doing?

I was also curious about which 1 tile and 2 tile tests you have gotten to work correct.  Is it true that a full 1 tile test has run successfully in the past, but no 2 tile tests have run successfully?

Thanks,
Andy
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on September 06, 2017, 10:43:23 PM
Hi
I have rebooted all the clients, wclients and VMs to re-run but got the same failure again.

It completed successfully when only one tile, 2 (or more) tiles failed.

According to the following message
Please ensure that neither the server or client(s) are overloaded. If server is overloaded, consider reducing the number of SIMULTANEOUS_SESSIONS requested. If client(s) appear overloaded, add more clients.

Should every client2 and web-clients be physical machines rather than VMs?

Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on September 11, 2017, 12:14:18 PM
You don't need physical machines to run the client. Clients run just fine on VMs. Most of the submissions in the last several years use 10 GbE with virtual clients. It still looks like your problem is with the internal connections between appserver and dbserver as well as webserver and infraserver.

Would you please post the primectrl.out log from your successful one-tile run?

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: DavidSchmidt on September 11, 2017, 06:13:54 PM
Hi Miles.

I had a couple of questions about your configuration:

1.  When you setup your second webserver/infraserver, did you edit the support_image.props and support_download.props file and change the TILEINDEX value to "1" before you ran Wafgen?

2. Did you confirm that your infraserver2 nfs share was mounted properly on webserver2 before you ran Wafgen?

The errors about incorrect file size sometimes is related to a corrupt dataset.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on September 12, 2017, 07:28:37 AM
Hi
Code: [Select]
1.  When you setup your second webserver/infraserver, did you edit the support_image.props and support_download.props file and change the TILEINDEX value to "1" before you ran Wafgen?    Yes, I did. I followed the steps of User Guide.


Code: [Select]
2. Did you confirm that your infraserver2 nfs share was mounted properly on webserver2 before you ran Wafgen?
     Yes, I have checked the /etc/exports and /etc/fstab files.

Now I have no idea since there is no clues for me to check.
I have attached the primectrl.out of my successful 1T4W run.

Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: abond on September 14, 2017, 10:31:57 PM
Hey Miles,

Some of the data you have posted seems to point to maybe the machine configuration you are running on might not be able to handle the load of a full 2nd tile.  Have you looked at some of the utilization numbers when trying to run a second tile?

One thing you could do to validate that your second tile is running ok from an setup standpoint is to run the second tile as a partial tile.  Many of the benchmark publications run the last tile at less than 100% load.  You could try running the first tile at full load and the second tile at 10% load to see if the second tile completes ok.  If you are getting push back from one of the subsystems when trying to run the second tile this might help determine whether that situation is happening or not.

Thanks,
Andy
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on September 20, 2017, 03:10:26 PM
To try Andy's suggestion, in Control.config set

Code: [Select]
LOAD_SCALE_FACTORS = "0.1"
POLL_INTERVAL_SEC = 1200

Then rerun a short run. It'll help determine if the subsystems are saturated.

You could tell us more about the physical host and amount of memory you have.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on September 20, 2017, 10:53:19 PM
Hi
Code: [Select]
LOAD_SCALE_FACTORS = "0.1,0"
My 3T1W configuration (web workload) can run without "connection refused" error after the value is changed to "0.1".
Code: [Select]
Host CPU0: E5-2683v4@2100 MHz   
     CPU1: E5-2683v4@2100 MHz
     (total 64 cores)
     memory : 256GB

Should I try other value of LOAD_SCALE_FACTORS such as "0.9","0.8"...etc
     and determine the best value to complete my run?

Will it be a compliant run?

Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on September 21, 2017, 12:30:34 AM
A run with a non-default setting of LOAD_SCALE_FACTORS is non-compliant. Are you intending to submit your result to SPEC for review? What is the goal of your testing?

What are the virtual memory settings for each VM? Does the total allocated virtual memory for the VMs fit into physical memory (256GB)? How many vCPUs are you using for each VM?

I would also work with IT to make sure that 10 GbE network is configured properly. It supports several more tiles than you're running.

To be compliant, you run with as many full tiles as you can then set the final tile to whatever the SUT can handle using LOAD_SCALE_FACTORS for that last tile. So if you pass at six tiles but fail at seven, use NUM_TILES = 7 and LOAD_SCALE_FACTORS[7] = "0.1" to run one-tenth of tile seven and see if the test passes. If it does, add 0.1 to its value and retest until it fails. That's how you get a score of (for example) 6.7 tiles. See https://www.spec.org/virt_sc2013/docs/SPECvirt_ClientHarnessUserGuide.html#mozTocId87692 and other submissions for how they did this.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on September 21, 2017, 02:23:48 AM
Hi
There are LOAD_SCALE_FACTORS and LOAD_SCALE_FACTORS[] in Control.config.
If I want to modify only the last tile but keep others default, which property should I modify? Only LOAD_SCALE_FACTORS[] or both?
I have modified LOAD_SCALE_FACTORS[3]="0.1" but cannot run successfully.

In my 3T1W configuration, I allocated 35840 MB memory and 4 vCPUs for each webservers. I use so much memory to avoid the web VM abnormally reboot due to "out of memory".

Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on September 21, 2017, 11:40:08 AM
What are you trying to do? Submit a test to SPEC for publication (needs to be compliant)? Generate load to do software regression testing (doesn't need to be compliant)? There are several ways to use SPECvirt.

If you just want to generate load against a server, you could use LOAD_SCALE_FACTORS = "0.6" or whatever percent of the tile works for you. In this case, without the "[tile#]", it runs all tiles at 60%.

If you want to use a partial tile, you need both LOAD_SCALE_FACTORS and "LOAD_SCALE_FACTORS"[tile#]". Look at https://www.spec.org/virt_sc2013/results/res2014q3/virt_sc2013-20140730-00016-perf.html (https://www.spec.org/virt_sc2013/results/res2014q3/virt_sc2013-20140730-00016-perf.html) for a 6.7 tile result. In the first table for Performance Summary, you can see tile 7 was run at 70%. You set this in Control.config, but we don't require people to submit Control.config as part of the supporting tarball because we can get the settings from the raw data file. See that at https://www.spec.org/virt_sc2013/results/res2014q3/virt_sc2013-20140730-00016.raw (https://www.spec.org/virt_sc2013/results/res2014q3/virt_sc2013-20140730-00016.raw) and search for LOAD_SCALE_FACTORS. (Ignore the first few with the -tile# after them, such as LOAD_SCALE_FACTOR-0 and -1....) It shows that in Control.config they used:

Code: [Select]
LOAD_SCALE_FACTORS    = "1.0"    # run all tiles at 100%
LOAD_SCALE_FACTORS[6] = "0.7"    # run only 7th tile at 70% using tile index [6]

That's how you run 6.7 tiles.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on September 22, 2017, 01:59:39 AM
Hi
Our objective is to run and generate a compliant report and evaluate the performance of our servers.

I have modified the properties and got the following results:

Code: [Select]
NUM_TILES=3
NUM_WORKLOADS=1
LOAD_SCALE_FACTORS = "0.6,0"
Error "connection refused" occurred after polling start and the run aborted.

Code: [Select]
NUM_TILES=3
NUM_WORKLOADS=1
LOAD_SCALE_FACTORS = "0.4,0"
The run completed successfully.

Code: [Select]
NUM_TILES=2
NUM_WORKLOADS=1
LOAD_SCALE_FACTORS[1] = "0.1"
LOAD_SCALE_FACTORS = "1.0,0"
Error "connection refused" occurred after polling start and the run aborted.

I try to generate a compliant report, How to go further?
Any settings of Control.config or other files should I check?
Thanks.

Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on September 22, 2017, 12:41:43 PM
If you can run at 40% but not 60%, your configuration is underconfigured. You need to figure out which resources are bottlenecked and increase them if you can. What are the CPU, network, and disk utilization? Are you running top or iostat? It sounds like you have enough memory.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on September 24, 2017, 10:23:24 PM
Hi
Code: [Select]
NUM_TILES=2
NUM_WORKLOADS=1
LOAD_SCALE_FACTORS[1] = "0.1"
LOAD_SCALE_FACTORS = "1.0,0"

"Connection Refused" error still occurs in the configuration.
Is this setting correct?

And its information during warm-up is as follows:
webserver1:
CPU usage: CPU1:55%,CPU2:72%,CPU3:72.3%,CPU4:26.7%
Memory usage:12.8GB(37.2%) of 34.4GB
Network History:Receiving:36.3MiB/s, Sending:75.1MiB/s
[root@webserver Desktop]# top

top - 09:16:38 up 2 days, 18:55,  5 users,  load average: 0.15, 0.22, 0.09
Tasks: 695 total,   1 running, 693 sleeping,   0 stopped,   1 zombie
Cpu(s):  3.1%us,  1.4%sy,  0.0%ni, 93.7%id,  0.0%wa,  1.0%hi,  0.8%si,  0.0%s
Mem:  36117916k total, 35735976k used,   381940k free,   142180k buffers
Swap:  4194296k total,        0k used,  4194296k free, 31392024k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND         
 5563 root      20   0  345m  35m  15m S  2.7  0.1  59:23.06 gnome-system-mo
 2545 root      20   0  769m 510m 9028 S  2.3  1.4  44:33.68 Xorg           
 3376 root      20   0  302m  15m  10m S  0.7  0.0   0:00.88 gnome-terminal 
12176 apache    20   0  563m  19m 8060 S  0.7  0.1   0:00.26 httpd           
12267 apache    20   0  564m  19m 8064 S  0.7  0.1   0:00.80 httpd           
12414 apache    20   0  563m  19m 8012 S  0.7  0.1   0:00.19 httpd           
12428 apache    20   0  563m  19m 8040 S  0.7  0.1   0:00.29 httpd           
12461 root      20   0 15548 1820 1000 R  0.7  0.0   0:01.39 top             
12492 apache    20   0  563m  19m 8048 S  0.7  0.1   0:00.26 httpd           
12523 apache    20   0  563m  17m 5816 S  0.7  0.0   0:00.16 httpd           
12627 apache    20   0  563m  19m 8016 S  0.7  0.1   0:00.19 httpd           
12640 apache    20   0  563m  19m 8044 S  0.7  0.1   0:00.25 httpd           
12661 apache    20   0  563m  19m 7980 S  0.7  0.1   0:00.17 httpd           
12662 apache    20   0  563m  19m 8056 S  0.7  0.1   0:00.20 httpd           
12725 apache    20   0  563m  19m 7980 S  0.7  0.1   0:00.14 httpd           
   42 root      20   0     0    0    0 S  0.3  0.0   4:04.61 ata_sff/0       
 1652 root      20   0     0    0    0 S  0.3  0.0   2:03.81 rpciod/3       

webserver2:
CPU usage: CPU1:1%,CPU2:1%,CPU3:2%,CPU4:18.4%
Memory usage:3.6GB(10.4%) of 34.4GB
Network History:Receiving:150.6KiB/s, Sending:8.9MiB/s
[root@webserver Desktop]# top

top - 09:16:38 up 2 days, 18:55,  5 users,  load average: 0.15, 0.22, 0.09
Tasks: 695 total,   1 running, 693 sleeping,   0 stopped,   1 zombie
Cpu(s):  3.1%us,  1.4%sy,  0.0%ni, 93.7%id,  0.0%wa,  1.0%hi,  0.8%si,  0.0%s
Mem:  36117916k total, 35735976k used,   381940k free,   142180k buffers
Swap:  4194296k total,        0k used,  4194296k free, 31392024k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND         
 5563 root      20   0  345m  35m  15m S  2.7  0.1  59:23.06 gnome-system-mo
 2545 root      20   0  769m 510m 9028 S  2.3  1.4  44:33.68 Xorg           
 3376 root      20   0  302m  15m  10m S  0.7  0.0   0:00.88 gnome-terminal 
12176 apache    20   0  563m  19m 8060 S  0.7  0.1   0:00.26 httpd           
12267 apache    20   0  564m  19m 8064 S  0.7  0.1   0:00.80 httpd           
12414 apache    20   0  563m  19m 8012 S  0.7  0.1   0:00.19 httpd           
12428 apache    20   0  563m  19m 8040 S  0.7  0.1   0:00.29 httpd           
12461 root      20   0 15548 1820 1000 R  0.7  0.0   0:01.39 top             
12492 apache    20   0  563m  19m 8048 S  0.7  0.1   0:00.26 httpd           
12523 apache    20   0  563m  17m 5816 S  0.7  0.0   0:00.16 httpd           
12627 apache    20   0  563m  19m 8016 S  0.7  0.1   0:00.19 httpd           
12640 apache    20   0  563m  19m 8044 S  0.7  0.1   0:00.25 httpd           
12661 apache    20   0  563m  19m 7980 S  0.7  0.1   0:00.17 httpd           
12662 apache    20   0  563m  19m 8056 S  0.7  0.1   0:00.20 httpd           
12725 apache    20   0  563m  19m 7980 S  0.7  0.1   0:00.14 httpd           
   42 root      20   0     0    0    0 S  0.3  0.0   4:04.61 ata_sff/0       
 1652 root      20   0     0    0    0 S  0.3  0.0   2:03.81 rpciod/3       

Clientmgr1_1096.out
 ...
-> 2017-09-25 10:46:03:824 SpecwebControl: Warming up for 300 seconds.
-> 2017-09-25 10:51:03:833 SpecwebControl: Clearing results.
-> 2017-09-25 10:51:05:538 SpecwebControl: Starting 7200-second runtime.
-> 2017-09-25 10:51:05:549 SpecwebControl: Clearing results.
-> 2017-09-25 10:51:06:261 RemoteLoadGen: [ERROR] Remote exception setting server reset data collection from wclient1:1010
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.7; nested exception is:
->    java.net.ConnectException: Connection refused
....
-> 2017-09-25 10:51:16:266 RemoteLoadGen: Warning: RMI exception trying to contact wclient1:1010. Retrying...
-> 2017-09-25 10:51:16:267 RemoteLoadGen: [ERROR] Unable to contact wclient1:1010
-> 2017-09-25 10:51:16:267,0,0,0,0,0,0,0,0,0,2147483647,0,0
-> 2017-09-25 10:51:16:268 SpecwebControl: Finished; collecting statistics.
-> 2017-09-25 10:51:16:269 RemoteLoadGen: [ERROR] Remote exception waiting for getStatistics response from wclient1:1010
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.7; nested exception is:
->    java.net.ConnectException: Connection refused
-> 2017-09-25 10:51:16:278 SpecwebControl: Test Complete.
->
-> *** Test Summary ***
-> 2017-09-25 10:51:16:278 SpecwebControl: [ERROR] doBenchmark() throws Exception java.lang.NullPointerException
-> java.lang.NullPointerException
->    at org.spec.specweb.SpecwebControl.reportResults(SpecwebControl.java:1030)
->    at org.spec.specweb.SpecwebControl.runWorkload(SpecwebControl.java:484)
->    at org.spec.specweb.SpecwebControl.runTests(SpecwebControl.java:689)
->    at org.spec.specweb.SpecwebControl.doBenchmark(SpecwebControl.java:234)
->    at org.spec.specweb.SpecwebControl.access$000(SpecwebControl.java:25)
->    at org.spec.specweb.SpecwebControl$1.run(SpecwebControl.java:764)
-> 2017-09-25 10:51:16:281 SpecwebControl: Terminating run. Please wait...



Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on September 25, 2017, 11:24:06 AM
Miles, I'm reviewing this with David who regularly runs with a dedicated web client. It could be that I'm missing something in Control.config.

Meanwhile, please add the pollmecheck.sh helper script to your runspecvirt.sh command. This outputs into primectrl.out confirming that SPECpoll is up on the workload VMs. Don't rerun just yet - let's hear back from David first.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: DavidSchmidt on September 25, 2017, 11:51:59 AM
Hi Miles. It looks like the issue is actually due to communications between your clients and wclients. You should verify that the mapping between your clients and wclients match the tile they are using (e.g. the specdriver name for tile 2 should point to client2 in wclient2's hosts file).

You say that your clients are using 10GbE NICs, so there really should be no reason why you cannot run the web workload on the same client as the other workloads. This also has the benefit of simplifying your benchmark harness. Is there some compelling reason you are using two client VMs per tile? I would suggest you try utilizing a single client per tile to see if you encounter the same problems.

One other thing to note is that I have seen better performance on the client side if I use SRIOV virtual functions rather than bridges for the virtual NICs, particularly when you have a lot of vClients on the same bridge. You might want to try using virtual functions.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on September 26, 2017, 05:24:14 AM
Hi
Code: [Select]
Is there some compelling reason you are using two client VMs per tile?When I used only one client to control corresponding tile, the failure "connection refused" always occurred. I followed User's Guide to add wclient which is dedicated to web workload, but it doesn't resolve the error.

Code: [Select]
You might want to try using virtual functions.Is SRIOV necessary? Even only 2 or 3 tile?
But the symptom occurs when I set "LOAD_SCALE_FACTORS[1]=0.1". Does it mean there is incorrect setting in my environment?

Should I provide any other log files to you for fixing this error?
Or would you please help me check by remote control my testbed?
I have been pending for a long time due to "Connection Refused" error.

Thanks.

/etc/hosts of cliet1
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
##
# Defaults used for the SPECvirt_sc2013 Example VM Setup Guide.

# External VM-to-client communications

#tile-1
100.100.1.1     infraserver infraserver1
100.100.1.2     webserver webserver1
100.100.1.3     mailserver mailserver1
100.100.1.4     appserver appserver1 specdelivery specemulator
100.100.1.5     dbserver dbserver1 dbserver2 dbserver3 dbserver4
100.100.1.6     batchserver batchserver1
100.100.1.7   wclient1
100.100.1.8   client1 specdriver
#tile-2
100.100.1.11    infraserver2
100.100.1.12    webserver2
100.100.1.13    mailserver2
100.100.1.14    appserver2
100.100.1.16    batchserver2
100.100.1.17   wclient2
100.100.1.18   client2

/etc/hosts of cliet2
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
##
# Defaults used for the SPECvirt_sc2013 Example VM Setup Guide.

# External VM-to-client communications
100.100.1.1     infraserver1
100.100.1.2     webserver1
100.100.1.3     mailserver1
100.100.1.4     appserver1
100.100.1.6     batchserver1
100.100.1.11     infraserver infraserver2
100.100.1.12     webserver webserver2
100.100.1.13     mailserver mailserver2
100.100.1.14     appserver appserver2 specdelivery specemulator
100.100.1.5      dbserver dbserver1 dbserver2
100.100.1.16     batchserver batchserver2
100.100.1.7   wclient1
100.100.1.17   wclient2
100.100.1.8   client1
100.100.1.18   client2 specdriver

/etc/hosts of wcliet1
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
##
# Defaults used for the SPECvirt_sc2013 Example VM Setup Guide.

# External VM-to-client communications

100.100.1.1     infraserver infraserver1
100.100.1.2     webserver webserver1
100.100.1.3     mailserver mailserver1
100.100.1.4     appserver appserver1
100.100.1.5     dbserver dbserver1
100.100.1.6     batchserver batchserver1
100.100.1.7   wclient1
100.100.1.8   client1 specdriver
100.100.1.18   client2

/etc/hosts of wcliet2
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
##
# Defaults used for the SPECvirt_sc2013 Example VM Setup Guide.

# External VM-to-client communications

#tile-2
100.100.1.11     infraserver infraserver2
100.100.1.12     webserver webserver2
100.100.1.13     mailserver mailserver2
100.100.1.14     appserver appserver2 specdelivery specemulator
100.100.1.5      dbserver dbserver1 dbserver2 dbserver3 dbserver4
100.100.1.16     batchserver batchserver2
100.100.1.17   wclient2

100.100.1.8   client1
100.100.1.18   client2 specdriver
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: lroderic on September 28, 2017, 10:56:36 AM
Have you confirmed that your 10 GbE network is working? It's simplest and best with a 10 GbE network since it lets you run all workloads against one client.

We need to focus on the client(s). All signs point to network bandwidth on the client NICs/vNICs. Are these physical clients or VMs? If they're VMs, are their vNICs configured correctly? Do they reside on a separate host from the SUT? Do you have a dedicated network for the testbed, or are you sharing it with a busy lab? Is the entire network on the same network switch?

Since you can run with three tiles @ 0.4 but fail @ 0.6, it looks like you're crossing that 1 GbE network threshold for the client. (Each client needs about 1.4 GbE.)

Please reboot the client right before a test run and capture the output of ifconfig. Then run the test and run top on the client during the test run. At the end of the test, do another ifconfig on the client to get packets sent/received and any collisions:

Code: [Select]
        RX packets 291259  bytes 201474352 (192.1 MiB)
        RX errors 0  dropped 566  overruns 0  frame 0
        TX packets 111161  bytes 19175356 (18.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

You can use ethtool to drill down on network stats as well.

If the client is a VM, please post its XML definition, and post the output of ifconfig and lspci on the host.

Lastly, get /etc/sysconfig/networks-scripts/ifcfg-* on the host for the NIC and bridge settings and post it here.

Lisa
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on October 02, 2017, 09:31:21 PM
Hi
All my clients and wclients are VMs.
My SPECvirt runs in a dedicated network rather than a busy lab.

The NIC used as VM network device on the host:(Please refer to HOST_ifconfig.txt for the details)
ens6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 90.90.1.80  netmask 255.255.0.0  broadcast 90.90.255.255
        inet6 fe80::3efd:feff:fe9d:6c70  prefixlen 64  scopeid 0x20<link>
        ether 3c:fd:fe:9d:6c:70  txqueuelen 1000  (Ethernet)
        RX packets 10807669  bytes 11371129452 (10.5 GiB)
        RX errors 0  dropped 377  overruns 0  frame 0
        TX packets 4458658  bytes 300435255 (286.5 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0



Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: DavidSchmidt on October 03, 2017, 02:50:17 PM
Hi Miles. I looked at the files you attached. The networking looks a little odd to me. You say that you are using ens6 is the port you are using for the VM network device. How is this device configured? In the lscpi info, I see details about 3 network devices:
  1 x 2-port I350 Intel 1GbE NIC (eno1 and eno2 in ifconfig data), 
  1 x Intel XL710 1-port 40GbE NIC,
  1 x Intel X540 2-port 10GbE NIC (ens5f0 and ens5f1, along with 32 virtual functions enp5s[xx]).

I actually don't see what device is ens6. I presume it is the XL710 NIC port, but don't know for certain since that NIC is slot 3, not slot 6. I also don't see any bridge information, so I don't know that layout is. If you don't use SRIVO, then you should have a bridge configured.

Can you run brctl show to show which devices are attached to which bridges, if any?

You say that your SUT network is on an isolated network. Which NIC ports are connected to the this network? I am presuming it's ens5f0 at least, but it's not set up as a bridge device as near as I can tell, so I am not sure how your client is actually talking to your SUT.

Thanks,
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on October 06, 2017, 04:14:08 AM
Hi
Yes, ens6 is the XL710 NIC portfunctions enp5s[xx]).

I use SRIVO on X540 but no bridge configured.
Code: [Select]
Can you run brctl show to show which devices are attached to which bridges, if any?bridge name   bridge id      STP enabled   interfaces
virbr0      8000.000000000000   yes   

Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: DavidSchmidt on October 09, 2017, 09:43:02 AM
Hi Miles. I have a couple of questions regarding your SRIOV configuration. Per the lscpi info you provided, it looks like only one port of your X540 is configured with VFs (only even numbered functions show up in list of Virtual Functions). Did you only configure one port to use VFs? Would you please provide the output for dmesg?

Also, Would you mind providing the xml file for the client that is using SRIOV?

Finally, can you confirm that your X540 ports are connected to the SUT switch?
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: Miles on October 12, 2017, 09:09:08 PM
Hi
I have completed 3T1W (web workload) successfully.

I modified TILEINDEX of support_image_props.rc and support_downloads_props.rc on webservers.

Now the value is 0 on webserver1; 1 on webserver2; and 2 on webserver3, but all 2 in my previous failed run.

But my webservers still often encounter "Out of memory: Kill process 16465 (httpd) score 1 or sacrifice child" and "connection refused" occurred.

Should I increase the VRAM on webservers? (Now 35840MB is allocated).

Thanks.
Title: Re: "PrimeControl: terminating run" while running 2 tiles
Post by: DavidSchmidt on October 16, 2017, 02:58:46 PM
Hi Miles. This looks like a tuning issue with the Apache webserver; it appears the webserver application is running out of memory. I would look at the tuning options of a published SPECvirt_sc2013 that uses Apache and verify you have the same settings in your httpd.conf files.