Author Topic: "PrimeControl: terminating run" while running 2 tiles  (Read 30578 times)

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #15 on: August 24, 2017, 10:39:13 AM »
Hi. Don't increase POLL_INTERVAL_SEC. That just makes the duration of the test longer, and you're not able to run the test in the first place. Randomly changing parameters in Control.config won't fix this.

I think your problem is with the internal vs. external VNICS. Your error on the client is "Connection refused to host: 100.100.1.8; nested exception is...".  The client doesn't have an internal VNIC and doesn't use the 100.100.X.X network - only web/infraserver and app/dbserver do. 

Is /etc/hosts on all of the clients populated correctly? On client1 it should look something like:

Code: [Select]
192.168.1.1     infraserver1 infraserver
192.168.1.2     webserver1 webserver
192.168.1.3     mailserver1 mailserver
192.168.1.4     appserver1 appserver specdelivery specemulator
192.168.1.5     dbserver1 dbserver
192.168.1.6     batchserver1 batchserver
192.168.1.8     client1 specdriver

192.168.1.11    infraserver2
192.168.1.12    webserver2
192.168.1.13    mailserver2
192.168.1.14    appserver2
192.168.1.16    batchserver2
192.168.1.18    client2

On client2 it should look like:

Code: [Select]
192.168.1.1     infraserver1
192.168.1.2     webserver1
192.168.1.3     mailserver1
192.168.1.4     appserver1
192.168.1.5     dbserver dbserver1
192.168.1.6     batchserver1
192.168.1.8     client1

192.168.1.11    infraserver2 infraserver
192.168.1.12    webserver2 webserver
192.168.1.13    mailserver2 mailserver
192.168.1.14    appserver2 appserver specdelivery specemulator
192.168.1.16    batchserver2 batchserver
192.168.1.18    client2 specdriver

Only the workload VMs need to have the internal VNIC addresses. Do not have these in a client's /etc/hosts. That is, the clients don't need and can't use:

Code: [Select]
100.100.1.X       infraserverX-int
100.100.1.X       webserverX-int
100.100.1.X       appserverX-int
100.100.1.X       dbserverX-int specdb

These work only the app/dbserver and web/infraserver VMs because they have internal VNICs. Traffic between the clients and VMs is on an external VNIC.

Please check the contents of /etc/hosts. have runspecvirt.sh call pollInit.sh and then pollmecheck.sh before every test to make sure SPECpoll is running on the workload VMs. Reduce all RAMP_SECONDS = 300. Run run one tile with appserver and webserver only, and see if that runs (NUM_TILES = 1, NUM_WORKLOADS = 2). If that works, run two tiles with appserver and webserver only (NUM_TILES = 2, NUM_WORKLOADS = 2). Let us know how it goes.

Lisa

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #16 on: August 24, 2017, 10:55:14 PM »
Hi
[1]
My external VNICS setting is "100.100.1.x" and internal setting is "10.10.1.x".
The run of 1T2W aborted because of the error "[ERROR] Unable to contact client1:1010", but it can work before...
I have checked with pollmecheck.sh and all VMs worked well.
I will check my Control.config.
Thanks.

Clientmgr1_1096.out
-> 2017-08-25 09:50:13:113 RemoteLoadGen: Finished starting clients.
-> 2017-08-25 09:50:13:120 SpecwebControl: Warming up for 600 seconds.
-> 2017-08-25 10:00:13:122 SpecwebControl: Clearing results.
-> 2017-08-25 10:00:13:123 SpecwebControl: Starting 7200-second runtime.
-> 2017-08-25 10:00:13:130 SpecwebControl: Clearing results.
-> 2017-08-25 10:00:23:127 RemoteLoadGen: Warning: RMI exception trying to contact client1:1010. Retrying...
-> 2017-08-25 10:00:23:128 RemoteLoadGen: [ERROR] Unable to contact client1:1010
-> 2017-08-25 10:00:23:128 RemoteLoadGen: [ERROR] 1 remote clients, but only 0 responded
-> 2017-08-25 10:00:23:128 SpecwebControl: [ERROR] Client(s) not responding. Aborting test.
-> 2017-08-25 10:00:23:128 RemoteLoadGen: [ERROR] Remote exception setting server reset data collection from client1:1010
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.8; nested exception is:
->    java.net.ConnectException: Connection refused
-> 2017-08-25 10:00:23:128 SpecwebControl: Stopping remote clients.
-> 2017-08-25 10:00:23:129 RemoteLoadGen: 180-second ramp-down starting.

[2]
One more question: if I run more than 5 tiles, what is the hostname of the second dbserver for tile2~8? dbserver2 or dbserver5?
Do I only need to modify the /etc/hosts files of all VMs and clients for more than one dbserver and SPECvirt will let them work?
« Last Edit: August 25, 2017, 03:21:47 AM by Miles »

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #17 on: August 25, 2017, 12:50:01 PM »
Hi Miles,

I'm sorry this is giving you such trouble. The Connection refused errors typically are due to either the firewall being on or the SPECpoll process not being up. Are you sure when you cloned the client and the web/infraservers you changed the hostnames and IP addresses and that they're correct in each client's /etc/hosts file? Can you ping all the workload VMs from the clients? It looks like you have a conflict with webserver2. Do you use a different version of Java JDK there?

Can you run only app/dbserver on client1 (NUM_TILES = 1, NUM_WORKLOADS = 1)? Can you run two of the appservers against one dbserver (NUM_TILES = 2, NUM_WORKLOADS = 1)?

> if I run more than 5 tiles, what is the hostname of the second dbserver for tile2~8?
> dbserver2 or dbserver5? Do I only need to modify the /etc/hosts files of all VMs and
> clients for more than one dbserver and SPECvirt will let them work?

You can call it whichever you want as long as appserver5-8 and client5-8 point to the correct hostname and dbserver alias. Under the helper directory in hosts.clients-5tile, we recommend that you call it dbserver5, which is what most if not all submissions use. Here's how tile 6 would look:

Code: [Select]
# ssh client6 "grep db /etc/hosts"
10.10.1.x dbserver5 dbserver

# ssh appserver6 "grep db /etc/hosts"
10.10.1.x dbserver5 dbserver
100.100.1.x dbserver5-int specdb

# ssh dbserver5 "grep db /etc/hosts"
10.10.1.x dbserver5 dbserver
100.100.1.x dbserver5-int specdb

# ssh dbserver5 "cat /tmp/pollme.out"
Creating RMI listener using RMI Registry port 8001
dbserver5-int/100.100.1.x:8001 ready...

Hope this helps.

Lisa

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #18 on: August 25, 2017, 12:58:17 PM »
Also, you don't have to have every tile's hostname and IP address in every /etc/hosts file. You only need the VMs that the particular tile is using. So for client6 the minimum you need is:

Code: [Select]
# External client-to-VM traffic for client
10.10.1.x     infraserver infraserver6
10.10.1.x     webserver webserver6
10.10.1.x     mailserver mailserver6
10.10.1.x     appserver appserver6 specdelivery specemulator
10.10.1.x     dbserver dbserver5
10.10.1.x     batchserver batchserver6
10.10.1.x     client6 specdriver

For appserver6 you need at a minimum:

Code: [Select]
# External client-to-VM traffic for workload VM
10.10.1.x     appserver appserver6 specdelivery specemulator
10.10.1.x     dbserver dbserver5
10.10.1.x     client6 specdriver

# Internal VM-to-VM only traffic on workload VM
100.100.1.x   appserver6-int
100.100.1.x   dbserver5-int specdb

Finally, it'd probably help you to review a recent submission to see how others have set up their testbeds. Lenovo ran 22 tiles in submission #52 at https://www.spec.org/virt_sc2013/results/res2016q2, so download the supporting TGZ tarball to see changes they made to the VM's OSes (/etc/hosts, /etc/sysconfig/network-scripts/ifcfg-eth*, and so on).

Lisa
« Last Edit: August 25, 2017, 01:16:39 PM by lroderic »

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #19 on: August 31, 2017, 10:51:52 PM »
Hi
Sorry, the following is my error message:

1. in Clientmgr1_1088.out:
-> 2017-08-25 10:00:13:195 WorkloadScheduler[1164]: [FATAL] Exceeded max allowed overthink time of 72 sec. Please ensure that neither the server or client(s) are overloaded. If server is overloaded, consider reducing the number of SIMULTANEOUS_SESSIONS requested. If client(s) appear overloaded, add more clients.

2. in Clientmgr1_1096.out:

-> java.rmi.ConnectException: Connection refused to host: 100.100.1.8; nested exception is:
->     java.net.ConnectException: Connection refused
-> 2017-08-25 10:00:23:128 SpecwebControl: Stopping remote clients.

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #20 on: September 01, 2017, 11:09:11 AM »
Please work on getting one workload working before adding a second. See https://www.spec.org/forums/index.php?topic=80.msg558#msg558.

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #21 on: September 01, 2017, 12:18:42 PM »
Hi
Yes, it works when 1 tile but failed after adding the second.

The connection is good but there is always error in Clientmgr1_1096.out when running 2 tiles or more:
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.8; nested exception is:
->     java.net.ConnectException: Connection refused
-> 2017-08-25 10:00:23:128 SpecwebControl: Stopping remote clients.

Thanks.

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #22 on: September 01, 2017, 04:02:27 PM »
Are you mounting infraserver2-int on webserver2? After you cloned webserver1 to webserver2, did you run Wafgen to generate the unique datastore for webserver2? After cloning, you need to run Wafgen on every infra/webserver pair. See https://www.spec.org/forums/index.php?topic=15.msg145#msg145 under Extra for webserver and Extra for infraserver.

I'd prefer though you focus on getting one workload working first. Pick appserver https://www.spec.org/forums/index.php?topic=80.msg558#msg558 or webserver (here), and let's concentrate on one to get it going with six tiles.

Lisa

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #23 on: September 04, 2017, 08:23:09 AM »
Hi
[1]
Yes,I have mounted infraserver2-int on webserver2 and run Wafgen on every infra/webserver pair.
Always terminated due to connection refused to client1 but the connection is working.

100.100.1.7   : wclient1
100.100.1.17 : wclient2

[2]
I modified WORKLOAD_LOAD_LEVEL
  • from 2500 to 1000 and the run completed successfully.

But I think it a bad modification because I got a low score of web-workload.

Clientmgr1_1096.out
-> 2017-09-04 21:08:10:035 ResultsFile: Invalid Run! Weighted percentage difference (2.62%) for home in Iteration 1 is too high. Expected: 62630 requests, Actual: 42363
-> 2017-09-04 21:08:10:035 ResultsFile: Invalid Run! Weighted percentage difference (4.07%) for search in Iteration 1 is too high. Expected: 97435 requests, Actual: 65965
-> 2017-09-04 21:08:10:036 ResultsFile: Invalid Run! Weighted percentage difference (3.81%) for catalog in Iteration 1 is too high. Expected: 90454 requests, Actual: 61038
-> 2017-09-04 21:08:10:036 ResultsFile: Invalid Run! Weighted percentage difference (8.13%) for product in Iteration 1 is too high. Expected: 191391 requests, Actual: 128591
-> 2017-09-04 21:08:10:036 ResultsFile: Invalid Run! Weighted percentage difference (7.37%) for fileCatalog in Iteration 1 is too high. Expected: 174002 requests, Actual: 117072
-> 2017-09-04 21:08:10:036 ResultsFile: Invalid Run! Weighted percentage difference (4.35%) for file in Iteration 1 is too high. Expected: 104396 requests, Actual: 70764
-> 2017-09-04 21:08:10:036 ResultsFile: Invalid Run! Weighted percentage difference (2.18%) for download in Iteration 1 is too high. Expected: 52220 requests, Actual: 35405
-> 2017-09-04 21:08:10:036 ResultsFile: Invalid Run! Sum of weighted percentage difference (32.53%) exceeds 1.5% for Iteration 1
-> 2017-09-04 21:08:10:036 SpecwebControl: **** SPECweb2005 benchmark completed
-> RESULT[0][0].home.TIME_GOOD is 0!
-> RESULT[0][0].search.TIME_GOOD is 0!
-> RESULT[0][0].catalog.TIME_GOOD is 0!
-> RESULT[0][0].product.TIME_GOOD is 0!
-> RESULT[0][0].fileCatalog.TIME_GOOD is 0!
-> RESULT[0][0].file.TIME_GOOD is 0!
-> RESULT[0][0].download.TIME_GOOD is 0!
-> RESULT[0][0].home.TIME_GOOD is 0!
-> RESULT[0][0].search.TIME_GOOD is 0!
-> RESULT[0][0].catalog.TIME_GOOD is 0!
-> RESULT[0][0].product.TIME_GOOD is 0!
-> RESULT[0][0].fileCatalog.TIME_GOOD is 0!
-> RESULT[0][0].file.TIME_GOOD is 0!
-> RESULT[0][0].download.TIME_GOOD is 0!
2017-09-04 21:08:12:429 Terminating processes. Please wait...
2017-09-04 21:08:12:430 Killing master procs ...
2017-09-04 21:08:12:430 Done killing procs ...

[3]
Although I have installed 10 GbE on the SUT and Client, I didn't setup SR-IOV environment. Is it necessary?

Thanks.

« Last Edit: September 05, 2017, 02:29:01 AM by Miles »

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #24 on: September 05, 2017, 10:55:35 AM »
1. You said you're using 100.100 for the internal vNICs?

Code: [Select]
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.8; nested exception is:
->     java.net.ConnectException: Connection refused

The clients don't use the internal vNICS, only the external vNICs on 10.10, so they need to be on the 10.10 network. Please capture the output of ifconfig on client1, webserver1, and infraserver1 and post it here. Also please post the contents of each VM's /etc/hosts file.

What is wclient? Did you split the web/infraserver workload onto a dedicate client? This is only required when you have 1 GbE, since each client uses about 1.4 GbE. With 10 GbE, you don't need to split off web/infraserver.

Have you looked at the Example VM guide and made sure you've run all the steps?

2. The run didn't complete successfully. TIME_GOOD is 0, which means no web transactions happened, which is why your score is 0.

3. SR-IOV isn't necessary. If it was, we'd have instructed you how to set up and use it in the Example VM guide.

Lisa

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #25 on: September 05, 2017, 09:56:28 PM »
Hi Lisa
1.
    100.100.xx.xx is for the external vNICs       
    10.10.xx.xx     is for the internal vNICs
    wclient is a client which is dedicated to web workload

    I will attach or POST more details later.

    Thanks.

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #26 on: September 06, 2017, 10:21:33 AM »
Hi Miles. I'm sorry I confused which network the internal vNICs use. Can webserver1-int and infraserver1-int ping each other over the internal network? And you're sure SPECpoll is running?

Separating the web workloads to their own dedicated clients can work but is unnecessary if you use 10 GbE.

Lisa

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #27 on: September 06, 2017, 11:14:40 AM »
Hi Lisa
The internal vNICs use 10.10.1.xx.
Webserver1-int and infraserver1-int can ping each other over the internal network.
I have checked all SPECpoll running in the VMs with pollmecheck.sh.

Both SUT and Client are installed a 10 Gb NIC card.

The symptom is the progress is good until warm up step, and connection refused happened later.

Thanks.




abond

  • Moderator
  • Newbie
  • *****
  • Posts: 35
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #28 on: September 06, 2017, 01:25:25 PM »
Hey Miles,

I noticed on this thread that you have various connection refused type messages.  What steps do you go through between your run attempts?  Many times on an unsuccessful run various processes can be left around that prevent the appropriate process from starting correctly.  I would recommend that you reboot the client and your workload VMs between each run attempt.  It is always good to start everything in a fresh state when trying to debug these kinds of issues.  Is this what you are currently doing?

I was also curious about which 1 tile and 2 tile tests you have gotten to work correct.  Is it true that a full 1 tile test has run successfully in the past, but no 2 tile tests have run successfully?

Thanks,
Andy

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #29 on: September 06, 2017, 10:43:23 PM »
Hi
I have rebooted all the clients, wclients and VMs to re-run but got the same failure again.

It completed successfully when only one tile, 2 (or more) tiles failed.

According to the following message
Please ensure that neither the server or client(s) are overloaded. If server is overloaded, consider reducing the number of SIMULTANEOUS_SESSIONS requested. If client(s) appear overloaded, add more clients.

Should every client2 and web-clients be physical machines rather than VMs?

Thanks.
« Last Edit: September 10, 2017, 09:08:27 PM by Miles »