Author Topic: "PrimeControl: terminating run" while running 2 tiles  (Read 49278 times)

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #30 on: September 11, 2017, 12:14:18 PM »
You don't need physical machines to run the client. Clients run just fine on VMs. Most of the submissions in the last several years use 10 GbE with virtual clients. It still looks like your problem is with the internal connections between appserver and dbserver as well as webserver and infraserver.

Would you please post the primectrl.out log from your successful one-tile run?

Lisa

DavidSchmidt

  • Moderator
  • Newbie
  • *****
  • Posts: 21
  • Karma: +3/-1
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #31 on: September 11, 2017, 06:13:54 PM »
Hi Miles.

I had a couple of questions about your configuration:

1.  When you setup your second webserver/infraserver, did you edit the support_image.props and support_download.props file and change the TILEINDEX value to "1" before you ran Wafgen?

2. Did you confirm that your infraserver2 nfs share was mounted properly on webserver2 before you ran Wafgen?

The errors about incorrect file size sometimes is related to a corrupt dataset.

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #32 on: September 12, 2017, 07:28:37 AM »
Hi
Code: [Select]
1.  When you setup your second webserver/infraserver, did you edit the support_image.props and support_download.props file and change the TILEINDEX value to "1" before you ran Wafgen?    Yes, I did. I followed the steps of User Guide.


Code: [Select]
2. Did you confirm that your infraserver2 nfs share was mounted properly on webserver2 before you ran Wafgen?
     Yes, I have checked the /etc/exports and /etc/fstab files.

Now I have no idea since there is no clues for me to check.
I have attached the primectrl.out of my successful 1T4W run.

Thanks.
« Last Edit: September 12, 2017, 07:35:04 AM by Miles »

abond

  • Moderator
  • Newbie
  • *****
  • Posts: 35
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #33 on: September 14, 2017, 10:31:57 PM »
Hey Miles,

Some of the data you have posted seems to point to maybe the machine configuration you are running on might not be able to handle the load of a full 2nd tile.  Have you looked at some of the utilization numbers when trying to run a second tile?

One thing you could do to validate that your second tile is running ok from an setup standpoint is to run the second tile as a partial tile.  Many of the benchmark publications run the last tile at less than 100% load.  You could try running the first tile at full load and the second tile at 10% load to see if the second tile completes ok.  If you are getting push back from one of the subsystems when trying to run the second tile this might help determine whether that situation is happening or not.

Thanks,
Andy

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #34 on: September 20, 2017, 03:10:26 PM »
To try Andy's suggestion, in Control.config set

Code: [Select]
LOAD_SCALE_FACTORS = "0.1"
POLL_INTERVAL_SEC = 1200

Then rerun a short run. It'll help determine if the subsystems are saturated.

You could tell us more about the physical host and amount of memory you have.

Lisa

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #35 on: September 20, 2017, 10:53:19 PM »
Hi
Code: [Select]
LOAD_SCALE_FACTORS = "0.1,0"
My 3T1W configuration (web workload) can run without "connection refused" error after the value is changed to "0.1".
Code: [Select]
Host CPU0: E5-2683v4@2100 MHz   
     CPU1: E5-2683v4@2100 MHz
     (total 64 cores)
     memory : 256GB

Should I try other value of LOAD_SCALE_FACTORS such as "0.9","0.8"...etc
     and determine the best value to complete my run?

Will it be a compliant run?

Thanks.

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #36 on: September 21, 2017, 12:30:34 AM »
A run with a non-default setting of LOAD_SCALE_FACTORS is non-compliant. Are you intending to submit your result to SPEC for review? What is the goal of your testing?

What are the virtual memory settings for each VM? Does the total allocated virtual memory for the VMs fit into physical memory (256GB)? How many vCPUs are you using for each VM?

I would also work with IT to make sure that 10 GbE network is configured properly. It supports several more tiles than you're running.

To be compliant, you run with as many full tiles as you can then set the final tile to whatever the SUT can handle using LOAD_SCALE_FACTORS for that last tile. So if you pass at six tiles but fail at seven, use NUM_TILES = 7 and LOAD_SCALE_FACTORS[7] = "0.1" to run one-tenth of tile seven and see if the test passes. If it does, add 0.1 to its value and retest until it fails. That's how you get a score of (for example) 6.7 tiles. See https://www.spec.org/virt_sc2013/docs/SPECvirt_ClientHarnessUserGuide.html#mozTocId87692 and other submissions for how they did this.

Lisa
« Last Edit: September 21, 2017, 12:35:56 AM by lroderic »

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #37 on: September 21, 2017, 02:23:48 AM »
Hi
There are LOAD_SCALE_FACTORS and LOAD_SCALE_FACTORS[] in Control.config.
If I want to modify only the last tile but keep others default, which property should I modify? Only LOAD_SCALE_FACTORS[] or both?
I have modified LOAD_SCALE_FACTORS[3]="0.1" but cannot run successfully.

In my 3T1W configuration, I allocated 35840 MB memory and 4 vCPUs for each webservers. I use so much memory to avoid the web VM abnormally reboot due to "out of memory".

Thanks.

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #38 on: September 21, 2017, 11:40:08 AM »
What are you trying to do? Submit a test to SPEC for publication (needs to be compliant)? Generate load to do software regression testing (doesn't need to be compliant)? There are several ways to use SPECvirt.

If you just want to generate load against a server, you could use LOAD_SCALE_FACTORS = "0.6" or whatever percent of the tile works for you. In this case, without the "[tile#]", it runs all tiles at 60%.

If you want to use a partial tile, you need both LOAD_SCALE_FACTORS and "LOAD_SCALE_FACTORS"[tile#]". Look at https://www.spec.org/virt_sc2013/results/res2014q3/virt_sc2013-20140730-00016-perf.html for a 6.7 tile result. In the first table for Performance Summary, you can see tile 7 was run at 70%. You set this in Control.config, but we don't require people to submit Control.config as part of the supporting tarball because we can get the settings from the raw data file. See that at https://www.spec.org/virt_sc2013/results/res2014q3/virt_sc2013-20140730-00016.raw and search for LOAD_SCALE_FACTORS. (Ignore the first few with the -tile# after them, such as LOAD_SCALE_FACTOR-0 and -1....) It shows that in Control.config they used:

Code: [Select]
LOAD_SCALE_FACTORS    = "1.0"    # run all tiles at 100%
LOAD_SCALE_FACTORS[6] = "0.7"    # run only 7th tile at 70% using tile index [6]

That's how you run 6.7 tiles.

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #39 on: September 22, 2017, 01:59:39 AM »
Hi
Our objective is to run and generate a compliant report and evaluate the performance of our servers.

I have modified the properties and got the following results:

Code: [Select]
NUM_TILES=3
NUM_WORKLOADS=1
LOAD_SCALE_FACTORS = "0.6,0"
Error "connection refused" occurred after polling start and the run aborted.

Code: [Select]
NUM_TILES=3
NUM_WORKLOADS=1
LOAD_SCALE_FACTORS = "0.4,0"
The run completed successfully.

Code: [Select]
NUM_TILES=2
NUM_WORKLOADS=1
LOAD_SCALE_FACTORS[1] = "0.1"
LOAD_SCALE_FACTORS = "1.0,0"
Error "connection refused" occurred after polling start and the run aborted.

I try to generate a compliant report, How to go further?
Any settings of Control.config or other files should I check?
Thanks.

« Last Edit: September 22, 2017, 04:42:45 AM by Miles »

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #40 on: September 22, 2017, 12:41:43 PM »
If you can run at 40% but not 60%, your configuration is underconfigured. You need to figure out which resources are bottlenecked and increase them if you can. What are the CPU, network, and disk utilization? Are you running top or iostat? It sounds like you have enough memory.

Lisa

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #41 on: September 24, 2017, 10:23:24 PM »
Hi
Code: [Select]
NUM_TILES=2
NUM_WORKLOADS=1
LOAD_SCALE_FACTORS[1] = "0.1"
LOAD_SCALE_FACTORS = "1.0,0"

"Connection Refused" error still occurs in the configuration.
Is this setting correct?

And its information during warm-up is as follows:
webserver1:
CPU usage: CPU1:55%,CPU2:72%,CPU3:72.3%,CPU4:26.7%
Memory usage:12.8GB(37.2%) of 34.4GB
Network History:Receiving:36.3MiB/s, Sending:75.1MiB/s
[root@webserver Desktop]# top

top - 09:16:38 up 2 days, 18:55,  5 users,  load average: 0.15, 0.22, 0.09
Tasks: 695 total,   1 running, 693 sleeping,   0 stopped,   1 zombie
Cpu(s):  3.1%us,  1.4%sy,  0.0%ni, 93.7%id,  0.0%wa,  1.0%hi,  0.8%si,  0.0%s
Mem:  36117916k total, 35735976k used,   381940k free,   142180k buffers
Swap:  4194296k total,        0k used,  4194296k free, 31392024k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND         
 5563 root      20   0  345m  35m  15m S  2.7  0.1  59:23.06 gnome-system-mo
 2545 root      20   0  769m 510m 9028 S  2.3  1.4  44:33.68 Xorg           
 3376 root      20   0  302m  15m  10m S  0.7  0.0   0:00.88 gnome-terminal 
12176 apache    20   0  563m  19m 8060 S  0.7  0.1   0:00.26 httpd           
12267 apache    20   0  564m  19m 8064 S  0.7  0.1   0:00.80 httpd           
12414 apache    20   0  563m  19m 8012 S  0.7  0.1   0:00.19 httpd           
12428 apache    20   0  563m  19m 8040 S  0.7  0.1   0:00.29 httpd           
12461 root      20   0 15548 1820 1000 R  0.7  0.0   0:01.39 top             
12492 apache    20   0  563m  19m 8048 S  0.7  0.1   0:00.26 httpd           
12523 apache    20   0  563m  17m 5816 S  0.7  0.0   0:00.16 httpd           
12627 apache    20   0  563m  19m 8016 S  0.7  0.1   0:00.19 httpd           
12640 apache    20   0  563m  19m 8044 S  0.7  0.1   0:00.25 httpd           
12661 apache    20   0  563m  19m 7980 S  0.7  0.1   0:00.17 httpd           
12662 apache    20   0  563m  19m 8056 S  0.7  0.1   0:00.20 httpd           
12725 apache    20   0  563m  19m 7980 S  0.7  0.1   0:00.14 httpd           
   42 root      20   0     0    0    0 S  0.3  0.0   4:04.61 ata_sff/0       
 1652 root      20   0     0    0    0 S  0.3  0.0   2:03.81 rpciod/3       

webserver2:
CPU usage: CPU1:1%,CPU2:1%,CPU3:2%,CPU4:18.4%
Memory usage:3.6GB(10.4%) of 34.4GB
Network History:Receiving:150.6KiB/s, Sending:8.9MiB/s
[root@webserver Desktop]# top

top - 09:16:38 up 2 days, 18:55,  5 users,  load average: 0.15, 0.22, 0.09
Tasks: 695 total,   1 running, 693 sleeping,   0 stopped,   1 zombie
Cpu(s):  3.1%us,  1.4%sy,  0.0%ni, 93.7%id,  0.0%wa,  1.0%hi,  0.8%si,  0.0%s
Mem:  36117916k total, 35735976k used,   381940k free,   142180k buffers
Swap:  4194296k total,        0k used,  4194296k free, 31392024k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND         
 5563 root      20   0  345m  35m  15m S  2.7  0.1  59:23.06 gnome-system-mo
 2545 root      20   0  769m 510m 9028 S  2.3  1.4  44:33.68 Xorg           
 3376 root      20   0  302m  15m  10m S  0.7  0.0   0:00.88 gnome-terminal 
12176 apache    20   0  563m  19m 8060 S  0.7  0.1   0:00.26 httpd           
12267 apache    20   0  564m  19m 8064 S  0.7  0.1   0:00.80 httpd           
12414 apache    20   0  563m  19m 8012 S  0.7  0.1   0:00.19 httpd           
12428 apache    20   0  563m  19m 8040 S  0.7  0.1   0:00.29 httpd           
12461 root      20   0 15548 1820 1000 R  0.7  0.0   0:01.39 top             
12492 apache    20   0  563m  19m 8048 S  0.7  0.1   0:00.26 httpd           
12523 apache    20   0  563m  17m 5816 S  0.7  0.0   0:00.16 httpd           
12627 apache    20   0  563m  19m 8016 S  0.7  0.1   0:00.19 httpd           
12640 apache    20   0  563m  19m 8044 S  0.7  0.1   0:00.25 httpd           
12661 apache    20   0  563m  19m 7980 S  0.7  0.1   0:00.17 httpd           
12662 apache    20   0  563m  19m 8056 S  0.7  0.1   0:00.20 httpd           
12725 apache    20   0  563m  19m 7980 S  0.7  0.1   0:00.14 httpd           
   42 root      20   0     0    0    0 S  0.3  0.0   4:04.61 ata_sff/0       
 1652 root      20   0     0    0    0 S  0.3  0.0   2:03.81 rpciod/3       

Clientmgr1_1096.out
 ...
-> 2017-09-25 10:46:03:824 SpecwebControl: Warming up for 300 seconds.
-> 2017-09-25 10:51:03:833 SpecwebControl: Clearing results.
-> 2017-09-25 10:51:05:538 SpecwebControl: Starting 7200-second runtime.
-> 2017-09-25 10:51:05:549 SpecwebControl: Clearing results.
-> 2017-09-25 10:51:06:261 RemoteLoadGen: [ERROR] Remote exception setting server reset data collection from wclient1:1010
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.7; nested exception is:
->    java.net.ConnectException: Connection refused
....
-> 2017-09-25 10:51:16:266 RemoteLoadGen: Warning: RMI exception trying to contact wclient1:1010. Retrying...
-> 2017-09-25 10:51:16:267 RemoteLoadGen: [ERROR] Unable to contact wclient1:1010
-> 2017-09-25 10:51:16:267,0,0,0,0,0,0,0,0,0,2147483647,0,0
-> 2017-09-25 10:51:16:268 SpecwebControl: Finished; collecting statistics.
-> 2017-09-25 10:51:16:269 RemoteLoadGen: [ERROR] Remote exception waiting for getStatistics response from wclient1:1010
-> java.rmi.ConnectException: Connection refused to host: 100.100.1.7; nested exception is:
->    java.net.ConnectException: Connection refused
-> 2017-09-25 10:51:16:278 SpecwebControl: Test Complete.
->
-> *** Test Summary ***
-> 2017-09-25 10:51:16:278 SpecwebControl: [ERROR] doBenchmark() throws Exception java.lang.NullPointerException
-> java.lang.NullPointerException
->    at org.spec.specweb.SpecwebControl.reportResults(SpecwebControl.java:1030)
->    at org.spec.specweb.SpecwebControl.runWorkload(SpecwebControl.java:484)
->    at org.spec.specweb.SpecwebControl.runTests(SpecwebControl.java:689)
->    at org.spec.specweb.SpecwebControl.doBenchmark(SpecwebControl.java:234)
->    at org.spec.specweb.SpecwebControl.access$000(SpecwebControl.java:25)
->    at org.spec.specweb.SpecwebControl$1.run(SpecwebControl.java:764)
-> 2017-09-25 10:51:16:281 SpecwebControl: Terminating run. Please wait...



Thanks.
« Last Edit: September 24, 2017, 11:56:18 PM by Miles »

lroderic

  • Moderator
  • Full Member
  • *****
  • Posts: 167
  • Karma: +6/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #42 on: September 25, 2017, 11:24:06 AM »
Miles, I'm reviewing this with David who regularly runs with a dedicated web client. It could be that I'm missing something in Control.config.

Meanwhile, please add the pollmecheck.sh helper script to your runspecvirt.sh command. This outputs into primectrl.out confirming that SPECpoll is up on the workload VMs. Don't rerun just yet - let's hear back from David first.

Lisa

DavidSchmidt

  • Moderator
  • Newbie
  • *****
  • Posts: 21
  • Karma: +3/-1
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #43 on: September 25, 2017, 11:51:59 AM »
Hi Miles. It looks like the issue is actually due to communications between your clients and wclients. You should verify that the mapping between your clients and wclients match the tile they are using (e.g. the specdriver name for tile 2 should point to client2 in wclient2's hosts file).

You say that your clients are using 10GbE NICs, so there really should be no reason why you cannot run the web workload on the same client as the other workloads. This also has the benefit of simplifying your benchmark harness. Is there some compelling reason you are using two client VMs per tile? I would suggest you try utilizing a single client per tile to see if you encounter the same problems.

One other thing to note is that I have seen better performance on the client side if I use SRIOV virtual functions rather than bridges for the virtual NICs, particularly when you have a lot of vClients on the same bridge. You might want to try using virtual functions.

Miles

  • Jr. Member
  • **
  • Posts: 72
  • Karma: +0/-0
Re: "PrimeControl: terminating run" while running 2 tiles
« Reply #44 on: September 26, 2017, 05:24:14 AM »
Hi
Code: [Select]
Is there some compelling reason you are using two client VMs per tile?When I used only one client to control corresponding tile, the failure "connection refused" always occurred. I followed User's Guide to add wclient which is dedicated to web workload, but it doesn't resolve the error.

Code: [Select]
You might want to try using virtual functions.Is SRIOV necessary? Even only 2 or 3 tile?
But the symptom occurs when I set "LOAD_SCALE_FACTORS[1]=0.1". Does it mean there is incorrect setting in my environment?

Should I provide any other log files to you for fixing this error?
Or would you please help me check by remote control my testbed?
I have been pending for a long time due to "Connection Refused" error.

Thanks.

/etc/hosts of cliet1
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
##
# Defaults used for the SPECvirt_sc2013 Example VM Setup Guide.

# External VM-to-client communications

#tile-1
100.100.1.1     infraserver infraserver1
100.100.1.2     webserver webserver1
100.100.1.3     mailserver mailserver1
100.100.1.4     appserver appserver1 specdelivery specemulator
100.100.1.5     dbserver dbserver1 dbserver2 dbserver3 dbserver4
100.100.1.6     batchserver batchserver1
100.100.1.7   wclient1
100.100.1.8   client1 specdriver
#tile-2
100.100.1.11    infraserver2
100.100.1.12    webserver2
100.100.1.13    mailserver2
100.100.1.14    appserver2
100.100.1.16    batchserver2
100.100.1.17   wclient2
100.100.1.18   client2

/etc/hosts of cliet2
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
##
# Defaults used for the SPECvirt_sc2013 Example VM Setup Guide.

# External VM-to-client communications
100.100.1.1     infraserver1
100.100.1.2     webserver1
100.100.1.3     mailserver1
100.100.1.4     appserver1
100.100.1.6     batchserver1
100.100.1.11     infraserver infraserver2
100.100.1.12     webserver webserver2
100.100.1.13     mailserver mailserver2
100.100.1.14     appserver appserver2 specdelivery specemulator
100.100.1.5      dbserver dbserver1 dbserver2
100.100.1.16     batchserver batchserver2
100.100.1.7   wclient1
100.100.1.17   wclient2
100.100.1.8   client1
100.100.1.18   client2 specdriver

/etc/hosts of wcliet1
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
##
# Defaults used for the SPECvirt_sc2013 Example VM Setup Guide.

# External VM-to-client communications

100.100.1.1     infraserver infraserver1
100.100.1.2     webserver webserver1
100.100.1.3     mailserver mailserver1
100.100.1.4     appserver appserver1
100.100.1.5     dbserver dbserver1
100.100.1.6     batchserver batchserver1
100.100.1.7   wclient1
100.100.1.8   client1 specdriver
100.100.1.18   client2

/etc/hosts of wcliet2
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
##
# Defaults used for the SPECvirt_sc2013 Example VM Setup Guide.

# External VM-to-client communications

#tile-2
100.100.1.11     infraserver infraserver2
100.100.1.12     webserver webserver2
100.100.1.13     mailserver mailserver2
100.100.1.14     appserver appserver2 specdelivery specemulator
100.100.1.5      dbserver dbserver1 dbserver2 dbserver3 dbserver4
100.100.1.16     batchserver batchserver2
100.100.1.17   wclient2

100.100.1.8   client1
100.100.1.18   client2 specdriver
« Last Edit: September 27, 2017, 10:03:37 PM by Miles »