SPECsfs2008_nfs.v3 Result: Hewlett-Packard Company

Tested By	Hewlett-Packard Company
Product Name	BL860c 4-node HA-NFS Cluster
Hardware Available	May 2009
Software Available	December 2009
Date Tested	September 2009
SFS License Number	3
Licensee Locations	Cupertino, CA USA

The HP BL860c 4-node HA-NFS cluster is a high performance and highly available blade server cluster with the ability to provide combined fileserver, database, and compute services to high-speed Ethernet networks. HP's blade technology provides linear scale-out processing from a single node to many nodes either within a single blade enclosure or across enclosures. The HP-UX operating system combined with HP's Serviceguard software provides continuous availability of the cluster providing transparent fail-over of nodes in the event of a node failure. Each individual blade node is highly available as well, providing redundant network and storage interfaces that fail-over transparently while still being able to maintain high I/O throughput. In addition, Double Chip Spare technology is used to assure node resilience in the face of memory module failures. At only 6U high, the c3000 chassis utilized in this cluster provides a very dense and energy efficient enclosure that is capable of delivering high I/O throughput. The MSA2324fc arrays providing storage in the benchmark are very reliable, very dense and deliver excellent throughput. Each MSA2324fc holds up to 99 high performance disk drives in only 8U of rack space. Unlike proprietary NFS solutions, HP's HA-NFS blade cluster solution is very flexible in that any HP supported Fibre Channel connected storage can be used and the blade nodes don't need to be dedicated solely to providing NFS services.

Configuration Bill of Materials

Item No	Qty	Type	Vendor	Model/Name	Description
1	1	Blade Chassis	HP	c3000	6U blade chassis with 6 fans, 6 power supplies, DVD drive, and dual management modules
2	4	Blade Server	HP	BL860c	Integrity BL860c 9140M dual-processor blade server with 48GB of memory
3	8	Disk Drive	HP	384842-B21	HP 72GB 3Gbit 10K RPM SAS 2.5 inch small form factor dual-port hard disk drive
4	8	Disk controller	HP	403619-B21	HP BLc Dual Port 4Gbps QLogic FC PCIe mezzanine card
5	4	Network adapter	HP	NC364m	Quad Port 1GbE BL-c PCIe Adapter
6	2	FC Switch	HP	AJ821A	HP B-Series 8Gbit 24-port SAN switch for BladeSystem c-Class
7	2	Network Switch	HP	455880-B21	HP Virtual Connect Flex-10 10Gb Ethernet Module for the BladeSystem c-Class
8	6	FC Disk Array	HP	MSA2324fc G2	HP StorageWorks dual controller Modular Smart Array with 24 small form factor SAS disk bays
9	18	Disk Storage Tray	HP	MSA70	HP StorageWorks disk storage bay with 25 small form factor SAS disk bays including a dual-domain I/O module for redundancy
10	576	Disk Drive	HP	418371-B21	HP 72GB 3Gbit 15K RPM SAS 2.5 inch small form factor dual-port hard disk drive

Server Software

Server Tuning

OS Name and Version	HP-UX 11iv3 0909 Data Center OE
Other Software	none
Filesystem Software	VxFS 5.0

Name	Value	Description
filecache_min	27	Minimum percentage of memory to use for the filecache. Set using kctune
filecache_max	27	Maximum percentage of memory to use for the filecache. Set using kctune
fs_meta_min	1	Minimum percentage of memory to use for the filesystem metadata pool. Set using kctune
vx_ninode	2400000	Maximum number of inodes to cache. Set using kctune
vxfs_bc_bufhwm	5500000	Maximum amount of memory (in bytes) to use for caching VxFS metadata. Set using kctune
lcpu_attr	1	Enable Hyper-Threading. Set using kctune
fcache_fb_policy	2	Enable synchronous flush-behind on writes. Set using kctune
base_pagesize	8	Set the OS minimum pagesize in kilobytes. Set using kctune
numa_mode	1	Set the OS operating mode to LORA (Locality-Optimized Resource Alignment). Set using kctune
fcache_vhand_ctl	10	Set how aggressively the vhand process flushes dirty pages to disk. Set using kctune
max_q_depth	64	Set the disk device queue depth via scsimgr.
Number of NFS Threads	256	The number of NFS threads has been increased to 256 by editing /etc/default/nfs.
tcp_recv_hiwater_def	1048567	Maximum TCP receive window size in bytes. Set on boot via /etc/rc.config.d/nddconf
tcp_xmit_hiwater_def	1048567	Amount of unsent data that triggers write-side flow control in bytes. Set on boot via /etc/rc.config.d/nddconf
noatime	on	Mount option added to all filesystems to disable updating of access times.

Server Tuning Notes

The kernel patch PHKL_40441 is needed to set the fcache_vhand_ctl tunable. This patch will be available by mid-December 2009. Device read-ahead was disabled on the MSA2324fc array by using the command "set cache-parameters read-ahead-size disabled" via the array's CLI on each exported LUN.

Disks and Filesystems

Description	Number of Disks	Usable Size
Hardware mirrored boot disks for system use only. There are two 73GB 2.5" 10K RPM SAS disks per node.	8	285.0 GB
96 73GB 2.5" 15K RPM disks per MSA2324fc array (6 total), bound into six 16 disk RAID10 LUNs striped in 64KB chunks. All data filesystems reside on these disks.	576	19.2 TB
Total	584	19.5 TB

Number of Filesystems	48
Total Exported Capacity	19.1 TB
Filesystem Type	VxFS
Filesystem Creation Options	-b 8192
Filesystem Config	A set of 12 filesystems are striped in 1MB chunks across 9 LUNs (144 disks) using HP-UX's Logical Volume Manager. There are 4 sets of 12 filesystems in all.
Fileset Size	15810.6 GB

The storage is comprised of six identically configured HP MSA2324fc disk arrays. Each array consists of a main unit that holds dual controllers and bays for 24 2.5" SAS disks. Three, 2U, MSA70 SAS drive shelves are then daisy chained from the MSA2324fc to add an addition 72 2.5" SAS drives (24 drives per MSA70). All drives are dual-ported and each MSA70 has two paths back to the MSA2324fc for redundancy (one to each controller). Each redundant MSA2324fc controller then has a 4Gbit Fibre Channel connection to a blade chassis FC switch. One controller connects to one FC switch and the other controller connects to the other FC switch for redundancy. Each array controller is the primary owner of three of the array's six LUNs. In the event of a controller failure or FC link failure, LUN ownership is transfered to the other controller. HP's Logical Volume Manager (LVM) is used to combine nine array LUNs together to form a volume group. Twelve logical volumes of identical size were then created from each volume group, by striping over each LUN in the volume group using 1MB chunks. A filesystem was then created on each logical volume (12 per volume group, 48 in all). A volume group and the filesystems on it are owned by a specific node. In case of a node failure, the volume group and filesystems owned by that node are transfered to another node in the cluster. Once the failing node recovers, its volume group and filesystems are migrated back to it. The migration of volume groups and filesystems are transparent to the remote users of the filesystems. All nodes are able to see all of the LUNS in the configuration at the same time. A subset of the LUNS are owned by any one node.

Network Configuration

Item No	Network Type	Number of Ports Used	Notes
1	Jumbo 10 Gigabit Ethernet	2	Traffic enters the cluster over these redundant 10GbE connections. There is one 10GbE connection per Flex-10 switch in the c3000
2	Jumbo Gigabit Ethernet	32	These connections consist of 8 per BL860c blade, 4 transfer data and 4 are hot-standby redundant links.

Network Configuration Notes

There are two Flex-10 10GbE switches configured in the c3000 blade chassis. Each switch has an external 10GbE port active to accept incoming network traffic. Dual switches are used for redundancy, both are active. The first switch connects internally to four 1GbE built-in ports on each BL860c blade. Two of these blade ports are used to carry network traffic while the other two are used for redundancy in the event of a failure. The second switch connects internally to the PCIe 4-port GbE card on each blade. Again, two ports are used to carry network traffic and the other two are used for redundancy. HP-UX's Auto Port Aggregation (APA) software is used to create highly available (HA) network pairs. One port from the built-in set of ports and one from the 4-port PCIe card are used to form an APA pair. Four pairs in total were created. Only one port in an APA pair is active, the other acts as a standby port in the event of some failure on the first port. All interfaces were configured to use jumbo frames (MTU size of 9000 bytes). All clients connected to the cluster through a HP ProCurve 3500yl-48G. The 10GbE uplink ports from the ProCurve were connected to the Flex-10 modules in the c3000, one 10GbE link per Flex-10 switch.

Benchmark Network

An MTU size of 9000 was set for all connections to the benchmark environment (load generators and blade servers). Each load generator connected to the network via one of its on-board 1GbE ports. Each blade server had four APA interfaces configured which combined two physical 1GbE server ports in an active/standby combination, thus resulting in a virtual 1GbE port for each APA. Each APA interface on the blade was then assigned an IP address in a separate IP subnet, so four IP subnets were configured on each blade. The same IP subnets were used across all four blades. The IP addresses for the load generators were chosen so they mapped evenly into the four IP subnets (four load generators per IP subnet). Each load generator sends requests to, and receives responses from, all active server interfaces.

Processing Elements

Memory

Item No	Qty	Type	Description	Processing Function
1	8	CPU	Intel Itanium Dual-Core Processor Model 9140M, 1.66Ghz with 18M L3 cache	NFS protocol, VxFS filesystem, Networking, Serviceguard
2	12	NA	MSA2324fc array controller	RAID, write cache mirroring, disk scrubbing

Description	Size in GB	Number of Instances	Total GB	Nonvolatile
Blade server main memory	48	4	192	V
Disk array controller's main memory, 1GB per controller	2	6	12	NV
Grand Total Memory Gigabytes			204

Memory Notes

Each storage array has dual controller units that work as an active-active fail-over pair. Writes are mirrored between the controller pairs. In the event of a power failure, the modified cache contents are backed up to Flash memory utilizing the power from a super capacitor. The Flash memory module can save the contents of the cache indefinitely. In the event of a controller failure, the other controller unit is capable of saving all state that was managed by the first (and vise versa). When one of the controllers has failed, the other controller turns off its write cache and writes directly to disk before acknowledging any write operations. If a super capacitor failure happens on a controller, then the write cache will also be disabled.

Stable Storage

NFS stable write and commit operations are not acknowledged until after the MSA2324fc array has acknowledged the data is stored in stable storage. The MSA2324fc has dual controllers that operate in an active-active fail-over pair. Writes are mirrored between controllers and a super capacitor + Flash is used to save the cache contents indefinitely in the event of power failure. If either a controller or super capacitor fail, then acknowledgement of writes does not occur until the written data reaches the disk.

System Under Test Configuration Notes

The system under test consisted of four Integrity BL860c blades using 48GB of memory. The blades were configured in an active-active cluster fail-over configuration using HP's Serviceguard HA software which comes standard with the HP-UX Data Center Operating Environment. Each BL860c contained two dual-port 4Gbit FC PCIe cards and a single four-port PCIe 1GbE card. Each BL860c also had two built-in dual-port 1GbE ports (4-ports total). The BL860c blades were contained within a single c3000 chassis configured with two Flex-10 10Gb Ethernet Switches and two 8Gbit FC switches. The Flex-10 in I/O bay #1 connected to all of the built-in 1GbE ports on each blade (16-ports total). The Flex-10 in I/O bay #2 connected to all of the 1GbE ports from the 4-port PCIe card on each blade (16 ports total). The 8Gbit FC switch in I/O bay #3 connected to one port on each PCIe dual-port 4Gbit FC card in each blade. The 8Gbit FC switch in I/O bay #4 connected to the other port on each PCIe dual-port 4Gbit FC card in each blade. Six of the external ports on the FC switch in I/O Bay #3 connect to the A controller of each MSA2324fc. Six of the external ports on the FC switch in I/O Bay #4 connect to the B controller of each MSA2324fc. This fully-connected FC configuration allows every FC port on the blade (4 total) to access any LUN on any array. A blade could fail three of its four FC ports and still have full access to all LUNS via the remaining port. Each MSA2324fc connected to the blade chassis is fully redundant.

Other System Notes

Test Environment Bill of Materials

Load Generators

Load Generator (LG) Configuration

Item No	Qty	Vendor	Model/Name	Description
1	16	HP	Integrity rx1620	1U server with 8GB RAM and the HP-UX 11iv3 operating system

LG Type Name	LG1
BOM Item #	1
Processor Name	Intel Itanium 2 1.3Ghz 3MB L3
Processor Speed	1.3 GHz
Number of Processors (chips)	2
Number of Cores/Chip	1
Memory Size	8 GB
Operating System	HP-UX 11iv3 0903
Network Type	Built-in 1GbE

Benchmark Parameters

Network Attached Storage Type	NFS V3
Number of Load Generators	16
Number of Processes per LG	48
Biod Max Read Setting	2
Biod Max Write Setting	2
Block Size	AUTO

Testbed Configuration

LG No	LG Type	Network	Target Filesystems	Notes
1..16	LG1	1	/n1_1,/n1_2,/n1_3,/n1_4,/n1_5,/n1_6,/n1_7,/n1_8,/n1_9,/n1_10,/n1_11,/n1_12,/n2_1,/n2_2,/n2_3,/n2_4,/n2_5,/n2_6,/n2_7,/n2_8,/n2_9,/n2_10,/n2_11,/n2_12,/n3_1,/n3_2,/n3_3,/n3_4,/n3_5,/n3_6,/n3_7,/n3_8,/n3_9,/n3_10,/n3_11,/n3_12,/n4_1,/n4_2,/n4_3,/n4_4,/n4_5,/n4_6,/n4_7,/n4_8,/n4_9,/n4_10,/n4_11,/n4_12	N/A

Load Generator Configuration Notes

All filesystems were mounted on all clients, which were connected to the same physical and logical network.

Uniform Access Rule Compliance

Other Notes

Config Diagrams

Hewlett-Packard Company	:	BL860c 4-node HA-NFS Cluster
SPECsfs2008_nfs.v3	=	134689 Ops/Sec (Overall Response Time = 2.53 msec)

SPECsfs2008_nfs.v3 Result

Performance

Product and Test Information