Performance is a topic that is of interest to all users of a product and we believe that you should easily be able to try out the test claims for yourself. Although AWS has some drawbacks, especially when it comes to networking and the virtualisation costs it imposes, its a pretty good place to get indicative numbers that you can easily replicate without having to go out and buy matching hardware.

EC2 Setup

In order to provide a baseline and some behaviours that are consistent one must be careful about baselining the system to a reasonable set of defaults or you will get LOTS of issues related to the way the kernel defaults are set up in AWS.

The following system baselines were used to provide context for the loadtesting and for production the machines should generally be setup using one of these as a guide.

The outline here is the OS level setup. Networking setups are outlines in the latter section and individual machine differences will be addressed in each particular test.

AWS config

When setting up for loadtesting there are some settings that should be defaulted for instances:

Placement groups must be used in order anything above low levels of network traffic. All machines should be in the same placement group.
Tenancy must be dedicated (shared tenancy will throw up all sorts of cpu wierdness)

OS Baseline

Amazon Linux AMI 2015.09.1 (HVM), SSD Volume Type - ami-f0091d91

Unlimited file descriptors

cat /proc/sys/fs/file-max

// this should be set to a very large number e.g 
1626000

// remember to check the security limits

cat /etc/security/limits.conf

Ensure Jumbo MTU

sudo ifconfig lo 9001 up

sysctl.conf

[ec2-user@ip-172-31-38-216 ~]$ cat /etc/sysctl.conf 

# Kernel sysctl configuration file for Red Hat Linux
#
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
# sysctl.conf(5) for more details.

# Controls IP packet forwarding
net.ipv4.ip_forward = 0

# Controls source route verification
net.ipv4.conf.default.rp_filter = 1

# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0

# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1

# Controls the use of TCP syncookies
# net.ipv4.tcp_syncookies = 1

# Disable netfilter on bridges.
#net.bridge.bridge-nf-call-ip6tables = 0
#net.bridge.bridge-nf-call-iptables = 0
#net.bridge.bridge-nf-call-arptables = 0

# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 65536

# Controls the maximum size of a message, in bytes
kernel.msgmax = 65536

# Controls the maximum shared segment size, in bytes
#kernel.shmmax = 68719476736

# Controls the maximum number of shared memory segments, in pages
#kernel.shmall = 4294967296i


# System default settings live in /usr/lib/sysctl.d/00-system.conf.
# To override those settings, enter new settings here, 
# or in an /etc/sysctl.d/<name>.conf file
#
# For more information, see sysctl.conf(5) and sysctl.d(5).
fs.file-max=1000000

# Use the full range of ports.
net.ipv4.ip_local_port_range = 1024 65535

# Enables fast recycling of TIME_WAIT sockets.
# (Use with caution according to the kernel documentation!)
#net.ipv4.tcp_tw_recycle = 1

# Allow reuse of sockets in TIME_WAIT state for new connections
# only when it is safe from the network stack’s perspective.
net.ipv4.tcp_tw_reuse = 1


# Increase the number of outstanding syn requests allowed.
# TURN OFF SYNCOOKIES OR THE KERNEL WILL ASSUME ITS A DDOS AND 
# WILL SEND RST ON CONNECTIONS. 
# THIS CANNOT BE STOPPED IF THIS IS ENABLED
net.ipv4.tcp_syncookies = 0

# The maximum number of "backlogged sockets".  Default is 128.
# DO NOT SET THIS HIGHER AS KERNEL USES A SHORT AND SILENTLY 
# TRUNCATES HIGHER VALUES
net.core.somaxconn = 65535

# Handle SYN floods and large numbers of valid HTTPS connections
# DO NOT SET THIS HIGHER AS KERNEL USES A SHORT AND SILENTLY 
# TRUNCATES HIGHER VALUES
net.ipv4.tcp_max_syn_backlog = 65535

# Increase the length of the network device input queue
# DO NOT SET THIS HIGHER AS KERNEL USES A SHORT AND SILENTLY 
# TRUNCATES HIGHER VALUES
net.core.netdev_max_backlog = 65535

# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
#net.core.optmem_max = 40960
#net.ipv4.tcp_rmem = 4096 87380 16777216
#net.ipv4.tcp_wmem = 4096 87380 16777216

Java Version

[ec2-user@ip-172-31-38-216 ~]$ java -version
openjdk version "1.8.0_65"
OpenJDK Runtime Environment (build 1.8.0_65-b17)
OpenJDK 64-Bit Server VM (build 25.65-b01, mixed mode)

driver ixgbevf

[ec2-user@ip-172-31-38-216 ~]$ modinfo ixgbevf
filename:       
    /lib/modules/4.1.10-17.31.amzn1.x86_64/kernel/drivers/amazon/ixgbevf/ixgbevf.ko
version:        2.14.2+amzn
license:        GPL
description:    Intel(R) 82599 Virtual Function Driver
author:         Intel Corporation, <linux.nics@intel.com>
srcversion:     355229834F8D8C535692BEF
alias:          pci:v00008086d00001515sv*sd*bc*sc*i*
alias:          pci:v00008086d000010EDsv*sd*bc*sc*i*
depends:        
intree:         Y
vermagic:       4.1.10-17.31.amzn1.x86_64 SMP mod_unload modversions 
parm:           InterruptThrottleRate:Maximum interrupts per second, 
per vector, (956-488281, 0=off, 1=dynamic), default 1 (array of int)

[ec2-user@ip-172-31-38-216 ~]$ ethtool -i eth0
driver: ixgbevf
version: 2.14.2+amzn
firmware-version: N/A
bus-info: 0000:00:03.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

Tuning for High Volume messaging

When looking to perform very high volume messaging one factor that you will run into, even with the ixgbevf drivers, is that the AWS networking is really quite limited at the high end.

On linux, TCP packet delivery from the network card to the Kernel is achieved using ksoftirq interrupts. Each one of these interrupts is mapped to a single core and so that CPU bounds the amount of interrupts that can be serviced (as well as competing with other processing). Each interrupt represents one TX or RX Queue in the NIC. The ixgebvf drivers provide 2 queues and therefore the maximum we can get is 4 cpus to service these queues using smp_affinity.

When running into scenarios that are > 1Gb/s you will start to see CPU usage similar to:

12437 ec2-user  20   0 43.7g  20g  20g S 1291.7 34.9   6:09.18 java       
  105 root      20   0     0    0    0 R 96.8  0.0   1:50.46 ksoftirqd/24 
  141 root      20   0     0    0    0 R 96.8  0.0   1:46.90 ksoftirqd/33 
    3 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0  
   12 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1  
   16 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/2  
   20 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/3  
   24 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/4  
   28 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/5  
   32 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/6  
   36 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/7  
   40 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/8  
   44 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/9  
   49 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/10 
12437 ec2-user  20   0 43.7g  20g  20g S 1248.9 34.9   6:46.74 java       
  105 root      20   0     0    0    0 S 91.1  0.0   1:53.20 ksoftirqd/24 
 141 root      20   0     0    0    0 R 90.4  0.0   1:49.62 ksoftirqd/33 
    3 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0 
    ...

From the above Top output you can see that the ksoftirq cpu settings are showing a bottleneck at single CPUs.

However, Linux also allows us to use something called Receive Packet Steering RPS to spread the interrupts over multiple CPUs.

For more detail see http://highscalability.com/blog/2014/8/18/1-aerospikegrep java -server-x-1-amazon-ec2-instance-1-million-tps-for.html.

We can take advantage of this (and set a lower cost clocksource using):

[root@ip-10-0-1-32 ~]# grep eth0-TxRx /proc/interrupts  | awk -F: '{print $1}'
 266
 267
[root@ip-10-0-1-32 ~]# echo '0000feff' > /sys/class/net/eth0/queues/rx-0/rps_cpus 
[root@ip-10-0-1-32 ~]# echo 'feff0000' > /sys/class/net/eth0/queues/rx-1/rps_cpus 
[root@ip-10-0-1-32 ~]# service irqbalance stop
Stopping irqbalance:                                       [  OK  ]
[root@ip-10-0-1-32 ~]# echo tsc > 
    /sys/devices/system/clocksource/clocksource0/current_clocksource

This will lead to not only better throughput but also better significantly better latency as the packets are not queued behind one another and more data can be both consumed from the network and processed concurrently.

It is worth noting that modern 10Gb cards usually have 8-16 TX/RX queues.

AWS Testing