Websockets
Websockets, unlike HTTP. is a full duplex protocol. This means that data can be sent in both directions without waiting for a poll request to enable a reply.
Although Websockets provides a number of advantages it also has some drawbacks compared to HTTP.
- A largish number of proxies do not support the protocol very well
- The connection oriented nature makes the browser-> server path more susceptible to issues with connectivity (especially for mobiles)
- there is no intrinsic pacing in the protocol for data being pushed to the browser unlike the request -> reply nature of HTTP which provides rate limiting as a side effect of the client controlling its request rate through the speed of its processing cycle.
While the first 2 can be managed somewhat through approaches such as HTTPS tunneling for proxies and the HTTP fallback/configuration for specific clients, the last point can only really be addressed in the communication library.
It is perhaps counterintuitive to want to rate limit the outbound path while much focus is placed on performance. However, if we do not, Rubris will quickly overwhelm the browser/client with data, leading to large build up in the browser/client OS stack, steadily increasing latency and blocked TCP streams leading to severe degradation of behaviour.
As a rule of thumb the browser’s consumption speed for websocket data is directly keyed into the amount of processing by the browser in the data receive path. Even a minimal amount of processing on the messages will cause the browser to struggle to keep up with a stream faster than 15Mb/s rate, and for a single user with say 1000 subscriptions ticking very quickly we can easily generate in the region of 40-70Mb/s. This is just too quick for the browser to deal with. The result will rapidly lead to large backlog in the receiving OS buffer and continually increasing apparent latencies. The client will also experience very intermittent data receipt and will eventually timeout as the pings will become queued behind more and more data in the OS buffer
Note: A reasonable rate limit will therefore actually increase the perceived performance under many circumstances and prevent blocked OS writes, TCP stream downgrades and client disconnects.
Rate Limiting
In order to deal with these problems Rubris has a rate limit set by default for WS connections that is similar to a https://en.wikipedia.org/wiki/Token_bucket approach combined with a timer to manage connection reactivation.
Mechanism
Conceptually each connection has a number of tokens, each one of which counts as a byte which can be spent per time period. Once the tokens are empty after a send in a period the connection is placed into a timer to be reactivated when the time to wait elapses. The time to wait will either be a full period if the amount of bytes immediately exceeds the token value on the first call, or a fraction of the period (rounded up the minimum Timer period) if any time is remaining.
Unlike in the traditional TokenBucket the tokens are not refreshed externally, instead if the next send is outside the period the Token count is reset and the process begins again.
As our goal is not really to limit outgoing data to a hard cutoff as with the traditional Tokenbucket, Rubris does not deduct the tokens prior to sending and block data above that. Instead it is a post-send function. This enables a reasonable flexibility in oversends to allow elasticity in dealing with bursts and large messages that otherwise would get limited when we do not really want them to. This elasticity can be up to ~80% depending on the characteristic of the buffer occupancy.
For example, a single 1Mb message will be sent with no limiting at once if the buffer completely passes to the OS without penalising the actual send, and only limited once if the message needs to be sent in 2 writes. This means the algorithm is more keyed to greedily pace with the OS buffer flush rate than a pre-send check. If we limited prior to send the message would be limited a minimum of 4 times (with a delay of 200ms this would take 1 second to send the data).
Similarly if the buffers are sized to the TokenBucket level (as they are in the default) and the buffers are slightly under the limit then 2 buffers per cycle can be sent in each period (provided the OS buffers keep up). As we can see the algorithm is skewed towards throughput with the assumption that bursts will not continue long enough to block the client.
It is important to realise that the primary goal is not to accurately rate limit the stream to the exact specified value but to approximate a smoothing limit to seek a constant throughput to the client while preventing blocking writes and TCP issues.
Blocked writes at the OS level are expensive to deal with and require registering for Write interest on the socket and waiting for the OS to inform us the outbound buffer is empty, plus a timer period. We should aim to avoid this as much as possible and if you see recurrent blocked writes in the logs you should examine the rate limit settings and extend the Limit period or reduce the token size.
The Rate limiter is not currently adaptive to blocked send counts, although this may be added in the future.
Example numbers
For example with a rate limit of 256k for a 200ms period, an outbound buffer size of 256k and a Timer period of 100ms we would expect to achieve a smooth throughput of approximately 15Mb/s if the buffers were slightly under 50% on each write as we would be able to send 2 complete buffers and only after the third would we be paused for 100ms (as we would always have a fraction of 200 that would be rounded to a single period). If each buffer was filled just under the limit then the connection would be limited after the second send. If we had lots of buffers with lowish occupancy then the oversend elasticity would be small before being limited.
The exact oversend is related to the frequency of writes and the percentage of the buffer filled on each write. For example, the same behaviour but restricting the tokens to 128k per period would result in a throughput ~ 5-7Mb/s as each buffer write would overrun the tokens and so the connection would be limited after each write.
If you have a fill size on average of 50% and want to limit the rate to 10-12mb/s then one would reduce the token level to just over 50% or increase the duration. Generally it is better to reduce the token size otherwise pauses will become larger than we really want.
As the BlockedWriteTimer is used to reactivate connections that have been limited, It is also important to realise how the configured slot size that governs the pause granularity interacts with this process.
The slot size is the minimum time period which the connection can be suspended for. If the time period for the limiter is set as 200 ms and the Timer slot size is 100ms then the min wait time is 100ms and the max is 200ms. If the limiter is set to 1 second then the min is 100ms and the max 1 sec (defined as the (period - elapsed time between 1st write and last in period) between writes). However, if the slot is 500ms then the minimum wait period is 500ms and the max wil also be 500ms for the first example. In general the slot times (which defaults to 100ms) should not be altered and especially not lowered as the threads driving the timers will become very busy, occupy much more CPU time that could be devoted to processing and the insert into the timer is much more likely to compete with the timer thread for the lock for the slot causing thread pauses for the processing.
Increasing the slot value will reduce the activity and lock contention, but obviously increase the floor for the wait time. In this case the Time period for the limiter should also be increased.
However, once we start to increase both time and amount of data we are back into effectively a non-throttled behaviour which is much more likely to overrun the OS buffer and cause a much more expensive blocked socket.
The defaults are set to assume a minimal processing. If the work you are doing in the browser is slow or interacts directly with the UI components then it is better to decrease the tokens or increase the period. Reducing Tokens will cause more frequent limiting if you have a constant full buffer data shape, increasing the period will allow more data per period but increase the pause time. Generally if you find the pause time above 300-400 ms you will start to see jitter in the browser in updates rather than a smooth flow of data paced to the browser.
Default settings
The default is 256k (the default buffer size) per 200ms.
This does not really limit throughput particularly in practice. For example using 4-5 CPUs with a 1000 clients and 2 modules the default settings will allow a throughput of ~10Gb/s if the buffers are mostly filled on each push.
Rate limiting can be disabled (see the config). However, unless you have clients that can deal with very high rates it is recommended that this is not done. In most normal circumstances one would expect very few limits to be breached and the functionality to be mostly unused.
The JMX bean for the Writer for each module provides a count of connections that have been limited and blocked and can be used to gauge the behaviour of the application for tuning.