SCTP网络性能的相关论文
SCTP Performance in Data Center Environments
本论文是英特尔公司发表的,发表时间为2014年10月,由Krishna Kant发表。
简介
本论文首先对SCTP进行了简单介绍,然后对TCP和SCTP进行了黑盒式地对比它们在linux中的实现和性能;接着对比了采取相应地优化后,SCTP所带来的性能提升。最后讨论了SACK,并进行了总结。
SCTP的特点与数据中心的要求
TCP的特点主要有:
- 面向流的,保序的传输;
- 使用贪婪方案来增加流速;
- 使用AIMD(缓慢增长、急速下降)的窗口方法来控制堵塞控制;
- 经过几十年的发展,已经相当稳定了。
- 一些缺点:缺少完整性检查; DoS攻击,等等。
SCTP的一些特点包括:
- multi-stream, 多流。一个association可以有多个流(逻辑通道)。流控和堵塞控制仍然是基于association的。
- Flexible ordering, 保序是在每一个流里的,而不是整个association;
- Multi-homing, 多宿机;
- DoS保护
- Robust association establishment: CRC检验以及心跳机制;
- Each SCTP operation (data send, heartbeat send, connection init, …) is sent as a “chunk” with its own header to identify such things as type, size, and other parameters.
数据中心的一些要求:
- much higher data rates
- much smaller and less variable round-trip times (RTTs)
- higher installed capacity and hence less chances of severe congestion 堵塞少
- low to very low end-to-end latency requirements 低延迟
- unique quality of service (QoS) needs
针对这些要求,所需要的是
- a low protocol processing overhead is more important than improvements in achievable throughput under sustained packet losses 快速处理协议
- achieving low communication latency is more important than using the available BW most effectively 低通信延迟
- a crucial performance metric for data center transport is number of CPU cycles per transfer (or CPU utilization for a given throughput) 单位传输的CPU使用
- packet losses should be actively avoided, rather than tolerated 避免丢包
- delay based congestion control is much preferred in a data center than a loss based congestion control 堵塞算法的选择
- demands a much higher level of availability, diagonosability and robustness
- fairness property is less important than the ability to provide different applications bandwidths
RDMA可以减少内存拷贝,但是,an effective implementation of RDMA becomes very difficult on top of a byte stream abstraction。所以才会考虑SCTP。
SCTP与TCP的性能比较
SCTP的实现主要有两个, 一是LK-SCTP, 另外一个是开源版本KAME。本论文方要采取前者进行测试比较。
测试环境
- two 2.8 GHz Pentium IV machines (HT disabled) with 512 KB second level cache (no level 3 cache)
- R.H 9.0 with 2.6 Kernel
- one or more Intel Gb NICs
- One machine was used as a server and the other as a client
- Multi-streaming tests were done using a small traffic generator that we cobbled up 多流测试
- iPerf sends back to back messages of a given size
测试配置
- checksum offload
- transport segmentation offload (TSO)
checksum calculation is very CPU intensive. In terms of CPU cycles, CRC-32 increases the protocol processing cost by 24% on the send side and a whopping 42% on the receive side。很明显地,需要把CRC32的代码去掉;这一功能可由专门的硬件来完成。
TSO for SCTP would have to be lot more complex than that for TCP and will clearly require new hardware.所以,把TCP TSO功能禁止掉了。
比较维度
- Average CPU cycles per instruction (CPI)
- Path-length or number of instructions per transfer (PL)
- No of cache misses per instruction in the highest level cache (MPI)
- CPU utilization
- Achieved throughput.
第一组比较
测试方法的要点有:
- a single connection running over the Gb NIC
- pushing 8 KB packets as fast as possible under zero packet drops
- The receive window size was set to 64 KB (>the small RTT 56us)
测试的主要结果:
- 无论是发送还是接收,SCTP可以达到与TCP几乎相同水平的吞吐量;
- 发送端的SCTP CPU utilization是TCP的2.1X来
- 发送端的cache miss要比TCP低
- 发送端的执行指令数目是TCP的3.7X来 ;
- 发送端的整体CPI是TCP的60%;
- 接收端的特性整体与上面保持一致,但是要有所改善,这是因为rev端所做的会少;
解释测试结果:inefficient implementation of data chunking, chunk bundling, maintaining several linked data structures, SACK processing, etc.
第二组比较
测试方法的变化主要是:
- 使用64byte大小进行传输测试,而不是8KB;
- 分别调整窗口大小为64KB和128KB大小;
测试的主要结果:
- 在默认的64KB设置下,TCP的吞吐量要比SCTP略好; —— 这与期望是不相符的;
- 在128KB的情况下,TCP的吞吐量则要比SCTP好再多; —— 这是由于更少的数据结构操作
在数据中心的环境中,低延时要比高吞吐量重要得多,所以将NO-DELAY设置打开是一个正确的选择。by default, whenever the window allows a MTU to be sent, SCTP will build a packet from the available application messages instead of waiting for more to arrive
多流的性能
测试环境
- 使用1.28KB大小进行测试 —— 避免单个NIC成为瓶颈
- 使用DP (dual processor)配置 —— 避免CPU成为瓶颈
- a single NIC with one connection (or association)
测试结果
- 总体来讲,TCP与SCTP中的单流测试吞吐量是相平的,但是SCTP的多流单连接的测试结果要差一些;
- 2 associations 1 stream 与 1 associations 2 streams的比较
- 后者的吞吐量要比前者少28%
- CPU utilization同样是后者比前者低28%
- send / recv二者的观察结果是一致的
in effect, the streams are about the same weight as associations; furthermore, they are also unable to drive the CPU to 100% utilization. 究其原因,与锁和同步问题相关。
更进一步的检验可以发现sctp在实现和规范上的问题。
- sendmsg()函数的实现, locks the socket at the beginning of the function & unlocks it when the message is delivered to the IP-Layer. This problem can be alleviated by a change in the TCB (transport control block) structure along with finer granularity locking
- A more serious issue is on the receive end – since the stream id is not known until the arriving SCTP has been processed and the chunks removed, there is little scope for simultaneous processing of both streams.
SCTP性能加强
实现上的问题
- Both of these structures are dynamically allocated & freed by LK-SCTP
- Each chunk is managed via two other data structures
- a total of 3 memory to memory (M2M) copies before the data appears on the wire.
对应的措施
- Avoid dynamic memory allocation/deallocation in favor of pre-allocation or use of ring buffers
- Avoid chunk bundling only when appropriate
- Cut down on M2M copies for large messages
规范上的问题
- one SACK per packet
- the frequency of SACKs in SCTP becomes too high and contributes to very substantial overhead
- the maximum burst size (MBS). While the intention of MBS is clearly rate control, specifying it as a constant protocol parameter or embedding a complex dynamic algorithm in the transport layer is not a desirable approach.
- transmission control block (TCB).
- the size of the association structure is an order of magnitude bigger at around 5KB.
- Large TCB sizes are undesirable both in terms of processing complexity and in terms of caching efficiency
Performance under Errors
a reduction in SACK frequency is detrimental to throughput performance at high drop rates, but is desirable at lower drop rates.
SCTP performance and security
本论文的主要贡献在于对SCTP协议格式的介绍,以及对新特性的简单测试。主要的点有:
- Overview of protocol format. 使用图片来介绍sctp协议的格式。
- sctp common header format
- 12 control chunk types and 1 data chunk type 协议类型的列表
- four-way Association initiation 图片化展示了四次握手过程
- SCTP state diagram 状态转换图
- 本论文的测试要简单得多,并没有对sctp进行深入的性能测试。
Portable and Performant Userspace SCTP Stack
本论文主要是从user space来实现sctp协议的讨论。除此之外,本论文还对当前已有的协议工作进行了汇总,包括了linux/iOS/Windowns等多个平台上支持sctp的状况。