Monday, November 9, 2009

Hardware Architecture for Conference Servers

The ATCA Summit http://www.advancedtcasummit.com/ (October 27-29, 2009) was a rare opportunity to think about and discuss the importance of hardware for our industry. As the communication industry becomes more software-driven, major industry events have focused on applications and solutions, and I rarely see good in-depth sessions on hardware. The ATCA Summit provided a refreshing new angle to communication technology.

First of all, ATCA stands for Advanced Telecom Computing Architecture and is a standard developed by the PICMG – a group of hardware vendors with great track record for defining solid hardware architectures: PCI, Compact PCI, MicroTCA, and ATCA. The ATCA Summit is the annual meeting of the ATCA community, or ecosystem, which includes vendors making chassis, blades, fans, power supplies, etc. components that can be used as tool kit to build a server quickly. Time-to-market is definitely an important reason companies turn to ATCA instead of developing their own hardware but equally important is that this telecom-grade (carrier-grade) hardware architecture provide very high scalability, redundancy, and reliability.

So, how does ATCA relate to visual communications? As visual communication becomes more pervasive and business critical both service providers (offering video services) and large enterprises (running their own video networks) start asking for more scalability and reliability in the video infrastructure. The core of the video infrastructure is the conference server (MCU), and the hardware architecture used in that network element has direct impact on the ability to support large video networks. HD video compression is very resource-intensive: raw HD video is about 1.5 gigabits per second, and modern H.264 compression technology can get it down to under 1 megabit per second. This 1500-fold compression requires powerful chips (DSPs) that generate a lot of heat; therefore, the conference server hardware must provide efficient power (think AC and DC power supplies) and cooling (think fans). But even in compressed form video is still using a lot of network bandwidth, and the conference server is the place where all video streams converge. Therefore, conference servers must have high input and output capabilities (think Gigabit Ethernet). Finally, some sort of blade architecture is required to allow for scalability, and server performance heavily depends on the way these blades are connected. The server component that connects the blades is referred to as ‘backplane’, although it does not need be physically in the back of the server. The ATCA architecture was built from ground up to meet these requirements. It was created with telecom applications in mind and has therefore high input/output, great power management and cooling, and a lot of mechanisms for high reliability.

The highlight of the ATCA Summit is always the Best of Show award. This year, Polycom RMX 4000 won Best of Show for infrastructure product, and I had the pleasure to receive the award. I posted a picture from the award ceremony here http://www.flickr.com/photos/20518315@N00/4080968072/. Subsequently, I presented in the session ‘The Users Talk Back’, and addressed the unique hardware functions in RMX 4000 that led to this award (http://www.flickr.com/photos/20518315@N00/4059010146/)

So, why did Polycom RMX 4000 win? I think it is mostly elegant engineering design and pragmatic decisions how to leverage standard hardware architecture to achieve unprecedented reliability. It starts with a high-throughput, low-overhead backplane (which we call ‘fabric switch’) that allows free flow of video across blades. This allows conferences to use resources from any of the blades. To illustrate the importance of this point, let’s briefly compare RMX 4000 to Tandberg MSE 8000 which combines 9 blades into a chassis but does not have a high-throughput backplane. Since video cannot flow freely among blades in MSE 8000, conferences are restricted to the resources available on a just one of the 9 blades. For example, if blade 1 supports 20 ports but 15 of them are already in use, you can only create a 5-party conference on that blade. If you need to start a 6-party conference, you cannot use blade 1, and have to look for another blade – let’s say blade 2 - that has 6 free ports. The 5 ports on blade 1 will stay idle until there is a conference of 5 or less participants. In fact, the ‘flat capacity’ software that is running on top of this hardware leads to even worse resource utilization on MSE 8000 but this article is about hardware, so I am not going into that subject (It is discussed in detail here http://videonetworker.blogspot.com/2009/08/curious-story-of-resource-management-in.html). The bottom line is that, with RMX 4000, you will be able to connect 5 participants to one blade and connect the sixth participant to another blade, without even noticing it.

Additional reliability can be gained by using DC power and full power supply redundancy. Direct Current (DC) power is used internally in all electronics equipment. However, power comes as Alternating Current (AC) over the power grid because AC power loss over long distances is lower than DC power loss. Once power reaches the data center, it makes sense to convert it once to DC and feed it to all servers, and that is why service providers and large enterprises running their own data centers like DC power. The alternative approach - provide AC power to each server and have each server convert it to DC - results in high conversion power loss, and is, basically, waste of energy, and should only be used if DC power is not available. RMX 4000 supports both AC and DC power but I am much more excited about the new DC power option. Each DC power supply has 1.5kW, and can power the entire RMX 4000. Best practice is to connect one DC power supply to the data center’s main power line and connect the second one to the battery array. Data centers have huge battery arrays that can keep them running even if the primary power line is down for hours or even days.

Reliability issues may arise from mixing media (video/audio) and signaling/management traffic, and therefore RMX 4000 completely separates these two types of traffic internally. This architectural approach also benefits security, since attacks against servers are usually about getting control of the signaling to manipulate the media. By clearly separating the two, RMX 4000 makes hijacking the server from outside impossible. Note that hijacking of voice conference servers is a major problem for voice service providers (I wrote about that here http://videonetworker.blogspot.com/2009/04/conferencing-service-providers-meet-at.html). As visual communication becomes more pervasive and business critical, similar issues can be expected in this space as well, and RMX 4000 is designed for that more dangerous future.

Finally, if a component in the conference server fails, it is critical that it can be replaced without disconnecting all calls and shutting down the server, thus preserving server-level reliability. All critical components in RMX 4000 are therefore hot swappable. This includes the four media blades (they are in the front of the chassis and host the video processing DSPs), RTM LAN modules (they are on the back of the chassis and connect to the IP network) and RTM ISDN modules (also on the back, connect to the ISDN network), power supplies, and fans. Each of these components can be removed and replaced with a new one while the RMX 4000 server is running.

..

I will discuss the topics of network-based redundancy and reliability in a separate article. Stay tuned!