Making real-time, multichannel video processing a reality: There's an easier way

With the processing requirements being identical for all video streams, it has been shown that it is advantageous to build fully symmetric processing systems.

Processing power ... system power dissipation ... data bandwidth ... data connectivity ... video latency. These all become critically important components of system development when designing a system to withstand the rigors of a mobile embedded system - and manage multiple video channels in a real-time environment. But the key might prove surprising: It’s all in the processor(s).

To accommodate remote surveillance, vehicle autonomy, and situational awareness, there is a growing requirement for real-time, multichannel, remote video connections in the military sector. It is now relatively easy to connect desktop computers or even laptops with a video stream. However, once one attempts to connect multiple remote video streams in a low-power, non-equipment-friendly environment, it becomes painfully obvious that what appears simple in the office accelerates into a significantly larger challenge in an embedded environment with myriad interrelated problems, each affecting the others’ performance or cost.

As the channel count increases, some of the more prominent challenges include: having enough total processing throughput without dissipating too much power for the environment; having enough bandwidth and therefore performance on an individual channel to always maintain acceptable video quality; and managing to move all the data on a network connection while still guaranteeing minimal latency on all channels with possible loss of video recovery. However, a dedicated processing scheme is serving to solve the issues of power, data bandwidth, and data connectivity and latency – and make possible the video capabilities so essential to successful field operations.

Processing power and power dissipation

To understand the system requirements and possible performance/power tradeoffs, it is necessary to develop a metric based on a single-processor system. The most obvious issue to examine is that of processor bandwidth. To get a scale of the total processing power required to handle up to five video streams, a simple test was run that included a commercially available computer containing a dual-core, 3.16 GHz Pentium equivalent processor, 4 GB of DDR2 SDRAM at 666 MHz, and a GbE connection. The system was running Fedora 9 Linux along with GStreamer 0.10.20 video management software. Table 1 shows the data collected using multiple remote computers to stream the video to and from the target machine.

21
Table 1: Measured PC processor load with 3,000 kb H.264

The total system power was estimated at 44 W without disks, fans, or power supply inefficiencies. Since vehicle autonomy applications such as supply transport and tactical ground video often require at least four video streams, one for each side of the vehicle, a full dual-core PC of about 57 W of power would be necessary. For optimal Mean Time Between Failure (MTBF) and therefore field survivability, complexity and power dissipation must be minimized. Therefore, a simpler solution with less power dissipation is needed.

Data bandwidth

Inadequate memory bandwidth often results in non-real-time video or missing video streams. Either scenario could be devastating for remote ground support or tactical awareness. In an application where a real-time tactical view is required from a vehicle such as a Humvee or a tank, the typical video processing flow can split into two logical sections as depicted by Figure 1.

21
Figure 1: Tactical video acquisition showing the processing steps for both camera and display
(Click graphic to zoom by 1.9x)

To provide video processing as detailed in Figure 1, it is critical to have adequate aggregate internal memory bandwidth to manage all the data. If a single 640 x 480 video stream is being acquired and the image were 2 bytes per pixel and 30 frames per second, the aggregate required bandwidth to move the image once would be 18.4 MBps. Assuming the video encoding algorithm is running mostly from internal processor cache and the software is structured so the data copies are minimized, the data throughput requirement for a single channel would exceed 249 MBps. This number includes:

  • Buffering the incoming frames for de-interlacing – 18.4 MBps
  • Copying the data for a simple “weave” de-interlace – 18.4 MBps
  • Reading and writing the data for compression – 209 MBps for a typical algorithm implementation
  • Writing the data for data transmission – 1 MBps (assumes less than 8 Mb result)
  • Ethernet protocol buffering, packetizing, etc. – in most systems this could be two or more copies – 2 MBps

This data throughput requirement does not include any bandwidth allocation for program accesses, data encryption, issues with cache page size or line size, cache flushing, physical memory access inefficiencies, stream scaling, or encryption support.

Since many high-performance graphics chips contain hardware assist for compressing and decompressing video, one could drop 209 MBps from the aggregate for four channels. However, even after making that adjustment, it is not possible to get the total processing system power, given in the previous example, down to a reasonable embedded target of 10 to 15 W.

However, if one were to have an independent video hardware video compression/decompression engine for each video channel and a separate memory interface, the processing requirements would be greatly reduced. And the available memory bandwidth would scale with the number of processors. This leads one quickly to the conclusion that using a low-power embedded processor, such as the Freescale i.MX27, for each channel would be a more efficient solution. In fact, when building a system with four processors running simultaneously, the processor load is less than 12 percent for each processor for the same configuration settings that were used for the PC.

Data connectivity and video latency

Assuming the encoding/decoding of channels is done on individual processors, it is still necessary to make sure that the data streams are combined in a logical fashion, that the aggregate network bandwidth is below the requirements of the remote transmission media, and that the system maintains a low latency.

Managing the source and destination of the data for the channels is relatively easy since the streams are from independent processors. Each stream can be assigned to an Ethernet address and the data streams connected to each other by a UDP connection. This can be managed through a Web page interface and would be easily reconfigured to meet the changing demands of the theater of engagement. Since the bandwidth requirement for a single channel is relatively small, 2 to 4 Mbps, the combined bandwidth can be kept to a manageable number (under 10 Mb for four channels, for example).

When independent processors are being used, the bandwidth requirements scale linearly with the number of channels being encoded. If the bandwidth requirement approaches the actual maximum throughput of the Ethernet connection, the traffic from any single connection could adversely affect the latency of the other channels. In more extreme cases, it will cause frame loss on other channels. The threshold of this effect is difficult to precisely predict since it is related to aggregate bandwidth for the wire, data buffer sizes, compression ratios, and a large number of other factors. Radio communications, such as those between a remote vehicle and a command center, are often the bottleneck for moving data. Therefore, careful examination of the application latency, bandwidth and connectivity requirements will drive the selection of the video encoding scheme.

Multiprocessor remedy

With the processing requirements being identical for all video streams, it has been shown that it is advantageous to build fully symmetric processing systems. Video bandwidth is optimized, as each processor has its own dedicated memory and is handling only a single video stream. Connectivity is simplified because each stream can be treated as a separate Ethernet address. Finally, latency is minimized because each processor is handling only one major task with the assistance of dedicated hardware. Also, each processor can be identical and, therefore, run the same software. This further reduces design cost and the complexity of managing the deployed system.

To validate the multiprocessor technology, the power consumption of the Beyond Electronics CPU-iMX27-VME video/audio processing board was measured. The resultant approximate 11 W with four processors streaming video and the onboard Ethernet switch managing the data traffic through a single Ethernet connection exhibits a sharp contrast to the commercial system described earlier at 57 W. When packaged in a conduction-cooled VMEbus form factor, this multi-processor system is clearly suited for operation within the constraints of most tactical mobile environments.

George Schreck is Chief Technical Officer with Beyond Electronics Corporation, where his current responsibilities include product development, product marketing, and corporate management. His experience includes 24 years of product design and marketing of high-reliability, embedded systems. He holds a Bachelor of Science from Lock Haven State University with additional studies in Physics and Computer Science at Lycoming College and Centenary College. He can be reached at gschreck@beyondelectronics.us.

Beyond Electronics Corporation 919-231-8000 www.beyondelectronics.us