Disc Search

Artist Search:

Advanced Search

Published Articles - (Issue 38 of Widescreen Review)

Back To Articles Main Page

Data Compression For Film: Psychoacoustics Considerations, SDDS® and Dolby® Digital - Part 2
By Perry Sun


In the last Issue 37, Part 1 of this series of articles on data compression (or data reduction) technologies for digital multichannel audio was devoted to elucidating the relevant digital audio concepts, defining the necessity for measures to reduce the amount of data needed for film sound and explaining how the theatrical version of Digital Theater Systems® (DTS®) operates. The purpose of this series of articles is to provide some information on the inner workings of the three digital film sound formats currently in use, as an aid in formulating judgements about how one of these systems perform with respect to the others. Then, in subsequent installments, we will explore data compression systems for consumer applications, such as DTS Coherent Acoustics (for DTS Digital Surround), MPEG audio and Meridian Lossless Packing (MLP).

It has been widely believed that the ability of a particular sound format to convey digital audio at reduced data rates and without noticeable degradation is closely allied to the degree of data reduction. However, it is just not as simple as that!

The reality is that all of the digital sound companies have diligently insured that the end result of their products is film audio quality that is of exemplary fidelity, with any possible data compression-related distortion kept to a minimum, unlikely to be noticed by the typical moviegoer (one could arguably say the same for professional audio experts). With varying degrees of data reduction employed among the formats, the methods to achieve high quality audio and low data loads thereby vary in complexity.

In this article, we will explore the measures used with Sony Dynamic Digital Sound® (SDDS®) and Dolby® Digital to achieve reduced audio data requirements for film applications. We'll also find out why direct listening comparisons between them in making judgements about their performance is not trivial and consider audio data compression for digital cinema.


Figure 1 - (A) Auditory sensitivity threshold (black) and loudness relative to 1kHz at 40dB (gray). Redrawn from reference 7. (B) Frequency masking at two amplitudes of a masking signal at 400Hz. Redrawn from reference 5.

Psychoacoustics Considerations
Data reduction measures for SDDS, Dolby Digital and several others, including MPEG audio, reduce the data needed to convey acoustic signals by attempting to mimic human patterns of hearing (also referred to as psychoacoustics). By capitalizing on well-documented characteristics of how we perceive sound, these algorithms are then programmed to retain data relevant to sounds we can hear and discard those that are considered imperceptible. Such an approach to audio data reduction is known as perceptual coding.

There are three aspects of psychoacoustics that are relevant to perceptual coding algorithms. The first is the variation of auditory sensitivity with frequency. As is evident from the graph in Figure 1 A, humans do not perceive sound equally throughout the audible frequency range, approximately 20Hz to 20kHz (20,000Hz). The black curve shows the threshold of audibility. The gray curve is the equal-loudness curve, or the sound pressure levels (SPLs) required at various frequencies to produce the same apparent loudness as a 1kHz signal at 40dB SPL (Sound Pressure Level).

The second pertinent aspect of psychoacoustics is known as masking, and occurs when one sound renders another sound inaudible. Masking can manifest itself across frequencies or over time. A very simple example is the sound of a bird chirping being masked by an airplane taking off simultaneously. Two masking curves are shown in Figure 1 B, centered around a masking signal at 400Hz. Any sound that falls below the curve will be obscured by the 400Hz signal. It should be apparent that masking is most effective for sounds that are closest to each other in frequency, and is dependent on the amplitude of the masking signal. Masking is also optimal for sounds happening closest to each other in time, although it is possible for a sound to be masked by another which occurred just before (forward masking), or to a more limited extent, just after it (backward masking).

The third characteristic of our auditory systems that is utilized by perceptual coders is the spectral resolution of sounds. The human hearing mechanism "detects" sounds by their frequencies through the basilar membrane of the inner ear, which splits the sound into its spectral components. Our ability to discriminate between audio signals at various frequencies is limited to specific regions, known as critical bands. There are about 30 critical bands spanning the audible frequency range, each about one-third octave in width (an octave is equal to twice the frequency; a critical band centered at 180Hz would have a width of approximately 60Hz). Therefore, the width of the critical band increases with frequency, with the majority of these bands residing below about 5kHz.


Figure 2 - Bit allocation, using models of masking (solid line) and auditory threshold (dashed line). Dark gray bars are spectral coefficients to which bits are allocated with higher priority than the coefficients denoted by the light gray bars.

SDDS: Coding On The Basis Of Hearing
SDDS stores its digital data on film, at both edges of the print along the sprocket holes. A dedicated soundhead illuminates the soundtracks at both edges with red LEDs (Light Emitting Diodes), and the pattern of tiny opaque and transparent blocks on the tracks, representing 0s and 1s of the binary data, are imaged using two CCD (Charge-Coupled Device) cameras. The stream of binary digits from the soundhead is then decoded into up to eight channels. SDDS is the only format that offers up to eight-channel capability, with the addition of two screen channels over the conventional 5.1 channel configuration. [The exception is the DTS special venue eight-channel system.&emdash;Editor] In addition, there are additional channels containing "extra" data at each edge of the film, to be used in situations where reading the data is difficult at either edge (a likely occurrence, due to wear and tear after repetitive playback).

The digital audio encoding-decoding algorithm (or codec) for SDDS is ATRAC, or Adaptive TRansform Acoustic Coding, and was originally developed by Sony for their consumer audio MiniDisc format. ATRAC is a perceptual coder.

ATRAC operates by first dividing the audio signal into three sub-bands, 0-5.5kHz, 5.5-11kHz and 11-22kHz. As is the case for the apt-X data compression algorithm for DTS (see Part 1 in Issue 37), QMFs are used for the signal division. As a first and primary approach to mimic human perception of sound, the sub-bands are transformed from the time domain into the frequency domain, using what is known as the Modified Discrete Cosine Transform (MDCT), so that individual spectral components (or binary number coefficients) of the audio signal can be resolved. This process is fully reversible, so the original time domain signal can be reconstructed. Data compression algorithms that work with audio in the frequency domain are known as transform coders. (In contrast, apt-X operates on audio signals in the time domain, and is a waveform coder.)

MDCT works with blocks of digital audio samples, and the number of spectral coefficients yielded is equal to half the number of samples in the block. Usually, blocks consisting of 11.6ms (milliseconds or 0.0116 seconds) of digital data are used within each sub-band to provide for a total of 256 frequency coefficients (based on ATRAC's sampling rate of 44.1kHz). However, this block size can be a problem with transients in the audio signal that occur on a considerably shorter timescale, resulting in increased quantization error (see Part 1) prior to the transient in the reconstructed time domain signal. To minimize this artifact (known as pre-echo), when a transient is detected, the block size is reduced to 1.45ms for the 11-22kHz sub-band and 2.9ms for the others, so that the timescale for any audible quantization error is reduced to the point of being obscured by backward masking from the onset of the transient.

The spectral coefficients from the MDCT are then grouped into bands. The width of the band increases with frequency, so in essence, these bands are intended to mimic the critical bands characteristic of human hearing. The spectral binary number coefficients in each band are then re-quantized, by expressing them as a smaller binary number with a common word length and a common scaling factor (which determines the quantizing step size; see Part 1). This is also known as floating-point coding, and will be explained later with Dolby Digital. A bit allocation procedure is used to determine the word length for each of the bands, based on models of auditory sensitivity threshold and frequency masking. The basic idea of how this works is shown in Figure 2. The masking curve (solid line) is calculated about each band. Band coefficients which are greater than or equal to the sensitivity threshold (dashed line) and the masking curve are assigned word lengths to allow for precise quantization (dark gray bars), while coefficients deemed to be insensitive or obscured by masking are assigned small or zero word lengths (light gray bars). ATRAC limits the total number of bits available to be allocated.


Figure 3 - Block diagram for ATRAC

A diagram summarizing the processes in ATRAC is shown in Figure 3. The input, PCM audio at 20-bit resolution and 44.1kHz sampling rate (solid gray arrow), is split into three sub-bands (QMF). Then, each of the sub-bands is transformed into its frequency coefficients (F), with a transient detector (T) that reduces the block size if necessary. These coefficients are then grouped into bands and undergo floating-point coding (C) by re-quantizing spectral coefficients (Q) with word lengths based on bit allocation (A). The output (black dashed arrow), consisting of the spectral coefficients, plus auxiliary data (block size, word length, scaling factor), is 20 percent of the input data, resulting in a 5:1 compression ratio. The coding is performed on each of the channels separately. The decoding consists of simply reversing the encoding steps.

Dolby Digital: Capitalizing On Efficiency
Dolby Digital stores its data on film, in the spacing between the sprocket holes. Similar to SDDS, the optical digital data consists of an array of tiny transparent and opaque blocks, illuminated by a series of red LEDs, and read into the sound processor using a CCD camera and electronics to convert video into digital data. The space available for data is very limited; a bit rate of about 560 kilobits (thousand bits) per second was determined to be practical for accurately and reliably reading the data. After taking into account ancillary data needed to be reserved for error correction (for splices and dirt accumulation), the bit rate available for coding multichannel audio is only 320 kilobits per second. As noted in Part 1, six channels of uncompressed PCM audio (20-bit/48kHz) requires a bit rate of 5.76 megabits (million bits) per second; so a data reduction ratio of 18:1 is needed.

The codec for Dolby Digital is AC-3®, originally envisioned for accommodating low-bandwidth requirements for DTV (Digital Television). In addition to digital film sound, AC-3 is used for several consumer formats including LaserDisc, DVD, digital satellite broadcast and DTV. As is the case for ATRAC, AC-3 accomplishes data reduction by discarding bits corresponding to components of the audio that are considered inaudible to humans. However, AC-3 also utilizes a number of other coding strategies, so that only five to seven percent of the original data is needed to convey 5.1 channels of high fidelity digital audio.

The central philosophy behind AC-3 is that all channels should be compressed together as an ensemble, where the total bits that can be accommodated by the media (which in this case is film) is distributed among the channels. The input to the AC-3 encoder is six-channel PCM audio (16 to 24-bit resolution and 48kHz sampling rate). The first step is to transform each of the channels from the time to the frequency domain, using Time Domain Aliasing Cancellation (TDAC). Blocks of 512 samples, or 10.7ms of audio, are normally used to yield 256 spectral coefficients. However, when a transient signal is detected, the block size is reduced to 5.4ms duration to minimize pre-echo.

Each of the spectral coefficients from all of the channels are then converted from a fixed-point binary number to floating-point notation, consisting of two smaller binary numbers: a mantissa and an exponent. The mantissa is a fractional amount of the fixed-point number, and the exponent is a scaling factor, to which the mantissa is multiplied to obtain the fixed-point number. The word length of the mantissa determines the resolution, and the exponent determines the quantizing step size of the frequency component. Expressing spectral coefficients in floating-point form is advantageous because it allows for floating-point coding opportunities. (As mentioned previously, ATRAC uses floating-point coding, by grouping spectral coefficients into bands, converting them to floating-point notation, and then assigning a single exponent to each band.)

AC-3 uses a variety of strategies to code with floating-point numbers. If the audio signal is steady over time, then exponent information can be repeated over several blocks, up to about 64ms in duration or six blocks. Dolby also determined that if the difference between exponents in adjacent frequencies were to be coded instead of the actual values of the exponents, only about two-bit resolution would be required.

For further data reduction, bit allocation measures are used. The set of spectral coefficient exponents spanning the frequency range is a representation of the signal power along the spectrum. This set is grouped into bands, whose width increases with frequency, similar to the critical bands.

Each band has a common exponent. The word length of the mantissas within each of the bands is then determined by a bit allocation routine, which is based on a predicted masking curve over the entire spectral range. This masking curve is determined for each frequency band. If a band exponent lies above or below the masking curve (as calculated from a model), the value of the curve at that band frequency is accordingly incremented or decremented. The results from each of the bands are then combined to obtain the predicted masking curve. After making sure that all parts of this curve exceed the threshold of human auditory sensitivity, the mantissa for each frequency component (not for each band) is re-quantized, with resolution corresponding to the extent to which its exponent exceeds the predicted masking value.

At this point, you may be wondering if ATRAC and AC-3 are similar. In fact, they are! Actually, Sony has to license some of their coding procedures from Dolby Laboratories.

After the mantissas have been re-quantized, a count of the number of bits consumed for all of the channels is performed. If the total number of bits available has not been exceeded, then the mantissas can be quantized with greater accuracy. However, if the total has been surpassed, then two measures can be invoked. The first is to just decrease the resolution of the mantissas. Up until now, we have only been considering data reduction in AC-3 for each of the channels independently. A second way of meeting the total bit requirement is a technique known to Dolby as coupling. The basic idea is that mantissa information for frequency bands across multiple channels is combined into a single coupling channel, based on the average signal power. For each band, the ratio between the signal power in the coupling channel and in each separate channel (known as the coupling coefficient) is substituted for the mantissa and exponent in each channel, which in turn requires fewer bits. Then, the original spectral coefficients for each channel are recovered upon decoding, by multiplying the mantissae from the coupling channel by the appropriate coupling coefficients. Coupling occurs only for frequency bands above 10kHz.




Figure 4 - Block diagram for AC-3

A block diagram summarizing the processes for AC-3 is shown in Figure 4. Input is six-channel PCM (solid black arrows), which is then transformed into the frequency domain (F), with block size determined by the transient detector (T). The binary number spectral coefficients are converted to floating-point and coded (C) by expressing exponents (gray dotted arrows) with the lowest bit resolution necessary and re-quantizing mantissas (Q; black dotted arrows) via bit allocation (A). The output compressed data stream (black dashed arrow) consists of the mantissa and exponent information from all of the channels, plus auxiliary data for exponent coding, coupling coefficients (gray dashed arrow), bit allocation, etc. It should be noted that several other coding strategies than those just described are implemented to achieve low data rates; AC-3 is by far the most complex of the three codecs used in digital film sound. The steps for decoding the data are essentially the reverse of those for encoding, requiring the auxiliary data for parameters and information on reconstructing the original channels.

Summary And Considerations
Perceptual coders achieve data reduction by discarding bits considered to be imperceptible to human hearing. This method of reducing data relies on three principles of psychoacoustics: the variation of human auditory sensitivity and perceived loudness with frequency, time and frequency masking of sounds, and discrimination between bands of frequencies.

The data compression algorithm for SDDS is ATRAC, which transforms audio from the time to the frequency domain, and employs a bit allocation procedure by assigning bits to spectral components within frequency bands in relation to their masking and auditory sensitivity thresholds. Dolby's AC-3 for Dolby Digital similarly transforms audio signals to the frequency domain. A floating-point conversion is performed on the spectral coefficients, which allows for several data reduction opportunities including masking/auditory sensitivity-driven bit allocation and coupling between channels.

You may think that one way to distinguish between the sonic performance for these formats would be to listen to the same film in DTS, SDDS and Dolby Digital and formulate judgments. But it's not that simple. The typical moviegoer would have to listen to the film in different auditoriums, on different sound systems. The variability between B-chains (equalization, amplification, speakers, room acoustics) would more than likely account for any perceived differences, rather than the formats themselves. Even if you had the ability to directly compare between digital sound formats on the same audio system, you need to know what kinds of coding artifacts to listen for and what types of audio signals are likely to tax the limits on certain codecs (such as transients and multitone signals). But most importantly, distinctions between formats cannot really be made by just comparing between them. You need to have access to the original uncompressed sound mix, as well as encoders for each of the formats, so that virtually every possible variable in making such comparisons can be controlled. [Widescreen Review is in the process of conducting such a controlled experiment comparing the sound quality of Dolby Digital AC-3 and DTS Coherent Acoustics encoding and decoding.&emdash;Editor]

Will digital audio data compression remain a necessity for digital cinema? The premiere public demonstrations of Star Wars: Episode I&emdash;The Phantom Menace utilized six channels of PCM audio at 24-bit/44.1kHz. For the approximately two-hour playing time, a total of 5.7GB (gigabytes) would be required. In comparison to the 360 or so GB required to store the video data, the need for audio data reduction would not be an issue. However, 360GB is considered too large a bandwidth for practical means of data delivery and storage. If the video data were to be compressed substantially to the 45 or so GB bandwidth currently being discussed, some compression of audio might be justified, if a gain in a few GB could improve the image quality. Data compression for audio could also be a viable consideration if more than 5.1 channels were to be used in the future, soundtracks for multiple languages implemented, and the digital audio resolution and/or sampling rate increased to match evolving standards in professional sound recording. With the excellence of what today's digital sound can offer with restricted data rates, the future of digital audio compression for film should not be ruled out.ations, such as Meridian Lossless Packing (MLP) and Sony/Phillips Direct Stream Digital (DSD).

References
1. Davis, Mark F. "The AC-3 Multichannel Coder," 95th AES Convention, preprint 3774.
2. Davis, Mark F. "The Big Squeeze: The Theory And Practice Of Dolby Digital," Audio, (July 1997).
3. Davis, Mark F., and Craig C. Todd. "AC-3: Operation, Bitstream Syntax, And Features," 97th AES Convention, preprint 3910.
4. Flemming, Howard. "Sony Dynamic Digital Sound: The Technology," Widescreen Review, Issue 7 (February/March 1994), 69-72.
5. Holman, Tomlinson. Sound For Film And Television. Boston: Focal Press, 1997.
6. Todd, C. C., Davidson, G. A., Davis, M. F., Fielder, L. D., Link, B. D., and S. Vernon. "AC-3: Flexible Perceptual Coding For Audio Transmission And Storage," 96th AES Convention, preprint 3796.
7. Tsutui, K., Hiroshi, S., Shimoyoshi, O., Sonohara, M., Akagiri, K., and R. M. Heddle. "ATRAC: Adaptive Transform Acoustic Coding For MiniDisc," 93rd AES Convention, preprint 3456.
8. Watkinson, John. The Art Of Digital Audio. 2nd ed. Oxford: Focal Press, 1994.
9. Weinberg, David J. "The Dolby Stereo Digital Film Sound System," Widescreen Review, Issue 8 (April/May, 1994), 53-63.

Perry Sun is the Movie Sound Editor for Widescreen Review, and also the editor of eFilmNetwork.com. Perry can be contacted via e-mail at perry@widescreenreview.com.

Back To Articles Main Page


Home
Disc Reviews
Recent News
About SurroundMusic.net
What Is Surround Music
Contact Us

Top Of Page

Widescreen Review® Magazine
27645 Commerce Center Drive
Temecula, CA 92590
Phone: 951 676 4914 • Fax: 951 693 2960

Copyright © 1995 - 2005 www.WidescreenReview.com
All Rights Reserved