Disc Search

Artist Search:

Advanced Search

What Is Surround Music?

Back To Articles Main Page

DTS Coherent Acoustics® The Future Of Audio Part Two: The Sonics Of Bit Rate Reduction
By Mike Smyth And Stephen Smyth


DTS Technology
This is the second article in an exclusive three-part series on the technology of DTS Coherent Acoustics®. DTS Coherent Acous-tics is a variable low to high bit rate solution to the delivery of discrete 5.1 multichannel digital audio in consumer and professional applications such as LaserDisc, DVD, CD, DAT, Digital VCR and HD platforms. DTS Coherent Acoustics is a competing codec to that developed by Dolby Laboratories and marketed as AC-3®, which we covered extensively during its pre-introduction period last year, and continue to do so on an on-going basis. DTS Coherent Acoustics had its first trade and consumer introduction at the Stereophile High End Hi-Fi '95 Consumer Show in Los Angeles, followed by a showing at the June 1995 CES High End Specialty and Home Theatre Show. As has been the norm, Widescreen Review presents this informative series of technology articles so that our readers can be well informed on the critical standards issues impacting the future of audio. - Gary Reber, Editor

Introduction
In the first article linear PCM was introduced as the standard method of representing high quality digital audio. In comparison to the analog signal the digital signal requires considerably more bandwidth for reproduction or transmission. It is generally felt that the current consumer PCM standard (16-bit word length and 44.1 kHz sampling rate) still does not match the quality of an analog system, and there is a desire to increase both the word length (to 20-bit) and sampling rate (to 96 kHz) in order to achieve "real HiFi" quality. The higher bit rates required will further aggravate the problem of economically delivering this data to the consumer, especially in multichannel formats.

The purpose behind the development of DTS Coherent Acoustics® was to enable mastering quality digital audio to be delivered to the home on existing and proposed new me-dia platforms. In addition it will be suitable for discrete multichannel formats, while simultaneously having extensions to the new proposed higher quality digital audio standards.

Within the DTS Coherent Acoustics framework is a digital audio compression methodology which operates directly on the linear PCM data in order to reduce the bit rate, with-out affecting the fidelity of the audio signal itself. The bit rate reduction is what allows, for example, multiple channels of higher quality audio to be delivered on a standard CD.

Introduction To Bit Rate Reduction
One of the conclusions drawn from an examination of the method used to digitize audio signals was that linear PCM requires a very high bit rate to achieve a quality comparable to the original analog signal. The reason, as explained in the last article, is that the PCM process is relatively simple, and cannot take into account either the characteristics of the audio signal itself or the way in which it is perceived by humans. In other words PCM is suitable for any sort of digital signal, audio or not, and makes no as-sumptions about how the signal is received or heard, or by whom.

The a priori reasoning behind digital audio compression is that, for the same fidelity, a lower data rate can be achieved if the PCM signal is re-quantized (coded) so as to take into account the sonic nature of the signal, and the fact that its quality is assessed by humans listening to it.

Redundant Information In Audio Signals
The sonic nature of the signal refers to inherent characteristics of audio signals such as music or speech. If these features are indeed inherent in all audio signals then it should not be necessary to explicitly convey this information during reproduction or transmission.

A simple example of such a characteristic is the tendency for audio signals to predominate at the lower frequencies. If it was known for example that for all audio signals the high frequency components above 10kHz were always less than half the magnitude of the lower frequency components below 10kHz, a simplistic coder could exploit this inherent characteristic of the audio signal by splitting the signal into two frequency bands and digitizing the higher frequencies at exactly half the resolution of the lower frequencies. In other words the coding system, by taking into account a characteristic of the signal, could reduce the bit-rate needed to reproduce the signal, and yet retain the same fidelity.

The term "objective redundancy" is used to refer collectively to all of these inherent characteristics within audio signals. The very name implies that any data used to convey these characteristics is not actually required and may be discarded, without losing any information whatsoever. The ability to recognize and extract redundant data from within the digital audio signal will therefore lead to a reduction in the bit-rate needed to re-produce the signal.

Irrelevant Information In Audio Signals
Since recorded audio signals are intended to be listened to, mostly by humans, this can also be exploited in order to reduce the bit-rate. If it can be shown that it is not humanly possible to perceive certain audio signals or components within an audio signal, then it should not be necessary to convey these signals, or components of them, but retain the same fidelity.

For example, if it was known that humans could not hear any frequency above 16kHz, then it would be pointless to transmit any frequency components above this. The signal would have been tailored to match the auditory characteristics of the human ear without any audible loss.

The term "perceptual irrelevancy" is commonly used to refer to those signals or signal components which cannot be heard by humans, and which may therefore be discarded with no audible effect. In contrast with redundant data, the removal of perceptually irrelevant data leads to a modification of the original signal, and once modified the signal can never be exactly recovered.

Dynamic Aspects Of Redundancy And Irrelevancy
In both of the simple examples referred to above the implication was that the redundant and irrelevant information was static and did not change with time. Unfortunately this is not the case and both redundancy and irrelevancy are very dynamic, changing dramatically from moment to moment. Much of the computational burden involved in digital audio compression is in simply identifying the rapidly changing redundant and irrelevant data.

Exploiting Redundancy And Irrelevancy
A reduction in bit-rate can be achieved by removing to any extent the objectively re-dundant and perceptually (or subjectively) irrelevant data from a digital audio signal. The only information transmitted would then be non-redundant and perceptually important, and the final bit-rate will be determined by the degree to which the redundant and irrelevant data has been identified and re-moved from the PCM data stream.

In typical digital audio signals the redundant element of the data can approach 10 bits of the original 16-bit samples. By removing just this redundant part of the data a compression ratio of approximately 2.5:1 could be achieved.

In contrast, early theoretical studies at AT&T Bell Laboratories in the late 1980's in-dicated that on average only approximately 2.2 bits per sample were required in order to code the perceptually relevant parts of most signals. Given an original signal of 16-bit linear PCM this implies a theoretical compression ratio of about 7:1.

A cursory look at these figures suggests that the removal of irrelevancy should have top priority since it is more likely to achieve a greater reduction in bit-rate. However, since the removal of irrelevant data is destructive, a well designed compression scheme should seek to remove redundancy first (which is not destructive) and only remove irrelevant data as a secondary operation.

Lossless And Lossy Audio Compression
An audio compression scheme that only removes redundancy from the signal, is re-ferred to as "lossless," since no real information in the signal is actually discarded and the signal can be reconstructed exactly during playback. The philosophical appeal of lossless compression is very strong to audiophiles, and the concept has begun to command more attention as new data delivery media are developed which may be more suitable for this form of audio compression. A brief explanation of lossless compression is given next in this article along with some of the practical delivery problems that its use entails.

Most commercial digital audio compression algorithms today, such as are used in DCC and MiniDisc, try to remove primarily irrelevancy, and are referred to as "lossy," since information that was in the original signal has been discarded.

The coding framework of DTS Coherent Acoustics utilizes both lossless and lossy compression techniques, and can operate in either mode. Due to the greater coding gains that can be achieved by lossy compression compared to lossless, the focus of most of the current article is on the techniques utilized by lossy compression algorithms.

Lossless Compression
Many people will already be familiar with lossless compression systems for non-audio digital data, particularly those schemes which operate primarily on digital text data. These systems have been used for increasing the throughput on modems (e.g. PKZIP) and more recently for increasing the storage ca-pacity of hard drives on PC's (e.g. Double Space). The algorithms used in these systems are able to increase throughput or storage capacity by as much as a factor of four by analyzing characteristics inherent in the text data, and at first glance would seem applicable for use with digital audio data.

However as mentioned earlier, a lossless compression algorithm must be able to ex-ploit the inherent characteristics of the digital data in order to remove redundancy and hence compress the data. A compression algorithm that is trying to identify textual re-dundancies will probably not be suitable for identifying sonic redundancies, and hence the degree of digital audio compression achieved by using PKZIP is likely to be small.

Textual Data
Lossless audio compression techniques can be most easily illustrated by considering the compression of text-based data. The standard ASCII method of encoding text is to assign an 8-bit number to the letters of the alphabet, including all the punctuation symbols etc. This process is hence similar to the encoding of an audio signal using 16-bit linear PCM. Lossless compression is possible because in normal English each letter does not have the same probability of being used. The letter "E" is used more frequently than "Z." By analyzing large amounts of text it is possible to rank all of the ASCII characters in order of frequency of use. This ranking is then used to encode the characters using a different algorithm which assigns fewer bits to the most frequently used characters, and more bits to those characters used less often. This means that in a normal piece of English text the total number of bits used to encode the sentences will be less than that used by the ASCII code, but with no information being lost.

Lossless Problems
The main problem with using lossless schemes will be very obvious to most PC users who have tried to compress graphical and textual data. The amount of compression varies according to the input data. Some text files compress easily, others compress very little if at all. Compressing digital audio files using PKZIP for example gives very poor results.

If the compressed data has to be subsequently transmitted in a fixed bandwidth then the original data going in to the compression algorithm has to slow down and speed up according to the degree of compression being achieved. Transmitting a fax illustrates this perfectly for text and graphical based data. The page feeds into the fax at a rate that is dependent on the amount of compression currently possible. Blank pages are transmitted quickly, pictures or text are transmitted more slowly.

The same problems exist for lossless compression of digital audio data. As explained earlier, lossless compression is achieved whenever redundant information in the PCM data is removed prior to storage or transmission of the digital audio. The degree of redundancy, however, is related to certain time-varying characteristics inherent in the original audio signal, e.g. a non-flat spectrum. Since these characteristics will change dramatically for different types of sounds, the amount of redundant information will also change for different sounds. Therefore, for typical audio passages which contain a wide variation of sounds over time, lossless compression of the original PCM data will result in a "variable rate" digital audio data stream, i.e. the amount of data per unit time will vary.

Pure sine waves and white noise are two common audio signals which cause a lossless coder to operate at either extreme of its variable rate capacity. Sinusoidal signals are highly redundant and can be transmitted or reproduced at very low bit rates compared to the original PCM. On the other hand full scale white noise, by definition, contains no redundant information. Any attempt to compress this signal results in a bit rate that simply equals that of the original PCM. In general, music and audio signals contain degrees of redundant information which fall between these limits, and average compression ratios of approximately 2:1 are commonly achieved when operating on 16-bit linear PCM.

For most real-time audio applications such as telephony, broadcasting or CD playback this variable rate process is not practical. Currently these applications all operate on fixed bandwidths, and therefore cannot be used economically with lossless compression schemes. However, variable rate digital me-dia, such as the new Digital Video Discs (To-shiba's DVD and Sony/Philips MMCD) or the ATM (Asynchronous Transfer Mode) telephony network, are much more favorable to the use of lossless compression techniques, and could afford the advocates of lossless compression the first opportunity to introduce it to the consumer.



Lossy Compression
As noted above lossy digital audio compression algorithms attempt to remove ir-relevant data from PCM audio signals.

This can be explicitly determined by using an accurate perceptual auditory model which in turn requires a thorough knowledge of how humans hear. This model should be able to highlight those parts of the original audio signal that are perceptually relevant and irrelevant.

Although the explicit use of a human auditory perceptual model to drive the irrelevancy extraction processes in digital audio compression is relatively recent, the same irrelevancy extraction processes have been utilized implicitly in speech and audio coding fields for the past 30 years. Techniques such as delta modulation, adaptive PCM (APCM), differential PCM (DPCM) and adaptive noise shaping are just a few traditional coding techniques which all rely on the removal of irrelevant information for their success

Perceptual Modeling
1. Noise Mask Threshold

One of the principle methods used to ex-tract irrelevancy and hence reduce the bit-rate of digital audio applies a "masking threshold" model to human hearing. In simple terms this model assumes that some sounds are masked by others, and cannot be heard. For example, conversation is found to be more difficult in a noisy environment. The noise tends to "mask out" parts of the speech, i.e. it cannot be heard. In other words, the human ear has a threshold of hearing that is dependent on the signal applied to it.

On more detailed analysis the ability of some sounds to mask others is found to be dependent on amplitude, frequency and the purity of the tones within the sound. Loud sounds are more effective at masking than quiet sounds, and the effectiveness of the masking is reduced as the frequency difference between the masking sound and the masked sound increases. Also, the ability of tones to mask noise is less than that of noise masking tones. Figure 1 illustrates the noise mask threshold calculated for a 1kHz tone.

By examining an audio signal closely in the frequency domain it is possible to work out which parts of the signal lie below the masking threshold and therefore cannot be heard by humans. These parts are deemed irrelevant and can be discarded without audibly affecting the signal. This is the fundamental basis of lossy perceptual coding.

Another way of describing the masking threshold is in terms of a noise threshold. Signals that are below the threshold cannot be heard because they lie below the noise floor. Signals above the masking threshold need only be quantized to the accuracy that the noise floor dictates. Quantizing the signal more accurately would not audibly increase the quality of the signal.

The masking threshold is thus used to remove irrelevant information from the audio signal, and also determines the accuracy with which the remaining parts of the signal are quantized. In general it is found that the remaining audible parts of the signal can be quantized to considerably less resolution than 16-bit linear PCM.

In practice the audio signal is first analyzed in terms of frequency using a high resolution time-to-frequency linear transform operating on a windowed block of time-domain digital audio samples. The linear frequency coefficients are then further transformed into linear "bark" coefficients which more closely approximate the way in which humans perceive frequencies, and which resemble a standard 1/3 octave scale. The masking ability of each of the bark coefficients is then calculated and, by summing over each of these individual masks, the overall frequency-dependent mask threshold is built up for the block of audio data. Figure 2 illustrates the noise mask threshold for a complex multi-tone audio signal.

This mask threshold now becomes the new quantization noise floor for the frequency transformed digital audio data. Any frequencies in the signal which fall below the threshold are discarded, and the remaining frequency coefficients are re-quantized to the accuracy demanded by the mask threshold at that point on the curve. The next block of time-domain data is then frequency transformed and a new mask threshold calculated for this block. The mask threshold is therefore dynamic and changes in response to the input signal.

2. Absolute Threshold Of Hearing

The sensitivity of the ear to sounds varies with frequency, being most sensitive around 3kHz and less sensitive at lower and higher frequencies. The absolute threshold of hearing at any particular frequency will vary be-tween individuals, but the average value is well documented and can be used as a fixed lower limit in the calculation of the noise masking threshold described above. This must be used with some caution however since the strict use of absolute values for noise masking thresholds would mean that the playback level of the audio signal would also need to be fixed. This would present a problem in consumer systems where the playback level is highly variable.

3. Temporal Masking

The thresholds referred to above have dealt with signals that are coincident in time but which vary in frequency. Temporal masking refers to the ability of one signal to mask another which occurs either before or after it in time. In general post-masking is much more effective than pre-masking. For sharp transient signals the post-masking effect can last for up to 500 milliseconds, during which time the noise threshold at all frequencies can be increased with little or no audible effect.

4. Interchannel Masking

When more than one audio channel is heard at one time through multiple loudspeakers (e.g. two-channel stereo) it is possible that sounds reproduced from one loudspeaker will mask sounds coming from another. The process is very similar to noise masking within a single channel, and is referred to as interchannel masking.

The main difficulty in attempting to exploit this irrelevancy is that it requires that the playback environment be known beforehand and that it does not change. For example if only one channel were played in isolation the masking effect of the other channels would not be present, and noise in the single channel would become audible. The same problem would also occur if the listener were not equidistant between the speakers.

As a result of the difficulties in controlling the playback environment, interchannel masking is of little practical interest.

5. Localization

The high degree of correlation or similarity between the left and right channels of a two-channel stereo signal can also be used to reduce the overall bit rate of a stereo signal. It has been found that the perception of a stereo image, or the ability to localize a phantom sound, is strongly dependent on frequencies below around 2kHz, and only weakly dependent on higher frequencies.

This implies that for stereo imaging (using two or more audio channels) the need to ac-curately reproduce the higher frequencies in all the channels may not be necessary to maintain an acceptable level of localization. In other words, the high frequencies in all the channels may not be particularly relevant to our ability to perceive stereo images, and could be removed.

However most of the recent research into this aspect of perceptual irrelevancy has been targeted at finding "acceptable" levels of imaging performance. It is not yet known to what degree these high frequencies may be removed without producing any audible effect whatsoever. As a result audiophile quality coding, such as is possible with DTS Coherent Acoustics, should not exploit this irrelevancy.

Removal Of Perceptual Irrelevancy
Since the perceptual models described above identify irrelevant information in the frequency domain, it is more efficient to transform the time-domain linear PCM audio data into the frequency-domain in order to remove the irrelevant data. Once the irrelevancy has been removed the relevant data is transformed back to the time-domain for playback.

Re-quantization
Recalling the noise mask threshold model, frequency components that lie below the threshold cannot be heard and may be re-moved, while the threshold also determines the level of quantization noise of the remaining components.

When the time-domain PCM samples are transformed to the frequency-domain, the noise floor in the frequency domain is at the 16-bit level (i.e. approximately 96 dB below full scale).

The re-quantization is essentially a mapping process that converts the original 16-bit resolution frequency coefficients to new values that have a much lower resolution and hence higher noise floor. As explained in Part One in Issue 14, the noise floor is dependent on the length of the PCM word which represents the amplitude of the frequency coefficient. Since the new perceptually in-audible noise levels are known (from the noise mask threshold), the new "bit allocation" for each frequency coefficient is determined directly from these noise levels. In practice, the 16-bit coefficients are usually re-quantized to between 1 and 10 bits per coefficient in order to comply with the perceptual mask thresholds. Figure 3 shows the increase in the noise floor following re-quantization in the frequency domain. The noise mask threshold is also shown, illustrating that the new noise floor is still well below the level of perception.

In real-time systems some simplifications are necessary since the calculations required to determine the mask threshold, whilst being quite simple, are computationally intensive. Essentially the mask threshold is only calculated at certain perceptually critical frequencies of the spectrum and the threshold value applied to a larger group of frequency coefficients clustered around this value.

In practical systems the dynamic noise mask threshold uses, as a lower limit, the ab-solute threshold of hearing.

The removal of temporal irrelevancy in-volves lowering the quantization resolution (thereby raising the noise floor) after the transient has occurred. The resolution is allowed to increase gradually as the temporal masking effect diminishes with time.

Joint-Stereo Coding
A "joint stereo" compression scheme can exploit stereo imaging, or localization irrelevancy, by maintaining the independence of the left and right channels at the lower frequencies, and averaging the left plus right signal at the higher frequencies. Theoretically this reduces the data rate by almost 50 percent because only one set of mid-to-high frequency coefficients are retained rather than independent left and right coefficients. If more than two channels are used in playback (e.g. 5.1 channel surround) the re-duction in the data rate may be even more substantial.

This scheme can be improved by including an independent intensity envelope for each audio channel that describes the original mid-to-high frequency amplitude profile. This is used to modify the single joint-stereo signal for each channel so as to retain a semblance of the original spectrum.

It should be noted that this scheme re-moves any unique phase information in the channels thus jointly coded, making them in-coherent at the point where the averaging begins. In real systems the frequency at which the averaging (joining) begins is normally signal dependent, but for any significant compression benefit to be realized averaging usually begins between 3kHz and 11kHz.

In passing it should also be noted that if there is little or no correlation between the left and right channels, or if the two channels have been derived from a 4:2:4 matrix surround encoder, the use of joint-stereo coding can produce highly audible artifacts. The loss of phase information of the signal above the 'join' will cause steering problems for any matrixed surround stereo signal.

Conclusion
In the first article the need to improve the performance of linear PCM digital audio systems was discussed. Current proposals ad-vocate an increase in the sampling rate to approximately 96kHz,, and an increase in sample resolution from 16 to 20 bits. Furthermore there is also a growing demand for multi-channel audio playback, and to maintain the highest quality these channels would need to be discrete.

One drawback of both of these proposals is that the bit-rate required could increase by almost a factor of eight, and that current (or even proposed) digital audio playback systems are not capable of delivering these high data rates economically.

A solution to this data-rate problem in-volves the use of bit-rate reduction techniques which remove redundant and irrelevant data from the original PCM audio signal. The removal of redundancy does not necessarily cause a loss of information, but the amount of reduction is limited. Furthermore lossless compression schemes intrinsically operate at a variable rate, which on their own limits their current applicability.

Irrelevant data can be removed by lossy compression schemes. Recent techniques employ a pyschoacoustic model of the ear to explicitly calculate the level of inaudible noise that can be injected into the original signal. In multichannel systems irrelevant localization data can also be removed to further reduce the bit-rate. The removal of irrelevant data produces substantial reductions in the data-rate, but unfortunately these are destructive processes in that the irrelevant information is permanently lost.

DTS Coherent Acoustics has been de-signed to solve this data-rate problem, by deploying a combination of lossless and lossy audio compression algorithms to en-sure perceptually transparent operation. The next article focuses on the computational mechanisms which must be employed by Coherent Acoustics to remove the redundant and irrelevant data.


Mike Smyth and Stephen Smyth are principals in AlgoRhythmic Technology. Stephen Smyth is the designer of the DTS algorithm both for the DTS theatrical system and the distinctly different DTS Coherent Acoustics consumer/professional system. DTS Technology is a joint venture between Digital Theater Systems, AlgoRhythmic Technology, Steven Spielberg and Universal/ MCA.

Back To Articles Main Page


Home
Disc Reviews
Recent News
About SurroundMusic.net
What Is Surround Music
Contact Us

Top Of Page

Widescreen Review® Magazine
27645 Commerce Center Drive
Temecula, CA 92590
Phone: 951 676 4914 • Fax: 951 693 2960

Copyright © 1995 - 2005 www.WidescreenReview.com
All Rights Reserved