理解嵌入式系统中基本的语音算法

转至 https://www.embedded.com/print/4015932

Understand the fundamentals of speech algorithms in an embedded system
Nitin Jain, MindTree Consulting - February 06, 2006

An enormously high number of algorithms are in use today in various electronic systems. Integrating and evaluating a DSP algorithm with the system is tricky enough to bring programmers to their knees. To try to simplify this complex technology, let’s first start with a relatively simple example.
Audio frequency spectrum that stretches to 40 kHz is divided in two bands. While the speech components are in the lower part of spectrum, from 5 Hz to 7 kHz, the audio components are in the remaining high portion ((Fig. 1).

Speech processing mainly involves compression/decompression, recognition, conditioning and enhancement algorithms. Signal-processing algorithms count on system resources like available memory and clock. As these resources relate directly to system cost, they’re often prohibitive.

Measuring an algorithm’s complexity is the first step in analyzing the algorithm. This includes looking at the clocks required, and determining the algorithm’s processing load, which can vary based on the processor employed. However, the memory requirements would not change based on the processor.

Most DSP algorithms work on collections of samples, better known as frames (Fig. 1). This introduces an inevitable delay due to frame collection that’s in addition to the processing delay. Note that the International Telecommunication Union (ITU) standardizes the acceptable delay for an algorithm.

Looking at the audio spectrum, basic telephone-quality speech occurs up to 4 kHz. High-quality speech reaches 7 kHz, followed by CD-quality audio.
An algorithm’s processing load is typically represented in millions of clocks per second (MCPS), which is the number of clocks/s from the core that an algorithm would need. Assume an algorithm that processes a frame of 64 samples at 8 kHz, and requires 300,000 clocks to process each frame. The time required to collect the frame would be 64/8000, or 8 ms. Or, in 1 second, 125 frames could be collected. To process all the frames, the algorithm would consume 300,000 × 125 = 37,500,000 clocks/s, represented as 37.5 MCPS. Simplifying, the MCPS equation is:

MCPS = (clock required to execute one frame × sampling frequency/frame size)/1 million

Note that there’s another common term used for measuring an algorithm’s processing load—MIPS (million instructions/s). The calculation of MIPS for an algorithm can be tricky. If the processor effectively executes one instruction in one cycle, the MCPS and MIPS ratings for that processor are the same. Analog Devices’ BlackFin is one such processor. Otherwise, if the processor takes more than one cycle to execute an instruction, a ration exists between the MCPS and MIPS ratings. For example, ARM7TDMI processor effectively requires 1.9 cycles/instruction.

The memory considerations for any algorithm are typically separated between code (read-only) and data (read-write) memory. The proper memory amount can be found by compiling the source code. Note that algorithms perform at their best when using the fastest memory, and this is usually memory that’s internal to the core.

Before integration
The time to start integrating and evaluating any speech algorithm on any embedded system is when the system is in a predictable or stable state. “Stable” means that the audio front-end’s interrupt structure is consistent. In other words, no bytes are and a decent amplitude level is maintained. It’s also best to have all the statistics of the system’s memory and clock available.

Integrating an algorithm on an existing system is somewhat easier. If the system is in devolvement phase, it’s recommended to test the audio front-end thoroughly before integrating of evaluating any algorithms. Within the system, you must verify that no interrupts are contradicting with each other. If such an issue were to exist, debugging can be a painful experience.

In a system that will incorporate audio/speech algorithms, robust audio firmware is a must. It must give the maximum time and accurate data to the algorithms to perform efficiently. One common mistake is to interrupt the core upon each sample’s arrival. If the algorithm operates only on the frames of a fixed number of samples, other interrupts are redundant. DMAs and internal FIFOs can collect samples and interrupt the core after collecting a frame.

Example algorithms
When developing any telecommunication system, engineers often begin by testing the voice quality with the typical pulse-code modulation (PCM) codec, known as G.711. This narrow-band codec restricts the sample amplitude to 8-bit precision and produces a 64-kbit/s throughput. The encoder and decoder work on each data sample using a weightless algorithm with little complexity and almost no processing delay. This gives designers the option to play with it and verify the system.

Checking the signal levels, adjusting the hardware codec gains, synchronizing the near and far-end interrupts, verifying the DMA function, or any other experiment can be accomplished using this basic telephony standard. During this process, don’t be surprised to find that the received compressed data is in bit-reversed manner. A simple “bit-reversing” code will bring it back to the expected state. Any wideband speech codec could be used as an example of a speech algorithm that’s heavy in terms of memory and clock consumption. One example is sub-band ADPCM (adaptive differential PCM), or G.722, which operates on data sampled on 16 kHz and thus covers entire speech spectrum. It retains the unvoiced frequency components, those between 4 and 7 kHz that provide high-quality natural speech.

Before any codec is integrates into a system, I recommend that the designer do careful testing. While G.711 encoding and decoding can be tested on a sample-by-sample basis, codecs that involve filters and other frequency-domain algorithms are tested differently, using a stream of at least few thousand samples. The codec verification engages the engineer in unit testing with ITU vectors, signal-level testing, and interoperability testing with other available codecs. Interoperability issues related to arranging the encoded data in 16-bit word before transmitting and mismatching in signal levels aren’t new to system integration engineers.

Algorithms that require lots of memory and clock cycles have a big impact on the system, unlike those that have been discussed so far. The more compute-intensive algorithms include echo cancellers, noise suppressors, and Viterbi algorithms. Evaluating the performance of these is not as easy as the speech codecs.

Generally, any telecomm systems that involve a hands-free or speaker mode employ an acoustic echo canceller. This prevents the second party from hearing his own voice as an echo. If operated in a noisy environment, a noise-control algorithm may be needed. The echo canceller-noise reducer (EC-NR) demands lots of memory and clocks from the system. Time- and frequency-domain techniques can help solve the acoustic echo problem, with frequency-domain techniques proven to be more efficient with less computational cost (Table 1).

Table 1

A frequency-domain technique uses an adaptive FIR filter to update its coefficients only when it finds that the residual echo error is larger than the threshold. Subtracting the estimated echo from the input signal gives the error. The far-end signals are used as a reference to these algorithms to estimate the echo. Providing a proper reference is needed to get good echo estimation and cancellation.

Another factor, echo tail length, is the echo reverberation time measured in milliseconds. Simply put, it’s the time spent in echo formation. The filter length is found by multiplying the echo tail length by the sampling frequency (Table 2).

Table 2
One of the basic requisites for an error-correction (EC) implementation is to support the data sampled until at least 16 kHz to ensure that wideband speech is covered. Integrating EC with wideband speech codecs requires some attention. As the echo tail length depends on the sampling frequency, canceling echo up to 72 ms with data sampled at 8 kHz will effectively cancel only half of the span when applied on the 16-kHz sampled data. And compared to 8 kHz, collecting a frame takes only half the time. Hence, engineers find integrating a half-effective EC with wideband codecs doubly challenging. Designers often raise the core frequency to efficiently manage EC on the system with a 16-kHz sampling rate.

Noise-reduction techniques have been used for many years. Depending on the application, an approach is chosen, implemented, and applied. For example, a technique could consider noise as more stationary then the human speech. The algorithm will model the noise and then subtract it from the input signal. A decay of 10 to 30 dB is significant for some applications. A common application that uses EC noise reduction could be when a handset is placed in speaker mode in a noisy environment or when hands-free mode is enabled in the car (Fig. 2).

Click here for Fig. 2
2. A basic system employs some of the speech compression/decompression and speech enhancement algorithms.

The EC tail length requirement for the hands-free application is about 50 ms and the NR level required can vary from 12 to 25 dB, depending on the noise attributes and expected voice quality. Generally, the higher the noise reduction, the more the speech quality is put at risk. Hence, a level can be selected dynamically to give a reasonable reduction while still maintaining the proper voice quality.

The EC noise reduction can require up to 15 or 20 kbytes of system memory. The processing of each 64-sample frame can consume from 1.5 to 3.0 Mclocks, depending on the processor. Evaluating the performance of this combination can be tricky. The steps include tuning the hardware codec gains; finding the correct microphone and speaker placement; finding the synchronization between far and near-end speech and interrupts; finding audio hardware with linear attributes; and testing various EC tail lengths and noise reduction levels to achieve the best echo cancellation and noise reduction.

It’s important to consider worst cases when evaluating the complexity of any algorithm. An algorithm’s execution time can vary for different frames. This data dependency is due to the fact that a processor might take more time to multiply two samples of higher amplitude than multiplying samples of lesser amplitude.

An example of being cheated with adaptive algorithms comes when you observe the cycles consumed for a few frames, when the filter coefficients have not been updated. Adaptation of filter data can take several thousand cycles, which must be considered. A word of caution—don’t rely solely on the algorithm. Experimenting with a variety of vectors will help increase the accuracy of MCPS and performance measurements.

Collecting the bits and pieces
The algorithms discussed are good enough to build a basic telephony system. When the system has more than one algorithm, the sequence of algorithms to be called is key. Few speech algorithms (like noise suppressors) introduce attributes of non-linearity to their output, and this could hamper the performance of other algorithms. That kind of algorithm must be placed as the last module in the speech-enhancement process.

About the author
Nitin Jain currently works in the research and development group of MindTree Consulting, in Bangalore, India. He holds an engineering degree in electronics and communications. Jain can be reached at nitin_jain@mindtree.com.

理解嵌入式系统中基本的语音算法相关推荐

python好还是c+-嵌入式系统中，Python与C/C++哪方更为适用？
[51CTO.com快译]长久以来,C/C++一直编译着嵌入式系统编程领域,但二者亦拥有自己的缺陷.相比之下,Python则成为嵌入式系统中的另一大***语言选项.在今天的文章中,我们将共同探讨双方的 ...
嵌入式系统中进程间通信的监视方法
概述复杂的嵌入式系统中,常常同时运行着相当多的进程.这些进程之间频繁的进行着大量的通信动作.进程的运行状态与这些不断发生的通信有着直接和紧密的联系.通过对进程间通信的监视,开发人员可以掌控系统内部运 ...
嵌入式系统中看门狗概述。。。
一直以来对于嵌入式中的watch dog(看门狗)都比较陌生,一直都不知道它到底是做什么的,单从名字上看也不知其所以然,然后就在网上找到了一篇blog,就是再说看门狗的作用和概述,原文如下: 1.概述 ...
理解嵌入式开发中的一些硬件相关的概念
为什么80%的码农都做不了架构师?>>> 做嵌入式系统开发,经常要接触硬件.做嵌入式开发对数字电路和模拟电路要有一定的了解.这样才能深入的研究下去.下面我们简单的介绍嵌入式开发 ...
Nginx在嵌入式系统中的应用
-----------------本文转载自 http://blog.csdn.net/xteda/article/details/39708009 ------------------------- ...
嵌入式系统中时间的应用以及rtc的验证过程
在嵌入式系统中时间分为3种,分别为当前时间,系统时间,硬件时间,三种之间有一定的关联关系,如果开发板没有电池,用于保持开发板的时钟,那么当开发板断电后,时钟恢复为默认时间,一般为1970年1月1日,0 ...
【原创】QT在嵌入式系统中显示中文的方法
[原创]QT在嵌入式系统中显示中文的方法此篇文章主要借鉴:http://zzqh007.blog.163.com/blog/static/44434847201011312168296/ 移植QT4 ...
嵌入式系统中对汉字的处理
现在要解决的问题是,嵌入式系统中经常要使用的并非是完整的汉字库,往往只是需要提供数量有限的汉字供必要的显示功能.例如,一个微波炉的LCD上没有必要提供显示"电子邮件"的功能: 一个 ...
理解Linux系统中的load average（图文版）转载
理解Linux系统中的load average(图文版) 博客分类: Linux linux load nagios 一.什么是load average? linux系统中的Load对当前CPU工作 ...

理解嵌入式系统中基本的语音算法

理解嵌入式系统中基本的语音算法相关推荐

最新文章

热门文章