Supply chain: Technology

Steve Yurick describes the recent advances in voice directed warehouse technology that can have a significant effect on distribution efficiency.


Voice directed warehouse applications are a proven solution for improving distribution efficiency in industries ranging from grocery and foodservice distribution to apparel and industrial supply. A typical voice application combines a voice-directed workflow—the system provides audio prompts directing users what to do—with speech recognition technology that understands a user’s spoken responses. Voice applications seamlessly integrate with other warehouse systems to enable hand- and eyes-free operations that drive new levels of associate productivity and accuracy across picking and other warehouse tasks.

Underlying every voice application is a sophisticated speech recognition platform, designed to ensure users are understood quickly and accurately, every time they speak. High recognition accuracy means users do not have to repeat their responses, which maximizes efficiency and user acceptance of voice as a tool that helps them do their jobs better.

Over the past decade, the underlying speech recognition technology used in warehouse voice applications has undergone a significant evolution due in large part to developments in consumer and non-industrial markets, everything from GPS and smartphones to medical transcription and call centers. Although warehouse voice applications have different needs from these other applications, the advances in the wider speech recognition space are having a direct impact on advances in the warehouse market. Rather than adopting consumer speech recognition technologies, the ideal recognition solution for the distribution center incorporates a combination of technologies that provide the best-possible speech recognition accuracy and ease of use in a noisy environment with a widely diverse user population.

Before describing the speech technology landscape, it is useful to describe the requirements for effective speech recognition in a warehouse or distribution center. Most importantly, warehouse applications require far better recognition accuracy than a call center or other consumer application. Acceptable recognition for a game-player would not be acceptable for a voice picker. If a warehouse worker using a voice picking system has to repeat him- or herself, or if the speech recognizer mis-recognizes commands, the user’s productivity suffers and acceptance and adoption of the voice application is jeopardized.

Speech recognition challenges in the warehouse or distribution center
Every five percent difference in recognition accuracy—99 percent accuracy versus 94 percent, for example—translates into 2-6 minutes of productivity benefit per user per day (depending on pick rates and other factors). While the individual time benefit of better accuracy is small, the cumulative benefit for operations with large numbers of users is significant. Conversely, the effect of superior recognition accuracy on individual user satisfaction is significant, but more difficult to quantify or monetize.

Two additional factors compound the recognition challenge in the warehouse: non-standard accents and variable background noise profiles. In any country, pronunciations differ widely from region to region, and individual user accents, speech patterns, and speech impediments each add another level of complexity. Unlike a call-center application, that may eventually default to a live person if a user cannot be understood by the recognizer, a warehouse system has to work for all users all the time—there is no fall-back.

Similarly, background noise in any given warehouse can be extreme due to blowers and fans, conveyors, forklifts, and other static noise. The level of background noise varies from area to area and may also change quickly; when a pallet is dropped or conveyors are turned on the noise variance is dramatic. Any recognition technology used in a warehouse must provide outstanding accuracy across diverse user speech patterns amid loud and highly-variable background noise patterns.

Underlying every speech recognition system are mathematical algorithms (typically referred to as “engines”) which translate user speech into data by matching the characteristics of digitized audio against a pre-defined model. There are two basic methods for doing this: word-based recognition, where the voice database is based on whole words, and phonetic-based recognition, where the voice database is based on phonemes, the sound components that make up words.

Many early warehouse voice systems utilized the word-based recognition approach. Speech recognition technology was still developing and the techniques for the more rigorous statistical modeling framework required for phonetic-based recognition had not yet matured. Over the past decade the bulk of speech recognition R&D has been focused on phonetic-based systems. As a result, phonetic-based engines have matured to the point where they are suitable for industrial applications.

The word-based method has some distinct advantages and disadvantages. A major disadvantage is that it requires every word that is to be recognized to be “trained” by each user before they can use the system. Since a typical warehouse application consists of 100-200 words that the user can speak, this can be time-consuming. Training times with typical word-based recognizers range from 20-40 minutes per user. Additionally, a word-based system requires users to speak their words in a consistent manner since there is less tolerance for pronunciation variations than the phonetic-based method.

On the plus side, the word-based method handles heavy accents, non-standard pronunciations, and speech impediments especially well because there is no pre-built dictionary in the system that defines one or more pronunciations for each word. In a word-based system, the pronunciations are exclusively defined by how the user pronounces the word during training. Since a user is in complete control of the pronunciations recorded, he or she may elect to choose a completely different pronunciation of a word.

Like the word-based method, the phonetic-based method has pros and cons. By mathematically modeling sub-word sounds, one needs to train only those distinct sounds in a language (usually around 45) in order to recognize large vocabularies of words and phrases. The phonetic-based method is used for systems requiring reduced or no voice training time (such as consumer-facing call center systems) or applications that require large vocabularies. This method is also strong at allowing for continuous speech recognition, allowing users to talk naturally, without the need to pause between words or commands.

Phonetic recognizers may still require user training to achieve acceptable recognition rates for voice picking or other warehouse applications. First-generation phonetic recognizers (used in the warehouse) required as much user training as word-based systems.

Until recently, warehouse voice systems all used speaker-dependent technology; the voice engine is trained to recognize each user’s speech patterns. Many of these first-generation speech recognizers were originally built, tuned and optimized for a specific voice-only hardware platform (closely-coupled hardware/software system.) Although these proprietary recognizers may now be used on general-purpose hardware devices, they may not perform as well as they did on the special-purpose hardware for which they were originally designed.

Many first generation speaker dependent systems often required users to record a second voice template after starting to work with the system. People usually speak very clearly and deliberately when they initially create their voice template, but as they get comfortable working with voice, they speed up and revert to their usual, natural speech patterns—words combined, word endings omitted; articulation and enunciation discarded. When that happens, the recognizer became unable to match what users said against the templates they built; users then have to perform a second 20-40 minute training process.

To eliminate the need to retrain, five years ago Lucas Systems introduced the concept of adaptive voice modeling, in which the speech software automatically adapts the user’s voice template in the course of use – the recognizer is continually “training” as the user works. As users start working faster, slurring words together and cutting off the ends of words, the recognizer keeps up. Rather than degrading, recognition accuracy actually improves with use, even when a user’s pronunciation changes slightly due to fatigue, which often occurs at the end of a shift. The ability to provide consistently high recognition rates despite changes in the user’s voice has a significant benefit in long-term user satisfaction.

Eliminating user training
This technology solution does away with training altogether so that anyone can use the voice system without creating a voice model; that is the intent of speaker-independent systems used in automated customer service applications and other consumer-oriented products.

Speaker independent systems have dominated mass user applications because it was unrealistic to ask every person calling in to a voice-directed customer service phone line to train the system. While previous speaker-independent systems (all of which are based on phonetic recognition technology) provided acceptable recognition accuracy for consumer-facing applications, those applications always had a fall-back to a live operator. They did not provide high enough accuracy rates required for warehouse applications, especially noisy environments. In a warehouse application, poor recognition degrades user productivity and frustrates users…impacting technology acceptance.

Another major component of a warehouse speech recognition platform is audio pre-processing to address background noise; the audio signal is provided to the computer-based recognizer. First generation warehouse voice systems typically relied solely on ‘noise-reducing’ microphones. These microphones—which have improved significantly over the years and are still required for warehouse applications—include a dual microphone in the boom, one facing towards the user and one facing away from the speaker, capturing and filtering background noise. Newer warehouse voice applications added a noise sampling function on the mobile computer. The recognizer would take a sample of current background noise (while the user does not speak) in order to set appropriate noise-cancellation parameters and provide a higher quality audio sample for the speech recognition engine. Since this is not an automatic process, users might perform noise sampling several times in the course of a shift if they have recognition issues as the background noise levels in the warehouse change.

A better solution is to include audio processing technology that can automatically and continuously monitor and adjust to changing background noise levels. To do this efficiently requires far more computer processing power than was available in previous generation voice terminals. The algorithms for processing audio input have advanced dramatically, similar to the advances in speech recognition engines. Now powerful mobile computers can efficiently support higher audio sampling rates enabling advanced audio pre-processing and sound adaptation. The final piece of the puzzle integrates the noise reduction approach with the speech recognition algorithms. The speech platform can effectively handle all types of warehouse noises, from steady state noises such as conveyors to noise spikes such as pallets dropped onto concrete.

Lucas Systems’ Jennifer VoicePlus uses the broadest possible range of recognition technology. Recently the company developed a speech platform—Serenade—designed to support warehouse-specific audio capabilities independent of the underlying speech engine. Serenade’s engine-independent design allows warehouse and distribution center to take advantage of advancing speech recognition technology. This new technology incorporates continuous noise adaptation, multiple simultaneous recognition approaches, and adaptive voice modeling.

This unique bundle of technologies provides optimal accuracy in noisy environments, maximum flexibility to adapt to different languages and atypical speech patterns, and minimal speech training. The ability to combine the power of phonetic-based recognition with the flexibility of the word-based recognition is now a reality. Advanced phonetic based recognizers provide the best-possible recognition accuracy across the broadest spectrum of users without the need for individual speech training.

Minimal (or zero) training
Even with an initial enrollment process of five minutes, distribution centers using this new technology save 20 minutes or more in initial training per user, achieve high accuracy levels immediately across all users (even challenging users), and eliminate the need to re-train after getting comfortable with the system.

A distribution center or warehouse with just 24 users realizes that every 20 minute time saving equals one eight-hour work day. Perhaps more importantly, with industry-best recognition rates, users express confidence in the system from day one and concentrate on job performance rather than technology.


Steve Yurick is Director of Speech and Mobile Technologies at Lucas Systems, Inc, the industry leader in voice-directed applications enabling hands and eyes-free warehouse operations. Tens of thousands of users use the technology daily at companies like Cardinal Health, C&S Wholesale Grocers, CVS/pharmacy, Kraft Nabisco, and OfficeMax.