SpeechKit Recognition API v3, gRPC: Recognizer
Статья создана
- Calls Recognizer
- RecognizeStreaming
- StreamingRequest
- StreamingOptions
- RecognitionModelOptions
- AudioFormatOptions
- RawAudio
- ContainerAudio
- TextNormalizationOptions
- LanguageRestrictionOptions
- EouClassifierOptions
- DefaultEouClassifier
- ExternalEouClassifier
- AudioChunk
- SilenceChunk
- Eou
- StreamingResponse
- SessionUuid
- AudioCursors
- AlternativeUpdate
- Alternative
- Word
- LanguageEstimation
- EouUpdate
- FinalRefinement
- StatusCode
A set of methods for voice recognition.
Call | Description |
---|---|
RecognizeStreaming | Expects audio in real-time |
Calls Recognizer
RecognizeStreaming
Expects audio in real-time
rpc RecognizeStreaming (stream StreamingRequest) returns (stream StreamingResponse)
StreamingRequest
Field | Description |
---|---|
Event | oneof: session_options , chunk , silence_chunk or eou |
session_options | StreamingOptions Session options. Should be the first message from user. |
chunk | AudioChunk Chunk with audio data. |
silence_chunk | SilenceChunk Chunk with silence. |
eou | Eou Request to end current utterance. Works only with external EOU detector. |
StreamingOptions
Field | Description |
---|---|
recognition_model | RecognitionModelOptions Configuration for speech recognition model. |
eou_classifier | EouClassifierOptions Configuration for end of utterance detection model. |
RecognitionModelOptions
Field | Description |
---|---|
model | string Reserved for future, do not use. |
audio_format | AudioFormatOptions Specified input audio. |
text_normalization | TextNormalizationOptions Text normalization options. |
language_restriction | LanguageRestrictionOptions Possible languages in audio. |
audio_processing_type | enum AudioProcessingType How to deal with audio data (in real time, after all data is received, etc). Default is REAL_TIME. |
AudioFormatOptions
Field | Description |
---|---|
AudioFormat | oneof: raw_audio or container_audio |
raw_audio | RawAudio Audio without container. |
container_audio | ContainerAudio Audio is wrapped in container. |
RawAudio
Field | Description |
---|---|
audio_encoding | enum AudioEncoding Type of audio encoding
|
sample_rate_hertz | int64 PCM sample rate |
audio_channel_count | int64 PCM channel count. Currently only single channel audio is supported in real-time recognition. |
ContainerAudio
Field | Description |
---|---|
container_audio_type | enum ContainerAudioType Type of audio container.
|
TextNormalizationOptions
Field | Description |
---|---|
text_normalization | enum TextNormalization Normalization
|
profanity_filter | bool Profanity filter (default: false). |
literature_text | bool Rewrite text in literature style (default: false). |
phone_formatting_mode | enum PhoneFormattingMode Define phone formatting mode
|
LanguageRestrictionOptions
Field | Description |
---|---|
restriction_type | enum LanguageRestrictionType
|
language_code[] | string |
EouClassifierOptions
Field | Description |
---|---|
Classifier | oneof: default_classifier or external_classifier Type of EOU classifier. |
default_classifier | DefaultEouClassifier EOU classifier provided by SpeechKit. Default. |
external_classifier | ExternalEouClassifier EOU is enforced by external messages from user. |
DefaultEouClassifier
Field | Description |
---|---|
type | enum EouSensitivity EOU sensitivity. Currently two levels, faster with more error and more conservative (our default). |
max_pause_between_words_hint_ms | int64 Hint for max pause between words. Our EOU detector could use this information to distinguish between end of utterance and slow speech (like one |
ExternalEouClassifier
Empty
AudioChunk
Field | Description |
---|---|
data | bytes Bytes with audio data. |
SilenceChunk
Field | Description |
---|---|
duration_ms | int64 Duration of silence chunk in ms. |
Eou
Empty
StreamingResponse
Field | Description |
---|---|
session_uuid | SessionUuid Session identifier |
audio_cursors | AudioCursors Progress bar for stream session recognition: how many data we obtained; final and partial times; etc. |
response_wall_time_ms | int64 Wall clock on server side. This is time when server wrote results to stream |
Event | oneof: partial , final , eou_update , final_refinement or status_code |
partial | AlternativeUpdate Partial results, server will send them regularly after enough audio data was received from user. This are current text estimation from final_time_ms to partial_time_ms. Could change after new data will arrive. |
final | AlternativeUpdate Final results, the recognition is now fixed until final_time_ms. For now, final is sent only if the EOU event was triggered. This could be change in future releases. |
eou_update | EouUpdate After EOU classifier, send the message with final, send the EouUpdate with time of EOU before eou_update we send final with the same time. there could be several finals before eou update. |
final_refinement | FinalRefinement For each final, if normalization is enabled, sent the normalized text (or some other advanced post-processing). Final normalization will introduce additional latency. |
status_code | StatusCode Status messages, send by server with fixed interval (keep-alive). |
SessionUuid
Field | Description |
---|---|
uuid | string Internal session identifier. |
user_request_id | string User session identifier. |
AudioCursors
Field | Description |
---|---|
received_data_ms | int64 Amount of audio chunks server received. This cursor is moved after each audio chunk was received by server. |
reset_time_ms | int64 Input stream reset data. |
partial_time_ms | int64 How much audio was processed. This time includes trimming silences as well. This cursor is moved after server received enough data to update recognition results (includes silence as well). |
final_time_ms | int64 Time of last final. This cursor is moved when server decides that recognition from start of audio until final_time_ms will not change anymore usually this even is followed by EOU detection (but this could change in future). |
final_index | int64 This is index of last final server send. Incremented after each new final. |
eou_time_ms | int64 Estimated time of EOU. Cursor is updated after each new EOU is sent. For external classifier this equals to received_data_ms at the moment EOU event arrives. For internal classifier this is estimation of time. The time is not exact and has the same guarantees as word timings. |
AlternativeUpdate
Field | Description |
---|---|
alternatives[] | Alternative List of hypothesis for timeframes. |
channel_tag | string Tag for distinguish audio channels. |
Alternative
Field | Description |
---|---|
words[] | Word Words in time frame. |
text | string Text in time frame. |
start_time_ms | int64 Start of time frame. |
end_time_ms | int64 End of time frame. |
confidence | double The hypothesis confidence. Currently is not used. |
languages[] | LanguageEstimation Distribution over possible languages. |
Word
Field | Description |
---|---|
text | string Word text. |
start_time_ms | int64 Estimation of word start time in ms. |
end_time_ms | int64 Estimation of word end time in ms. |
LanguageEstimation
Field | Description |
---|---|
language_code | string Language code in ISO 639-1 format. |
probability | double Estimation of language probability. |
EouUpdate
Field | Description |
---|---|
time_ms | int64 EOU estimated time. |
FinalRefinement
Field | Description |
---|---|
final_index | int64 Index of final for which server sends additional information. |
Type | oneof: normalized_text Type of refinement. |
normalized_text | AlternativeUpdate Normalized text instead of raw one. |
StatusCode
Field | Description |
---|---|
code_type | enum CodeType Code type. |
message | string Human readable message. |