danbev commited on
Commit
5451562
·
unverified ·
1 Parent(s): fc04dc0

examples : add stereo to mono conversion in read_audio_data (#3266)

Browse files

This commit adds a conversion from stereo to mono in the
`read_audio_data` function of `common-whisper.cpp`.

The motivation for this change is prior to Commit
7d3da68f792018e81a758881e081154d1cbe6b6f ("examples : use miniaudio for
direct decoding flac, mp3, ogg and wav (#2759)", there was a step that
read stereo int16 data -> pcm16 (448512 samples), and then converted to
mono (224256 samples), and then also convert to stereo in `pcmf32s.

The middle step here seems to have been missed when rewriting the code to
use Miniaudio and caused issues then transcribing stereo audio files.

For example, currently using the audio sample in the linked issue the
output is:
```console
[00:00:00.000 --> 00:00:03.000] (speaker 1) Sous-titres réalisés para la communauté d'Amara.org
```

And with the change in this commit the output is:
```
[00:00:00.000 --> 00:00:01.500] (speaker 1) *sonnerie de téléphone*
[00:00:01.500 --> 00:00:07.000] (speaker 1) Salut jeune homme !
[00:00:07.000 --> 00:00:08.500] (speaker 0) C'est vrai que je te dérange ?
[00:00:08.500 --> 00:00:10.500] (speaker 1) Ah pas du tout, pas du tout, pas du tout !
[00:00:10.500 --> 00:00:12.500] (speaker 1) J'étais en train de...
[00:00:12.500 --> 00:00:14.500] (speaker 1) de préparer un courrier
```

Resolves: https://github.com/ggml-org/whisper.cpp/issues/3092

Files changed (1) hide show
  1. examples/common-whisper.cpp +14 -7
examples/common-whisper.cpp CHANGED
@@ -112,13 +112,20 @@ bool read_audio_data(const std::string & fname, std::vector<float>& pcmf32, std:
112
  }
113
 
114
  if (stereo) {
115
- pcmf32s.resize(2);
116
- pcmf32s[0].resize(frame_count);
117
- pcmf32s[1].resize(frame_count);
118
- for (uint64_t i = 0; i < frame_count; i++) {
119
- pcmf32s[0][i] = pcmf32[2*i];
120
- pcmf32s[1][i] = pcmf32[2*i + 1];
121
- }
 
 
 
 
 
 
 
122
  }
123
 
124
  ma_decoder_uninit(&decoder);
 
112
  }
113
 
114
  if (stereo) {
115
+ std::vector<float> stereo_data = pcmf32;
116
+ pcmf32.resize(frame_count);
117
+
118
+ for (uint64_t i = 0; i < frame_count; i++) {
119
+ pcmf32[i] = (stereo_data[2*i] + stereo_data[2*i + 1]);
120
+ }
121
+
122
+ pcmf32s.resize(2);
123
+ pcmf32s[0].resize(frame_count);
124
+ pcmf32s[1].resize(frame_count);
125
+ for (uint64_t i = 0; i < frame_count; i++) {
126
+ pcmf32s[0][i] = stereo_data[2*i];
127
+ pcmf32s[1][i] = stereo_data[2*i + 1];
128
+ }
129
  }
130
 
131
  ma_decoder_uninit(&decoder);