This week, I wanted to finish what I started during input devices week by using a microphone to send sound input to cloud over a websocket server.
I found this tutorial on Youtube, which had source code for getting my websocket server up and running, as well as guidelines for connecting to WiFi and using I2S.
I modified the set up and loop code slightly while debugging because the sound playback was not intially working. Set up included connecting to WiFi, the Websocket server, and setting the I2S configurations. Looping included sending the audio data read in from the microphone.
void setup() { // Set up Serial Monitor Serial.begin(115200); Serial.println(" "); delay(1000); connectWiFi(); connectWSServer(); // Set up I2S i2s_install(); i2s_setpin(); i2s_start(I2S_PORT); delay(500); } void loop() { // Get I2S data and place in data buffer size_t bytesIn = 0; esp_err_t result = i2s_read(I2S_PORT, &sBuffer, bufferLen, &bytesIn, portMAX_DELAY); if (result == ESP_OK && isWebSocketConnected) { client.sendBinary((const char*)sBuffer, bytesIn); } }
From here, I wanted to test the Google speech-to-text API. I didn't have to touch the code I flashed onto the ESP32S3 since the code from the tutorial was already streaming the audio data to the websocket server, so instead I worked on adapting the code from the server side. In contrast to the demo tutorial, I didn't need to send the audio data to a PCM player back, instead I had to call the Google STT API on the data stream. Here is what my server code looked like in the end, writing audio data to the speech recognizer stream instead of to the PCM player.
const WebSocket = require("ws"); const speech = require('@google-cloud/speech'); const speechClient = new speech.SpeechClient(); // Creates a client const WS_PORT = process.env.WS_PORT || 8888; const wsServer = new WebSocket.Server({ port: WS_PORT }, () => console.log(`WS server is listening at ws://localhost:${WS_PORT}`) ); // speech to text const encoding = 'LINEAR16'; const sampleRateHertz = 16000; const languageCode = 'en-US'; const request = { config: { encoding: encoding, sampleRateHertz: sampleRateHertz, languageCode: languageCode, }, interimResults: false, // If you want interim results, set this to true }; // Create a recognize stream const recognizeStream = speechClient .streamingRecognize(request) .on('error', console.error) .on('data', data => process.stdout.write( data.results[0] && data.results[0].alternatives[0] ? `Transcription: ${data.results[0].alternatives[0].transcript}\n` : '\n\nReached transcription time limit, press Ctrl+C\n' ) ); // array of connected websocket clients let connectedClients = []; wsServer.on("connection", (ws, req) => { console.log("Connected"); // add new connected client connectedClients.push(ws); // listen for messages from the streamer, the clients will not send anything so we don't need to filter ws.on("message", (data) => { connectedClients.forEach((ws, i) => { if (ws.readyState === ws.OPEN) { recognizeStream.write(data); } else { connectedClients.splice(i, 1); } }); }); });