Convert Speech To Text Using Azure AI Speech In Angular 16 And .NET 6 Web API

Introduction

Azure AI Speech is a cloud-based service provided by Microsoft as part of its Azure Cognitive services. It enables developers to integrate speech processing capabilities into their applications, services, and devices. The Azure AI Speech service offers various speech capabilities including speech recognition, text-to-speech, speech translation, and speaker recognition.

In this post, we will see all the steps to record audio from microphone using an Angular 16 application as webm format and send it to a .NET 6 Web API. Backend API will convert webm to wav with the help of NAudio library and then using Azure AI Speech service, convert this wav audio file to plain text and return back to Angular app.

Here’s an overview of the key features of Azure AI Speech service:

1. Speech-to-Text (STT)

  • Real-Time Speech Recognition: Converts spoken audio into text in real-time. Useful for applications like voice assistants, transcription services, and more.
  • Batch Transcription: Allows for the processing of pre-recorded audio files in batch mode.
  • Customization: Enables customization of the speech recognition models to better understand domain-specific terminology.
  • Different Languages and Dialects: Supports a wide range of languages and dialects.

2. Text-to-Speech (TTS)

  • Realistic Voices: Converts text into lifelike spoken audio in various languages.
  • Custom Voice: Allows for the creation of a unique voice font for your brand.
  • Style Control: Adjusts the speaking style of the voice to suit different scenarios or emotions.

3. Speech Translation

  • Real-Time Translation: Supplies real-time translation of spoken language into another spoken language.
  • Wide Range of Languages: Supports many languages and dialects.

4. Speaker Recognition

  • Speaker Verification: Confirms whether a given piece of audio matches a specific speaker's voice.
  • Speaker Identification: Shows who is speaking from a group of known speakers.
  • Voice Enrollment: Process of registering a user's voice for later recognition.

5. Speech Analytics

  • Sentiment Analysis: Analyzes spoken language to figure out the speaker's sentiment.
  • Keyword Spotting: Shows specific words or phrases in spoken language.

Use Cases

  • Voice Assistants and Bots: Enhance customer service with voice-enabled assistants.
  • Transcription Services: Automatically transcribe audio from meetings, lectures, or interviews.
  • Accessibility: Make applications more accessible with voice interfaces.
  • Language Learning: Help in language learning with speech recognition and translation.
  • Security: Use speaker recognition for biometric authentication.

Azure AI Speech continues to evolve, and Microsoft constantly adds new features and capabilities to improve the service and extend its functionality.

Create an Azure AI Speech in Azure portal

We can choose Azure AI services blade and select Speech service in Azure portal.

Azure AI Services

Please choose an existing resource group / create a new resource group. We can choose the Free F0 plan for testing purposes. Please note that, per subscription, only one free tier plan is available.

Create Speech Services

After creating the speech resource, we can go to the Keys and Endpoint tab and get the key. We will be using this key later in our .NET 6 Web API project.

Keys and Endpoints

Create .NET 6 Web API with Visual Studio 2022

We can create .NET 6 Web API with Visual Studio 2022.

Please add the NuGet libraries below into the project.

  • Microsoft.CognitiveServices.Speech

  • NAudio

When user gives an audio input from Angular, audio byte array will be passed to .NET Web API. By default, audio will be in webm format. We use NAudio library to convert webm to wav format. Because currently, Microsoft Cognitive Speech recognizes only wav files, not webm format.

We can create a Helper class now.

Helper.cs

using Microsoft.CognitiveServices.Speech.Audio;
using Microsoft.CognitiveServices.Speech;
using NAudio.Wave;
using System.Diagnostics;

namespace SpeechToText.NET6;

public static class Helper
{
    private static readonly string subscriptionKey = "342fdfe6715d469f9b64359275fc97df";
    private static readonly string serviceRegion = "eastus";

    public static async Task<string> ConvertAudioToTextAsync(string audioFilePath, string lang)
    {
        var config = SpeechConfig.FromSubscription(subscriptionKey, serviceRegion);
        var audioConfig = AudioConfig.FromWavFileInput(audioFilePath);
        if (lang != null)
        {
            config.SpeechRecognitionLanguage = lang;
        }
        try
        {
            using var recognizer = new SpeechRecognizer(config, audioConfig);
            var result = await recognizer.RecognizeOnceAsync();

            if (result.Reason == ResultReason.RecognizedSpeech)
            {
                return result.Text;
            }
            else if (result.Reason == ResultReason.NoMatch)
            {
                return "No speech could be recognized.";
            }
            else if (result.Reason == ResultReason.Canceled)
            {
                var cancellation = CancellationDetails.FromResult(result);
                return $"CANCELED: Reason={cancellation.Reason}";

                // Handle cancellation reasons like CancellationReason.Error
            }
            else
            {
                return "Recognition failed.";
            }
        }
        catch (Exception ex)
        {
            return ex.Message;
        }

    }
    public static void ConvertWebmToWav(string webmInput, string wavOutput)
    {
        string rawOutput = "temp.raw";

        // Step 1: Convert webm to raw PCM
        var processStartInfo = new ProcessStartInfo
        {
            FileName = "ffmpeg",
            Arguments = $"-i {webmInput} -f s16le -ac 2 -ar 44100 -y {rawOutput}",
            RedirectStandardOutput = true,
            UseShellExecute = false,
            CreateNoWindow = true
        };

        using (var process = Process.Start(processStartInfo))
        {
            process.WaitForExit();
        }

        // Step 2: Convert raw PCM to wav using NAudio
        var waveFormat = new WaveFormat(44100, 2);

        using (var fileStream = File.OpenRead(rawOutput))
        using (var reader = new RawSourceWaveStream(fileStream, waveFormat))
        using (var writer = new WaveFileWriter(wavOutput, reader.WaveFormat))
        {
            byte[] buffer = new byte[1024];
            int bytesRead;
            while ((bytesRead = reader.Read(buffer, 0, buffer.Length)) > 0)
            {
                writer.Write(buffer, 0, bytesRead);
            }
        }

        // Cleanup
        try
        {
            System.IO.File.Delete(rawOutput);
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Failed to delete {rawOutput}. Reason: {ex.Message}");
        }
    }
}

We have added two static methods inside this class. ConvertWebmToWav and ConvertAudioToTextAsync.

Inside ConvertWebmToWav method, we are using NAduio library and ffmpeg.exe file to convert webm file to wav format.

FFmpeg is a powerful and versatile open-source software suitable for handling multimedia files and streams. It supplies a wide range of tools and libraries for converting audio and video formats, processing multimedia content, and streaming. FFmpeg is popular for its high performance, compatibility with many formats, and extensive feature set.

You can download FFmpeg.exe from the site below.

https://www.videohelp.com/software/ffmpeg

Inside ConvertAudioToTextAsync method, audio file path is passed to AudioConfig.FromWavFileInput method and covert audio to a text.

We need an OutputResult model for getting converted audio text and pass to Angular application.

OutputResult.cs

namespace SpeechToText.NET6;

public class OutputResult
{
    public string? Result { get; set; }
}

Create an AudioController under the Controllers folder.

AudioController.cs

using Microsoft.AspNetCore.Mvc;

namespace SpeechToText.NET6.Controllers;

[Route("api/[controller]")]
[ApiController]
public class AudioController : ControllerBase
{
    private readonly string _audioSavePath = Path.Combine(Directory.GetCurrentDirectory(), "AudioFiles");
    [HttpPost("upload")]
    public async Task<ActionResult<OutputResult>> Upload([FromQuery] string lang)
    {
        var audio = Request.Form.Files[0];
        if (audio == null || audio.Length == 0)
            return BadRequest("No audio file provided.");

        var sourcefilePath = Path.Combine(_audioSavePath, audio.FileName);
        var destFilePath = Path.Combine(_audioSavePath, "converted-audio" + ".wav");

        if (!Directory.Exists(_audioSavePath))
        {
            Directory.CreateDirectory(_audioSavePath);
        }

        using (var stream = new FileStream(sourcefilePath, FileMode.Create))
        {
            audio.CopyTo(stream);
        }

        Helper.ConvertWebmToWav(sourcefilePath, destFilePath);

        string text = await Helper.ConvertAudioToTextAsync(Path.Combine(_audioSavePath, destFilePath), lang);
        OutputResult result = new()
        {
            Result = text
        };

        return result;
    }
}

We are passing audio file as FormData from Angular. In the upload method of controller, we have locally saved this file and calling Helper.ConvertWebmToWav method. We are also calling Helper. ConvertAudioToTextAsync in this upload method.

We can change the Program.cs file by adding CORS entries. So that our Angular application can consume Web API endpoints.

Program.cs

builder.Services.AddCors(options =>
{
    options.AddDefaultPolicy(
        builder =>
        {
            builder.WithOrigins("https://localhost:4200")
                   .AllowAnyHeader()

                   .AllowAnyMethod();
        });
});


.......

app.UseCors();

We have completed the .NET 6 Web API coding part. We can create an Angular 16 application now.

ng new SpeechToTextAngular16

Choose default routing and styling options and continue.

After few minutes, Angular application will be created.

Create a loader service.

ng g s loading

loading.service.ts

import { Injectable } from '@angular/core';
import { BehaviorSubject } from 'rxjs';

@Injectable({
  providedIn: 'root'
})
export class LoadingService {
  private _isLoading = new BehaviorSubject<boolean>(false);

  get isLoading$() {
    return this._isLoading.asObservable();
  }

  showLoader() {
    this._isLoading.next(true);
  }

  hideLoader() {
    this._isLoading.next(false);
  }
}

We can create a loader component. This component will be used to show the loader while there is some background processing going on.

ng g c loading

Replace component files with the code below.

loading.component.ts

import { Component } from '@angular/core';
import { LoadingService } from '../loading.service';

@Component({
  selector: 'app-loading',
  templateUrl: './loading.component.html',
  styleUrls: ['./loading.component.css']
})
export class LoadingComponent {
  isLoading$ = this.loadingService.isLoading$;

  constructor(private loadingService: LoadingService) { }
}

loading.component.html

<div *ngIf="isLoading$ | async" class="loader-overlay">
    <div class="small progress">
        <div></div>
    </div>
</div>

loading.component.css

.loader-overlay {
  position: fixed;
  top: 0;
  left: 0;
  right: 0;
  bottom: 0;
  display: flex;
  align-items: center;
  justify-content: center;
}

.progress {
    position: relative;
    width: 5em;
    height: 5em;
    margin: 0 0.5em;
    font-size: 12px;
    text-indent: 999em;
    overflow: hidden;
    -webkit-animation: progress_ani 1s infinite steps(8);
    animation: progress_ani 1s infinite steps(8);
  background: none;
}

.small.progress {
    font-size: 8px;
}

.progress:after,
.progress:before,
.progress > div:after,
.progress > div:before {
    content: "";
    position: absolute;
    top: 0;
    left: 2.25em;
    width: 0.5em;
    height: 1.5em;
    border-radius: 0.2em;
    background: #eee;
    box-shadow: 0 3.5em #eee;
    -webkit-transform-origin: 50% 2.5em;
    transform-origin: 50% 2.5em;
}

.progress:before {
    background: #555;
}

.progress:after {
    -webkit-transform: rotate(-45deg);
    transform: rotate(-45deg);
    background: #777;
}

.progress > div:before {
    -webkit-transform: rotate(-90deg);
    transform: rotate(-90deg);
    background: #999;
}

.progress > div:after {
    -webkit-transform: rotate(-135deg);
    transform: rotate(-135deg);
    background: #bbb;
}

@-webkit-keyframes progress_ani {
    to {
        -webkit-transform: rotate(1turn);
        transform: rotate(1turn);
    }
}

@keyframes progress_ani {
    to {
        -webkit-transform: rotate(1turn);
        transform: rotate(1turn);
    }
}

Modify app.module with the code below.

app.module.ts

import { NgModule } from '@angular/core';
import { BrowserModule } from '@angular/platform-browser';

import { AppComponent } from './app.component';
import { LoadingComponent } from './loading/loading.component';
import { HttpClientModule } from '@angular/common/http';
import { FormsModule } from '@angular/forms';

@NgModule({
  declarations: [
    AppComponent,
    LoadingComponent
  ],
  imports: [
    BrowserModule,
    HttpClientModule,
    FormsModule, 
  ],
  providers: [],
  bootstrap: [AppComponent]
})
export class AppModule { }

Modify our main app component class and html file with the code given below.

app.component.ts

import { Component } from '@angular/core';
import { HttpClient } from '@angular/common/http';
import { ChangeDetectorRef } from '@angular/core';
import { LoadingService } from './loading.service';

@Component({
  selector: 'app-root',
  templateUrl: './app.component.html',
  styleUrls: ['./app.component.css']
})
export class AppComponent {
  constructor(private http: HttpClient, private cd: ChangeDetectorRef, private loadingService: LoadingService) {
    this.selectedLang = this.languageOptions[0].value;
  }

  mediaRecorder?: MediaRecorder;
  audioChunks: Blob[] = [];
  downloadLink?: string;
  showMicrophone?: boolean;
  showStop?: boolean;
  convertedMessage?: string;
  showMessage!: boolean;

  selectedLang!: string;
  languageOptions = [
    { value: 'en-US', label: 'English' },
    { value: 'hi-IN', label: 'Hindi' },
    { value: 'ml-IN', label: 'Malayalam' },
    { value: 'ta-IN', label: 'Tamil' },
    { value: 'te-IN', label: 'Telugu' },
    { value: 'kn-IN', label: 'Kannada' }
  ];

  async startRecording() {
    this.showMicrophone = false;
    this.showStop = true;
    this.convertedMessage = "";

    try {
      this.downloadLink = "";
      const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

      this.mediaRecorder = new MediaRecorder(stream);
      this.audioChunks = [];

      this.mediaRecorder.ondataavailable = (event: BlobEvent) => {
        this.audioChunks.push(event.data);
      };

      this.mediaRecorder.onstop = () => {
        this.onRecordingStopped();
      };

      this.mediaRecorder.start();
    } catch (error) {
      console.error("Error accessing the microphone:", error);
    }
  }

  onRecordingStopped() {
    const audioBlob = new Blob(this.audioChunks, { type: 'audio/webm;codecs=opus' });
    const audioUrl = URL.createObjectURL(audioBlob);
    this.downloadLink = audioUrl; // set download link after recording stops
    this.uploadAudio();
    this.cd.detectChanges();
  }

  stopRecording() {
    if (this.mediaRecorder) {
      this.showStop = false;
      this.showMicrophone = true;
      this.mediaRecorder.stop();
    }
  }

  uploadAudio() {
    if (!this.downloadLink) {
      console.error("No audio data available to upload.");
      return;
    }
    this.loadingService.showLoader();
    const audioBlob = new Blob(this.audioChunks, { type: 'audio/webm;codecs=opus' });
    const formData = new FormData();
    formData.append('audio', audioBlob, 'recorded-audio.webm');
    this.http.post<OutputResult>('https://localhost:5000/api/audio/upload?lang=' + this.selectedLang, formData)
      .subscribe({
        next: (res) => {
          console.log('Upload successful', res);
          this.convertedMessage = res.result;
          this.loadingService.hideLoader();
          this.cd.detectChanges();
        },
        error: (err) => {
          console.error('Upload error:', err);
          this.loadingService.hideLoader();
          this.convertedMessage = err;
          this.cd.detectChanges();
        },
        complete: () => console.info('Request completed')
      });
  }

  ngOnInit(): void {
    this.showMicrophone = true;
  }

  showHideButtons() {
    this.showMessage = false;
  }

}

export interface OutputResult {
  result: string;
}

app.component.html

<div class="content" role="main">
  <div class="row justify-content-center pt-2 pb-2">
    <select [(ngModel)]="selectedLang" class="dropdown">
      <option *ngFor="let lang of languageOptions" [value]="lang.value">
        {{ lang.label }}
      </option>
    </select>

    <button title="Click here to start recording action" class="btn-mic w-auto" (click)="startRecording()"
      *ngIf="showMicrophone" style="margin-right: 10px;">
      <img src="../../assets/mic.png" />
    </button>
    <button title="Click here to stop recording" class="btn-stop w-auto" (click)="stopRecording()" *ngIf="showStop"
      style="margin-right: 10px;">
      <img src="../../assets/stop.png" />
    </button>
  </div>
<p>Converted Speech :</p>
<input [(ngModel)]="convertedMessage" [disabled]="true" class="textbox" />
</div>
    
<div class="container mt-3" style="max-width:1330px;padding-top:30px;" role="main">
  <app-loading></app-loading>
</div>

app.component.css

.content {
  display: flex;
  margin: 10px auto 10px;
  padding: 0 16px;
  max-width: 960px;
  flex-direction: column;
  align-items: center;
}

.btn-mic,
.btn-stop {
  background: none;
  border: none;
  padding: 0;
  cursor: pointer;
}

.btn-mic img,
.btn-stop img {
  width: 30px;
  opacity: 0.5;
}

.dropdown {
  height: 30px;
  margin: 10px;
  width: 120px;
}

.textbox {
  width: 500px;
  margin: -10px;
  height: 150px;
  text-align: center;
  padding: 0px;
}

We have completed the entire coding for Angular as well. We can run both Web API and Angular.

Currently we have given six languages support for this application. We can add more languages if needed.

We can test the application by capturing audio. After clicking on the stop button, audio from Angular will be sent to .NET 6 Web API for processing.

After some time, you can see the processed result. This time I have chosen Malayalam as my audio language.

You can again record an audio with English as audio language.

Again, we can choose another language Hindi and record a voice.

Conclusion

In this post, we have seen all the steps to capture an audio from Angular and send to .NET 6 Web API. By default, audio format is in webm and we have used NAduio library to convert webm to wav with the help of ffmpeg.exe. After converting to wav file, we have used Microsoft Cognitive Services Speech library to convert audio to text. After that we sent converted text to Angular and display in UI. I have tried some ways to convert webm format to wav in typescript itself. Though it converted wav file successfully, Azure Cognitive service for Speech did not convert speech into text. It was showing some header missing error. If you tried any other way to convert speech to text, please message here.