How to Avoid Passing Sensitive Information to LLM-OpenAI

In my previous article Passing An Audio File To LLM, I explained how one can pass an audio file to LLM. In continuation to that, I’m extending this article by adding more value to it by addressing the use case of sensitive information.

Let’s say, an audio file contains information about your bank account number, your secure id, your pin, your passcode, your date of birth, or any such information that has to be kept secured. You will find this kind of information if you are dealing with customer-facing audio calls, specifically in the finance sector. These details, which are also known as PII (Personal Identifiable Information), are very sensitive and it is not at all safe to keep them only on any server. Hence, one should be very careful while dealing with such kind of data.

Now, when it comes to using PII with generative AI-based applications, we need a way wherein we can just remove such information from data before passing that to LLM and that’s what this article is all about.

In this article, I’ll show you a very quick way to redact such sensitive information from an audio file and save it. So, that this updated audio file can be transcribed and sent to LLM.

High-level steps

To execute the solution from end-to-end, we need to work with below components/libraries:

Redaction And Transcription

  • For redaction and transcript generation, we will be using AssemblyAI

Embedding Generator

  • For generating the embeddings, we will be using OpenAIEmbeddings

Vector Database

  • Chroma will be used as an in-memory database for storing the vectors

Large Language Model

  • OpenAI as LLM

All these are wrapped under a library called Langchain, so we will be highly utilizing that too.

First of all, we need to grab the keys as shown below:

Get An OpenAI API Key

To get the OpenAI key, you need to go to https://openai.com/, log in, and then grab the keys using the highlighted way:

Get an OpenAI Key

Get An AssemblyAI API Key

To get the AssemblyAI key, you need to go to AssemblyAI | Account, log in and then grab the keys using the highlighted way:

Get an Assembly AI API Key

Install Packages

Install these packages:

assemblyai
openai
sentence-transformers
langchain
chromadb
tiktoken

Import Required Packages

Do install the dependent libraries and import the below packages:

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import AssemblyAIAudioTranscriptLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
import assemblyai

Redact Sensitive Information

Next, we will read an audio and transcribe that using below lines of code below. You can take any audio file, in my case I just downloaded it from the Azure Cognitive Services GitHub link.

transcript = assemblyai.Transcriber().transcribe(
    "Voice.wav",
    config = assemblyai.TranscriptionConfig(
        redact_pii=True,
        redact_pii_audio=True,
        redact_pii_policies = [
            assemblyai.PIIRedactionPolicy.phone_number
        ]
    )
)

transcript.save_redacted_audio("RedactedAudio.wav")

Make sure to define at least one redaction policy. Here are the few more, which you can define:

Policy

You can grab more information about this here.

Transcribe Audio

Now we have removed sensitive information, It is straightforward to extract text from this audio.

doc = AssemblyAIAudioTranscriptLoader(file_path="RedactedAudio.wav").load()

Here is what doc contains:

[Document(page_content=” Hello, thank you for calling Contoso. Who am I speaking with today? Hi, my name is Mary Rondo. I’m trying to enroll myself with Contoso.

And what’s the best callback number in case we get disconnected? I only have a cell phone, so I can give you that. Yeah, that’ll be fine. Sure. So it’s. And then. Got it. So to confirm it’s. Yes, that’s right. Excellent. Let’s get some additional information for your application.

{‘text’: ‘thank’, ‘start’: 1220, ‘end’: 1406, ‘confidence’: 0.99995}, …, {‘text’: ‘day.’, ‘start’: 184252, ‘end’: 184340, ‘confidence’: 0.93062}], ‘utterances’: None, ‘confidence’: 0.9500014150943394, ‘audio_duration’: 185.0, ‘webhook_status_code’: None, ‘webhook_auth’: False, ‘summary’: None, ‘auto_highlights_result’: None, ‘content_safety_labels’: None, ‘iab_categories_result’: None, ‘chapters’: None, ‘sentiment_analysis_results’: None, ‘entities’: None})]

Let’s update the metadata to something we want using the below code:

doc[0].metadata = {“audio_url”:doc[0].metadata[“audio_url”]}

Chunk The Text

I’m taking chunk size as 700 but you can change this number.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=0)
texts = text_splitter.split_documents(doc)

Generate Embeddings And Save To Database

In this step, we will generate embeddings for the above text using below lines:

db = Chroma.from_documents(texts,OpenAIEmbeddings())
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

Query And Get The Response

This step will create a QA chain and passing the query to that, will answer.

chain = RetrievalQA.from_chain_type(llm,retriever=db.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 3}))
query = "What this audio file is all about?"
chain({"query":query})

Here is the output:

{‘query’: ‘What this audio file is all about?’, ‘result’: ‘The audio file is a conversation between a customer named Mary Rondo and a representative from Contoso. Mary is calling to enroll herself in health insurance with Contoso. The representative asks Mary for her full name, callback number, Social Security number, and email address to complete the enrollment process.’}

You can see, how easy it was to read audio, redact information, transcribe audio, pass it on to the LLM, and get our questions answered.