Tip To Improve The Response Time From LLM - OpenAI & Azure OpenAI

Semantic Kernel, a powerful tool for integrating large language models into your applications, now supports streaming responses. In this blog post, we’ll explore how to leverage this feature to obtain streamed results from LLMs like AzureOpenAI and OpenAI.

Why Streamed Responses Matter?

When working with language models, especially in conversational scenarios, streaming responses offer several advantages.

  • Real-time Interaction: Streaming allows you to receive partial responses as they become available, enabling more interactive and dynamic conversations.
  • Reduced Latency: Instead of waiting for the entire response, you can start processing and displaying content incrementally, reducing overall latency.
  • Efficient Resource Usage: Streaming conserves memory and resources by handling data in smaller chunks.

How to use Streaming Responses?

I've published a complete video on how to generate this using Python and that can be found here.

Here, I'm just publishing the code snippets for your usage.

Install Dependencies

pip install semantic-kernel==0.9.6b1

Read configuration values and Instantiate Kernel

import semantic_kernel as sk
from dotenv import dotenv_values

config = dotenv_values("Configuration.env")
kernel = sk.Kernel()

Create an Object of LLM

Azure OpenAI

from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion
azure_completion_service = AzureChatCompletion(
        service_id="aoai_chat", 
        deployment_name=config["AZURE_CHAT_MODEL"], 
        endpoint=config["AZURE_API_BASE"], 
        api_key=config["AZURE_API_KEY"])  

OpenAI

from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion
openai_chat_service = OpenAIChatCompletion(
        service_id="", 
        ai_model_id="", 
        api_key="", 
        org_id=""
    )

Define execution settings for LLM

Azure OpenAI

from semantic_kernel.connectors.ai.open_ai.prompt_execution_settings.azure_chat_prompt_execution_settings import AzureChatPromptExecutionSettings
oai_prompt_execution_settings = AzureChatPromptExecutionSettings(
    service_id="oai_text",
    max_tokens=150
)

OpenAI

from semantic_kernel.connectors.ai.open_ai.prompt_execution_settings.open_ai_prompt_execution_settings import OpenAITextPromptExecutionSettings
openai_prompt_execution_settings = OpenAITextPromptExecutionSettings(
    service_id="",
    max_tokens=150
)

Make a Call to LLM

Azure OpenAI

from semantic_kernel.contents import ChatHistory 
chat = ChatHistory()
chat.add_system_message("You are an AI assistant that can create amazing poems.")
chat.add_user_message("Tell me a poem on a rainy day")
stream = azure_completion_service.complete_chat_stream(
    chat_history=chat, 
    settings=oai_prompt_execution_settings)
async for text in stream:
    print(str(text[0]), end="")

OpenAI

from semantic_kernel.contents import ChatHistory 
chat = ChatHistory()
chat.add_system_message("You are an AI assistant that can create amazing poems.")
chat.add_user_message("Tell me a poem on a rainy day")
stream = openai_chat_service.complete_chat_stream(
    chat_history=chat, 
    settings=openai_prompt_execution_settings)
async for text in stream:
    print(str(text[0]), end="") 

Final output

Output

Conclusion

Streaming responses enhance the user experience, making interactions smoother and more efficient. Whether you’re building chatbots, virtual assistants, or any other AI-powered application, consider leveraging Semantic Kernel’s streaming capabilities.

Happy streaming!


Similar Articles