Additional Tokenizer Support in ML.NET

Tokenization

Tokenization is a fundamental component in the preprocessing of natural language text for AI models. Tokenizers are responsible for breaking down a string of text into smaller, more manageable parts, often referred to as tokens. This process is crucial for understanding costs and managing context when using services like Azure OpenAI, as well as for providing inputs to self-hosted or local models.

The Microsoft.ML.Tokenizers package offers an open-source, cross-platform tokenization library. With the release of ML.NET 4.0, this library has been significantly enhanced, providing refined APIs, new tokenizer supports, and better integration with existing libraries.

Key enhancements in ML.NET 4.0

  1. Refined APIs and Functionality: Improved user experience with more intuitive and flexible APIs.
  2. Tiktoken Support: Added support for the Tiktoken tokenizer, expanding the range of models that can be efficiently tokenized.
  3. Llama Model Tokenizer: New tokenizer support specifically designed for the Llama model.
  4. CodeGen Tokenizer: Compatible with models such as codegen-350M-mono and phi-2, making it easier to work with code generation models.
  5. Enhanced Encoding Methods: New EncodeToIds overloads that accept Span<char> instances, allowing for customized normalization and pre-tokenization.

Additionally, the ML.NET team has collaborated closely with the DeepDev TokenizerLib and SharpToken communities to ensure comprehensive coverage of various tokenization scenarios. Users of DeepDev or SharpToken are encouraged to migrate to Microsoft.ML.Tokenizers, with a detailed migration guide available for assistance.

Tiktoken text Tokenizer

The Tiktoken Text Tokenizer is designed to work with models like GPT-4. It efficiently breaks down text into tokens that the model can process, ensuring that text data is accurately and quickly prepared for AI applications.

Example in C#

Tokenizer tokenizer = Tokenizer.CreateTiktokenForModel("gpt-4");
string text = "Hello, World!";
// Encode to IDs.
IReadOnlyList<int> encodedIds = tokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}");
// encodedIds = {9906, 11, 4435, 0}
// Decode IDs to text.
string? decodedText = tokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}");
// decodedText = Hello, World!
// Get token count.
int idsCount = tokenizer.CountTokens(text);
Console.WriteLine($"idsCount = {idsCount}");
// idsCount = 4
// Full encoding.
EncodingResult result = tokenizer.Encode(text);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Tokens)}'}}");
// result.Tokens = {'Hello', ',', ' World', '!'}
Console.WriteLine($"result.Offsets = {{{string.Join(", ", result.Offsets)}}}");
// result.Offsets = {(0, 5), (5, 1), (6, 6), (12, 1)}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Ids)}}}");
// result.Ids = {9906, 11, 4435, 0}
// Encode up to number of tokens limit.
int index1 = tokenizer.IndexOfTokenCount(text, maxTokenCount: 1, out string processedText1, out int tokenCount1);
Console.WriteLine($"processedText1 = {processedText1}");
// processedText1 = Hello, World!
Console.WriteLine($"tokenCount1 = {tokenCount1}");
// tokenCount1 = 1
Console.WriteLine($"index1 = {index1}");
// index1 = 5
int index2 = tokenizer.LastIndexOfTokenCount(text, maxTokenCount: 1, out string processedText2, out int tokenCount2);
Console.WriteLine($"processedText2 = {processedText2}");
// processedText2 = Hello, World!
Console.WriteLine($"tokenCount2 = {tokenCount2}");
// tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");
// index2 = 12

Llama text Tokenizer

The Llama Text Tokenizer is designed for the Llama model, which is used for various natural language processing tasks. This tokenizer ensures that text is tokenized in a way that is compatible with the Llama model's requirements.

Example in C#

// Create the Tokenizer.
string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-llamaTokenizer/resolve/main/llamaTokenizer.model";
using Stream remoteStream = File.OpenRead(modelUrl);
Tokenizer llamaTokenizer = Tokenizer.CreateLlama(remoteStream);
string text = "Hello, World!";
// Encode to IDs.
IReadOnlyList<int> encodedIds = llamaTokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}");
// encodedIds = {1, 15043, 29892, 2787, 29991}
// Decode IDs to text.
string? decodedText = llamaTokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}");
// decodedText = Hello, World!
// Get token count.
int idsCount = llamaTokenizer.CountTokens(text);
Console.WriteLine($"idsCount = {idsCount}");
// idsCount = 5
// Full encoding.
EncodingResult result = llamaTokenizer.Encode(text);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Tokens)}'}}");
// result.Tokens = {'<s>', '▁Hello', ',', '▁World', '!'}
Console.WriteLine($"result.Offsets = {{{string.Join(", ", result.Offsets)}}}");
// result.Offsets = {(0, 0), (0, 6), (6, 1), (7, 6), (13, 1)}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Ids)}}}");
// result.Ids = {1, 15043, 29892, 2787, 29991}
// Encode up to two tokens.
int index1 = llamaTokenizer.IndexOfTokenCount(text, maxTokenCount: 2, out string processedText1, out int tokenCount1);
Console.WriteLine($"processedText1 = {processedText1}");
// processedText1 = ▁Hello,▁World!
Console.WriteLine($"tokenCount1 = {tokenCount1}");
// tokenCount1 = 2
Console.WriteLine($"index1 = {index1}");
// index1 = 6
// Encode from end up to one token.
int index2 = llamaTokenizer.LastIndexOfTokenCount(text, maxTokenCount: 1, out string processedText2, out int tokenCount2);
Console.WriteLine($"processedText2 = {processedText2}");
// processedText2 = ▁Hello,▁World!
Console.WriteLine($"tokenCount2 = {tokenCount2}");
// tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");
// index2 = 13

Conclusion

The additional tokenizer support in ML.NET 4.0 greatly enhances its capability to process and manage natural language text. These improvements, including new tokenizer supports and refined APIs, provide developers with powerful tools to handle text data more effectively.


Similar Articles