Tokenizer

Introduction

large language models process text using tokens, which are common sequences of characters found in a set of text. The models learn to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.

It's important to note that the exact tokenization process varies between models. Newer models like MaaS-3.5 and MaaS-4 use a different tokenizer than previous models, and will produce different tokens for the same input text.

Tokenization

Differrent models use different tokenizers, and the tokenization process can vary between models. The tokenization process is not always deterministic, and can be influenced by factors like the model's training data and the specific implementation of the tokenizer.

Here will provider different ways to count tokens with tiktoken

For example, if a text string (e.g, "tiktoken is great!") is given and an encoding (e.g, "cl100k_base") is provided, a tokenizer can split the text string into tokens (e.g, ["t", "ik", "token", " is", " great", "!"]).

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly 3/4 of a word (so 100 tokens ~= 75 words).

Encodings

Encodings specify how text is converted into tokens. Different models use different encodings. tiktoken supports three encodings used by OpenAI models:

Encoding name	models
o200k_base	MaaS-4o
cl100k_base	MaaS-4, MaaS-3.5-turbo, MaaS-embedding-ada-002, MaaS-embedding-3-small, MaaS-embedding-3-large
p50k_base	Codex models, text-davinci-002, text-davinci-003
r50k_base (or gpt2)	GPT-3 models like davinci

Tokenization without vision

For the model without vision or multimodals, the token can be counted with tiktoken directly, the encoding methos can be found in the Encodings table above.

def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
    """Return the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    if model in {
        "gpt-3.5-turbo-0613",
        "gpt-3.5-turbo-16k-0613",
        "gpt-4-0314",
        "gpt-4-32k-0314",
        "gpt-4-0613",
        "gpt-4-32k-0613",
        }:
        tokens_per_message = 3
        tokens_per_name = 1
    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif "gpt-3.5-turbo" in model:
        print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613")
    elif "gpt-4" in model:
        print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
        return num_tokens_from_messages(messages, model="gpt-4-0613")
    else:
        raise NotImplementedError(
            f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens."""
        )
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens

If you want more detailed information about the tokenization process, you can refer to the tiktoken documentation.

Tokenization with vision

For the model with vision or multimodals like MaaS-4-turbo or MaaS-4o, the tokenization process should be splited into two parts: text and image. The text part can be tokenized with tiktoken directly, and the image part can be tokenized with the image tokenizer.

Vision token count: Image inputs are metered and charged in tokens, just as text inputs are. The token cost of a given image is determined by two factors: its size, and the detail option on each image_url block. All images with detail: low cost 85 tokens each. detail: high images are first scaled to fit within a 2048 x 2048 square, maintaining their aspect ratio. Then, they are scaled such that the shortest side of the image is 768px long. Finally, we count how many 512px squares the image consists of. Each of those squares costs 170 tokens. Another 85 tokens are always added to the final total.

Here are some examples demonstrating the above.

A 1024 x 1024 square image in detail: high mode costs 765 tokens 1024 is less than 2048, so there is no initial resize. The shortest side is 1024, so we scale the image down to 768 x 768. 4 512px square tiles are needed to represent the image, so the final token cost is 170 * 4 + 85 = 765.
A 2048 x 4096 image in detail: high mode costs 1105 tokens We scale down the image to 1024 x 2048 to fit within the 2048 square. The shortest side is 1024, so we further scale down to 768 x 1536. 6 512px tiles are needed, so the final token cost is 170 * 6 + 85 = 1105.
A 4096 x 8192 image in detail: low most costs 85 tokens Regardless of input size, low detail images are a fixed cost.

from math import ceil
from PIL import Image
import requests
from io import BytesIO
import base64
from token_calculation import num_tokens_from_messages


def base64_from_url(url):
    return base64.b64encode(requests.get(url).content).decode()

def calculate_token_cost(image_url, detail_level='auto', detail_threshold=2048*2048):
    # decide the image is url or base64 encoded image
    if image_url.startswith("data:image"):
        # convert base64 endcoded image to image
        image_url = image_url.split(",")[1]
        img = Image.open(BytesIO(base64.b64decode(image_url)))
    else:
        response = requests.get(image_url)
        img = Image.open(BytesIO(response.content))
    width, height = img.size
    if detail_level == 'low':
        return 85

    if width > 2048 or height > 2048:
        aspect_ratio = width / height
        if aspect_ratio > 1:
            width, height = 2048, int(2048 / aspect_ratio)
        else:
            width, height = int(2048 * aspect_ratio), 2048

    if width >= height and height > 768:
        width, height = int((768 / height) * width), 768
    elif height > width and width > 768:
        width, height = 768, int((768 / width) * height)

    tiles_width = ceil(width / 512)
    tiles_height = ceil(height / 512)
    total_tokens = 85 + 170 * (tiles_width * tiles_height)

    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this imagesdasds?"},
                {
                    "type": "image_url",
                    "image_url": "IMAGE_URL or IMAGE_BASE64"
                },
            ],
        }
    ]

    prompt_token = num_tokens_from_messages(messages=messages, model="gpt-4-32k-0613")

    return total_tokens + prompt_token

# img = base64_from_url("https://zhuhq.oss-cn-beijing.aliyuncs.com/sdasd.jpeg")
# img = f'data:image/jpeg;base64,{img}'
# print(calculate_token_cost(img))

print(calculate_token_cost("https://zhuhq.oss-cn-beijing.aliyuncs.com/sdasd.jpeg"))

For more detailed information about the tokenization process, you can refer to the tiktoken documentation.