Image Captioning with CLIP and GPT-4

Praveen Kumar
23h
390
0
7

Article

✅ Try the live CLIP + GPT-4 image captioning demo on Hugging Face here.

In today’s AI workflows, image captioning isn’t just about labeling objects in a photo — it’s about generating meaningful, human-like descriptions. Whether you’re building accessibility tools, enhancing visual search, or powering social media content, your captions need to be sharp, contextual, and natural.

That’s where CLIP + GPT-4 comes in — a powerful combo that marries visual understanding with natural language generation, all without traditional supervised training pipelines.

📌 What is Image Captioning?

At its core, image captioning is the task of generating a descriptive sentence about an image. Think of it as answering the question:

“What’s going on in this picture?”

Traditional models like Show-and-Tell or CNN+LSTM-based architectures require:

Large, labeled datasets (e.g., MS COCO)
Heavy training
Specialized tuning for each domain

With CLIP + GPT-4, you can skip most of that.

🤝 Why Combine CLIP with GPT-4?

🔹 What is CLIP?

CLIP (Contrastive Language-Image Pretraining), developed by OpenAI, is a multimodal model that learns to match images and text in a shared embedding space. It doesn’t generate text — it understands image semantics and matches them to natural language.

🔹 What is GPT-4?

GPT-4 is a large language model that excels at generating human-like text, including descriptions, narratives, instructions, and even poetry.

🧠 The Synergy

By combining CLIP’s vision understanding with GPT-4’s text generation power, you get a zero-shot image captioning system that can describe almost anything — without fine-tuning or manually labeled data.

🛠️ How It Works: CLIP + GPT-4 for Image Captioning

Here’s a breakdown of the typical pipeline:

1. Encode the image with CLIP’s vision encoder

→ Outputs a feature vector representing the image.

2. Use CLIP’s text encoder to match possible captions

→ Useful for ranking candidate descriptions.

3. Inject the image context into a GPT-4 prompt

→ For example:

“Describe this image in one sentence. The image shows: [CLIP-top concepts].”

4. Let GPT-4 generate the final caption

→ Output is natural, nuanced, and tailored.

You can also feed in CLIP-ranked tags or keywords as part of a prompt engineering strategy.

💻 Hands-On: Building a CLIP + GPT-4 Image Captioning Pipeline

Let’s sketch out a minimal working pipeline (assumes access to CLIP via open_clip and GPT-4 via OpenAI API):

1. Install Requirements

pip install openai open-clip-torch torchvision

2. Load CLIP and Encode the Image

import torch

import open_clip

from PIL import Image

from torchvision import transforms

# Load CLIP model

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')

tokenizer = open_clip.get_tokenizer('ViT-B-32')

# Load and preprocess image

image = preprocess(Image.open("your_image.jpg")).unsqueeze(0)

# Encode image

with torch.no_grad():

    image_features = model.encode_image(image)

3. Generate Semantic Tags with CLIP

CLIP works great at ranking concepts. Try generating top tags:

labels = ["a cat", "a sunset", "a mountain", "a person surfing", "a city skyline"]

text_tokens = tokenizer(labels)

text_features = model.encode_text(text_tokens)

# Similarity

similarity = (image_features @ text_features.T).squeeze()

top_idx = similarity.topk(3).indices.tolist()

top_tags = [labels[i] for i in top_idx]

4. Prompt GPT-4 with Visual Context

import openai

openai.api_key = "your_api_key"

prompt = f"""You're a captioning assistant. Describe the following image based on the visual hints:

Image contains: {', '.join(top_tags)}.

Write a natural, engaging caption.

"""

response = openai.ChatCompletion.create(

  model="gpt-4",

  messages=[{"role": "user", "content": prompt}],

  temperature=0.7

)

caption = response['choices'][0]['message']['content'].strip()

print("Generated Caption:", caption)

🔍 Real-World Applications

Use Case	How CLIP + GPT-4 Helps
Accessibility	Generate detailed alt-text for blind users
Social Media Tools	Auto-caption user photos with context
Visual Search Engines	Index and retrieve images by generated text
Education	Summarize images in slides or learning apps
E-commerce	Describe product images for SEO-rich listings

⚙️ Why This Beats Traditional Captioning

Feature	Traditional CNN-RNN	CLIP + GPT-4
Needs Labeled Data	✅ Yes	❌ No
Generalizes to New Domains	❌ Poor	✅ Strong
Multilingual Friendl	❌ Limited	✅ Yes (GPT-4 multilingual)
Output Fluency	😐 Sometimes robotic	😎 Human-like

🧠 Tips for Better Captions

Use diverse concept lists for CLIP keyword ranking.
Prompt GPT-4 with tone/style preferences (e.g. “Make it funny” or “Keep it formal”).
For batch use, cache CLIP outputs and reuse for different prompt strategies.

🔚 Final Thoughts

CLIP + GPT-4 is a powerful, flexible image captioning setup that doesn’t require training data, scales well, and generates captions that sound like they were written by humans — because they basically are.

If you want captions that are more than just “a man holding a phone,” and you’re tired of building image classifiers from scratch, this zero-shot combo is your next best friend.

✅ Try the live CLIP + GPT-4 image captioning demo on Hugging Face here.

C# Corner

C# Corner started as an online community for software developers in 1999.