Qwen3: Think Deeper, Act Faster with Open Models

Sarthak Varshney
Apr 30
6k
0
3

Article

Introduction

Imagine asking an AI a complex coding or math question and having it reason step by step before answering. That’s exactly what Alibaba’s new Qwen3 models promise: a “hybrid” way of thinking that can slow down for tough problems. This guide is a friendly, plain-English introduction to Qwen3, explaining its main ideas (like Mixture-of-Experts and thinking mode) and why they matter.

We’ll use analogies and even real benchmark examples to show how Qwen3 performs and compare it to other top models like GPT-4o and Google’s Gemini. By the end, you’ll have a solid grasp of what makes Qwen3 special, and even know how to try it yourself for coding, reasoning, or multilingual tasks.

Qwen3: Think Deeper, Act Faster with Open Models

What is Qwen3?

Qwen3 is the latest series of open-source large language models (LLMs) from Alibaba Cloud’s Qwen team. A “large language model” is just a neural network trained on huge amounts of text, so it can understand and generate human-like writing (imagine a very smart digital librarian or tutor). The Qwen3 family comes in eight models: two Mixture-of-Experts (MoE) models and six dense models. The MoE models are named Qwen3-235B-A22B (235 billion parameters, with 22B active at inference) and Qwen3-30B-A3B (30B total, 3B active). The six dense models are Qwen3-32B, 14B, 8B, 4B, 1.7B, and 0.6B. All are released under an open Apache 2.0 license, meaning anyone can download and run them freely.

A dense model is like a single generalist expert: it tries to handle every question with one big brain. An Mixture-of-Experts (MoE) model is different – imagine a kitchen of specialist chefs. When a question comes in, only the few chefs (experts) who know that topic wake up and cook, instead of everyone working at once. This means Qwen3’s MoE models can pack huge “knowledge” while only using a fraction of the total compute for each task. In other words, the model only uses as much compute as needed for each query, making it very efficient.

Here’s an analogy: a dense model is like one super-chef who can make any dish, while a Mixture-of-Experts model is like a restaurant with specialists. If you want noodles, the noodle expert and perhaps the sauce expert jump in – the others wait. This lets the restaurant serve many cuisines without having every chef busy all the time. In AI terms, Qwen3’s MoE approach is what allows Alibaba to scale up to those gigantic parameter counts without paying the full cost of every query.

Thinking vs. Non-Thinking Modes: Two Ways to Answer

One of the most exciting features of Qwen3 is that it can switch between two modes of reasoning: a fast “non-thinking” mode and a slower “thinking” mode. Think of it like two gears.

Non-Thinking Mode: This is the “blitz” or “instant” mode. The model answers as quickly as possible, giving you a response in one shot. It’s great for simple questions or casual chat when you want an answer right away. For example, if you ask “What’s the capital of France?”, you’d get the immediate answer “Paris” without any extra steps, because a quick answer is fine for such questions.
Thinking Mode: This is the “step-by-step” mode. Here, Qwen3 takes its time to reason through the problem before answering. It might even produce intermediate steps or use more tokens internally to break the problem down. This is ideal for harder problems – say, complex math, tricky coding bugs, or logic puzzles. In this mode, you’re essentially allowing the model to “show its work” before giving the final answer.

This hybrid design gives you control over how much “thinking” the AI does. For tough tasks, you can switch into thinking mode (for example, by using a /think command or button), and the model will spend more computational budget reasoning carefully. For easy questions or chats, stick with non-thinking mode to get instant answers. Internally, Qwen3 smoothly adjusts how much compute it uses. As Alibaba’s team explains, this lets the model scale its performance with the allocated “thinking budget”. The chart later in this article shows how performance climbs as more thinking tokens are allowed.

In short, you can think of Qwen3 as having two brains: a quick-answer brain and a deep-thought brain. This flexibility means you can ask it “think mode on” for a detailed solution, or “fast mode” when you just need a speedy reply. It’s a novel idea in consumer AI that tries to give users the best of both worlds.

How Qwen3 Learned to Think: Training and Pre-training

Under the hood, Qwen3 has been trained on an enormous amount of text. The Qwen team reports that the initial pre-training dataset was about 36 trillion tokens of text, roughly double what they used for Qwen2.5. (A token is like a word-piece; 36 trillion is astronomically large.) This training data came from across the web and many documents in 119 languages and dialects. In plain terms, Qwen3 has “read” an enormous multilingual library, which helps it understand and generate text in many languages.

The training was done in stages, each focusing on different skills. In Stage 1, the model saw over 30 trillion tokens (4K context length) of general data – this built its basic language understanding. In Stage 2, another 5 trillion tokens of knowledge-intensive data (like science, math, coding) were added, sharpening its technical reasoning skills. Finally, Stage 3 used specially curated long-text data so the model could handle 32K token documents. You can imagine this like schooling: first elementary general knowledge, then specialised higher-level subjects, and then training on reading long textbooks. Thanks to this process, even the small Qwen3 models perform much better than expected – for example, Qwen3-4B (dense) now does as well on math and coding as older 32B models did.

After pre-training, Qwen3 underwent extra fine-tuning to improve reasoning. The team used a four-stage pipeline: they fine-tuned the model on long “chain-of-thought” examples (math and coding problems worked out step-by-step), then used reinforcement learning to reinforce good reasoning. They then blended this reasoning model with regular instruction data so the model could also answer straight questions. The end result is that Qwen3 can both “think out loud” when needed and also keep answers concise. In short, the model was explicitly taught to reason step-by-step on hard problems, which is why the thinking mode works so well in practice.

Qwen3 in Action: Benchmarks and Performance

How does Qwen3 actually perform on tasks? Very impressive, especially in coding, math, and multilingual problems. Let’s look at a few benchmark highlights and comparisons:

Coding and Programming: Qwen3 shines on coding tests. For instance, on ArenaHard – a set of 500 challenging software/math problems – the big MoE model Qwen3-235B-A22B scored 95.6%, beating OpenAI’s o1 (about 92.1%) and DeepSeek R1 (93.2%). It even nearly matches Google’s Gemini 2.5-Pro (96.4%) on that benchmark. The table below shows that Qwen3’s models (especially the MoE ones) top the charts on coding benchmarks compared to other AIs:

Coding and Programming
Related Image: © Qwen3

Coding benchmark results: Qwen3’s MoE models (blue highlights) have higher scores on coding tests (ArenaHard, LiveCodeBench, Codeforces Elo) than competitors like DeepSeek-R1, OpenAI o1, Google’s Gemini 2.5-Pro, and even GPT-4o.

Even the smaller MoE, Qwen3-30B-A3B, outperforms its big, dense counterpart. In many coding contests, Qwen3-30B-A3B (with 3B active params) beats the older Qwen3-32B (dense) by a large margin. In practice, this means hobbyists running the 30B MoE on a high-end GPU can get performance comparable to a much larger model, saving time and money.

Reasoning and Math: Qwen3’s thinking mode makes a huge difference in hard problems. The chart below illustrates how giving the model more “thinking” budget boosts accuracy on math and coding tasks:

Hybrid Thinking Modes
Related Image: © Qwen3

Pass@1 vs. thinking budget: Qwen3 with thinking mode (blue line) dramatically outperforms the baseline (red dashed) as more tokens are allowed. On AIME math contests and LiveCodeBench coding problems, accuracy jumps from ~40% to 80%+ as the model is allowed to think more.

With thinking mode turned on, Qwen3 goes from solving ~40% of AIME-2024 math problems to 85%+ as the token budget rises【3†】. This means it can tackle very difficult questions by breaking them down. Even without thinking mode, Qwen3 is strong: for example, on the GPQA programming question benchmark, Qwen3-30B-A3B scores about 65.8%, compared to GPT-4o’s 46.0%. Its smaller 4B model also rivals GPT-4o on an 8-language reasoning test (MultiF – Qwen3-4B 66.3% vs. GPT-4o 65.6%).

Related Image: © Qwen3

Reasoning benchmarks: Qwen3-30B-A3B (blue) and Qwen3-4B (highlighted) outperform other models like GPT-4o, DeepSeek-V3, and Gemini on tasks such as AIME’24, AIME’25, LiveCodeBench, GPQA, and multilingual MultiF

In summary, across coding, math, and reasoning tasks, Qwen3 (especially the MoE models) sits at the top of open-model leaderboards. It often matches or even surpasses proprietary models like GPT-4o and Gemini on these benchmarks – a big deal for open-source AI.

Multilingual and General Abilities: Qwen3’s training on 119 languages pays off in many languages. It can understand and generate text in dozens of languages and dialects. For example, on a multilingual reasoning test covering 8 languages (MultiF), even the 4B model achieves about 66.3% accuracy, slightly above GPT-4o’s ~65.6%. This means developers worldwide can use Qwen3 in their native language effectively. On top of raw tests, Qwen3 has also been fine-tuned for natural conversation. It excels at creative writing, role-play, and multi-turn dialogue, aiming to follow instructions and keep chats engaging. In other words, it’s not just a calculator – it’s designed to be a more human-like and helpful AI partner.

Getting Started with Qwen3

The great news is Qwen3 is open for anyone to try. The models and demos are publicly released on platforms like Hugging Face, ModelScope, and even Kaggle. You can go to the Hugging Face Qwen3 collection and find all the model checkpoints (look for names starting with Qwen3-). There’s also a Qwen Chat website (chat.qwen.ai) and mobile app where you can chat with Qwen3 directly and switch on “Thinking Mode” with a button.

If you want to run Qwen3 on your own machine or server, there are many tools. For example, the Ollama tool lets you start Qwen3 easily – you can run something like ollama run qwen3:30b-a3b chat with the 30B MoE model locally. There are also frameworks like vLLM or SGLang for fast GPU serving, and lighter-weight libraries like llama.cpp or KTransformers for running smaller models (e.g. Qwen3-4B) on a PC. Even LMStudio or MLX supports Qwen3 models.

Here are some quick ways hobbyists and developers can experiment with Qwen3:

Hugging Face: Browse the Qwen3 model collection and run the hosted demos online.
Qwen Chat (Web): Chat with Qwen3 in your browser. Use the “Thinking Mode” toggle to compare fast vs. deep answers.
Ollama: If you have a beefy GPU, install Ollama and do ollama pull qwen3:30b-a3b so ollama run qwen3:30b-a3b to chat locally.
LLama.cpp or KTransformers: Good for running smaller Qwen3 (4B or 1.7B) on a PC or even a CPU.
APIs/Frameworks: For power users, use SGLang or vLLM to serve Qwen3-235B across multiple GPUs in the cloud.

No matter which model or tool you choose, you’ll notice that Qwen3 is quite capable even at smaller scales. Its open-weight license and optimised design make it friendly for experimentation. Pro tip: always try switching into /think mode or its interface equivalent when solving tough problems – you’ll often see much better results.

Develop with Qwen3

Below is a simple guide for you to use Qwen3 on different frameworks. First of all, we provide a standard example of using Qwen3-30B-A3B in Hugging Face transformers:

from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switch between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

To disable thinking, you just need to make changes to the argument enable_thinking Like the following:

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # True is the default value for enable_thinking.
)

For deployment, you can use sglang>=0.4.6.post1 or vllm>=0.8.4 to create an OpenAI-compatible API endpoint:

SGLang:

python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B --reasoning-parser qwen3

vLLM:

vllm serve Qwen/Qwen3-30B-A3B --enable-reasoning --reasoning-parser deepseek_r1

If you use it for local development, you can use ollama by running a simple command ollama run qwen3:30b-a3b to play with the model, or you can use LMStudio or llama.cpp and ktransformers to build locally.

Advanced Usages

We provide a soft switch mechanism that allows users to dynamically control the model’s behavior when enable_thinking=True. Specifically, you can add /think and /no_think to user prompts or system messages to switch the model’s thinking mode from turn to turn. The model will follow the most recent instructions in multi-turn conversations.

Here is an example of a multi-turn conversation:

from transformers import AutoModelForCausalLM, AutoTokenizer

class QwenChatbot:
    def __init__(self, model_name="Qwen/Qwen3-30B-A3B"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.history = []

    def generate_response(self, user_input):
        messages = self.history + [{"role": "user", "content": user_input}]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.tokenizer(text, return_tensors="pt")
        response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
        response = self.tokenizer.decode(response_ids, skip_special_tokens=True)

        # Update history
        self.history.append({"role": "user", "content": user_input})
        self.history.append({"role": "assistant", "content": response})

        return response

# Example Usage
if __name__ == "__main__":
    chatbot = QwenChatbot()

    # First input (without /think or /no_think tags, thinking mode is enabled by default)
    user_input_1 = "How many r's in strawberries?"
    print(f"User: {user_input_1}")
    response_1 = chatbot.generate_response(user_input_1)
    print(f"Bot: {response_1}")
    print("----------------------")

    # Second input with /no_think
    user_input_2 = "Then, how many r's in blueberries? /no_think"
    print(f"User: {user_input_2}")
    response_2 = chatbot.generate_response(user_input_2)
    print(f"Bot: {response_2}") 
    print("----------------------")

    # Third input with /think
    user_input_3 = "Really? /think"
    print(f"User: {user_input_3}")
    response_3 = chatbot.generate_response(user_input_3)
    print(f"Bot: {response_3}")

Agentic Usages

Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of the agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.

To define the available tools, you can use the MCP configuration file, or you can integrate a Qwen-Agent tool, or integrate other tools by yourself.

from qwen_agent.agents import Assistant

# Define LLM
llm_cfg = {
    'model': 'Qwen3-30B-A3B',

    # Use the endpoint provided by Alibaba Model Studio:
    # 'model_type': 'qwen_dashscope',
    # 'api_key': os.getenv('DASHSCOPE_API_KEY'),

    # Use a custom endpoint compatible with OpenAI API:
    'model_server': 'http://localhost:8000/v1',  # api_base
    'api_key': 'EMPTY',

    # Other parameters:
    # 'generate_cfg': {
    #         # Add: When the response content is `<think>this is the thought</think>this is the answer;
    #         # Do not add: When the response has been separated by reasoning_content and content.
    #         'thought_in_content': True,
    #     },
}

# Define Tools
tools = [
    {'mcpServers': {  # You can specify the MCP configuration file
            'time': {
                'command': 'uvx',
                'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']
            },
            "fetch": {
                "command": "uvx",
                "args": ["mcp-server-fetch"]
            }
        }
    },
  'code_interpreter',  # Built-in tools
]

# Define Agent
bot = Assistant(llm=llm_cfg, function_list=tools)

# Streaming generation
messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Friends of Qwen

Conclusion

Qwen3 represents an exciting new step in AI: a family of open-source models that not only scales up in size but also lets you control how it thinks. Its dual-mode design (thinking vs. non-thinking) is like giving the AI a conscious choice of gears, which is pretty novel in consumer AI. For beginners, the takeaway is that Qwen3 is both powerful and accessible. It was trained on a mind-boggling amount of data (36T tokens, 119 languages) and holds its own against giants like GPT-4o and Gemini on coding and reasoning tasks. Yet, because it’s open-source, you can download it or run it in your projects without special restrictions. Whether you want a coding assistant, a math tutor, or a multilingual chatbot, Qwen3 provides the tools to build it.

So give Qwen3 a try! Ask it to “explain a formula” in thinking mode, or have a quick chat in non-thinking mode. Play with both the big models if you have the hardware, or start with a smaller one on your laptop. Thanks to platforms like Hugging Face and tools like Ollama, it’s never been easier to experiment with cutting-edge AI at home. Think of Qwen3 as your new AI lab partner – a conversational assistant that can take a problem, mull it over, and help you solve it step by step, or just quickly answer trivia depending on your needs.

Qwen3 bridges high-end AI capability and user-friendly control. With its open weights and innovative thinking mode, it’s like having a smarter, more thoughtful chat buddy by your side. Happy experimenting!

Reference

Qwen: Qwen3: Think Deeper, Act Faster