What are the best Ollama models in 2025? Ollama now offers over 100 open-source AI models for local deployment, ranging from tiny 270M parameter models to massive 671B reasoning systems. The most popular choices are Llama 3.1 8B for general use (108M+ downloads), DeepSeek-R1 for advanced reasoning (75M+ downloads), and Gemma 3 for efficient multimodal tasks (28M+ downloads). This guide covers every model available on Ollama, helping you choose the right one for your specific needs.

Running AI locally has become essential for developers, researchers, and businesses who need privacy, cost control, and offline capability. With Ollama making local deployment as simple as running a single command, the only question remaining is which model to choose.

This comprehensive guide examines every model in the Ollama library, providing the technical details, performance characteristics, and practical recommendations you need to make informed decisions.

What Is Ollama and Why Does Model Selection Matter?

Ollama is an open-source platform that simplifies running large language models locally on your hardware. Instead of sending data to cloud APIs like OpenAI or Anthropic, you download models once and run them entirely on your machine. Your data never leaves your device.

The platform handles the complexity of model quantization, memory management, and optimization automatically. You run ollama run llama3.1 and start chatting within minutes.

Model selection matters because each model has different strengths:

Parameter count affects capability and memory requirements
Training focus determines whether models excel at code, reasoning, or conversation
Quantization level trades quality for speed and memory efficiency
Context window limits how much text the model can process at once
Architecture type (dense vs. Mixture-of-Experts) impacts efficiency and specialization

Choosing the wrong model wastes hardware resources or leaves performance on the table. This guide helps you match models to your actual needs.

The Meta Llama Family: The Foundation of Local AI

Meta's Llama models form the backbone of local AI. They are the most widely used, best supported, and most thoroughly tested models available. The December 2024 release of Llama 3.3 changed the conversation around open-source large language models, delivering performance comparable to much larger models at a fraction of the computational cost.

Llama 3.3 (70B Parameters)

Llama 3.3 is Meta's latest flagship model, released in December 2024. It offers performance comparable to the much larger Llama 3.1 405B while requiring only 43GB of storage, representing a major advancement in efficient model design.

Key Specifications:

Parameters: 70 billion
Context Window: 128K tokens
Size: 43GB
Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai
Downloads: 2.9 million
Training Data: 15 trillion tokens from public sources (7x larger than Llama 2)
License: Llama 3.3 Community License

Architecture Details:

Llama 3.3 is an auto-regressive language model using an optimized transformer architecture with several key innovations:

Grouped-Query Attention (GQA): Improves inference scalability and efficiency
128K Vocabulary Tokenizer: Encodes language more efficiently than previous versions
Supervised Fine-Tuning (SFT): Aligns model behavior with human preferences
Reinforcement Learning with Human Feedback (RLHF): Ensures helpfulness and safety

Benchmark Performance:

Benchmark	Llama 3.3 70B	Comparison
MMLU Chat (0-shot, CoT)	86.0	Matches Llama 3.1 70B, competitive with Amazon Nova Pro (85.9)
MMLU PRO (5-shot, CoT)	68.9	Improved over Llama 3.1 70B
GPQA Diamond (0-shot, CoT)	50.5	Better than Llama 3.1 70B (48.0)
HumanEval (0-shot)	88.4	Near Llama 3.1 405B (89.0)
MBPP EvalPlus	87.6	Slight improvement over Llama 3.1 70B (86.0)
MATH (0-shot, CoT)	77.0	Major improvement over Llama 3.1 70B (67.8)
MGSM (0-shot)	91.1	Substantial improvement over Llama 3.1 70B (86.9)
IFEval	92.1	Excellent instruction-following

Inference Performance:

Achieves 276 tokens/second on Groq hardware (25 tokens/second faster than Llama 3.1 70B)
NVIDIA TensorRT-LLM with speculative decoding achieves up to 3.55x throughput speedup on HGX H200

Cost Efficiency:

Input tokens: $0.10 per million (vs. $1.00 for Llama 3.1 405B)
Output tokens: $0.40 per million (vs. $1.80 for Llama 3.1 405B)

Best For: Users who need maximum capability and have RTX 4090 or Apple Silicon with 64GB+ memory. This model approaches GPT-4 quality for many tasks while running locally.

Hardware Requirements: Minimum 64GB RAM or 24GB VRAM with CPU offloading. Runs well on M2 Max or M3 Max MacBooks.

Llama 3.2 (1B and 3B Parameters)

Llama 3.2 represents Meta's push into efficient, edge-deployable models. Released in September 2024, these are designed for devices with limited resources and represent a new era of on-device AI.

Key Specifications:

Parameters: 1B (1.3GB) or 3B (2.0GB)
Context Window: 128K tokens
Languages: 8 officially supported (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai)
Training Data: Up to 9 trillion tokens from public sources
Downloads: 51 million

Architecture & Training Innovation:

The 1B and 3B models were created using two innovative techniques:

Pruning: Started with Llama 3.1 8B and systematically removed less critical network components
Knowledge Distillation: Used logits from Llama 3.1 8B and 70B as token-level targets during training

This approach allowed Meta to create models that retain much of the capability of larger models in a fraction of the size.

Edge Deployment Features:

Compatible with Qualcomm, MediaTek, and ARM processors
Designed for mobile and IoT applications
Instantaneous local processing without cloud latency
Complete data privacy with no cloud transmission
Works on devices with as little as 8GB RAM

Benchmark Performance:

Benchmark	Llama 3.2 1B	Llama 3.2 3B	Gemma 2B IT	Phi 3.5-mini IT
MMLU (5-shot)	-	63.4	57.8	69.0
IFEval	-	77.4	61.9	59.2
TLDR9 (summarization)	16.8	19.0	-	-
BFCL V2 (tool use)	25.7	67.0	-	-

The 3B model outperforms Gemma 2 2.6B and Phi 3.5-mini on instruction-following, summarization, and tool use benchmarks. Notably, Llama 3.2 3B significantly outperformed the original GPT-4 on the MATH benchmark.

Best For: Mobile development, IoT applications, and situations where you need AI on resource-constrained devices. Also excellent for rapid prototyping when speed matters more than maximum quality.

Hardware Requirements: Runs on any modern hardware. Even laptops with 8GB RAM handle these models comfortably.

Llama 3.2-Vision (11B and 90B Parameters)

The vision variants are Meta's first Llama models to support vision tasks, adding powerful image understanding capabilities to the Llama architecture through a novel adapter-based approach.

Key Specifications:

Parameters: 11B (7.8GB) or 90B (55GB)
Context Window: 128K tokens
Capabilities: Image reasoning, captioning, visual question answering, OCR, object detection
Input: Text and images (up to 1120x1120 pixels)
Training Data: 6 billion image-text pairs

Architecture Innovation:

Llama 3.2 Vision introduces a unique architecture combining:

Base Language Model: Llama 3.1 8B (for 11B) or Llama 3.1 70B (for 90B)
Vision Encoder: Separately trained image processing component
Cross-Attention Adapters: Connect image representations to the language model

The key innovation is that the language model parameters remained frozen during vision adapter training, preserving all text-only capabilities. This means Llama 3.2-Vision serves as a drop-in replacement for Llama 3.1 for text tasks.

Image Processing Capabilities:

High-resolution support up to 1120x1120 pixels
Object classification and identification
Image-to-text transcription (including handwriting) via OCR
Chart and graph understanding
Document analysis and data extraction
Contextual visual Q&A
Image comparison

Grouped-Query Attention (GQA): All models support GQA for faster inference, particularly beneficial for the larger 90B model.

Best For: Applications requiring image analysis: document processing, visual content moderation, image-based research assistance, visual accessibility tools. The 11B variant is the sweet spot for most users.

Limitations: Image+text combinations only support English. Text-only tasks support the full 8-language set.

Hardware Requirements:

11B: 16GB VRAM or 32GB RAM
90B: 64GB+ VRAM or distributed setup

Llama 3.1 (8B, 70B, and 405B Parameters)

Llama 3.1 remains the workhorse of local AI, with the 8B version being the most downloaded model on Ollama at over 108 million downloads. The 405B variant was the first openly available model to rival GPT-4 and Claude 3 Opus in capability.

Key Specifications:

Sizes: 8B (4.9GB), 70B (43GB), 405B (243GB)
Context Window: 128K tokens
Capabilities: Tool use, multilingual, long-form summarization, coding
Training Data: 15 trillion tokens

Architectural Improvements over Llama 3:

Extended context window from 8K to 128K tokens
Improved tokenizer efficiency
Enhanced multilingual capabilities
Native tool use and function calling support

Best For:

8B: Everyday professional work, document summarization, code generation, content drafting. The best balance of capability and accessibility.
70B: Complex analysis, detailed reasoning, high-stakes professional applications
405B: Research and enterprise applications requiring maximum capability. First open model to truly compete with GPT-4.

Hardware Requirements:

8B: 8GB VRAM or 16GB RAM
70B: 64GB RAM or distributed GPU setup
405B: Multiple high-end GPUs or specialized infrastructure (typically 8x 80GB GPUs)

Llama 3 (8B and 70B Parameters)

The previous generation remains useful for applications optimized for its architecture, though Llama 3.1 is recommended for new projects.

Key Specifications:

Sizes: 8B (4.7GB), 70B (40GB)
Context Window: 8K tokens
Downloads: 13.2 million

Best For: Legacy compatibility or when the shorter 8K context window is sufficient. Generally recommend Llama 3.1 for new projects due to the larger context window and improved capabilities.

Llama 2 (7B, 13B, and 70B Parameters)

The foundation that started the open-source AI revolution in July 2023.

Key Specifications:

Sizes: 7B (3.8GB), 13B (7.4GB), 70B (39GB)
Context Window: 4K tokens
Training: 2 trillion tokens
Downloads: 4.9 million

Best For: Research comparisons, fine-tuning base models, or applications where you have existing Llama 2 infrastructure. Historical significance as the model that democratized large language model access.

Llama 2 Uncensored (7B and 70B Parameters)

A variant of Llama 2 with safety guardrails removed, created using Eric Hartford's uncensoring methodology.

Key Specifications:

Sizes: 7B (3.8GB), 70B (39GB)
Context Window: 2K tokens
Downloads: 1.5 million

Best For: Research purposes, creative writing without restrictions, or applications where you need the model to engage with topics the standard version refuses.

Caution: Use responsibly. The lack of guardrails means the model will attempt to comply with any request. Not recommended for production applications without additional safety measures.

DeepSeek Models: The Reasoning Revolution

DeepSeek has emerged as a major force in open-source AI, particularly with reasoning-focused models. Their January 2025 release of DeepSeek-R1 demonstrated that reinforcement learning alone could produce emergent reasoning capabilities rivaling frontier closed-source models—at a fraction of the training cost.

DeepSeek-R1 (1.5B to 671B Parameters)

DeepSeek-R1 is a family of open reasoning models that approach the performance of OpenAI's o1 and Google's Gemini 2.5 Pro. The full model represents one of the most significant open-source AI releases of 2025.

Key Specifications:

Sizes: 1.5B (1.1GB), 7B (4.7GB), 8B (5.2GB), 14B (9.0GB), 32B (20GB), 70B (43GB), 671B (404GB)
Context Window: 128K-160K tokens
Downloads: 75.2 million
License: MIT (fully permissive)
Training Cost: Approximately $5.6 million (significantly lower than competing models)

Architecture: Mixture of Experts (MoE)

DeepSeek-R1 leverages a sophisticated MoE framework:

Total Parameters: 671 billion
Active Parameters: Only 37 billion per inference (5.5% activation rate)
Experts per Layer: 256 routed experts
Selected Experts: 8 per query (typically 2-4 for simpler tasks)

Key architectural innovations include:

Multi-Head Latent Attention (MLA): Dramatically reduces KV cache size, a common bottleneck in transformers. Enables faster inference and longer text generation.
Expert Routing Mechanism: Lightweight gating network assigns probability distributions over experts. Top-ranked experts process queries in parallel.
Multi-Token Prediction (MTP): Improves generation efficiency.

Revolutionary Training Methodology:

DeepSeek-R1's training process represents a breakthrough in AI development:

DeepSeek-R1-Zero: Trained via large-scale reinforcement learning (RL) without supervised fine-tuning. Remarkably, powerful reasoning behaviors emerged naturally from pure RL.
Group Relative Policy Optimization (GRPO): A novel RL algorithm from the DeepSeekMath paper. Built on PPO (Proximal Policy Optimization), GRPO enhances mathematical reasoning while reducing memory consumption.
Multi-Stage Pipeline:
- Stage 1: Pure RL to discover reasoning patterns
- Stage 2: Supervised Fine-Tuning (SFT) on synthesized reasoning data
- Stage 3: Second RL phase for helpfulness and harmlessness

Distilled Models:

The smaller models (1.5B-70B) are distilled from the full 671B model, demonstrating that reasoning patterns from larger models can effectively transfer to smaller ones. This makes advanced reasoning accessible on consumer hardware.

Benchmark Performance:

DeepSeek-R1 matches or exceeds frontier models on reasoning benchmarks:

Competitive with OpenAI o1 on mathematical reasoning
Approaches GPT-4 Turbo on code generation
Exceeds many closed models on logic and scientific analysis

Hardware Requirements:

7B-8B: 8GB VRAM
14B: 12GB VRAM
32B: 24GB VRAM
70B: 48GB+ VRAM or large RAM
671B: Minimum 800GB HBM in FP8 format; requires 64-way expert parallelism across multiple GPUs

Best For: Mathematical reasoning, programming challenges, logical problem-solving, scientific analysis. The 14B-32B range offers the best balance of capability and hardware requirements for most users.

DeepSeek-Coder (1.3B to 33B Parameters)

A coding-focused model trained on 87% code and 13% natural language, optimized for programming tasks.

Key Specifications:

Sizes: 1.3B (776MB), 6.7B (3.8GB), 33B (19GB)
Context Window: 16K tokens
Training: 2 trillion tokens
Downloads: 2.4 million

Best For: Code completion, code generation, programming assistance, and technical documentation. Excellent for developers who need a dedicated coding assistant.

DeepSeek-Coder-V2 (16B and 236B Parameters)

An advanced Mixture-of-Experts coding model that achieves GPT-4 Turbo-level performance on code tasks—the first open model to reach this milestone.

Key Specifications:

Sizes: 16B (8.9GB), 236B (133GB)
Active Parameters: 2.4B (16B model), 21B (236B model)
Context Window: Up to 160K tokens
Architecture: Mixture-of-Experts with Multi-Head Latent Attention
Programming Languages: 338 supported
Training Data: 10.2 trillion tokens (60% code, 10% mathematics)
Downloads: 1.3 million

Architecture Innovations:

MoE Efficiency: The 236B model uses only 21B active parameters per inference, achieving high performance without prohibitive compute costs.
Multi-Head Latent Attention (MLA): Reduces KV cache size dramatically, enabling faster inference and longer context handling.

Benchmark Performance:

Benchmark	DeepSeek-Coder-V2 236B	Comparison
HumanEval	90.2%	New state-of-the-art
MBPP	76.2%	New state-of-the-art
MATH	75.7%	Near GPT-4o (76.6%)

Hardware Requirements:

16B (Lite): Single GPU with 40GB VRAM in BF16
236B (Full): 8x 80GB GPUs for BF16 inference

Best For: Professional development environments, code review automation, and complex programming tasks requiring maximum accuracy.

Google Gemma Family: Efficiency Meets Capability

Google's Gemma models leverage technology from the Gemini family in compact, efficient packages. The March 2025 release of Gemma 3 established new standards for what's possible on a single GPU.

Gemma 3 (270M to 27B Parameters)

Gemma 3 is Google's latest and most capable model family that runs on a single GPU, bringing Gemini-class capabilities to local deployment.

Key Specifications:

Sizes: 270M (text only), 1B, 4B, 12B, 27B
Context Window: 32K tokens (1B), 128K tokens (4B and larger)
Languages: 35+ out-of-the-box, 140+ pretrained support
Multimodal: 4B and larger process both text and images
Downloads: 28.9 million
Training Data: 14T tokens (27B), 12T tokens (12B), 4T tokens (4B), 2T tokens (1B)

Architecture Innovations:

Gemma 3 introduces several architectural improvements:

Interleaved Attention Blocks: Each block contains 5 local attention layers (sliding window of 1024) and 1 global attention layer. This captures both short and long-range dependencies efficiently.
Enhanced Positional Encoding: Upgraded RoPE (Rotary Positional Embedding) with base frequency increased from 10K to 1M for global layers, maintaining 10K for local layers.
Improved Normalization: QK-norm for stable attention scores, replacing soft-capping from Gemma 2. Uses Grouped-Query Attention (GQA) with both post-norm and pre-norm RMSNorm.
Memory Efficiency: Architectural changes reduce KV cache overhead during long-context inference compared to global-only attention.

Vision Integration (4B+):

Vision Encoder: Based on SigLIP for processing images
Pan & Scan Algorithm: Adaptively crops and resizes images to handle different aspect ratios
Fixed Processing Size: Vision encoder operates on 896x896 square images

Benchmark Performance:

Benchmark	Gemma 3 27B	Notes
MMLU-Pro	67.5	Strong general knowledge
LiveCodeBench	29.7	Competitive coding
Bird-SQL	54.4	Database queries
GPQA Diamond	42.4	Graduate-level reasoning
MATH	69.0	Mathematical ability
FACTS Grounding	74.9	Factual accuracy
MMMU	64.9	Multimodal understanding
LM Arena Elo	1338	Top 10 overall (March 2025)

The 27B model outperforms Llama3-405B, DeepSeek-V3, and o3-mini in preliminary human preference evaluations on LMArena.

Additional Features:

Function calling and structured output support
Official quantized versions available
Runs efficiently on workstations, laptops, and smartphones

Hardware Requirements:

270M-1B: Any modern hardware
4B: 6GB VRAM
12B: 12GB VRAM
27B: 20GB+ VRAM

Best For: Multilingual applications, multimodal projects, and situations where you need strong performance with reasonable hardware. The 12B variant is particularly efficient for its capability level.

Gemma 2 (2B, 9B, and 27B Parameters)

The previous generation remains excellent for many applications, offering proven reliability and broad compatibility.

Key Specifications:

Sizes: 2B (1.6GB), 9B (5.4GB), 27B (16GB)
Context Window: 8K tokens
Downloads: 12.3 million

The 27B variant delivers "performance surpassing models more than twice its size" according to Google's benchmarks.

Best For: Creative text generation, chatbots, content summarization, NLP research, and language learning applications where Gemma 3's longer context isn't needed.

Gemma (2B and 7B Parameters)

The original Gemma release from February 2024, lightweight but capable.

Key Specifications:

Sizes: 2B (1.7GB), 7B (5.0GB)
Context Window: 8K tokens
Training: Web documents, code, mathematics

Best For: Edge deployments, resource-constrained environments, and applications needing a small but capable model with Google's quality standards.

CodeGemma (2B and 7B Parameters)

Google's code-specialized variant optimized for IDE integration and code completion.

Key Specifications:

Sizes: 2B (1.6GB), 7B (5.0GB)
Context Window: 8K tokens
Languages: Python, JavaScript, Java, Kotlin, C++, C#, Rust, Go, and others
Training: 500 billion tokens including code and mathematics
Fill-in-the-Middle: Supported for code completion

Best For: IDE integration, code completion, fill-in-the-middle tasks, and coding assistant applications.

Alibaba Qwen Family: Multilingual Excellence

Qwen models from Alibaba excel at multilingual tasks and offer excellent performance across the capability spectrum. The April 2025 release of Qwen3 introduced revolutionary hybrid reasoning capabilities.

Qwen3 (0.6B to 235B Parameters)

The latest Qwen generation provides both dense and Mixture-of-Experts variants with groundbreaking hybrid reasoning modes.

Key Specifications:

Dense Models: 0.6B, 1.7B, 4B, 8B (default), 14B, 32B
MoE Models: 30B-A3B (30B total, 3B active), 235B-A22B (235B total, 22B active)
Context Window: 32K-128K tokens
Languages: 119 languages and dialects
Training Data: 36 trillion tokens
License: Apache 2.0

Architecture: Dense and MoE Variants

Dense models use traditional transformer architecture where all parameters contribute during inference.

MoE models feature:

128 expert FFNs per layer
8 experts selected per token
Extended 128K context support

Revolutionary Hybrid Reasoning Modes:

Qwen3's most significant innovation is unifying two reasoning approaches in one model:

Thinking Mode: The model reasons step-by-step before delivering answers. Ideal for complex problems requiring deeper thought.
Non-Thinking Mode: Quick, near-instant responses for simpler questions where speed matters more than depth.

This eliminates the need to switch between chat-optimized models (like GPT-4o) and dedicated reasoning models (like QwQ-32B). Users can even set a "thinking budget" to balance computational effort against response speed.

Training Process:

Three-stage pretraining:

Stage 1: 30+ trillion tokens at 4K context for basic language skills
Stage 2: Additional 5 trillion tokens emphasizing STEM, coding, and reasoning
Stage 3: High-quality long-context data extending to 32K tokens

Benchmark Performance:

The flagship Qwen3-235B-A22B competes with:

OpenAI o1 and o3-mini
DeepSeek-R1
Google Gemini-2.5-Pro
Grok-3

Remarkably, Qwen3-30B-A3B outperforms QwQ-32B despite having 10x fewer activated parameters. Even Qwen3-4B rivals Qwen2.5-72B-Instruct performance.

Best For: Multilingual applications, agent development, creative writing, role-playing, and multi-turn dialogue systems. Excellent for applications that need both quick responses and deep reasoning.

Qwen3-Coder (30B and 480B Parameters)

Alibaba's latest coding models optimized for agentic and coding tasks.

Key Specifications:

Sizes: 30B (19GB), 480B (varies)
Optimization: Long code contexts
Downloads: 1.6 million

Best For: Complex software development, large codebase navigation, and autonomous coding agents.

Qwen3-VL (2B to 235B Parameters)

The most powerful vision-language model in the Qwen family.

Key Specifications:

Size Range: 2B to 235B
Capabilities: Visual understanding, document analysis, multimodal reasoning
Downloads: 881K

Best For: Document processing, visual question answering, and applications requiring both image and text understanding.

Qwen2.5-Coder (0.5B to 32B Parameters)

The state-of-the-art open-source coding model, matching GPT-4o on code repair benchmarks.

Key Specifications:

Sizes: 0.5B (398MB), 1.5B, 3B, 7B, 14B, 32B (20GB)
Context Window: 128K tokens
Programming Languages: 92 supported
Training Data: 5.5 trillion tokens
Downloads: 9.5 million

Architecture:

Built on Qwen2.5 architecture with:

32B Model: 5,120 hidden size, 40 query heads, 8 key-value heads, 27,648 intermediate size

Benchmark Performance:

Benchmark	Qwen2.5-Coder 32B	Notes
Aider (code repair)	73.7	Comparable to GPT-4o, 4th overall
MdEval (multi-language repair)	75.2	#1 among open-source
McEval (40+ languages)	65.9	Excellent cross-language support

The model achieves state-of-the-art performance across 10+ benchmarks including code generation, completion, reasoning, and repair.

Best For: Professional development, code generation, code reasoning, and code fixing tasks. The best open-source coding model available.

Qwen2 (0.5B to 72B Parameters)

The previous generation with excellent multilingual support for 29 languages.

Key Specifications:

Sizes: 0.5B (352MB), 1.5B (935MB), 7B (4.4GB), 72B (41GB)
Context Window: 32K-128K tokens
Languages: 29 including major European, Asian, and Middle Eastern languages

Best For: Multilingual chatbots, translation, and cross-lingual applications.

CodeQwen (7B Parameters)

An earlier code-specialized Qwen model with exceptional context length.

Key Specifications:

Size: 7B (4.2GB)
Context Window: 64K tokens
Training: 3 trillion tokens of code data
Languages: 92 coding languages

Best For: Long-context code understanding, Text-to-SQL, and bug fixing.

Mistral AI Models: French Excellence

Mistral AI, based in Paris, has produced some of the most efficient and capable open-source models. Their innovative use of Mixture-of-Experts and Sliding Window Attention has influenced the entire field.

Mistral (7B Parameters)

The original Mistral model that proved smaller models could outperform much larger ones through architectural innovation.

Key Specifications:

Size: 7B (4.4GB)
Context Window: 32K tokens
License: Apache 2.0
Downloads: 23.6 million

Architecture Innovations:

Sliding Window Attention: Trained with 8K context, fixed cache size, theoretical attention span of 128K tokens
Grouped Query Attention (GQA): Faster inference and smaller cache
Byte-fallback BPE Tokenizer: No out-of-vocabulary tokens

Outperforms Llama 2 13B on all benchmarks and approaches CodeLlama 7B on code tasks.

Hardware Requirements: 24GB RAM and single GPU

Best For: General-purpose applications, chatbots, and situations where you need reliable performance with moderate resources.

Mixtral 8x7B and 8x22B (47B and 141B Total Parameters)

Mistral's groundbreaking Mixture-of-Experts models that use only a fraction of their parameters for each inference.

Key Specifications:

Specification	Mixtral 8x7B	Mixtral 8x22B
Total Parameters	47B	141B
Active Parameters	13B	39B
Size	26GB	80GB
Context Window	32K tokens	64K tokens
Downloads	1.6 million	-

Architecture:

Mixtral shares Mistral 7B's architecture with one key difference: each layer contains 8 feedforward blocks (experts) instead of one. A router network selects which 2 experts process each token.

Key features:

Sliding Window Attention with broader context support
Grouped Query Attention for efficient inference
Byte-fallback BPE Tokenizer

Performance:

8x7B: Outperforms Llama 2 70B on most benchmarks with 6x faster inference. Matches or outperforms GPT-3.5 on standard benchmarks.
8x22B: Outperforms ChatGPT 3.5 on MMLU and WinoGrande. Achieves 90.8% on GSM8K (math) and 44.6% on MATH.

Resource Requirements:

8x7B: 64GB RAM, dual GPUs recommended
8x22B: ~90GB VRAM in half-precision, 5.3x slower than 7B, 2.1x slower than 8x7B

Languages: English, French, Italian, German, Spanish (native fluency)

Best For: Applications requiring high capability with better efficiency than pure dense models. Excellent for multilingual European applications.

Microsoft Phi Family: Small But Mighty

Microsoft's Phi models prove that careful training on high-quality synthetic data can create remarkably capable small models. The Phi series represents a different philosophy: quality over quantity in training data.

Phi-4 (14B Parameters)

The latest Phi model, released in December 2024, trained on synthetic datasets and high-quality filtered data with a focus on reasoning.

Key Specifications:

Size: 14B (9.1GB)
Context Window: 16K tokens
Focus: Reasoning and logic
Training Data: 16 billion tokens (8.3 billion unique)
Downloads: 6.7 million

Training Innovation: Synthetic Data First

Phi-4 represents a paradigm shift in training methodology:

Synthetic Data Generation: GPT-4o rewrote web text, code, scientific papers, and books as exercises, discussions, Q&A pairs, and structured reasoning tasks.
Feedback Loop: GPT-4o critiqued its own outputs and generated improvements.
50 Dataset Types: Different seeds and multi-stage prompting procedures covering diverse topics, skills, and interaction types. Total: ~400B unweighted tokens.

Phi-4 substantially surpasses its teacher model (GPT-4) on STEM-focused QA capabilities, demonstrating that synthetic data can produce emergent capabilities beyond the teacher.

Architecture:

Dense decoder-only Transformer with minimal changes from Phi-3:

Modified RoPE base frequency for 32K context support
Optimized for memory/compute-constrained environments

Best For: Edge deployment, real-time applications, and situations requiring strong reasoning in a compact package.

Phi-4-Reasoning (14B Parameters)

A fine-tuned variant specifically optimized for complex reasoning tasks through supervised fine-tuning and reinforcement learning.

Key Specifications:

Size: 14B (11GB)
Context Window: 32K tokens
Training: SFT + Reinforcement Learning
RL Training: Only ~6,400 math-focused problems
Downloads: 916K

Training Approach:

Curated Prompts: 1.4M prompts focused on "boundary" cases at the edge of Phi-4's baseline capabilities. Emphasized multi-step reasoning over factual recall.
Synthetic Responses: Generated using o3-mini in high-reasoning mode.
Structured Reasoning: Special <think> and </think> tokens separate intermediate reasoning from final answers, promoting transparency and coherence.

Benchmark Performance:

Despite only 14B parameters, Phi-4-Reasoning:

Outperforms DeepSeek-R1 Distill Llama 70B (5x larger)
Approaches full DeepSeek-R1 (671B) on AIME 2025
Excels on GPQA-Diamond (graduate-level science)
Strong on LiveCodeBench (competitive coding)
Generalizes to NP-hard problems (3SAT, TSP)

Best For: Mathematical reasoning, scientific analysis, complex problem-solving, and coding tasks. Exceptional reasoning capability for its size.

Phi-3 (3.8B and 14B Parameters)

The previous generation with excellent efficiency and the first Phi model to achieve widespread adoption.

Key Specifications:

Sizes: Mini 3.8B (2.2GB), Medium 14B (7.9GB)
Context Window: 128K tokens
Training: 3.3 trillion tokens

Best For: Quick prototyping, mobile applications, and situations where Phi-4 is too resource-intensive.

Phi-2 (2.7B Parameters)

Microsoft's earlier small model demonstrating that 2.7B parameters can achieve remarkable capability.

Key Specifications:

Size: 2.7B (1.6GB)
Context Window: 2K tokens
Capabilities: Common-sense reasoning, language understanding

Best For: Extremely constrained environments, quick experiments, and applications where even Phi-3 is too large.

Coding-Specialized Models

Beyond the coding variants of general models, Ollama offers several dedicated code models optimized specifically for software development tasks.

CodeLlama (7B to 70B Parameters)

Meta's code-specialized version of Llama 2, offering specialized variants for different use cases.

Key Specifications:

Sizes: 7B (3.8GB), 13B (7.4GB), 34B (19GB), 70B (39GB)
Context Window: 16K (2K for 70B)
Languages: Python, C++, Java, PHP, TypeScript, C#, Bash
Variants: Base, Instruct, Python-specialized
Training: 500B tokens (1T for 70B)

Fill-in-the-Middle (FIM) Support:

Important: Infilling is only available in 7B and 13B base models. The 34B and 70B models were trained without the infilling objective.

Use the <FILL_ME> token for code completion in the middle of files.

Performance:

34B: 53.7% HumanEval, 56.2% MBPP (comparable to ChatGPT at release)
70B: Highest capability but 2K context limitation
7B/13B: Best for real-time completion due to low latency

Best For: Code completion, generation, review, and fill-in-the-middle tasks. Choose size based on your latency requirements.

StarCoder2 (3B, 7B, and 15B Parameters)

Next-generation open code models from BigCode with full transparency about training data.

Key Specifications:

Sizes: 3B (1.7GB), 7B (4.0GB), 15B (9.1GB)
Context Window: 16K tokens
Languages: 17 (3B, 7B) to 600+ (15B)
Training: 3-4 trillion tokens from The Stack v2
License: BigCode OpenRAIL-M v1

Architecture:

Grouped Query Attention
Sliding window attention (4,096 tokens)
Fill-in-the-Middle objective training

Training Data (The Stack v2):

67.5 terabytes of code data (4x larger than original StarCoder)
Full transparency with SoftWare Heritage persistent IDentifiers (SWHIDs)

Performance:

Model	Comparison
StarCoder2-3B	Matches original StarCoder-15B
StarCoder2-15B	Matches 33B+ models, outperforms CodeLlama-34B

StarCoder2-15B outperforms DeepSeekCoder-33B on math, code reasoning, and low-resource languages.

Note: StarCoder2-7B underperforms relative to 3B and 15B for unknown reasons.

Best For: Code completion, generation, and applications where transparency about training data matters (compliance, licensing concerns).

WizardCoder (7B and 33B Parameters)

State-of-the-art code generation using innovative Evol-Instruct techniques.

Key Specifications:

Sizes: 7B (3.8GB), 33B (19GB)
Context Window: 16K tokens
Base: Code Llama and DeepSeek Coder

Best For: Advanced code generation tasks requiring high accuracy.

Stable Code 3B (3B Parameters)

Stability AI's efficient code completion model optimized for real-time IDE use.

Key Specifications:

Size: 3B (1.6GB)
Context Window: 16K tokens
Languages: 18 programming languages
Feature: Fill-in-the-Middle capability

Best For: IDE integration, real-time code completion, and applications requiring fast inference.

Granite Code (3B to 34B Parameters)

IBM's enterprise-focused decoder-only code models with strong compliance and licensing guarantees.

Key Specifications:

Sizes: 3B (2.0GB), 8B (4.6GB), 20B (12GB), 34B (19GB)
Context Window: 8K-128K tokens
Capabilities: Code generation, explanation, fixing
Variants: Base, Instruct, Accelerator

Architecture:

Transformer decoder with pre-normalization
Multi-Query Attention for efficient inference
GELU activation in MLP blocks
LayerNorm for activation normalization

Two-Phase Training:

Phase 1: 3-4 trillion tokens from 116 programming languages
Phase 2: 500 billion additional tokens mixing code and natural language for improved reasoning

34B Model Creation (Depth Upscaling):

IBM created the 34B model through depth upscaling:

Remove final 8 layers from first 20B checkpoint
Remove first 8 layers from second 20B checkpoint
Merge to create 88-layer model
Continue training on 1.4T tokens

Performance:

Granite models consistently outperform equivalent-size CodeLlama. Even Granite-3B-Code-Instruct surpasses CodeLlama-34B-Instruct.

Enterprise Features:

Training data collected per IBM AI ethics principles
IBM legal team guidance for trustworthy enterprise use
Available on watsonx.ai and RHEL AI
Accelerator versions for reduced latency

Best For: Enterprise environments requiring IBM support, compliance guarantees, and licensing clarity.

Magicoder (7B Parameters)

Code models trained using the innovative OSS-Instruct methodology for reduced training bias.

Key Specifications:

Size: 7B (3.8GB)
Context Window: 16K tokens
Training: 75K synthetic instructions generated from open-source code

Best For: Diverse, realistic code generation with reduced training bias compared to models trained on curated instruction sets.

SQL-Specialized Models

For database work, these specialized models convert natural language to SQL with high accuracy.

SQLCoder (7B and 15B Parameters)

Fine-tuned on StarCoder specifically for SQL generation, slightly outperforming GPT-3.5-turbo on natural language to SQL tasks.

Key Specifications:

Sizes: 7B (4.1GB), 15B (9.0GB)
Context Window: 8K-32K tokens

Best For: Database querying, business intelligence, and SQL generation from natural language descriptions.

DuckDB-NSQL (7B Parameters)

Specialized for DuckDB SQL generation, optimized for analytics workloads.

Key Specifications:

Size: 7B (3.8GB)
Context Window: 16K tokens
Base: Llama-2 7B with SQL-specific training

Best For: DuckDB-specific applications, analytics workloads, and data engineering tasks.

Vision-Language Models

These models combine text and image understanding for multimodal applications.

LLaVA (7B to 34B Parameters)

Large Language and Vision Assistant, one of the most influential open-source vision-language models.

Key Specifications:

Sizes: 7B (4.7GB), 13B (8.0GB), 34B (20GB)
Context Window: 4K-32K tokens
Capabilities: Visual reasoning, OCR, image captioning
Downloads: 12.3 million

Architecture (LLaVA 1.6/LLaVA-NeXT):

Vision Encoder: CLIP-ViT-L
Vision-Language Connector: MLP (upgraded from linear projection in v1.5)
Resolution: Up to 672x672 (4x more pixels than v1.5)
Aspect Ratios: Supports 672x672, 336x1344, 1344x336

Key Improvements in v1.6:

Enhanced OCR: Replaced TextCaps with DocVQA and SynDog-EN training data
Chart Understanding: Added ChartQA, DVQA, AI2D for diagram comprehension
Better Visual Reasoning: Improved zero-shot performance

Training Efficiency:

LLaVA 1.6 maintains minimalist design:

32 GPUs for ~1 day
1.3M training samples
Reuses pretrained connector from v1.5
100-1000x lower compute cost than competing models

Performance: Catches up to Gemini Pro and outperforms Qwen-VL-Plus on selected benchmarks.

2025 Development (LLaVA-Mini):

LLaVA-Mini achieves comparable performance using only 1 vision token instead of 576 (0.17% of original), offering 77% FLOPs reduction and significantly lower GPU memory.

Best For: General visual understanding, document analysis, and multimodal conversations.

LLaVA-Llama3 (8B Parameters)

LLaVA fine-tuned from Llama 3 Instruct with improved benchmark scores.

Key Specifications:

Size: 8B (5.5GB)
Context Window: 8K tokens
Downloads: 2.1 million

Best For: Users who want LLaVA capabilities with Llama 3's improved language understanding.

BakLLaVA (7B Parameters)

Mistral 7B augmented with LLaVA architecture, combining Mistral's efficiency with vision capabilities.

Key Specifications:

Size: 7B (4.7GB)
Context Window: 32K tokens
Downloads: 373K

Best For: Visual understanding with Mistral's efficient architecture and longer context.

MiniCPM-V (8B Parameters)

Efficient multimodal model from OpenBMB, designed to run on edge devices while outperforming much larger models.

Key Specifications:

Size: 8B (5.5GB)
Architecture: SigLip-400M + Qwen2-7B
Context Window: 32K tokens
Resolution: Up to 1.8 million pixels (e.g., 1344x1344)
Languages: English, Chinese, German, French, Italian, Korean

Token Efficiency:

MiniCPM-V 2.6 produces only 640 tokens when processing a 1.8M pixel image—75% fewer than most models. This improves:

Inference speed
First-token latency
Memory usage
Power consumption

Performance:

OpenCompass: 65.2 (surpasses GPT-4o mini and Claude 3.5 Sonnet on single-image)
State-of-the-art on OCRBench, surpassing GPT-4V and Gemini 1.5 Pro
Supports real-time video understanding on edge devices like iPad

2025 Updates:

MiniCPM-o 2.6 (January 2025): Adds real-time speech-to-speech conversation and multimodal live streaming. OpenCompass: 70.2
MiniCPM-V 4.5: Outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B in vision-language capabilities

Best For: High-resolution image analysis, OCR-heavy applications, edge deployment, and efficient multimodal deployment.

Moondream (1.8B Parameters)

Tiny vision-language model designed specifically for edge deployment.

Key Specifications:

Size: 1.86B (1.7GB), under 5GB memory required
Context Window: ~1000 tokens
Base: SigLIP + Phi-1.5 weights
Downloads: 472K

Capabilities:

Visual question answering
Image captioning
Object detection (including document layout detection)
UI understanding
Zero-shot detection (outperforms o3-mini, SmolVLM 2.0, Claude 3 Opus)

2025 Improvements (June 2025 Release):

Reinforcement Learning fine-tuning across 55 vision-language tasks
Better OCR for documents and tables
ScreenSpot [email protected]: 60.3 (up from 53.3)
DocVQA: 79.3 (up from 76.5)
TextVQA: 76.3 (up from 74.6)
CountBenchQA: 86.4 (up from 80)
COCO object detection: 51.2 (up from 30.5)

Edge Deployment:

Runs on Raspberry Pi and single-board computers
Much faster than larger models like Qwen2.5-VL on edge hardware
Designed for quick multimodal tasks on constrained hardware

Best For: Edge devices, mobile applications, and situations requiring vision capabilities with minimal resources.

Embedding Models

Embedding models convert text to numerical vectors for semantic search, retrieval-augmented generation (RAG), and similarity matching.

nomic-embed-text

High-performance text embedding that surpasses OpenAI's ada-002 and text-embedding-3-small, with full reproducibility.

Key Specifications:

Parameters: 137 million
Size: 274MB
Context Window: 8192 tokens (industry-leading for open models)
Downloads: 48.7 million
License: Apache 2.0

Architecture (nomic-bert-2048):

BERT base with key modifications:

Rotary Positional Embeddings (RoPE): Replaces absolute encodings, enables context extrapolation
SwiGLU Activation: Replaces GeLU for improved performance
Flash Attention: Optimized attention computation
Vocabulary Size: Multiple of 64 for efficiency

Training Pipeline:

Stage 1: Self-supervised MLM objective (BERT-style)
Stage 2: Contrastive training with web-scale unsupervised data
Stage 3: Contrastive fine-tuning with 1.6M curated paired samples

Training data: 235 million text pairs (fully disclosed)

Performance:

Outperforms text-embedding-ada-002 on short-context MTEB
Outperforms text-embedding-3-small on MTEB
Outperforms jina-embeddings-v2-base-en on long context (LoCo, Jina benchmarks)
Best performing 100M parameter class unsupervised model

Nomic Embed v1.5:

Adds Matryoshka Representation Learning for adjustable embedding dimensions
Now multimodal: nomic-embed-vision-v1.5 aligned to text embedding space

Best For: Semantic search, similarity matching, RAG applications, and any task requiring high-quality text embeddings.

mxbai-embed-large

State-of-the-art large embedding model from mixedbread.ai.

Key Specifications:

Size: 335M (670MB)
Context Window: 512 tokens
Downloads: 6 million

Achieves top performance among BERT-large models on MTEB benchmark, outperforming OpenAI's commercial embedding.

Best For: High-accuracy embedding applications where quality matters more than speed or context length.

BGE-M3

Versatile multilingual embedding model from BAAI (Beijing Academy of Artificial Intelligence), supporting three retrieval methods in one model.

Key Specifications:

Size: 567M (1.2GB)
Context Window: 8K tokens
Languages: 100+ languages
Capabilities: Dense, multi-vector, and sparse retrieval
Downloads: 3 million

M3 = Multi-Multi-Multi:

Multi-linguality: 100+ languages
Multi-granularity: Up to 8192 tokens input
Multi-functionality: Three retrieval methods unified

Architecture:

Based on XLM-RoBERTa-large (24 layers, 1024 hidden, 16 heads) with RetroMAE enhancements. Core model: ~550M parameters.

Three Retrieval Methods:

Dense Retrieval: Normalized [CLS] token hidden state
Sparse Retrieval: Linear layer + ReLU on hidden states (outperforms BM25)
Multi-vector (ColBERT-style): Fine-grained query-passage interactions

Hybrid Scoring: s_rank = w1·s_dense + w2·s_lex + w3·s_mul

Training:

Pre-trained on ~1.2 billion unsupervised text pairs
Fine-tuned on English, Chinese, and multilingual retrieval datasets
Novel self-knowledge distillation approach

Performance:

MIRACL (18 languages): nDCG@10 = 70.0 (highest among multilingual embedders)
Outperforms mE5 (~65.4 average)
Sparse representations outperform BM25 across all tested languages

2025 Updates:

Available on NVIDIA NIM and IONOS Cloud for production deployment
BGE-VL released for multimodal embedding (MIT license)

Best For: Multilingual retrieval, cross-lingual search, applications requiring variable text lengths, and hybrid retrieval systems.

all-minilm

Lightweight embedding model for resource-constrained environments.

Key Specifications:

Sizes: 46MB and 67MB variants
Context Window: 512 tokens
Downloads: 2.1 million

Best For: Quick prototyping, edge deployment, and applications where embedding model size matters.

Snowflake Arctic Embed

Retrieval-optimized embeddings from Snowflake, designed for production RAG pipelines.

Key Specifications:

Sizes: 22M (46MB), 33M (67MB), 110M (219MB), 137M (274MB), 335M (669MB)
Context Window: 512-2K tokens

Best For: Retrieval-focused applications, search systems, and production RAG pipelines.

Enterprise and Specialized Models

Command R (35B Parameters)

Cohere's model optimized for RAG and tool integration, designed for enterprise-scale deployments.

Key Specifications:

Size: 35B (19GB)
Context Window: 128K tokens
Languages: 10+ languages (English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Chinese, Arabic)

Architecture:

Auto-regressive transformer with:

Supervised Fine-Tuning (SFT)
Preference training for human alignment

RAG Capabilities:

Command R is specifically designed for retrieval-augmented generation:

Grounded summarization with source citations
Verifiable outputs with grounding spans
Optimized for working with Cohere's Embed and Rerank models

Tool Use:

Single-step: Multiple tools called simultaneously in one step
Multi-step: Sequential tool calls using previous results
Enables dynamic actions based on external information

Best For: Enterprise RAG applications, tool-using agents, and production chatbots requiring high throughput.

Command R+ (104B Parameters)

Cohere's flagship model with advanced multi-step reasoning and RAG capabilities.

Key Specifications:

Size: 104B
Context Window: 128K tokens
Languages: 10+ languages

August 2024 Update:

50% higher throughput
25% lower latencies
Same hardware footprint

Best For: Complex RAG workflows, multi-step tool use, and enterprise applications requiring maximum capability.

Aya (8B and 35B Parameters)

Cohere's multilingual model supporting 23 languages with strong cross-lingual performance.

Key Specifications:

Sizes: 8B (4.8GB), 35B (20GB)
Context Window: 8K tokens
Downloads: 213K

Best For: Multilingual applications requiring strong cross-lingual performance.

Solar (10.7B Parameters)

Upstage's efficient model using innovative Depth Up-Scaling, outperforming models up to 30B parameters.

Key Specifications:

Size: 10.7B (6.1GB)
Context Window: 4K tokens
Base: Llama 2 architecture with Mistral weights

Outperforms Mixtral 8x7B on H6 benchmarks despite having fewer total parameters.

Best For: Single-turn conversations and applications where efficiency matters.

Nemotron (70B Parameters)

NVIDIA-customized Llama 3.1 for enhanced response quality, optimized for the NVIDIA ecosystem.

Key Specifications:

Size: 70B (43GB)
Context Window: 128K tokens
Training: RLHF with REINFORCE algorithm

Best For: Enterprise applications requiring NVIDIA ecosystem integration and high-quality responses.

InternLM2 (1.8B to 20B Parameters)

Shanghai AI Lab's model with outstanding reasoning and tool utilization capabilities.

Key Specifications:

Sizes: 1.8B (1.1GB), 7B (4.5GB), 20B (11GB)
Context Window: 32K-256K tokens

Best For: Mathematical reasoning, tool utilization, and web browsing applications.

Yi (6B to 34B Parameters)

01.ai's bilingual English-Chinese models trained on 3 trillion tokens.

Key Specifications:

Sizes: 6B (3.5GB), 9B (5.0GB), 34B (19GB)
Context Window: 4K tokens
Training: 3 trillion tokens

Best For: English-Chinese bilingual applications and research.

Community and Fine-Tuned Models

OpenHermes (7B Parameters)

Teknium's state-of-the-art fine-tune on Mistral using carefully curated open datasets.

Key Specifications:

Size: 7B (4.1GB)
Context Window: 32K tokens
Training: 1,000,000 entries primarily from GPT-4

Training Data:

~1 million dialogue entries, primarily GPT-4 generated
7-14% programming instructions
Converted to ShareGPT format, then ChatML via axolotl
Extensive filtering of public datasets

Training Approach:

Supervised fine-tuning on multi-turn conversations
Preference data rated by GPT-4
Distilled Direct Preference Optimization (dDPO)

Interesting Finding: Including 7-14% code instructions boosted non-code benchmarks (TruthfulQA, AGIEval, GPT4All) while slightly reducing BigBench.

Performance:

GPT4All average: 73.12
AGIEval: 43.07
TruthfulQA: 53.04
HumanEval pass@1: 50.7%
Matches larger 70B models on certain benchmarks

Best For: Multi-turn conversations, coding tasks, and applications requiring strong instruction-following.

Dolphin-Mixtral (8x7B and 8x22B)

Uncensored fine-tune of Mixtral optimized for coding and unrestricted responses.

Key Specifications:

Sizes: 8x7B (26GB), 8x22B (80GB)
Context Window: 32K-64K tokens
Downloads: 799K

Best For: Uncensored coding assistance and creative applications.

Zephyr (7B and 141B Parameters)

HuggingFace's helpful assistant models, optimized for user assistance.

Key Specifications:

Sizes: 7B (4.1GB), 141B (80GB)
Context Window: 32K-64K tokens
Downloads: 338K

Best For: Helpful, conversational applications prioritizing user assistance.

OpenChat (7B Parameters)

C-RLFT trained model that surpasses ChatGPT on various benchmarks.

Key Specifications:

Size: 7B (4.1GB)
Context Window: 8K tokens
Downloads: 253K

Best For: Chat applications requiring strong open-source performance.

Nous-Hermes 2 (10.7B and 34B Parameters)

Nous Research's scientific and coding-focused models.

Key Specifications:

Sizes: 10.7B (6.1GB), 34B (19GB)
Context Window: 4K tokens
Downloads: 196K

Best For: Scientific discussion, coding tasks, and research applications.

Samantha-Mistral (7B Parameters)

Eric Hartford's companion assistant trained on philosophy and psychology.

Key Specifications:

Size: 7B (4.1GB)
Context Window: 32K tokens
Downloads: 159K

Best For: Conversational AI emphasizing personal development and relationship coaching.

Vicuna (7B to 33B Parameters)

LMSYS's chat assistant trained on ShareGPT conversations.

Key Specifications:

Sizes: 7B (3.8GB), 13B (7.4GB), 33B (18GB)
Context Window: 2K-16K tokens

Best For: General chat applications and fine-tuning experiments.

Orca-Mini (3B to 70B Parameters)

Llama-based models trained using Orca methodology for learning complex reasoning patterns.

Key Specifications:

Sizes: 3B (2.0GB), 7B (3.8GB), 13B (7.4GB), 70B (39GB)
Context Window: Various

Best For: Entry-level hardware deployments and learning complex reasoning patterns.

Neural Chat (7B Parameters)

Intel's Mistral-based model for high-performance chatbots, optimized for Intel hardware.

Key Specifications:

Size: 7B (4.1GB)
Context Window: 32K tokens
Downloads: 198K

Best For: Chatbot applications optimized for Intel hardware.

TinyLlama (1.1B Parameters)

Compact Llama trained on 3 trillion tokens, demonstrating that tiny models can be surprisingly capable.

Key Specifications:

Size: 1.1B (638MB)
Context Window: 2K tokens
Downloads: 3.2 million

Best For: Extremely constrained environments and minimal footprint deployments.

EverythingLM (13B Parameters)

Uncensored Llama 2 with extended 16K context.

Key Specifications:

Size: 13B (7.4GB)
Context Window: 16K tokens
Downloads: 91K

Best For: Extended context applications without content restrictions.

Notux (8x7B Parameters)

Optimized Mixtral variant with improved fine-tuning.

Key Specifications:

Size: 8x7B (26GB)
Context Window: 32K tokens

Best For: Users wanting improved Mixtral performance through fine-tuning.

XWinLM (7B and 13B Parameters)

Llama 2-based model with competitive benchmark performance.

Key Specifications:

Sizes: 7B (3.8GB), 13B (7.4GB)
Context Window: 4K tokens
Downloads: 143K

Best For: General chat and alternative to base Llama 2.

Domain-Specific Models

Meditron (7B and 70B Parameters)

Medical-specialized model from EPFL, designed for healthcare applications.

Key Specifications:

Sizes: 7B (3.8GB), 70B (39GB)
Context Window: 2K-4K tokens

Outperforms Llama 2, GPT-3.5, and Flan-PaLM on many medical reasoning tasks.

Best For: Medical question answering, differential diagnosis support, and health information (with appropriate clinical oversight).

Important: Not a substitute for professional medical advice. Requires clinical oversight for any healthcare applications.

MedLlama2 (7B Parameters)

Llama 2 fine-tuned on MedQA dataset for medical question-answering.

Key Specifications:

Size: 7B (3.8GB)
Context Window: 4K tokens
Downloads: 114K

Best For: Medical question-answering and research (not for clinical use).

Wizard-Math (7B to 70B Parameters)

Mathematical reasoning specialist optimized for problem-solving and computational tasks.

Key Specifications:

Sizes: 7B (4.1GB), 13B (7.4GB), 70B (39GB)
Context Window: 2K-32K tokens
Downloads: 164K

Best For: Mathematical problem-solving, tutoring applications, and computational reasoning.

FunctionGemma (270M Parameters)

Google's Gemma 3 variant fine-tuned for function calling, enabling reliable tool use in agents.

Key Specifications:

Size: 270M
Specialization: Tool and function calling
Downloads: 13K

Best For: Agent development and applications requiring reliable function calling.

Multilingual Models

StableLM2 (1.6B and 12B Parameters)

Stability AI's multilingual model optimized for European languages.

Key Specifications:

Sizes: 1.6B (983MB), 12B (7.0GB)
Context Window: 4K tokens
Languages: English, Spanish, German, Italian, French, Portuguese, Dutch
Downloads: 179K

Best For: Multilingual European applications with moderate resource requirements.

Falcon (7B to 180B Parameters)

Technology Innovation Institute's multilingual models with massive scale options.

Key Specifications:

Sizes: 7B (4.2GB), 40B (24GB), 180B (101GB)
Context Window: 2K tokens

The 180B variant performs between GPT-3.5 and GPT-4 levels on many benchmarks.

Best For: High-capability multilingual applications and research.

Hardware Requirements Quick Reference

Model Category	VRAM Needed	RAM Alternative	Best GPUs
1-3B models	4GB	8GB	Any modern GPU
7-8B models	8GB	16GB	RTX 3060, RTX 4060
13-14B models	12GB	24GB	RTX 3060 12GB, RTX 4070
32-34B models	24GB	48GB	RTX 4090, A6000
70B models	48GB+	64GB+	Multiple GPUs, Apple Silicon
100B+ models	Specialized	128GB+	Enterprise infrastructure

Apple Silicon Recommendations:

M1/M2 (16GB): 7-8B models comfortably
M2 Pro/M3 Pro (32GB): Up to 32B models, 70B with slow speed
M3 Max (128GB): 70B models at usable speeds

Quantization Impact:

Q4 (4-bit): 75% size reduction, minimal quality loss
Q8 (8-bit): Higher quality, more memory
Q2-Q3: Maximum compression, noticeable quality degradation
Recommended: Q4_K_M for best balance

How to Choose the Right Model

For General Chat and Assistance

Budget hardware: Llama 3.2 3B, Phi-3 Mini, Gemma 2 2B
Standard hardware: Llama 3.1 8B, Mistral 7B, Gemma 3 12B
High-end hardware: Llama 3.3 70B, Qwen3 32B

For Coding and Development

Quick completions: Stable Code 3B, CodeGemma 2B
General coding: Qwen2.5-Coder 7B, DeepSeek-Coder 6.7B
Maximum quality: Qwen2.5-Coder 32B, DeepSeek-Coder-V2 16B

For Reasoning and Analysis

Efficient reasoning: Phi-4-Reasoning, DeepSeek-R1 14B
Maximum capability: DeepSeek-R1 70B, Qwen3 32B

For Image Understanding

Lightweight: Moondream, LLaVA 7B
Balanced: MiniCPM-V, Gemma 3 12B
Maximum capability: Llama 3.2-Vision 90B, Qwen3-VL

For Multilingual Applications

European languages: Mixtral 8x7B, StableLM2
Asian languages: Qwen3, Yi
100+ languages: BGE-M3, Qwen2

For RAG and Search

Standard embedding: nomic-embed-text, all-minilm
High-quality embedding: mxbai-embed-large, BGE-M3
RAG systems: Command R with your embedding choice

Getting Started with Ollama

Installing Ollama and running your first model takes just a few minutes:

Install Ollama: Download from ollama.com for Windows, Mac, or Linux
Pull a model: ollama pull llama3.1
Start chatting: ollama run llama3.1

For integration with applications, Ollama provides a REST API at localhost:11434:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is the sky blue?"
}'

Using Local AI on Practical Web Tools

If you want to experience local AI without any setup, try our AI Chat feature. It connects to your local Ollama installation, providing a polished interface while keeping all processing on your machine. Your prompts never touch our servers, maintaining complete privacy.

The interface works with any Ollama model. Simply select your preferred model and start chatting. Combined with our privacy-focused file conversion tools, you can build complete local workflows without sending sensitive data to the cloud.

Frequently Asked Questions

What is the best Ollama model for beginners?

Start with Llama 3.1 8B. It runs on most hardware (8GB VRAM or 16GB RAM), provides excellent quality across diverse tasks, and has the largest community support. Once comfortable, explore specialized models based on your specific needs.

How much VRAM do I need for Ollama?

For 7-8B models, 8GB VRAM is sufficient. For 13-14B models, aim for 12GB. For 32B+ models, you need 24GB or more. Alternatively, models can run in system RAM at reduced speed, roughly doubling the memory requirement.

What is the fastest Ollama model?

The fastest capable model is Llama 3.2 1B or Phi-3 Mini, generating 100+ tokens per second on modest hardware. For usable quality, Llama 3.1 8B at 40-70 tokens per second on modern GPUs offers the best speed/quality balance.

Which Ollama model is best for coding?

Qwen2.5-Coder 32B offers the best quality, matching GPT-4o on code repair benchmarks. For smaller hardware, Qwen2.5-Coder 7B or DeepSeek-Coder 6.7B provide excellent results. StarCoder2 15B offers transparency about training data.

Can Ollama models process images?

Yes. Llama 3.2-Vision, LLaVA, MiniCPM-V, BakLLaVA, Moondream, and Gemma 3 (4B+) all process images. MiniCPM-V and LLaVA 1.6 offer the best image understanding for their size.

What is the difference between quantization levels?

Q4 uses 4 bits per parameter, reducing model size by 75% with minimal quality loss. Q8 uses 8 bits for higher quality but more memory. Q2-Q3 saves more memory but noticeably degrades quality. For most uses, Q4_K_M is the sweet spot.

How do I choose between Llama, Mistral, and Qwen?

Llama has the largest ecosystem and broadest support. Mistral offers excellent efficiency and European language performance. Qwen excels at multilingual tasks (especially Asian languages) and provides strong coding variants. Try each for your specific task.

Are these models safe to use?

Most models include safety training. However, "uncensored" variants (Llama 2 Uncensored, Dolphin-Mixtral) have guardrails removed and should be used responsibly. Always implement appropriate safeguards for production applications.

How do Ollama models compare to ChatGPT?

Llama 3.1 70B and DeepSeek-R1 70B approach GPT-4 quality for many tasks. For everyday use, Llama 3.1 8B competes with GPT-3.5. The gap has narrowed significantly, though frontier models still lead on the most complex reasoning.

Can I fine-tune Ollama models?

Ollama itself runs pre-existing models. For fine-tuning, use the base models from HuggingFace with tools like Axolotl or PEFT, then import the fine-tuned weights into Ollama.

What is the best model for mathematical reasoning?

DeepSeek-R1 and Phi-4-Reasoning lead in mathematical reasoning. Phi-4-Reasoning is remarkable for its size, matching much larger models on math olympiad problems. For maximum capability, DeepSeek-R1 70B or the full 671B model approach frontier performance.

Which models have the longest context windows?

Qwen3 supports up to 256K tokens. Llama 3.1/3.2/3.3 and DeepSeek-R1 support 128K. Gemma 3 (4B+) supports 128K. For embedding, BGE-M3 and nomic-embed-text support 8K tokens.

Conclusion

The Ollama model library offers something for every use case, from tiny 270M parameter edge models to massive 671B reasoning systems. The key is matching model capabilities to your actual needs rather than always choosing the largest option.

For most users, starting with Llama 3.1 8B provides an excellent foundation. As you identify specific needs—whether coding, reasoning, multilingual support, or image understanding—explore the specialized models in those categories.

Local AI has reached a maturity where quality rivals cloud APIs for many tasks, while offering complete privacy, zero ongoing costs, and offline capability. With Ollama making deployment trivial, the only barrier is choosing your first model.

Start experimenting today with our AI Chat feature, which connects seamlessly to your local Ollama installation for a polished, private AI experience.

Model information current as of December 2025. Download counts and specifications updated regularly by Ollama. Always check ollama.com/library for the latest models and versions.