Kimi-VL-A3B-Thinking

Kimi-VL-A3B-Thinking is a powerful multimodal LLM, adept at understanding both text and images to generate detailed, step-by-step reasoning.

Platform: Replicate

Multimodal ReasoningVision-Language ProcessingStep-by-Step Text Generation

115 runs

L40S

Commercial

🚀Function Overview

A multimodal large language model specialized in complex reasoning tasks that processes images and text to generate detailed responses with explicit thinking processes.

Key Features

Processes images and text inputs simultaneously
Generates text outputs with step-by-step reasoning
Handles long-context inputs up to 128K tokens
Maintains native image resolution using MoonViT encoder
Efficient with only 2.8B activated parameters at runtime
Supports Flash-Attention 2 and multiple precision formats

Use Cases

•Solving complex math problems requiring visual interpretation
•Analyzing images with detailed contextual reasoning
•Building AI agents for multimodal environments
•Summarizing multi-page documents and academic papers
•Video analysis through frame-by-frame processing

⚙️Input Parameters

prompt

string

Text prompt for the model

image

string

Optional image input

top_p

number

Top-p sampling probability

temperature

number

Sampling temperature

max_length_tokens

integer

Maximum number of tokens to generate

💡Usage Examples

Example 1

Input Parameters

{
  "image": "https://raw.githubusercontent.com/zsxkib/cog-kimi-vl-a3b-thinking/main/images/demo1.jpeg",
  "top_p": 1,
  "prompt": "Where am I?",
  "temperature": 0.6,
  "max_length_tokens": 2048
}

Output Results

◁think▷Okay,

let's

see.

The

user

asking

"Where

I?"

and

provided

image

cyberpunk

bar

scene.

The

image

has

lot

neon

lights,

futuristic

elements,

and

people

sitting

bar.

There's

sign

that

says

"The

Heist"

and

some

text

about

sitting

Jackie.

Also,

there's

mini-map

the

corner,

which

might

indicate

it's

from

video

game.

First,

need

figure

out

which

game

this

is.

The

cyberpunk

aesthetic

with

the

bar

setting

and

the

mini-map

suggests

could

Cyberpunk

2077.

The

Heist

location

known

area

that

game.

The

characters'

outfits

and

the

overall

environment

match

the

game's

style.

The

user

might

in-game,

the

answer

should

confirm

that

they're

Cyberpunk

2077,

specifically

The

Heist

bar.

should

mention

the

game's

title

and

the

location's

name.

Also,

note

the

HUD

elements

part

the

game's

interface.

Make

sure

keep

concise

but

informative.◁/think▷You

are

**Cyberpunk

2077**,

role-playing

video

game

set

dystopian

future

version

Night

City.

The

image

shows

you

inside

**The

Heist**,

bar

nightclub

popular

the

game.

The

cyberpunk

aesthetic,

neon

lighting,

and

HUD

elements

(mini-map,

objectives

"Sit

Jackie")

align

with

the

game's

signature

style.

This

location

known

for

its

vibrant

atmosphere

and

interactions

with

characters

Jackie

Welles,

key

figure

the

early

story.<|im_end|>

Quick Actions

Use NowView Documentation

Technical Specifications

Hardware Type: L40S
Run Count: 115
Commercial Use: Supported
Platform: Replicate

Related Keywords

Multimodal ReasoningVision-Language ProcessingStep-by-Step Text GenerationComplex Reasoning TasksSolving Complex Math ProblemsAnalyzing ImagesBuilding AI AgentsSummarizing Multi-Page Documents

Related Models

Google GPT-4.1-mini

Fast, affordable version of GPT-4.1

OpenAI GPT-4.1 Model

OpenAI's Flagship GPT model for complex tasks.

Bielik 1.5B v3 Instruct

Bielik-1.5B-v3-Instruct is a generative text model featuring 1.6 billion parameters. It is result of collaboration between the open-science/open-souce project SpeakLeash and the High Performance Computing (HPC)