G
GetLLMs

Kimi-VL-A3B-Thinking

Kimi-VL-A3B-Thinking is a powerful multimodal LLM, adept at understanding both text and images to generate detailed, step-by-step reasoning.

Platform: Replicate
Multimodal ReasoningVision-Language ProcessingStep-by-Step Text Generation
115 runs
L40S
Commercial

🚀Function Overview

A multimodal large language model specialized in complex reasoning tasks that processes images and text to generate detailed responses with explicit thinking processes.

Key Features

  • Processes images and text inputs simultaneously
  • Generates text outputs with step-by-step reasoning
  • Handles long-context inputs up to 128K tokens
  • Maintains native image resolution using MoonViT encoder
  • Efficient with only 2.8B activated parameters at runtime
  • Supports Flash-Attention 2 and multiple precision formats

Use Cases

  • Solving complex math problems requiring visual interpretation
  • Analyzing images with detailed contextual reasoning
  • Building AI agents for multimodal environments
  • Summarizing multi-page documents and academic papers
  • Video analysis through frame-by-frame processing

⚙️Input Parameters

prompt

string

Text prompt for the model

image

string

Optional image input

top_p

number

Top-p sampling probability

temperature

number

Sampling temperature

max_length_tokens

integer

Maximum number of tokens to generate

💡Usage Examples

Example 1

Input Parameters

{
  "image": "https://raw.githubusercontent.com/zsxkib/cog-kimi-vl-a3b-thinking/main/images/demo1.jpeg",
  "top_p": 1,
  "prompt": "Where am I?",
  "temperature": 0.6,
  "max_length_tokens": 2048
}

Output Results

◁think▷Okay,
let's
see.
The
user
is
asking
"Where
am
I?"
and
provided
an
image
of
a
cyberpunk
bar
scene.
The
image
has
a
lot
of
neon
lights,
futuristic
elements,
and
people
sitting
at
a
bar.
There's
a
sign
that
says
"The
Heist"
and
some
text
about
sitting
next
to
Jackie.
Also,
there's
a
mini-map
in
the
corner,
which
might
indicate
it's
from
a
video
game.
First,
I
need
to
figure
out
which
game
this
is.
The
cyberpunk
aesthetic
with
the
bar
setting
and
the
mini-map
suggests
it
could
be
Cyberpunk
2077.
The
Heist
as
a
location
is
a
known
area
in
that
game.
The
characters'
outfits
and
the
overall
environment
match
the
game's
style.
The
user
might
be
in-game,
so
the
answer
should
confirm
that
they're
in
Cyberpunk
2077,
specifically
at
The
Heist
bar.
I
should
mention
the
game's
title
and
the
location's
name.
Also,
note
the
HUD
elements
as
part
of
the
game's
interface.
Make
sure
to
keep
it
concise
but
informative.◁/think▷You
are
in
**Cyberpunk
2077**,
a
role-playing
video
game
set
in
a
dystopian
future
version
of
Night
City.
The
image
shows
you
inside
**The
Heist**,
a
bar
or
nightclub
popular
in
the
game.
The
cyberpunk
aesthetic,
neon
lighting,
and
HUD
elements
(mini-map,
objectives
like
"Sit
next
to
Jackie")
align
with
the
game's
signature
style.
This
location
is
known
for
its
vibrant
atmosphere
and
interactions
with
characters
like
Jackie
Welles,
a
key
figure
in
the
early
story.<|im_end|>

Quick Actions

Technical Specifications

Hardware Type
L40S
Run Count
115
Commercial Use
Supported
Platform
Replicate

Related Keywords

Multimodal ReasoningVision-Language ProcessingStep-by-Step Text GenerationComplex Reasoning TasksSolving Complex Math ProblemsAnalyzing ImagesBuilding AI AgentsSummarizing Multi-Page Documents