《月之暗面:Kimi-VL技術報告(英文版)(23頁).pdf》由會員分享,可在線閱讀,更多相關《月之暗面:Kimi-VL技術報告(英文版)(23頁).pdf(23頁珍藏版)》請在三個皮匠報告上搜索。
1、KIMI-VL TECHNICALREPORTKimi TeamABSTRACTWe present Kimi-VL,an effi cient open-source Mixture-of-Experts(MoE)vision-language model(VLM)that offers advanced multimodal reasoning,long-context understanding,and strong agentcapabilitiesall while activating only 2.8B parameters in its language decoder(Kim
2、i-VL-A3B).Kimi-VL demonstrates strong performance across challenging domains:as a general-purpose VLM,Kimi-VL excels in multi-turn agent tasks(e.g.,OSWorld),matching flagship models.Furthermore,itexhibits remarkable capabilities across diverse challenging vision language tasks,including college-leve
3、l image and video comprehension,OCR,mathematical reasoning,multi-image understanding.Incomparative evaluations,it effectively competes with cutting-edge effi cient VLMs such as GPT-4o-mini,Qwen2.5-VL-7B,and Gemma-3-12B-IT,while surpassing GPT-4o in several key domains.Kimi-VL also advances in proces
4、sing long contexts and perceiving clearly.With a 128K extendedcontext window,Kimi-VL can process diverse long inputs,achieving impressive scores of 64.5 onLongVideoBench and 35.1 on MMLongBench-Doc.Its native-resolution vision encoder,MoonViT,further allows it to see and understand ultra-high-resolu
5、tion visual inputs,achieving 83.2 on InfoVQAand 34.5 on ScreenSpot-Pro,while maintaining lower computational cost for common tasks.Building upon Kimi-VL,we introduce an advanced long-thinking variant:Kimi-VL-Thinking.Developed through long chain-of-thought(CoT)supervised fi ne-tuning(SFT)and reinfor
6、cementlearning(RL),this model exhibits strong long-horizon reasoning capabilities.It achieves scores of61.7 on MMMU,36.8 on MathVision,and 71.3 on MathVista while maintaining the compact 2.8Bactivated LLM parameters,setting a new standard for effi cient yet multimodal thinking models.Codeand models
7、are publicly accessible at https:/ Parameters(B)1520253035MathVision Pass1Kimi-VL-A3B-ThinkingQwen-2.5-VL-3BQwen-2.5-VL-7BQwen-2.5-VL-32BQwen-2.5-VL-72BGemma-3-4B-ITGemma-3-12B-ITGemma-3-27B-ITDeepSeek-VL2-A4.5BLlama-3.2-11B-Inst.QVQ-72B-PreviewFigure 1:Comparison between Kimi-VL-Thinking and fronti
8、er open-source VLMs,including short-thinking VLMs(e.g.Gemma-3 series,Qwen2.5-VL series)and long-thinking VLMs(QVQ-72B-Preview),on MathVision benchmark.Our model achieves strong multimodal reasoning with just 2.8B LLM activated parameters.arXiv:2504.07491v2 cs.CV 15 Apr 2025KIMI-VL TECHNICALREPORT607
9、08090MMBench-EN-v1.174.665.877.179.682.683.1 Kimi-VL-A3B Qwen2.5-VL-7B DeepSeek-VL2 Llama-3.2-11B-Inst.Gemma-3-12B-IT GPT-4o4045505560MMMU(val)59.6486051.158.6573844505662BLINK50.339.853.656.457.34048566472LongVideoBenchVideo-MME(w/o sub)58.251.54645.564.858.265.15667.864.5816243240MMLongBench-Doc21
10、.313.82929.635.1010203040ScreenSpot-Pro0.82934.502.557.510OSWorld(Pass1)52.58.2 GPT-4o-mini3045607590InfoVQA43.834.657.978.182.683.2GENERALOCRMULTI-IMAGEAGENTLONG DOCLONG VIDEOFigure 2:Highlights of Kimi-VL performance for a wide range of benchmarks like,general benchmarks(MMMU,MMBench),OCR(InfoVQA)
11、,multi-image(BLINK),long video(LongVideoBench,Video-MME),long document(MMLongBench-Doc),and agent(ScreenSpot-Pro and OSWorld).Detailed results are presented in Table 3.1IntroductionWith the rapid advancement of artifi cial intelligence,human expectations for AI assistants have transcended traditiona
12、llanguage-only interactions,increasingly aligning with the inherently multimodal nature of our world.To betterunderstand and interact with these expectations,new generations of natively multimodal models,such as GPT-4o(OpenAI et al.2024)and Google Gemini(Gemini Team et al.2024),have emerged with the
13、 capability to seamlesslyperceive and interpret visual inputs alongside language processing.Most recently,advanced multimodal models,pioneered by OpenAI o1 series(OpenAI 2024)and Kimi k1.5(K.Team et al.2025),have further pushed theseboundaries by incorporating deeper and longer reasoning on multimod
14、al inputs,thereby tackling more complexproblems in the multimodal domain.Nevertheless,development in large VLMs in the open-source community has signifi cantly lagged behind their language-only counterparts,particularly in aspects of scalability,computational effi ciency,and advanced reasoning capab
15、ilities.While language-only model DeepSeek R1(DeepSeek-AI,D.Guo,et al.2025)has already leveraged the effi cientand more scalable mixture-of-experts(MoE)architecture and facilitated sophisticated long chain-of-thought(CoT)reasoning,most recent open-source VLMs,e.g.Qwen2.5-VL(Bai et al.2025)and Gemma-
16、3(Gemma Team et al.2025),continue to rely on dense architectures and do not support long-CoT reasoning.Early explorations into MoE-basedvision-language models,such as DeepSeek-VL2(Zhiyu Wu et al.2024)and Aria(D.Li et al.2024),exhibit limitationsin other crucial dimensions.Architecturally,both models
17、 still adopt relatively traditional fi xed-size vision encoders,hindering their adaptability to diverse visual inputs.From a capability perspective,DeepSeek-VL2 supports only alimited context length(4K),while Aria falls short in fi ne-grained visual tasks.Additionally,neither of them supportslong-th
18、inking abilities.Consequently,there remains a pressing need for an open-source VLM that effectively integratesstructural innovation,stable capabilities,and enhanced reasoning through long-thinking.In light of this,we present Kimi-VL,a vision-language model for the open-source community.Structurally,
19、Kimi-VLconsists of our Moonlight(J.Liu et al.2025a)MoE language model with only 2.8B activated(16B total)parameters,paired with a 400M native-resolution MoonViT vision encoder.In terms of capability,as illustrated in Figure 2,Kimi-VL can robustly handle diverse tasks(fi ne-grained perception,math,co
20、llege-level problems,OCR,agent,etc.)across a broad spectrum of input forms(single-image,multi-image,video,long-document,etc.).Specifi cally,it featuresthe following exciting abilities:2KIMI-VL TECHNICALREPORTMixture-of-Experts(MoE)Language Decoder1113px58px1731 pix800pxUI SCREENSHOTOCR(SPECIAL ASPEC
21、T RATIO)270px480pxLONG VIDEOFINE-GRAINED1008px672pxSMALL IMAGE50px20pxMLP ProjectorWhat can you interpret from The user asked NAttention LayerMoE FFN Non-shared ExpertsShared ExpertsRouterMoonViT(Native-resolution)35.4 Kimi-VL-A3B-ThinkingFigure 3:The model architecture of Kimi-VL and Kimi-VL-Thinki
22、ng,consisting of a MoonViT that allows native-resolution images,an MLP projector,and a Mixture-of-Experts(MoE)language decoder.1)Kimi-VL is smart:it has comparable text ability against effi cient pure-text LLMs;without long thinking,Kimi-VL isalready competitive in multimodal reasoning and multi-tur
23、n agent benchmarks,e.g.,MMMU,MathVista,OSWorld.2)Kimi-VL processes long:it effectively tackles long-context understanding on various multimodal inputs within its128K context window,far ahead of similar-scale competitors on long video benchmarks and MMLongBench-Doc.3)Kimi-VL perceives clear:it shows
24、all-round competitive ability over existing effi cient dense and MoE VLMs invarious vision-language scenarios:visual perception,visual world knowledge,OCR,high-resolution OS screenshot,etc.Furthermore,with long-CoT activation and reinforcement learning(RL),we introduce the long-thinking versionof Ki
25、mi-VL,Kimi-VL-Thinking,which further substantially improves performance on more complex multimodalreasoning scenarios.Despite its small scale,Kimi-VL-Thinking offers compelling performance on hard reasoningbenchmarks(e.g.,MMMU,MathVision,MathVista),outperforming many state-of-the-art VLMs with even
26、larger sizes.2Approach2.1Model ArchitectureThe architecture of Kimi-VL consists of three parts:a native-resolution vision encoder(MoonViT),an MLP projector,and an MoE language model,as depicted in Figure 3.We introduce each part in this section.MoonViT:A Native-resolution Vision EncoderWe design Moo
27、nViT,the vision encoder of Kimi-VL,to nativelyprocess images at their varying resolutions,eliminating the need for complex sub-image splitting and splicing operations,as employed in LLaVA-OneVision(B.Li et al.2024).We incorporate the packing method from NaViT(Dehghaniet al.2023),where images are div
28、ided into patches,flattened,and sequentially concatenated into 1D sequences.Thesepreprocessing operations enable MoonViT to share the same core computation operators and optimization as a languagemodel,such as the variable-length sequence attention mechanism supported by FlashAttention(Dao et al.202
29、2),ensuring non-compromised training throughput for images of varying resolutions.3KIMI-VL TECHNICALREPORTText Pre-training 5.2T data Pure Text DataJoint Pre-training 1.4T data Up to 40%Multimodal Data Progressive Multimodal RatioViT Training 2.0T-0.1T data CoCa-loss with tiny language decoder-align
30、 to LLMJoint Cooldown 0.6T data High-quality Text&Multimodal Data Re-warmup to higher LRJoint Long-context 0.3T data Long Text&Long Video&Long Doc RoPE base:50,000-800,000resumes LR schedulerresumes LR schedulerFigure 4:The pre-training stages of Kimi-VL consume a total of 4.4T tokens after text-onl
31、y pre-training of its languagemodel.To preserve text abilities,all stages that update the language model are joint training stages.MoonViT is initialized from and continually pre-trained on SigLIP-SO-400M(Zhai et al.2023),which originallyemploys learnable fi xed-size absolute positional embeddings t
32、o encode spatial information.While we interpolatethese original position embeddings to better preserve SigLIPs capabilities,these interpolated embeddings becomeincreasingly inadequate as image resolution increases.To address this limitation,we incorporate 2D rotary positionalembedding(RoPE)(J.Su et
33、al.2023)across the height and width dimensions,which improves the representation offi ne-grained positional information,especially in high-resolution images.These two positional embedding approacheswork together to encode spatial information for our model and seamlessly integrate with the flattening
34、 and packingprocedures.This integration enables MoonViT to effi ciently process images of varying resolutions within the samebatch.The resulting continuous image features are then forwarded to the MLP projector and,ultimately,to the MoElanguage model for subsequent training stages.MLP ProjectorWe em
35、ploy a two-layer MLP to bridge the vision encoder(MoonViT)and the LLM.Specifi cally,we fi rst use a pixel shuffle operation to compress the spatial dimension of the image features extracted by MoonViT,performing 22 downsampling in the spatial domain and correspondingly expanding the channel dimensio
36、n.We thenfeed the pixel-shuffled features into a two-layer MLP to project them into the dimension of LLM embeddings.Mixture-of-Experts(MoE)Language ModelThe language model of Kimi-VL utilizes our Moonlight model(J.Liuet al.2025a),an MoE language model with 2.8B activated parameters,16B total paramet
37、ers,and an architecture similarto DeepSeek-V3(DeepSeek-AI,A.Liu,et al.2025).For our implementation,we initialize from an intermediatecheckpoint in Moonlights pre-training stageone that has processed 5.2T tokens of pure text data and activated an8192-token(8K)context length.We then continue pre-train
38、ing it using a joint recipe of multimodal and text-only datatotaling 2.3T tokens,as detailed in Sec.2.3.2.2Muon OptimizerWe use an enhanced Muon optimizer(J.Liu et al.2025b)for model optimization.Compared to the original Muonoptimizer(Jordan et al.2024),we add weight decay and carefully adjust the p
39、er-parameter update scale.Additionally,we develop a distributed implementation of Muon following the ZeRO-1(Rajbhandari et al.2020)optimization strategy,which achieves optimal memory effi ciency and reduced communication overhead while preserving the algorithmsmathematical properties.This enhanced M
40、uon optimizer is used throughout the entire training process to optimize allmodel parameters,including the vision encoder,the projector,and the language model.2.3Pre-Training StagesAs illustrated in Figure 4 and Table 1,after loading the intermediate language model discussed above,Kimi-VLspre-traini
41、ng comprises a total of 4 stages consuming 4.4T tokens overall:fi rst,standalone ViT training to establish arobust native-resolution visual encoder,followed by three joint training stages(pre-training,cooldown,and long-contextactivation)that simultaneously enhance the models language and multimodal
42、capabilities.The details are as follows.ViT Training StagesThe MoonViT is trained on image-text pairs,where the text components consist of a variety oftargets:image alt texts,synthetic captions,grounding bboxes,and OCR texts.The training incorporates two objectives:a SigLIP(Zhai et al.2023)lossLsigl
43、ip(a variant of contrastive loss)and a cross-entropy lossLcaptionfor captiongeneration conditioned on input images.Following CoCas approach(J.Yu et al.2022),the fi nal loss function isformulated asL=Lsiglip+Lcaption,where=2.Specifi cally,the image and text encoders compute the contrastiveloss,while
44、the text decoder performs next-token prediction(NTP)conditioned on features from the image encoder.Toaccelerate training,we initialized both encoders with SigLIP SO-400M(Zhai et al.2023)weights and implementeda progressive resolution sampling strategy to gradually allow larger size;the text decoder
45、is initialized from a tinydecoder-only language model.During training,we observed an emergence in the caption loss while scaling up OCR4KIMI-VL TECHNICALREPORTTable 1:Overview of training stages:data composition,token volumes,sequence lengths,and trainable components.StagesViT TrainingJoint Pre-trai
46、ningJoint CooldownJoint Long-contextDataAlt text+Synthesis CaptionText,KnowledgeHigh-quality TextLong TextGroundingInterleavingHigh-quality MultimodalLong VideoOCRVideo,AgentAcademic SourcesLong DocumentTokens2T+0.1T1.4T0.6T0.3TSequence length81928192819232768-131072TrainingViTViT&LLMViT&LLMViT&LLMT
47、able 2:Needle-in-a-Haystack(NIAH)test on text/video haystacks,where needles are uniformly distributed at variouspositions within the haystack.We report recall accuracy across different haystack lengths up to 131,072 tokens(128K).Haystack Length(0,2048(2048,4096(4096,8192(8192,16384(16384,32768(32768
48、,65536(65536,131072-text haystack100.0100.0100.0100.0100.0100.087.0-video haystack100.0100.0100.0100.0100.0100.091.7data,indicating that the text decoder had developed some OCR capabilities.After training the ViT in the CoCa-alikestage with 2T tokens,we align the MoonViT to the MoE language model us
49、ing another 0.1T tokens,where onlyMoonViT and MLP projector are updated.This alignment stage signifi cantly reduces the initial perplexity of MoonViTembeddings in the language model,allowing a smoother joint pre-training stage as follows.Joint Pre-training StageIn the joint pre-training stage,we tra
50、in the model with a combination of pure text data(sampled from the same distribution as the initial language model)and a variety of multimodal data(as discussed inSec.3.1).We continue training from the loaded LLM checkpoint using the same learning rate scheduler,consumingan additional 1.4T tokens.Th
51、e initial steps utilize solely language data,after which the proportion of multimodaldata gradually increases.Through this progressive approach and the previous alignment stage,we observe that jointpre-training preserves the models language capabilities while successfully integrating visual comprehe
52、nsion abilities.Joint Cooldown StageThe stage following the pre-training stage is a multimodal cooldown phase,where the modelis continue trained with high-quality language and multimodal datasets to ensure superior performance.For thelanguage part,through empirical investigation,we observe that the
53、incorporation of synthetic data during the coolingphase yields signifi cant performance improvements,particularly in mathematical reasoning,knowledge-based tasks,and code generation.The general text components of the cooldown dataset are curated from high-fi delity subsets ofthe pre-training corpus.
54、For math,knowledge,and code domains,we employ a hybrid approach:utilizing selectedpre-training subsets while augmenting them with synthetically generated content.Specifi cally,we leverage existingmathematical knowledge and code corpora as source material to generate question-answer(QA)pairs through
55、aproprietary language model,implementing rejection sampling techniques to maintain quality standards(Yue,Qu,et al.2023;D.Su et al.2024).These synthesized QA pairs undergo comprehensive validation before being integrated intothe cooldown dataset.For the multimodal part,in addition to the two strategi
56、es as employed in text cooldown datapreparation,i.e.question-answer synthesis and high-quality subset replay,to allow more comprehensive visual-centricperception and understanding(B.Li et al.2024;Tong et al.2024;J.Guo et al.2024),we fi lter and rewrite a varietyof academic visual or vision-language
57、data sources to QA pairs.Unlike post-training stages,these language andmultimodal QA pairs in the cooldown stage are only included for activating specifi c abilities and henceforth facilitatinglearning high-quality data,thus,we keep their ratio at a low portion to avoid overfi tting these QA pattern
58、s.The jointcooldown stage signifi cantly improves both language and multimodal abilities of the model.Joint Long-context Activation StageIn the fi nal pre-training stage,we extend the context length of the model from8192(8K)to 131072(128K),with the inverse frequency of its RoPE(J.Su et al.2023)embed
59、dings reset from 50,000to 800,000.The joint long-context stage is conducted in two sub-stages,where each one extends the models contextlength by four times.For data composition,we fi lter and upsample the ratio of long data to 25%in each sub-stage,while using the remaining 75%tokens to replay shorte
60、r data in its previous stage;our exploration confi rms that thiscomposition allows the model to effectively learn long-context understanding while maintaining short-context ability.5KIMI-VL TECHNICALREPORTJoint Supervised Fine-tuning Text+Multimodal SFT Data 1 Epoch32K+1 Epoch128KKimi-VLKimi-VL-Thin
61、kingLong-CoT Supervised Fine-tuning Text+Multimodal Long-CoT Data Planning,Evaluation,Reflection,ExplorationReinforcement Learning(RL)Online RL on Answer Only Length penalty,Difficulty controlFigure 5:The post-training stages of Kimi-VL and Kimi-VL-Thinking,including two stages of joint SFT in 32K a
62、nd128K context,and further long-CoT SFT and RL stages to activate and enhance long thinking abilities.To allow the model to activate long-context abilities on both pure-text and multimodal inputs,the long data usedin Kimi-VLs long-context activation consists of not only long text,but also long multi
63、modal data,including longinterleaved data,long videos,and long documents.Similar as cooldown data,we also synthesize a small portion of QApairs to augment the learning effi ciency of long-context activation.After the long-context activations,the model canpass needle-in-a-haystack(NIAH)evaluations wi
64、th either long pure-text or long video haystack,proving its versatilelong-context ability.We provide the NIAH recall accuracy on various range of context length up to 128K in Table 2.2.4Post-Training StagesJoint Supervised Fine-tuning(SFT)In this phase,we fi ne-tune the base model of Kimi-VL with in
65、struction-basedfi ne-tuning to enhance its ability to follow instructions and engage in dialogue,culminating in the creation of theinteractive Kimi-VL model.This is achieved by employing the ChatML format(Openai,2024),which allows for atargeted instruction optimization while maintaining architectura
66、l consistency with Kimi-VL.We optimize the languagemodel,MLP projector,and vision encoder using a mixture of pure-text and vision-language SFT data,which will bedescribed in Sec 3.2.Supervision is applied only to answers and special tokens,with system and user prompts beingmasked.The model is expose
67、d to a curated set of multimodal instruction-response pairs,where explicit dialogue roletagging,structured injection of visual embeddings,and preservation of cross-modal positional relationships are ensuredthrough the format-aware packing.Additionally,to guarantee the models comprehensive profi cien
68、cy in dialogue,weincorporate a mix of multimodal data and pure text dialogue data used in Moonlight,ensuring its versatility acrossvarious dialogue scenarios.We fi rst train the model at the sequence length of 32k tokens for 1 epoch,followed by another epoch at the sequencelength of 128k tokens.In t
69、he fi rst stage(32K),the learning rate decays from2105to2106,before it re-warmupsto1105in the second stage(128K)and fi nally decays to1106.To improve training effi ciency,we pack multipletraining examples into each single training sequence.Long-CoT Supervised Fine-TuningWith the refi ned RL prompt s
70、et,we employ prompt engineering to constructa small yet high-quality long-CoT warmup dataset,containing accurately verifi ed reasoning paths for both text andimage inputs.This approach resembles rejection sampling(RS)but focuses on generating long-CoT reasoning pathsthrough prompt engineering.The re
71、sulting warmup dataset is designed to encapsulate key cognitive processes that arefundamental to human-like reasoning,such as planning,where the model systematically outlines steps before execution;evaluation,involving critical assessment of intermediate steps;reflection,enabling the model to recons
72、ider and refi neits approach;and exploration,encouraging consideration of alternative solutions.By performing a lightweight SFT onthis warm-up dataset,we effectively prime the model to internalize these multimodal reasoning strategies.As a result,the fi ne-tuned long-CoT model demonstrates improved
73、capability in generating more detailed and logically coherentresponses,which enhances its performance across diverse reasoning tasks.Reinforcement LearningTo further advance the models reasoning abilities,we then train the model with reinforce-ment learning(RL),enabling the model to autonomously gen
74、erate structured CoT rationales.Specifi cally,similaras Kimi k1.5(K.Team et al.2025),we adopt a variant of online policy mirror descent as our RL algorithm,whichiteratively refi nes the policy modelto improve its problem-solving accuracy.During thei-th training iteration,wetreat the current model as
75、 a reference policy model and optimize the following objective,regularized by relative entropyto stabilize policy updates:maxE(x,y)D?E(y,z)r(x,y,y)KL(x)|i(x)?,(1)whereris a reward model that justifi es the correctness of the proposed answeryfor the given problemx,by assigninga valuer(x,y,y)0,1based
76、on the ground truthy,and 0is a parameter controlling the degree ofregularization.6KIMI-VL TECHNICALREPORTEach training iteration begins by sampling a problem batch from the datasetD,and the model parameters are updatedtoi+1using the policy gradient derived from(1),with the optimized policy model sub
77、sequently assuming the role ofreference policy for the subsequent iteration.To enhance RL training effi ciency,we implement a length-based rewardto penalize excessively long responses,mitigating the overthinking problem where the model generates redundantreasoning chains.Besides,we employ two sampli
78、ng strategies including curriculum sampling and prioritized sampling,which leverage diffi culty labels and per-instance success rates to focus training effort on the most pedagogically valuableexamples,thereby optimizing the learning trajectory and improving training effi ciency.Through large-scale
79、reinforcement learning training,we can derive a model that harnesses the strengths of both basicprompt-based CoT reasoning and sophisticated planning-enhanced CoT approaches.During inference,the modelmaintains standard autoregressive sequence generation,eliminating the deployment complexities associ
80、ated withspecialized planning algorithms that require parallel computation.Simultaneously,the model develops essential meta-reasoning abilities including error detection,backtracking,and iterative solution refi nement by effectively utilizing thecomplete history of explored reasoning paths as contex
81、tual information.With endogenous learning from its completereasoning trace history,the model can effectively encode planned search procedures into its parametric knowledge.2.5InfrastructureStorage We utilize S3(Amazon Web Services 2023)compatible object storage from cloud service vendors to store ou
82、rvisual-text data.To minimize the time between data preparation and model training,we store visual data in its originalformat and have developed an effi cient and flexible data loading system.This system provides several key benefi ts:Supports on-the-fly data shuffling,mixing,tokenization,loss maski
83、ng and packing during training,allowing us toadjust data proportions as needed;Enables random augmentation of both visual and text data,while preserving the correctness of 2D coordinate andorientation information during transformations;Ensures reproducibility by strictly controlling random states an
84、d other states across different data loader workers,guaranteeing that any interrupted training can be resumed seamlesslythe data sequence after resumption remainsidentical to an uninterrupted run;Delivers high-performance data loading:through multiple caching strategies,our system reliably supports
85、trainingon large scale clusters while maintaining controlled request rates and throughput to the object storage.Additionally,to ensure consistent dataset quality control,we developed a centralized platform for data registration,visualization,compiling statistics,synchronizing data across cloud stora
86、ge systems,and managing dataset lifecycles.Parallelism We adopt a 4D parallelism strategyData Parallelism(S.Li et al.2020),Expert Parallelism(Fedus et al.2022),Pipeline Parallelism(Y.Huang et al.2019;Narayanan et al.2021),and Context Parallelism(Jacobs et al.2023;H.Liu et al.2023)to accelerate the s
87、peed of Kimi-VL.After optimizing parallel strategies,the resulting trainingthroughput of our model is around 60%higher than a 7B dense VLM(e.g.VLMs based on Qwen2.5-7B).Data Parallelism(DP).DP replicates the model across multiple devices,each processing different micro-batches.This setup allows larg
88、er effective batch sizes by simply increasing the number of devices.Expert Parallelism(EP).EP distributes expert modules in the MoE layer across multiple devices.When combinedwith DP,experts on a given device can handle tokens from different DP groups,enhancing computational effi ciency.Pipeline Par
89、allelism(PP).PP splits the model into multiple layer-based stages.To minimize pipeline bubbles,weallocate the Vision Tower(VT)and several decoder layers to the fi rst stage,place the output layer and additionaldecoder layers in the last stage,and distribute the remaining decoder layers evenly across
90、 intermediate stages basedon their time overhead.Context Parallelism(CP).CP addresses long-sequence training by splitting sequences across different CP ranksin conjunction with flash attention(Dao et al.2022).This substantially reduces peak memory usage and relievesthe memory pressure from attention
91、 computations.Beyond these four parallel strategies,we incorporate ZeRO1(Rajbhandari et al.2020)and Selective CheckpointingActivation(T.Chen et al.2016;Korthikanti et al.2022)to further optimize memory usage.ZeRO1 reduces optimizerstate overhead by using a distributed optimizer while avoiding extra
92、communication costs.Selective CheckpointingActivation trades time for space by recomputing only those layers that have low time overhead but high memoryconsumption,striking a balance between computation effi ciency and memory demands.For extremely long sequences,we expand recomputation to a broader
93、set of layers to prevent out-of-memory errors.7KIMI-VL TECHNICALREPORTThe manuscripts in the images belong to Albert Einstein,a renowned physicist known for his contributions to the theory of general relativity.The content of the manuscripts records his work on gravitational field equations and rela
94、ted mathematical derivations,likely from his research on general relativity or a related field in physics.Key Observations:1.Handwriting Style:The handwriting is consistent with Einsteins known style,characterized by careful,albeit slightly messy,mathematical notation.2.Content Analysis:-The equatio
95、ns involve terms like partial derivatives,summations,and Greek letters(e.g.,g),which are typical in tensor calculus and field theory.-The first image shows equations that resemble Einsteins formulation of gravitational field equations,possibly related to the Einstein field equations in general relat
96、ivity.-The second image includes numerical calculations,constants(e.g.,K,M,T),and integrals,indicating detailed derivations or computations in gravitational theory.3.German Terms:The presence of German terms like Einheitsvektor(unit vector)and Gln(equation)suggests Einstein,who was German-speaking.C
97、onclusion:These manuscripts are part of Einsteins research materials,documenting his mathematical and theoretical work on gravitational fields and field equations.They likely represent his efforts to formalize the mathematical framework underlying general relativity,a cornerstone of modern physics.F
98、inal Answer:The manuscripts belong to Albert Einstein and record his work on gravitational field equations and mathematical derivations in general relativity.InstructionPlease infer step by step who this manuscript belongs to and what it recordsResponseAlright,I need to figure out who this manuscrip
99、t belongs to and what it records based on the images provided.Lets start by analyzing the images carefully.First,Ill look at the visual characteristics of the manuscript.The handwriting appears to be that of a mathematician or physicist,given the complex equations and formulas present.Thepaper looks
100、 old and has a yellowish tint,indicating it might be from an earlier time period.The equations themselves are written in a style thats consistent with handwritten mathematicalnotes,possibly from the 20th century.Next,Ill examine the content of the manuscripts.Both images contain extensive mathematic
101、al equations,involving variables like g(gravity),M(mass),T(time?),and various indices andsummations.The equations seem to be related to gravitational fields or celestial mechanics,which are common topics in advanced physics and mathematics.Looking at the first image(Image 1),I can see equations invo
102、lving partial derivatives and summations,which are typical in field theory or tensor calculus.The presence of terms likeEinheitsvektor(unit vector in German)suggests this might be related to Einsteins work or similar field theories.The equations seem to be setting up a system of equations,possibly f
103、or aspecific problem in physics.The second image(Image 2)contains more numerical calculations and formulas.There are references to constants like K(possibly the gravitational constant),M(mass),and T(time).The equations include integrals,derivatives,and algebraic manipulations,indicating a detailed m
104、athematical derivation or computation.Now,considering the writing style and the content,I notice that the handwriting is somewhat messy but consistent,suggesting its from a single author.The use of German terms likeEinheitsvektor and Gln(which could be an abbreviation for Gleichung or equation)point
105、s to the author being German-speaking.The first image seems to be a page from a mathematical manuscript,possibly from Einsteins work,given the style and content.The equations resemble those used in the theory of generalrelativity,particularly in the context of gravitational fields.The second image,w
106、ith its numerical calculations and references to constants,might be part of a companion document or adifferent section of the same manuscript.Putting this together,the manuscripts likely belong to Albert Einstein,given the style of the equations and the content related to gravitational fields and fi
107、eld theory.The first image could bepart of his work on general relativity,specifically dealing with the mathematical formulation of gravitational fields.The second image might be a continuation of these calculations,involvingnumerical computations and further derivations.To verify,Ill consider the f
108、ollowing points:1.The presence of German terms suggests a German-speaking author,which aligns with Einsteins background.2.The complex mathematical content matches Einsteins known work in field theory and general relativity.3.The handwritten nature of the manuscripts is consistent with how Einstein o
109、ften recorded his ideas and calculations.Given these observations,I conclude that the manuscripts belong to Albert Einstein and record his work on gravitational field equations and related mathematical derivations,possibly fromhis research on general relativity or a related field in physics.Figure 6
110、:Manuscript reasoning visualization.Kimi-VL-Thinking demonstrates the ability to perform historical andscientifi c inference by analyzing handwritten manuscripts step by step.In this example,our model identifi es the authoras Albert Einstein based on handwriting style,content analysis,and language c
111、ues.It reasons that the manuscriptsrelate to gravitational fi eld equations,consistent with Einsteins contributions to general relativity.3Data Construction3.1Pre-Training DataOur multimodal pre-training corpus is designed to provide high-quality data that enables models to process andunderstand inf
112、ormation from multiple modalities,including text,images,and videos.To this end,we have also curatedhigh-quality data from six categories caption,interleaving,OCR,knowledge,video,and agent to form the corpus.When constructing our training corpus,we developed several multimodal data processing pipelin
113、es to ensure dataquality,encompassing fi ltering,synthesis,and deduplication.Establishing an effective multimodal data strategy iscrucial during the joint training of vision and language,as it both preserves the capabilities of the language model andfacilitates alignment of knowledge across diverse
114、modalities.We provide a detailed description of these sources in this section,which is organized into the following categories:Caption DataOur caption data provides the model with fundamental modality alignment and a broad range of worldknowledge.By incorporating caption data,the multimodal LLM gain
115、s wider world knowledge with high learningeffi ciency.We have integrated various open-source Chinese and English caption datasets like(Schuhmann et al.2022;Gadre et al.2024)and also collected substantial in-house caption data from multiple sources.However,throughout thetraining process,we strictly l
116、imit the proportion of synthetic caption data to mitigate the risk of hallucination stemmingfrom insuffi cient real-world knowledge.For general caption data,we follow a rigorous quality control pipeline that avoids duplication and maintain highimage-text correlation.We also vary image resolution dur
117、ing pre-training to ensure that the vision tower remainseffective when processing images of both high-and low-resolution.Image-text Interleaving Data During the pre-training phase,the model benefi ts from interleaving data for manyaspects.For example,multi-image comprehension ability can be boosted
118、by interleaving data;interleaving data alwaysprovides detailed knowledge for the given image;a longer multimodal context learning ability can also be gained byinterleaving data.Whats more,we also fi nd that interleaving data can contribute positively to maintaining the models8KIMI-VL TECHNICALREPORT
119、language abilities.Thus,image-text interleaving data is an important part in our training corpus.Our multimodalcorpus considered open-sourced interleave datasets like(Zhu et al.2024;Laurenon et al.2024)and also constructedlarge-scale in-house data using resources like textbooks,webpages,and tutorial
120、s.Further,we also fi nd that synthesizingthe interleaving data benefi ts the performance of multimodal LLM for keeping the text knowledge.To ensure eachimages knowledge is suffi ciently studied,for all the interleaving data,despite standard fi ltering,deduping,and otherquality control pipeline,we al
121、so integrate a data reordering procedure to keep all the image and text in the correct order.OCR Data Optical Character Recognition(OCR)is a widely adopted technique that converts text from images into aneditable format.In our model,a robust OCR capability is deemed essential for better aligning the
122、 model with humanvalues.Accordingly,our OCR data sources are diverse,ranging from open-source to in-house datasets,encompassingboth clean and augmented images,and spanning over single-page and multi-page inputs.In addition to the publicly available data,we have developed a substantial volume of in-h
123、ouse OCR datasets,coveringmultilingual text,dense text layouts,web-based content,and handwritten samples.Furthermore,following the principlesoutlined in OCR 2.0(Wei et al.2024),our model is also equipped to handle a variety of optical image types,includingfi gures,tables,geometry diagrams,mermaid pl
124、ots,and natural scene text.We apply extensive data augmentationtechniquessuch as rotation,distortion,color adjustments,and noise additionto enhance the models robustness.Asa result,our model achieves a high level of profi ciency in OCR tasks.In addition to single-page OCR data,we collect and convert
125、 a large volume of in-house multi-page OCR data to activatethe models understanding of long documents in the real world.With the help of these data,our model is capable ofperforming accurate OCR on a single image but can also comprehend an entire academic paper or a scanned book.Knowledge Data The c
126、oncept of multimodal knowledge data is analogous to the previously mentioned text pre-trainingdata,except here we focus on assembling a comprehensive repository of human knowledge from diverse sourcesto further enhance the models capabilities.For example,carefully curated geometry data in our datase
127、t is vital fordeveloping visual reasoning skills,ensuring the model can interpret the abstract diagrams created by humans.Our knowledge corpus adheres to a standardized taxonomy to balance content across various categories,ensuringdiversity in data sources.Similar to text-only corpora,which gather k
128、nowledge from textbooks,research papers,andother academic materials,multimodal knowledge data employs both a layout parser and an OCR model to processcontent from these sources.While we also include fi ltered data from internet-based and other external resources.Because a signifi cant portion of our
129、 knowledge corpus is sourced from internet-based materials,infographics can causethe model to focus solely on OCR-based information.In such cases,relying exclusively on a basic OCR pipeline maylimit training effectiveness.To address this,we have developed an additional pipeline that better captures
130、the purelytextual information embedded within images.Agent Data For agent tasks,the models grounding and planning capabilities have been signifi cantly enhanced.In addition to utilizing publicly available data,a platform has been established to effi ciently manage and executevirtual machine environm
131、ents in bulk.Within these virtual environments,heuristic methods were employed tocollect screenshots and corresponding action data.This data was then processed into dense grounding formats andcontinuous trajectory formats.The design of the Action Space was categorized according to Desktop,Mobile,and
132、 Webenvironments.Furthermore,icon data was collected to strengthen the models understanding of the meanings of iconswithin software graphical user interfaces(GUIs).To enhance the models planning ability for solving multi-step desktoptasks,a set of computer-use trajectories was collected from human a
133、nnotators,each accompanied by synthesizedChain-of-Thought(Aguvis(Yiheng Xu et al.2024).These multi-step agent demonstrations equip Kimi-VL with thecapability to complete real-world desktop tasks(on both Ubuntu and Windows).Video Data In addition to image-only and image-text interleaved data,we also
134、incorporate large-scale video dataduring pre-training,cooldown,and long-context activation stages to enable two directions of essential abilities of ourmodel:fi rst,to understand a long-context sequence dominated by images(e.g.hour-long videos)in addition to longtext;second,to perceive fi ne-grained
135、 spatio-temporal correspondence in short video clips.Our video data are sourced from diverse resources,including open-source datasets as well as in-house web-scale videodata,and span videos of varying durations.Similarly,to ensure suffi cient generalization ability,our video data cover awide range o
136、f scenes and diverse tasks.We cover tasks such as video description and video grounding,among others.For long videos,we carefully design a pipeline to produce dense captions.Similar to processing the caption data,westrictly limit the proportion of the synthetic dense video description data to reduce
137、 the risk of hallucinations.Text Data Our text pretrain corpus directly utilizes the data in Moonlight J.Liu et al.2025a,which is designed toprovide comprehensive and high-quality data for training large language models(LLMs).It encompasses fi ve domains:English,Chinese,Code,Mathematics&Reasoning,an
138、d Knowledge.We employ sophisticated fi ltering and qualitycontrol mechanisms for each domain to ensure the highest quality training data.For all pretrain data,we conducted9KIMI-VL TECHNICALREPORTrigorous individual validation for each data source to assess its specifi c contribution to the overall t
139、raining recipe.This systematic evaluation ensures the quality and effectiveness of our diverse data composition.To optimize theoverall composition of our training corpus,the sampling strategy for different document types is empirically determinedthrough extensive experimentation.We conduct isolated
140、evaluations to identify document subsets that contribute mostsignifi cantly to the models knowledge acquisition capabilities.These high-value subsets are upsampled in the fi naltraining corpus.However,to maintain data diversity and ensure model generalization,we carefully preserve a balancedrepresen
141、tation of other document types at appropriate ratios.This data-driven approach helps us optimize the trade-offbetween focused knowledge acquisition and broad generalization capabilities.Benchmark(Metric)GPT-4oGPT-Qwen2.5-Llama3.2-Gemma3-DeepSeek-Kimi-VL-4o-miniVL-7B11B-Inst.12B-ITVL2A3BArchitecture-
142、DenseDenseDenseMoEMoE#Act.Params(LLM+VT)-7.6B+0.7B 8B+2.6B12B+0.4B 4.1B+0.4B 2.8B+0.4B#Total Params-8B11B12B28B16BCollege-levelMMMUval(Pass1)69.160.058.64859.651.157.0VideoMMMU(Pass1)61.2-47.441.857.244.452.6MMVUval(Pass1)67.461.650.144.457.052.152.2GeneralMMBench-EN-v1.1(Acc)83.177.182.665.874.679.
143、683.1MMStar(Acc)64.754.863.949.856.155.561.3MMVet(Pass1)69.166.967.157.664.960.066.7RealWorldQA(Acc)75.467.168.563.359.168.468.1AI2D(Acc)84.677.883.977.378.181.484.9Multi-imageBLINK(Acc)68.053.656.439.850.3-57.3MathMathVista(Pass1)63.852.568.247.756.162.868.7MathVision(Pass1)30.4-25.113.632.117.321.
144、4OCRInfoVQA(Acc)80.757.982.634.643.878.183.2OCRBench(Acc)815785864753702811867OS AgentScreenSpot-V2(Acc)18.1-86.8-92.8ScreenSpot-Pro(Acc)0.8-29.0-34.5OSWorld(Pass1)5.03-2.5-8.22WindowsAgentArena(Pass1)*9.42.73.4-10.4Long DocumentMMLongBench-Doc(Acc)42.829.029.613.821.3-35.1Long VideoVideo-MME(w/o su
145、b./w/sub.)71.9/77.2 64.8/68.9 65.1/71.646.0/49.558.2/62.1-67.8/72.6MLVUMCQ(Acc)64.648.170.244.452.3-74.2LongVideoBenchval66.758.256.045.551.5-64.5Video PerceptionEgoSchemafull72.2-65.054.356.938.578.5VSI-Bench34.0-34.220.632.421.737.4TOMATO37.728.827.621.528.627.231.7Table 3:Performance of Kimi-VL a
146、gainst proprietary and open-source effi cient VLMs;performance of GPT-4o is alsolisted in gray for reference.Top and second-best models are in boldface andunderlinerespectively.Some results ofcompeting models are unavailable due to limitation of model ability on specifi c tasks or model context leng
147、th.*GPT-4o and GPT-4o-mini results use Omniparser without UIA,according to Bonatti et al.2024.10KIMI-VL TECHNICALREPORTThe first picture(image 1)depicts a dense urban area with a mix of buildings,greenspaces,and a distinctive circular structure(possibly a dome or observatory).Todetermine which sub-p
148、icture matches this location:Image 4 shows a similarly dense urban environment with a large building complexfeaturing a central courtyard and a circular structure resembling the one in image 1.Both images share the same type of urban layout,building density,and green spaces.Image 3 includes a church
149、 spire(not present in image 1)and a more spread-out layout,making it distinct.Image 2 is clearly suburban with larger plots and fewer buildings,unlike the compacturban setting of image 1.Conclusion:Image 4 is the most likely candidate as it matches the urban density,building types,and the circular s
150、tructure seen in image 1.Answer:The fourth sub-picture(image 4)is in the same place as the first picture.Which of the four subgraphs in the figure is taken at the same place as subgraph 1?The dome building in the image is theRogers Centre,a multi-purpose stadium inToronto,Canada.It is recognizable b
151、y itsdistinctiveretractableroofandisalandmark in the citys skyline,often visiblealongside the CN Tower(the tall spire in thebackground).TheRogersCentrehostsvariousevents,includingsportsgames,concerts,and conventions.What is the dome building in the picture?InstructionResponseYou are in Cyberpunk 207
152、7,a open-world action role-playinggame set in Night City(year 2077).The image shows a futuristic baror club within the games cyberpunk-themed environment,likely amission or social interaction location(e.g.,Sit next to Jackie isvisible in the HUD).The setting features neon lights,holographicdisplays,
153、and characters with cybernetic enhancements,typicalof the games aesthetic.Where am I?InstructionInstructionResponseResponseFigure 7:Kimi-VL exhibits strong visual reasoning capabilities by grounding visual content in spatial,contextual,andcultural knowledge.It accurately identifi es matching urban l
154、ocations based on structural and layout features,interpretsscenes from video games like Cyberpunk 2077 using stylistic cues,and recognizes real-world landmarks such as theRogers Centre in Toronto.3.2Instruction DataAt this stage,the data is primarily aimed at enhancing the models conversational abil
155、ities and instruction-followingcapabilities.To cover as many scenarios as possible,we enrich the data across different domains.For non-reasoningtasks,including chart interpretation,agent grounding,OCR,image-grounded conversations,question-answering,writing,and text processing,we initially construct
156、a seed dataset through human annotation.This seed dataset is used totrain a seed model.Subsequently,we collect a diverse set of prompts and employ the seed model to generate multipleresponses to each prompt.Annotators then rank these responses and refi ne the top-ranked response to produce the fi na
157、lversion.For reasoning tasks like visual coding,visual reasoning,and math/science problems,where rule-based andmodel-based verifi cations are more accurate and effi cient than human judgment,we utilize rejection sampling to expandthe SFT dataset.The complete vanilla SFT dataset comprises approximate
158、ly a 1:1 ratio of text tokens to image tokens.3.3Reasoning DataOur reasoning data is meticulously constructed for activation and enhancement of the models multimodal reasoningcapabilities during both the long-CoT supervised fi ne-tuning and reinforcement learning stages.Through developing ageneratio
159、n pipeline that resembles rejection sampling(RS)and prompt engineering,we collect and synthesize an amountof high-quality long-CoT data.Specifi cally,we fi rst assemble a collection of QA data with ground truth annotationsthat require multi-step reasoning,such as mathematical problem-solving and dom
160、ain-specifi c VQA.Subsequently,we sample multiple detailed reasoning trajectories for each question by leveraging a powerful long-CoT model-11KIMI-VL TECHNICALREPORT InstructionInstructionx+x+124=1802x+124=1802x=56x=28ResponseResponseFigure 8:Kimi-VL demonstrates its capability to perform symbolic r
161、easoning and geometric inference by solving acircle geometry problem step by step.The model analyzes given conditions,applies geometric theorems such as theinscribed angle theorem and properties of triangle angles,and accurately derives the target angle.Kimi k1.5(K.Team et al.2025)with curated reaso
162、ning prompts.In rejection sampling,we feed the true labels andmodel predictions into an off-the-shelf reward model for judgment.Wrong chain-of-thought responses are fi ltered outaccording to the model evaluation as well as some rule-based rewards,thus improving the reasoning data quality.4Evaluation
163、We begin by presenting our comprehensive model and conducting a comparative analysis with leading state-of-the-art(SoTA)solutions.Following this introduction,we proceed to assess various sub-capabilities of the model throughdetailed performance evaluations.This part examines how effectively the mode
164、l handles different tasks and scenarios,providing insights into its strengths and limitations across diverse functional domains.4.1Comparison to the State-of-the-Art ModelsTable 3 presents a comprehensive evaluation of Kimi-VL against state-of-the-art vision-language models across multiplebenchmarks
165、.Although having a more parameter-effi cient architecture(2.8B+0.4B activated parameters)comparedto larger models such as GPT-4o,Llama-3.2-11B-Inst.and Gemma3-12B-IT,Kimi-VL demonstrates competitive orsuperior performance in several key areas.Our model employs a Mixture-of-Experts(MoE)architecture s
166、imilar toDeepSeek-VL2,but outperforms it on most benchmarks with signifi cantly fewer parameters(activated:2.8B vs 4.5B;total:16B vs 28B);it also outperforms Qwen2.5-VL-7B(actually 8.3B)on 19 out of 24 benchmarks,though the latterhas 2.59more activated parameters.The following sections analyze perfo
167、rmance across specifi c domains,whichreveals Kimi-VL s strengths in OCR,math,agent,long-form content understanding,multi-image and video perception.4.1.1College-level Academic ProblemsOur Kimi-VL model demonstrates competitive performance on college-level academic benchmarks.On MMMUvalidation set,it
168、 achieves a score of 57.0%,which outperforms DeepSeek-VL2(51.1%)and is comparable to Qwen2.5-12KIMI-VL TECHNICALREPORTParse the complete table and output it in markdown table format#Sparkling Smiles Clinic Ratio Analysis#Growth|Year|Sales Growth|-|-|2024|30%|2025|2%|2026|2%|2027|2%|2028|2%|2029|2%|2
169、030|2%|2031|2%|#Percent of Total Assets|Year|Inventory|Other Current Assets|Total Current Assets|Long-term Assets|Total Assets|Current Liabilities|Long-term Liabilities|Total Liabilities|Net Worth|-|-|-|-|-|-|-|-|-|-|2024|23%|-1%|21%|79%|100%|5%|44%|49%|51%|2025|17%|35%|52%|48%|100%|4%|28%|32%|68%|2
170、026|14%|54%|68%|32%|100%|3%|21%|24%|76%|2027|12%|66%|79%|21%|100%|3%|15%|18%|82%|2028|11%|75%|86%|14%|100%|2%|11%|13%|87%|2029|9%|78%|87%|13%|100%|2%|7%|9%|91%|2030|8%|84%|92%|8%|100%|2%|5%|6%|94%|2031|7%|88%|96%|4%|100%|1%|3%|4%|96%|#Percent of Sales|Year|Gross Margin|Selling,General&Administrative
171、 Expenses|Advertising Expenses|Profit Before Interest and Taxes|-|-|-|-|-|2024|72%|100%|3%|0%|2025|72%|85%|2%|15%|2026|72%|85%|2%|15%|2027|72%|84%|2%|16%|2028|72%|84%|2%|16%|2029|72%|84%|2%|17%|2030|72%|83%|2%|17%|2031|72%|83%|2%|17%|#Main Ratios|Year|Current Ratio|Quick Ratio|Total Debt to Total As
172、sets|Pre-tax Return on Net Worth|Pre-tax Return on Assets|-|-|-|-|-|-|2024|4.42|-0.23|0.49|-4%|-2%|2025|14.73|9.98|0.32|56%|38%|2026|22.80|17.96|0.24|43%|33%|2027|31.05|26.11|0.18|36%|30%|2028|39.56|34.53|0.13|31%|27%|2029|47.86|42.72|0.09|26%|23%|2030|59.37|54.13|0.06|22%|20%|2031|71.17|65.82|0.04|
173、19%|18%|#Additional Ratios|Year|Net Profit Margin|Return on Equity|-|-|-|2024|-1%|-4%|2025|8%|45%|2026|9%|35%|2027|9%|29%|2028|9%|25%|2029|9%|20%|2030|9%|17%|2031|9%|15%|#Activity Ratios|Year|Inventory Turnover|-|-|2024|78.41|2025|61.54|2026|61.54|2027|61.54|2028|61.54|2029|61.54|2030|61.54|2031|61.
174、54|ResponseResponse Please generate the latex code of the formula in the image.InstructionInstructionCertainly!Here is the LaTeX code for the formula in the image:latexbeginalign*boldsymbolx_t-1&sim q(boldsymbolx_t-1|boldsymbolx_t,boldsymbolx_0)=mathcalN(boldsymbolmu_q,boldsymbolSigma_q)&=mathcalNle
175、ft(fracsqrtalpha_t(1-baralpha_t-1)boldsymbolx_t+sqrtbaralpha_t-1(1-alpha_t)boldsymbolx_01-baralpha_t,frac(1-alpha_t)(1-baralpha_t-1)1-baralpha_tboldsymbolIright)&=mathcalNleft(fracsqrtbaralpha_t-1beta_t1-baralpha_tboldsymbolx_0+fracsqrtalpha_t(1-baralpha_t-1)1-baralpha_tboldsymbolx_t,frac1-baralpha_
176、t-11-baralpha_tbeta_tboldsymbolIright)endalign*ResponseResponse 識別這篇作文的內容InstructionInstruction這篇作文的內容如下:-得勤快,我會練字,我就是我自己的手機毒霸,管好我自己,少做傻事情哈。其實聰明人也可以很聽話的,至少在沒有長大以前。問你能不能不跟到你爸爸切北京啊,“我不切,一個人在成都你養我”“我養你啊”,哎,想到就心酸,等著吧。好了,我不寫了。你,要好好的,要切煮餓了,成都天氣也涼了,北京也一樣吧。多穿衣服多喝熱水好了,我們就到這。以后再遇到起:好久不見。你好嗎?我很好!六年級二班-王天樂-Resp
177、onseResponseInstructionInstructionRendered formulaFigure 9:Diverse OCR visualization.Kimi-VL demonstrates strong OCR capabilities across varied content types,including structured fi nancial tables,complex mathematical formulas,and handwritten Chinese text.The modelaccurately parses tabular data into
178、 markdown,converts formulas to LaTeX,and transcribes handwritten paragraphs withcontextual understanding,showcasing its versatility in multimodal text extraction and interpretation.VL-7B(58.6%)and even Gemma-3-12B-IT(59.6%),despite having signifi cantly fewer activated parameters.Onvideo college-lev
179、el problems,it signifi cantly outperforms Qwen2.5-VL-7B and DeepSeek-VL2,only behind 10BGemma-3-12B-IT,demonstrating reasonable university-level understanding capabilities compared to larger models.These results indicate that Kimi-VL effectively balances parameter effi ciency with academic reasoning
180、 abilities.4.1.2General Visual AbilityKimi-VL exhibits strong general visual understanding capabilities across multiple benchmarks.On MMBench-EN-v1.1,it achieves 83.1%accuracy,outperforming all effi cient VLMs in comparison,and performing on par with GPT-4o.For AI2D,our model achieves 84.9%and surpa
181、sses all compared models including GPT-4o(84.6%).On MMVet,Kimi-VL scores 66.7%and ties closely with Qwen2.5-VL-7B(67.1%)and GPT-4o-mini(66.9%).For RealWorldQA,it achieves 68.1%,outperforming Gemma3-12B(59.1%)and approaching Qwen2.5-VL-7B(68.5%).These resultsdemonstrate that our model maintains robus
182、t general visual understanding despite its compact architecture.In multi-image reasoning tasks,Kimi-VL shows promising capabilities with a score of 57.3%on the BLINK benchmark.This performance surpasses Qwen2.5-VL-7B(56.4%),GPT-4o-mini(53.6%),Gemma3-12B-IT(50.3%),and Llama3.2-11B-Inst.(39.8%).Theabi
183、litytoreasonacrossmultipleimagesrequiresunderstandingspatialandtemporalrelationshipsbetween visual elements,which our model handles effectively with fewer parameters than most competitors.4.1.3Mathematical ReasoningWith its relatively small scale,Kimi-VL also demonstrates strong mathematical reasoni
184、ng capabilities,particularly onthe MathVista benchmark where it achieves 68.7%,outperforming all compared models including GPT-4o(63.8%)andQwen2.5-VL-7B(68.2%).It indicates our models exceptional ability to understand and solve mathematical problemspresented in visual contexts.On the more challengin
185、g MathVision benchmark,due to limited activated parameters,13KIMI-VL TECHNICALREPORTStep 1Step 2Step 3Step 4Step 5Step 6Can you enable the Do Not Track feature in Chrome to enhance my online privacy?Instruction:Thought:.To enable the Do Not Track feature inChrome,I need to access Chromes settings.Lo
186、okingat the interface,I can see the vertical ellipsis menubutton in the top-right corner of the browser window.This menu button is the standard way to accessChromes settings and other features.Action:Click on the vertical ellipsis menu button inthe top-right corner of the Chrome window to openthe br
187、owser menu.Toolcall:click(x=0.884,y=0.144)Thought:.I can see that Security is one of the options in the Privacy and security section.This makes sense as Do Not Track is a security feature that helps protect user privacy by preventing websites from tracking browsing activity.Action:Click on the Secur
188、ity option in the Privacy and security section to access more detailed security settings.Toolcall:click(x=0.522,y=0.884)Thought:.The next logical step would be to accessChromes settings,where privacy options like DoNot Track are typically located.Looking at themenu thats currently open,I can see the
189、 Settingsoption at the bottom of the menu.This would be themost direct path to access privacy settings.Action:Click on the Settings option in the Chromemenu to access the browsers settings page.Toolcall:click(x=0.744,y=0.851)Thought:.Now,I need to find where the Do NotTrack setting is located.Based
190、on my knowledge ofChromes settings,this feature is typically found inthe Privacy and security section.I can see thatPrivacy and security is one of the options in the leftsidebar menu.Action:Click on the Privacy and security option inthe left sidebar menu.Toolcall:click(x=0.386,y=0.311)Thought:.Now w
191、e need to find the Do NotTrack setting.Looking at the current screen,I cansee that Do Not Track is not immediately visible inthe list of options.However,I notice that theres aSecurity option with a lock icon that might containprivacy-related settings including Do Not Track.Action:Click on the Securi
192、ty option in the Privacyand security section to access more detailed securitysettings.Toolcall:click(x=0.522,y=0.884)Thought:.To find the Do Not Track setting,Ineed to scroll down in the Security section.Thecurrent view doesnt show all available options,andthe Do Not Track setting is likely further
193、down thepage.Scrolling down will reveal more settingsoptions that arent currently visible.Action:Scroll down in the Security settings page toreveal more options.Toolcall:scroll(-5)Thought:.The Manage HTTPS/SSL certificatesand settings option is visible in the list.This is likelywhere the Do Not Trac
194、k setting would be located,as certificate management is related to privacy andsecurity features.Action:ClickontheManageHTTPS/SSLcertificates and settings option to access certificatemanagement settings.Toolcall:click(x=0.527,y=0.883)Thought:.Looking at the available options,Third-party cookies is vi
195、sible in the list.This option controls whether third-party cookies are blocked in Incognito mode.While this isnt exactly Do Not Track,it is related to privacy controls.Action:Click on the Third-party cookies option in the Privacy and security section to access its settings.Toolcall:click(x=0.542,y=0
196、.772)Thought:.Since the current page doesnt show the Do Not Track setting,the logical next action would be to continue navigating through the settings.The back button in the browser would allow the user to return to the previous page and continue their search for the Do Not Track setting.Action:Clic
197、k the back button in the browser to return to the previous page.Toolcall:click(x=0.486,y=0.242)Thought:.I notice theres an Advanced section in the current view that contains a Send a Do Not Track request with your browsing traffic option.This is exactly what we need to enable the Do Not Track featur
198、e.The toggle switch for this option is currently off.Action:Click the toggle switch next to Send a Do Not Track request with your browsing traffic to enable the feature.Toolcall:click(x=0.808,y=0.871)Thought:.The dialog explains what Do Not Track does and has two buttons:Cancel and Confirm.To comple
199、te the task of enabling this feature,we need to confirm the action by clicking the Confirm button.Action:Click the Confirm button to enable the Do Not Track feature.Toolcall:click(x=0.708,y=0.669)Thought:.I notice that were currently on the Manage certificates page,which is not where the Do Not Trac
200、k setting would be located.To enable the Do Not Track feature,we need to navigate to the appropriate settings section.Action:Click on the Privacy and security option in the left sidebar menu to expand its submenu.Toolcall:click(x=0.365,y=0.317)Step 7Step 8Step 9Step 10Step 11Step 12Figure 10:Kimi-VL
201、 is capable of following multi-step reasoning processes to complete complex GUI tasks.In thisexample,it successfully enables the“Do Not Track”feature in the Chrome browser to enhance online privacy.Theagent interprets each screen,identifi es relevant UI elements,and performs the appropriate actions
202、sequentially withclear thoughts,actions,and API calls.14KIMI-VL TECHNICALREPORTPlease split this video into scenes,providing start time,end time,and detailed descriptions for each scene.Instruction(00:00:00-00:00:15)The scene opens with a dark room illuminated by a single light source,where a person
203、 is seen cooking food.The atmosphere is mysterious and intriguing,with the dim lighting and the steam rising from the cooking pot creating a sense of anticipation.At 00:00:05,text appears on the screen,reading THE NORTH FACE PRESENTS,which sets the stage for the theme of the video.The person continu
204、es to cook,and at 00:00:13,the scene briefly transitions to a black screen,building suspense before the next segment.The overall mood is mysterious and intriguing,aligning with the theme of the video.(00:00:15-00:00:36)The scene opens with a close-up of an elderly persons face,revealing their furrow
205、ed brow and weathered skin,which suggests a life of hardship and resilience(00:00:15-00:00:20).The camera then shifts to their yellow jacket,highlighting the texture and details,and a prayer wheel comes into view,held by the person(00:00:20-00:00:27).The focus then moves to the prayer wheel,capturin
206、g its intricate details and the serene expression of the elderly person as they spin it(00:00:27-00:00:36).The atmosphere is calm and contemplative,with the dim lighting and the sound of spinning prayer wheels creating a sense of spiritual reflection.The scene introduces a sense of cultural and spir
207、itual depth,emphasizing themes of contemplation,spirituality,and the passage of time.(00:00:36-00:00:46)The scene opens with a breathtaking aerial view of majestic snow-capped mountains,setting the stage for a theme of natural grandeur and adventure.At 00:00:36,the camera captures the serene and awe
208、-inspiring landscape,which transitions to a title card at 00:00:38,introducing the film as A NEW FILM BY SHERPAS CINEMA.The title card remains on screen until 00:00:45,after which the camera continues to showcase the stunning mountain range.The scene then shifts to a close-up of a persons eye at 00:
209、00:46,subtly hinting at the personal stories and human elements that will unfold within this vast and beautiful setting.The overall mood is one of awe and inspiration,emphasizing the scale and beauty of the natural world.(00:00:46-00:01:09)The scene opens with a close-up of an eye,which then transit
210、ions to a detailed view of a prayer wheel,emphasizing the intricate designs and the serene atmosphere of the setting(00:00:47-00:00:56).The camera then shifts to a dynamic shot of a wave,capturing its power and beauty as it crashes(00:00:56-00:01:00).Following this,the scene moves to an underwater v
211、iew,showcasing the serene and mysterious environment beneath the surface(00:01:00-00:01:03).The camera then pans to a majestic mountain range,highlighting the grandeur and scale of the natural world(00:01:03-00:01:09).The mood throughout the scene is one of awe and tranquility,blending elements of n
212、ature,spirituality,and the profound connection between the individual and the vastness of the world around them.(00:01:09-00:01:27)The scene opens with an interior shot of a room filled with numerous candles arranged on a table,creating a serene and contemplative atmosphere.The camera then transitio
213、ns to a close-up of an elderly persons face,their furrowed brow and weathered skin hinting at a life of experience and wisdom.The focus shifts to the intricate details of their yellow jacket,emphasizing the texture and craftsmanship.The scene then moves to a close-up of a prayer wheel,held gently in
214、 their hands,with the dim lighting and soft focus adding to the sense of reverence and tranquility.The camera lingers on the prayer wheel,capturing its intricate designs and the serene expression of the elderly person as they spin it,before panning out to reveal the majestic mountain range in the ba
215、ckground.The scene then cuts to a close-up of a rock surface,its rugged texture and natural patterns standing in contrast to the previous serene and spiritual elements.The camera then shifts to a view of clouds and treetops,with the mountain peak peeking through,before moving to a close-up of snow-c
216、overed tree branches,highlighting the beauty and harshness of the natural environment.The scene then transitions to an aerial shot of a river,its winding path cutting through the landscape,before finally focusing on a close-up of a hand holding a prayer wheel.Throughout the scene,the themes of spiri
217、tuality,tradition,and the beauty of nature are woven together,creating a sense of awe and reflection.(00:01:27-00:01:39)The scene opens with a close-up of a person wearing a yellow jacket,focusing on the intricate details of their clothing,which suggests a sense of preparation and readiness.At 00:01
218、:29,the camera shifts to a close-up of an eye,capturing the intensity and focus of the individual.The scene then transitions to an aerial view of a breathtaking snow-covered mountain range at 00:01:34,emphasizing the vastness and grandeur of the landscape.The camera slowly pans across the mountains,
219、highlighting their rugged beauty and the serene,untouched nature of the environment.The overall mood of the scene is one of anticipation and awe,as the viewer is introduced to the challenging and majestic setting that lies ahead.The theme of preparation and the awe-inspiring nature of the landscape
220、are prominently featured,setting the stage for what is to come.(00:01:39-00:02:03)The scene opens with a group of climbers ascending a steep,snow-covered mountain,their movements slow and deliberate as they navigate the treacherous terrain(00:01:40-00:01:46).The camera then shifts to a close-up of a
221、 hand turning the pages of a book,the soft rustle of the pages providing a moment of quiet introspection(00:01:46-00:01:50).Next,the scene transitions to an aerial view of a snow-covered mountain range,the vast expanse emphasizing the scale and beauty of the environment(00:01:50-00:01:56).A skier is
222、 then shown descending a steep,snowy slope,the camera following their swift and agile movements as they carve through the powder(00:01:56-00:02:03).The sequence concludes with a close-up of a snow-covered mountain ridge,the pristine white landscape highlighting the raw power and majesty of nature(00
223、:02:03).Throughout the scene,the theme of human determination and the awe-inspiring beauty of nature is prominently featured,capturing the essence of adventure and the challenge of conquering the elements.(00:02:03-00:02:17)The scene opens with a close-up of a prayer wheel,its intricate details high
224、lighted by the soft,diffused lighting.At 00:02:03,the camera then transitions to a breathtaking view of a snow-capped mountain peak,emphasizing the grandeur and majesty of the natural world.The serene and contemplative mood is maintained as the scene shifts to a close-up of a vintage watch at 00:02:
225、05,its hands frozen in time.This is followed by a shot of a person in a tent,illuminated by a warm,yellow light,suggesting a moment of quiet reflection or preparation.At 00:02:08,the scene cuts to a climbers harness,with carabiners attached,symbolizing readiness and the technical aspects of the clim
226、b.The tension builds as two climbers are shown on a snowy ridge at 00:02:09,their silhouettes stark against the bright snow.The climax of the scene is reached at 00:02:10,with a skier launching off a cliff,the camera capturing the breathtaking moment of the jump and the spray of snow as they descend
227、.The sequence of skiing shots from 00:02:10 to 00:02:17 showcases the skiers skill and the dynamic,exhilarating nature of the sport,while also highlighting the inherent dangers and the thrill of the adventure.The editing effectively weaves together themes of preparation,reflection,and the intense,aw
228、e-inspiring moments of a mountain climb.(00:02:17-00:02:42)The scene opens with a skier in a red jacket performing a mid-air trick,showcasing their skill and agility against the backdrop of a snowy mountain slope(00:02:17-00:02:20).The camera then transitions to a skier in a blue jacket,who is captu
229、red mid-air as they soar through a cloud of snow,emphasizing the dynamic and thrilling nature of the sport(00:02:20-00:02:23).The focus shifts to a skier in a red helmet,who is seen navigating through a dense forest of snow-covered trees,highlighting the technical aspects of the descent(00:02:23-00:
230、02:26).The action intensifies with a skier in a red jacket,who is shown launching off a cliff and then landing smoothly on a steep,snowy slope,demonstrating the precision and control required in such maneuvers(00:02:26-00:02:29).The scene then transitions to a snowboarder in a red jacket,who is capt
231、ured mid-air as they perform a trick,further emphasizing the excitement and challenge of the sport(00:02:29-00:02:33).The camera then shifts to a breathtaking view of a bird soaring through a cloudy sky,symbolizing freedom and the vastness of the natural world(00:02:33-00:02:42).The editing seamless
232、ly weaves together these moments of action and tranquility,creating a narrative that explores the themes of skill,freedom,and the connection between humans and nature.(00:02:42-00:03:05)The scene opens with a skier performing a mid-air trick against a stunning sunset backdrop,capturing the thrill an
233、d freedom of the sport(00:02:42-00:02:46).The camera then transitions to a serene shot of incense burning,symbolizing a moment of reflection and spirituality(00:02:46-00:02:52).This is followed by a breathtaking view of a snow-covered mountain range under a pink sky,emphasizing the awe-inspiring bea
234、uty of nature(00:02:52-00:02:59).The scene then shifts to a surfer riding a wave,highlighting the dynamic and exhilarating aspects of water sports(00:02:59-00:03:04).The overall mood of the scene is a blend of thrill,reflection,and the majesty of nature,with each shot seamlessly transitioning to the
235、 next,creating a cohesive and visually captivating sequence.(00:03:05-00:03:27)The scene begins with a serene shot of a forest,where sunlight filters through the trees,creating a peaceful and introspective atmosphere.At 00:03:06,the camera shifts to a temple,where people are seen walking,adding a se
236、nse of cultural and spiritual depth to the setting.The temple is adorned with prayer flags,which flutter gently in the breeze,symbolizing hope and aspiration.At 00:03:08,the focus narrows to a close-up of prayer flags,their vibrant colors and intricate designs standing out against the backdrop of th
237、e temple.The scene then transitions to a forest floor covered in fallen leaves,evoking a sense of the passage of time and the beauty of natures cycles.At 00:03:10,the camera captures the texture and patterns of the leaves,emphasizing the intricate details of the natural world.The scene continues wit
238、h a shot of a cracked,dry lakebed,stretching out to the horizon under a clear blue sky,which adds a sense of vastness and isolation.At 00:03:11,the title INTO THE MIND appears on the screen,setting the theme for the sequence.The camera then zooms in on the cracked earth,highlighting the textures and
239、 patterns of the ground,before fading to black at 00:03:15.The credits roll,listing the names of the cast and crew,and the scene concludes with a black screen at 00:03:27.(00:03:27-00:03:37)The scene opens with a black screen displaying the credits,acknowledging the contributions of various individu
240、als and organizations involved in the making of the film.The mood is neutral and informative,setting the stage for the conclusion of the narrative.As the credits roll,the screen transitions to a dark,rocky interior,likely a cave or a similar natural formation,with a wooden structure partially visibl
241、e.This shift in setting suggests a change in the visual style,possibly indicating a new chapter or a different aspect of the story.The credits continue to display on the screen,providing a final overview of the production team and their roles.The scene then fades to black,marking the end of the film
242、.ResponseFigure 11:Video scene splitting.Kimi-VL processes a long-form video by segmenting it into coherent scenes andproviding detailed start/end timestamps along with fi ne-grained natural language descriptions for each scene.Kimi-VL outperforms DeepSeek-VL2 and Llama-3.2-11B-Inst.,but lags behind
243、 Qwen2.5-VL-7B and Gemma-12B-IT.Nevertheless,through RL and test-time scaling,Kimi-VL-Thinking has signifi cantly improved and already on par with30B-level VLMs(see Table 4).These results highlight our models effectiveness in combining visual perception withmathematical problem-solving,an essential
244、capability for real-world applications.4.1.4Document Understanding and OCRKimi-VL excels in document understanding and OCR tasks across all benchmarks in this category.On InfoVQA,itachieves 83.2%accuracy,outperforming GPT-4o(80.7%)and DeepSeek-VL2(78.1%).For OCRBench,our modelscores 86.7%,surpassing
245、 all other models including GPT-4o-mini(78.5%)and DeepSeek-VL2(81.1%).These resultsdemonstrate that our model has exceptional text recognition and document understanding capabilities,making itespecially suitable for applications involving document processing and information extraction.Video source:h
246、ttps:/ TECHNICALREPORTInstructionThere is an old Chinese saying:Give a man a fish,and you feed him for a day;teach a man to fish,and you feed him for a lifetime.In this representation,the author puts forward a further requirement.Find it then explain it in detail.35:5500:00ResponseThe requirement is
247、 to Teach him the taste of fish and make him hungry.This implies that in addition to teachingsomeone how to perform a task or acquire a skill,it is also important to inspire and motivate them to continue learningand improving on their own.By making them hungry for more knowledge or experience,they w
248、ill be more likely to seekout new challenges and opportunities for growth,which can lead to greater success and fulfillment in the long run.Figure 12:Catching and understanding key details from an hour-long video course.Kimi-VL demonstrates its abilityto comprehend and interpret instructional video
249、content by analyzing frame sequences and extracting conceptualprogression over time.In this case,the model identifi es a deepening of the traditional saying“Teach a man to fi sh,andyou feed him for a lifetime”into a more nuanced idea:“Teach him the taste of fi sh and make him hungry.”4.1.5Agent Grou
250、nding and Multi-turn Agent InteractionIn agent-based tasks,Kimi-VL demonstrates remarkable performance.On single-step grounding,our model showsstrong accuracy,with 92.0%on ScreenSpot-V2 and 34.5%on extremely diffi cult ScreenSpot-Pro(on 4K screens),proving its strong agent grounding abilities.More i
251、mportantly,it also shows strong multi-step turn agent interactionabilities:For OSWorld,Kimi-VL reaches 8.22%,outperforming GPT-4o(5.03%)and other capable open-sourcemodels;On WindowsAgentArena,our model achieves 10.4%,also surpassing GPT-4o(9.4%)and others.These resultshighlight Kimi-VLs exceptional
252、 ability to understand and interact with operating system interfaces,suggesting strongpotential for applications in automated UI navigation and task execution.4.1.6Long Document and Long Video UnderstandingKimi-VL demonstrates competitive performance in long-form content understanding.On MMLongBench
253、-Doc,achallenging benchmark with question-answering on up to 100+pages,it achieves 35.1%,outperforming GPT-4o-mini(29.0%)and Qwen2.5-VL-7B(29.6%),only behind GPT-4o(42.8%).For long video understanding,on Video-MME,our model outperforms all effi cient VLMs and especially leads on the fairer w/o subti
254、tle setting,where models haveto fi nd answers from video frames instead of hacking from input subtitles;on w/subtitlesetting,it also reachesVideo source:https:/ TECHNICALREPORTNon-Thinking ModelThinking ModelBenchmark(Metric)GPT-4oGPT-Qwen2.5-VL-Gemma-3-o1-QVQ-72B-Kimi-Kimi-VL-4o-mini72B7B27B12B1217
255、Previewk1.5Thinking-A3BMathVision(full)(Pass1)30.4-38.125.135.532.1-35.938.636.8MathVista(mini)(Pass1)63.856.774.868.262.356.471.071.474.971.3MMMU(val)(Pass1)69.160.074.858.664.859.677.370.370.061.7Table 4:Performance of the Kimi-VL-Thinking against various open-source and proprietary models across
256、differentbenchmarks.The metrics evaluated include MathVista(mini),MMMU(val),and MathVision(full),with resultsexpressed in terms of Pass1.The Kimi-VL-Thinking outperforms the non-thinking models in most cases,showcasingthe enhanced reasoning and processing capabilities of the“thinking”variant across
257、different domains and scales.1 24816Max Thinking Length(k tokens)162024283236Test Time Accuracy(%)18.7%22.6%29.0%34.0%36.8%MathVision1 24816Max Thinking Length(k tokens)666768697071Test Time Accuracy(%)66.7%69.0%70.9%70.6%71.3%MathVista1 24816Max Thinking Length(k tokens)48525660Test Time Accuracy(%
258、)49.2%52.4%56.2%60.1%61.7%MMMUFigure 13:Test-time accuracy when scaling the max thinking token length of our Kimi-VL-Thinking model.extraordinary 72.6%accuracy.On the MCQ subset of MLVU,Kimi-VL achieves an impressive 74.2%score,achievingstate-of-the-art and surpassing both GPT-4o(64.6%)and Qwen2.5-V
259、L-7B(70.2%).For LongVideoBench,it scores64.5%,outperforming all compared models except GPT-4o(66.7%).These results demonstrate Kimi-VL s strongcapability to understand long-form PDFs and videos.4.1.7Egocentric and Fine-grained Video PerceptionKimi-VL also shows strong performance in more nuanced vid
260、eo perception tasks.On EgoSchema full set(hidden testset),it achieves 78.5%,signifi cantly outperforming GPT-4o(72.2%),Qwen2.5-VL-7B(65.0%).For VSI-Bench,avery challenging benchmark that requires to understand spatial relationships and correspondences of multiple objectsin a video,our model scores 3
261、7.4%,surpassing GPT-4o(34.0%)and Qwen2.5-VL-7B(34.2%).In TOMATO thatexamines fi ne-grained temporal perception of VLMs,Kimi-VL reaches 31.7%,outperforming Qwen2.5-VL-7B(27.6%)and GPT-4o-Mini(28.8%).These results demonstrate our models strong capability to understand dynamic visualcontent,track objec
262、ts over time,and interpret complex actions in video sequences,making it well-suited for applicationsrequiring temporal visual understanding.4.2A Reasoning Extension of Kimi-VLFurthermore,we conduct a reasoning extension to empower Kimi-VL to reason with CoT and present a long-thinkingversion of the
263、model,Kimi-VL-Thinking,through long-CoT activation and reinforcement learning.We validate itssuperior performance on several image benchmarks,as shown in Table 4.Kimi-VL-Thinking signifi cantly improves over the base Kimi-VL model,with gains of 2.6%on MathVista,4.7%onMMMU,and 15.4%on MathVision,demo
264、nstrating its capability to leverage test-time computation for deeper reasoningand better handling of complex multimodal queries.In Table 4,Kimi-VL-Thinking further outperforms or rivalsstate-of-the-art thinking and non-thinking models:achieving 71.3%on MathVista,outperforming GPT-4o(63.8%)andGPT-4o
265、-mini(56.7%);scoring 61.7%on MMMU,surpassing GPT-4o-mini(60.0%)and Qwen2.5-VL-7B(58.6%);andreaching 36.8%on MathVision,exceeding GPT-4o(30.4%)and Gemma-3-27B-IT(35.5%),even QVQ-72B(35.9%).While marginally behind some larger-scale models on select benchmarks,Kimi-VL-Thinking accomplishes these result
266、swith only 3B activated parametersorders of magnitude fewer than its counterpartsunderscoring its strong effi ciencyand effectiveness in multimodal reasoning.17KIMI-VL TECHNICALREPORTOur Kimi-VL-Thinking model also exhibits strong test-time scaling properties,as shown in Figure 13.Specifi cally,incr
267、easing the max thinking token length at inference time consistently improves test-time accuracy across all threebenchmarks.For example,on MathVision,the accuracy rises steadily from 18.7%at 1k tokens to 36.8%at 16k tokens,and similar upward trend is also observed on MMMU,indicating that the model is
268、 able to utilize longer reasoning chainsfor better performance.However,not all benchmarks benefi t equally from longer thinking lengths.On MathVista,performance saturates early,with accuracy reaching 70.9%at 4k tokens and no further signifi cant gains observed as thetoken length increases to 16k.It
269、suggests that for this task,the necessary reasoning depth is already captured within arelatively short context,and additional computation does not yield further improvements.5Conclusion,Limitation,and Future WorkWe introduce Kimi-VL,a VLM designed with a balanced approach to cover both multimodal an
270、d text-only pre-training/post-training,underpinned by an MoE-based architecture for scalable effi ciency.Its 128K extended context win-dow enables precise retrieval in lengthy texts and videos,while the native-resolution encoder MoonViT helps maintainhigh accuracy with low computational overhead in
271、ultra-high-resolution visual tasks.Additionally,Kimi-VL-Thinkingfacilitates effective long-chain reasoning in complex image and video inference.Overall,Kimi-VL demonstrates robustadaptability and effi ciency across multimodal,long-context,and high-resolution tasks,indicating substantial potentialfor
272、 future research and industrial applications.However,Kimi-VL still faces several challenges:1.Although the current model size performs effectively for many standard tasks,it remains too limited to addresshighly specialized or domain-specifi c problems,or problems that are strongly dependent on langu
273、age abilities,restricting Kimi-VLs ability to handle extremely complex scenarios.2.While the reasoning capability is already strong for typical use cases,it has yet to reach its theoretical upper bound,particularly for intricate tasks requiring multi-step inference or deeper contextual understanding
274、.3.Despite providing a 128K extended context window,due to limited parameters in its attention layers(which isonly comparable to a 3B model),its long-context abilities is still insuffi cient for certain advanced applications thatinvolve extremely long sequences or high-volume contextual information.
275、In the future,we will tackle these challenges by scaling up the model size,expanding pre-training data,and enhancingpost-training algorithms.Our next steps include optimizing Kimi-VL and releasing larger versions,as well as refi ningpost-training and test-time scaling mechanisms for a better thinkin
276、g model.These efforts will pave the way for moreadvanced applications in both research and industry.ReferencesAmazon Web Services.Amazon Simple Storage Service(Amazon S3).Web.Available at:https:/ on 12/15/2023).Bai,Shuai et al.Qwen2.5-VL Technical Report.2025.arXiv:2502.13923 cs.CV.URL:https:/arxiv.
277、org/abs/2502.13923.Bonatti,Rogerio et al.Windows Agent Arena:Evaluating Multi-Modal OS Agents at Scale.2024.arXiv:2409.08264cs.AI.URL:https:/arxiv.org/abs/2409.08264.Chen,Lin et al.“Are We on the Right Way for Evaluating Large Vision-Language Models?”In:arXiv preprintarXiv:2403.20330(2024).Chen,Tian
278、qi et al.Training Deep Nets with Sublinear Memory Cost.2016.arXiv:1604.06174 cs.LG.URL:https:/arxiv.org/abs/1604.06174.Cheng,Kanzhi et al.“Seeclick:Harnessing gui grounding for advanced visual gui agents”.In:arXiv preprintarXiv:2401.10935(2024).Dao,Tri et al.FlashAttention:Fast and Memory-Efficient
279、Exact Attention with IO-Awareness.2022.arXiv:2205.14135cs.LG.URL:https:/arxiv.org/abs/2205.14135.DeepSeek-AI,Daya Guo,et al.DeepSeek-R1:Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.2025.arXiv:2501.12948 cs.CL.URL:https:/arxiv.org/abs/2501.12948.DeepSeek-AI,Aixin Liu,et al.De
280、epSeek-V3 Technical Report.2025.arXiv:2412.19437 cs.CL.URL:https:/arxiv.org/abs/2412.19437.Dehghani,Mostafa et al.Patch n Pack:NaViT,a Vision Transformer for any Aspect Ratio and Resolution.2023.arXiv:2307.06304 cs.CV.URL:https:/arxiv.org/abs/2307.06304.18KIMI-VL TECHNICALREPORTFedus,William,Barret
281、Zoph,and Noam Shazeer.Switch Transformers:Scaling to Trillion Parameter Models withSimple and Efficient Sparsity.2022.arXiv:2101.03961 cs.LG.URL:https:/arxiv.org/abs/2101.03961.Fu,Chaoyou et al.“Video-MME:The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs inVideo Analysis”.In:arXi
282、v:2405.21075(2024).Fu,Xingyu et al.“Blink:Multimodal large language models can see but not perceive”.In:European Conference onComputer Vision.Springer.2024,pp.148166.Gadre,Samir Yitzhak et al.“Datacomp:In search of the next generation of multimodal datasets”.In:Advances inNeural Information Processi
283、ng Systems 36(2024).Grauman,Kristen et al.“Ego4d:Around the world in 3,000 hours of egocentric video”.In:Proceedings of the IEEE/CVFconference on computer vision and pattern recognition.2022,pp.1899519012.Guo,Jarvis et al.MAmmoTH-VL:Eliciting Multimodal Reasoning with Instruction Tuning at Scale.202
284、4.arXiv:2412.05237 cs.CL.URL:https:/arxiv.org/abs/2412.05237.Hu,Kairui et al.“Video-MMMU:Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos”.In:arXiv preprint arXiv:2501.13826(2025).Huang,Yanping et al.GPipe:Efficient Training of Giant Neural Networks using Pipeline Paralleli
285、sm.2019.arXiv:1811.06965 cs.CV.URL:https:/arxiv.org/abs/1811.06965.Jacobs,Sam Ade et al.DeepSpeed Ulysses:System Optimizations for Enabling Training of Extreme Long SequenceTransformer Models.2023.arXiv:2309.14509 cs.LG.URL:https:/arxiv.org/abs/2309.14509.Jordan,Keller et al.Muon:An optimizer for hi
286、dden layers in neural networks.2024.URL:https:/kellerjordan.github.io/posts/muon/.Kembhavi,Aniruddha et al.“A diagram is worth a dozen images”.In:European conference on computer vision.Springer.2016,pp.235251.Korthikanti,Vijay et al.Reducing Activation Recomputation in Large Transformer Models.2022.
287、arXiv:2205.05198cs.LG.URL:https:/arxiv.org/abs/2205.05198.Laurenon,Hugo et al.“Obelics:An open web-scale fi ltered dataset of interleaved image-text documents”.In:Advancesin Neural Information Processing Systems 36(2024).Li,Bo et al.LLaVA-OneVision:Easy Visual Task Transfer.2024.arXiv:2408.03326 cs.
288、CV.URL:https:/arxiv.org/abs/2408.03326.Li,Dongxu et al.Aria:An Open Multimodal Native Mixture-of-Experts Model.2024.arXiv:2410.05993 cs.CV.URL:https:/arxiv.org/abs/2410.05993.Li,Kaixin et al.“ScreenSpot-Pro:GUI Grounding for Professional High-Resolution Computer Use”.In:Workshop onReasoning and Plan
289、ning for Large Language Models.2025.Li,Shen et al.PyTorch Distributed:Experiences on Accelerating Data Parallel Training.2020.arXiv:2006.15704cs.DC.URL:https:/arxiv.org/abs/2006.15704.Liu,Hao,Matei Zaharia,and Pieter Abbeel.Ring Attention with Blockwise Transformers for Near-Infinite Context.2023.ar
290、Xiv:2310.01889 cs.CL.URL:https:/arxiv.org/abs/2310.01889.Liu,Jingyuan et al.“Muon is Scalable for LLM Training”.In:arXiv preprint arXiv:2502.16982(2025).“Muon is Scalable for LLM Training”.In:arXiv preprint arXiv:2502.16982(2025).Liu,Yuan et al.“MMBench:Is Your Multi-modal Model an All-around Player
291、?”In:arXiv:2307.06281(2023).Liu,Yuliang et al.“On the hidden mystery of ocr in large multimodal models”.In:arXiv e-prints(2023),arXiv2305.Lu,Pan et al.“Mathvista:Evaluating mathematical reasoning of foundation models in visual contexts”.In:arXivpreprint arXiv:2310.02255(2023).Mangalam,Karttikeya,Rai
292、ymbek Akshulakov,and Jitendra Malik.“Egoschema:A diagnostic benchmark for verylong-form video language understanding”.In:Advances in Neural Information Processing Systems 36(2023),pp.4621246244.Mathew,Minesh et al.“Infographicvqa”.In:Proceedings of the IEEE/CVF Winter Conference on Applications ofCo
293、mputer Vision.2022,pp.16971706.Narayanan,Deepak et al.Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.2021.arXiv:2104.04473 cs.CL.URL:https:/arxiv.org/abs/2104.04473.OpenAI.“Learning to reason with LLMs”.In:(2024).URL:https:/ et al.GPT-4o System Card.2024.arXiv:2410.2
294、1276 cs.CL.URL:https:/arxiv.org/abs/2410.21276.Rajbhandari,Samyam et al.“Zero:Memory optimizations toward training trillion parameter models”.In:SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE.2020,pp.116.Schuhmann,Christoph et al.“Laion-5b:An open l
295、arge-scale dataset for training next generation image-text models”.In:Advances in Neural Information Processing Systems 35(2022),pp.2527825294.19KIMI-VL TECHNICALREPORTShangguan,Ziyao et al.“TOMATO:Assessing Visual Temporal Reasoning Capabilities in Multimodal FoundationModels”.In:International Conf
296、erence on Learning Representations.2025.URL:https:/ et al.“Nemotron-CC:Transforming Common Crawl into a Refi ned Long-Horizon Pretraining Dataset”.In:arXiv preprint arXiv:2412.02595(2024).Su,Jianlin et al.RoFormer:Enhanced Transformer with Rotary Position Embedding.2023.arXiv:2104.09864cs.CL.URL:htt
297、ps:/arxiv.org/abs/2104.09864.Team,Gemini et al.Gemini 1.5:Unlocking multimodal understanding across millions of tokens of context.2024.arXiv:2403.05530 cs.CL.URL:https:/arxiv.org/abs/2403.05530.Team,Gemma et al.Gemma 3 Technical Report.2025.arXiv:2503.19786 cs.CL.URL:https:/arxiv.org/abs/2503.19786.
298、Team,Kimi et al.“Kimi k1.5:Scaling reinforcement learning with llms”.In:arXiv preprint arXiv:2501.12599(2025).Tong,Shengbang et al.Cambrian-1:A Fully Open,Vision-Centric Exploration of Multimodal LLMs.2024.arXiv:2406.16860 cs.CV.URL:https:/arxiv.org/abs/2406.16860.Wang,Ke et al.“Measuring multimodal
299、 mathematical reasoning with math-vision dataset”.In:arXiv preprintarXiv:2402.14804(2024).Wei,Haoran et al.“General OCR Theory:Towards OCR-2.0 via a Unifi ed End-to-end Model”.In:arXiv preprintarXiv:2409.01704(2024).Wu,Haoning et al.“Longvideobench:A benchmark for long-context interleaved video-lang
300、uage understanding”.In:Advances in Neural Information Processing Systems 37(2024),pp.2882828857.Wu,Zhiyong et al.“Os-atlas:A foundation action model for generalist gui agents”.In:arXiv preprint arXiv:2410.23218(2024).Wu,Zhiyu et al.DeepSeek-VL2:Mixture-of-Experts Vision-Language Models for Advanced
301、Multimodal Understanding.2024.arXiv:2412.10302 cs.CV.URL:https:/arxiv.org/abs/2412.10302.x.ai.“Grok-1.5 Vision Preview”.In:(2024).URL:https:/x.ai/news/grok-1.5v.Xie,Tianbao et al.“Osworld:Benchmarking multimodal agents for open-ended tasks in real computer environments”.In:Advances in Neural Informa
302、tion Processing Systems 37(2024),pp.5204052094.Xu,Yiheng et al.Aguvis:Unified Pure Vision Agents for Autonomous GUI Interaction.2024.arXiv:2412.04454cs.CL.Yang,Jihan et al.“Thinking in space:How multimodal large language models see,remember,and recall spaces”.In:arXiv preprint arXiv:2412.14171(2024)
303、.Yu,Jiahui et al.CoCa:Contrastive Captioners are Image-Text Foundation Models.2022.arXiv:2205.01917 cs.CV.URL:https:/arxiv.org/abs/2205.01917.Yu,Weihao et al.“Mm-vet:Evaluating large multimodal models for integrated capabilities”.In:International conferenceon machine learning.PMLR.2024.Yue,Xiang,Yua
304、nsheng Ni,et al.“Mmmu:A massive multi-discipline multimodal understanding and reasoningbenchmark for expert agi”.In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024,pp.95569567.Yue,Xiang,Xingwei Qu,et al.“Mammoth:Building math generalist models through hybrid in
305、struction tuning”.In:arXiv preprint arXiv:2309.05653(2023).Zhai,Xiaohua et al.Sigmoid Loss for Language Image Pre-Training.2023.arXiv:2303.15343 cs.CV.URL:https:/arxiv.org/abs/2303.15343.Zhao,Yilun et al.“MMVU:Measuring Expert-Level Multi-Discipline Video Understanding”.In:arXiv preprintarXiv:2501.1
306、2380(2025).Zhou,Junjie et al.“Mlvu:A comprehensive benchmark for multi-task long video understanding”.In:arXiv preprintarXiv:2406.04264(2024).Zhu,Wanrong et al.“Multimodal c4:An open,billion-scale corpus of images interleaved with text”.In:Advances inNeural Information Processing Systems 36(2024).20
307、KIMI-VL TECHNICALREPORTAppendixAContributionsCore ContributorsBohong YinBowei XingCheng ChenChu WeiDehao ZhangDongliang WangHaoning WuHaotian YaoHaoyu LuHao YangLin SuiXinyuan Wang#Xinyu ZhouYang LiY.CharlesYiping BaoYimin ChenYuxin WuZaida ZhouZhaowei LiZhiqi HuangZhilin YangZiwei ChenContributorsA
308、ngang DuBowen QuBowen Wang#Chenlin ZhangChenzhuang DuCongcong WangDikang DuEnming YuanEnzhe LuFang LiFlood SungGuangda WeiGuokun LaiHan ZhuHao DingHao HuHao ZhangHeng WangHongcheng GaoHuabin ZhengJiaming LiJianlin SuJianzhou WangJiaqi Deng#Jiezhong QiuJin XieJinhong WangJingyuan LiuJunjie YanKun Ouy
309、angLiang ChenLonghui YuMengfan DongMengnan DongNuo XuPengyu ChengQizheng GuRunjie ZhouShaowei LiuSihan CaoTao Yu#Tianhui SongTongtong BaiWeiran HeWei SongWeixiao HuangWeixin XuXiaokun YuanXingzhe WuXingcheng YaoXinxing ZuYangyang HuYan ZhongYanru ChenYibo MiaoYejie WangYibo LiuYidao QinYiqin WangYon
310、gsheng KangYuanxin LiuYulun DuYuzhi WangYuzi YanZhejun JiangZheng ZhangZihao HuangZijia ZhaoZongyu Lin*Project lead(s).#The University of Hong Kong,Moonshot.aiThe listing of authors is in alphabetical order based on their fi rst names.21KIMI-VL TECHNICALREPORTBEvaluation DetailsB.1Image BenchmarkMMM
311、U(Yue,Ni,et al.2024)encompasses a carefully curated collection of 11.5K multimodal questions sourcedfrom college exams,quizzes,and textbooks.These questions span six major academic fi elds:Art&Design,Business,Science,Health&Medicine,Humanities&Social Science,and Tech&Engineering.MMBench-EN-v1.1(Yuan
312、 Liu et al.2023)is a fi ne-grained benchmark that contains 2974 multiple-choice questions,covering 20 ability dimensions.It incorporate perception and reasoning as the top-level ability dimensions in its abilitytaxonomy,leading to different levels of evaluation in various ability dimensions.MMStar(L
313、in Chen et al.2024)is an elite vision-indispensable multimodal benchmark comprising 1,500 challengesamples meticulously selected by humans.It is designed to benchmark 6 core capabilities and 18 detailed axes,aimingto evaluate the multimodal capacities of LVLMs with a carefully balanced and purifi ed
314、 selection of samples.MMVet(W.Yu et al.2024)is designed based on the insight that the intriguing ability to solve complicated tasks isoften achieved by a generalist model being able to integrate different core vision-language capabilities.It defi nes 6 coreVL capabilities and examines the 16 integra
315、tions of interest derived from the capability combination.RealWorldQA(x.ai 2024)is a benchmark designed to evaluate the real-world spatial understanding capabilities ofmultimodal models.It assesses how well the models comprehend physical environments.The benchmark consists ofover 700 images,each acc
316、ompanied by a question and a verifi able answer,and these images are drawn from variousreal-world scenarios.AI2D(Kembhavi et al.2016)is a dataset of over 5000 grade school science diagrams with over 150000 rich annotations,their ground truth syntactic parses,and more than 15000 corresponding multipl
317、e choice questions.MathVision(K.Wang et al.2024)is a carefully curated collection of 3,040 high-quality mathematical problems withvisual contexts that are sourced from real math competitions.It covers 16 distinct mathematical disciplines and isgraded across 5 levels of diffi culty.This dataset offer
318、s a comprehensive and diverse set of challenges,making it idealfor evaluating the mathematical reasoning abilities of LMMs.MathVista(P.Lu et al.2023)is a benchmark that integrates challenges from a variety of mathematical and visualtasks,demanding participants to exhibit fi ne-grained,deep visual un
319、derstanding along with compositional reasoning tosuccessfully complete the tasks.BLINK(X.Fu et al.2024)is a benchmark designed to evaluate multi-image visual cognition,encompassing tasksrelated to depth relationships,feature matching,digital forensics,and spatiotemporal reasoning.It features a diver
320、se setof multi-image perceptual similarity tasks,validated through standardized protocols.InfoVQA(Mathew et al.2022)is a dataset specifi cally designed to assess models capabilities in interpreting andreasoning with complex infographics that integrate text,graphics,and visual elements.Model performa
321、nce on thisdataset is evaluated using the ANLS metric on the test set.OCRBench(Yuliang Liu et al.2023)evaluates the OCR capabilities of MLLMs across fi ve tasks:text recognition,scene text VQA,document VQA,key information extraction,and handwritten math expression recognition.Thebenchmark is scored
322、out of a maximum of 1000 points.B.2Video and Long Document BenchmarkVideoMMMU(K.Hu et al.2025)is a video benchmark designed to evaluate the college-level knowledge acquisitioncapabilities of large multimodal models.It consists of 300 expert-level videos and 900 human-annotated questions.Thevideos sp
323、an six diverse academic disciplines:Art,Humanities,Medicine,Business,Science,and Engineering.Thequestions are structured to align with three cognitive stages:Perception,Comprehension,and Adaptation.MMVU(Y.Zhao et al.2025)is a video benchmark designed to evaluate the expert-level video understanding
324、ability.The benchmark contains 3,000 expert-annotated questions over 1,529 videos,which span 27 subjects from four coredisciplines:Science,Healthcare,Humanities&Social Sciences,and Engineering.Video-MME(C.Fu et al.2024)is a video benchmark that consists of 900 manually selected videos(totaling 254ho
325、urs length),and 2,700 QA pairs.The videos,varying in duration,are categorized into 30 fi ne-grained classes acrosssix diverse domains:Knowledge,Film&Television,Sports Competition,Artistic Performance,Life Record,andMultilingual content.Evaluations are conducted under two different settings:with and
326、without subtitles.22KIMI-VL TECHNICALREPORTMLVU(J.Zhou et al.2024)is designed to evaluate the model performance in comprehending long videos from multipleaspects.It consists of 1,730 videos along with 3,102 corresponding question-answer pairs(2,593 in dev set and 509 intest set).Videos of this bench
327、mark are collected from multiple scenarios,including Sport,Ego-centric,Life Record,Tutorial,etc.The close-ended task set of MLVU comprises 7 different tasks:Action Order,Action Count,TopicReasoning,Anomaly Recognition,Plot QA,Ego Reasoning,and Needle QA.LongVideoBench(H.Wu et al.2024)is a video ques
328、tion-answering benchmark designed to evaluate the long-formmultimodal perception and relation capability of large multimodal models.The benchmark includes 3,763 web-collected videos spanning various lengths and themes,along with their corresponding subtitles.It includes 6,678human-annotated multiple
329、-choice questions,distributed across 17 fi ne-grained categories,which accesses differentaspects of video-language understanding.EgoSchema(Mangalam et al.2023)is a video benchmark designed to evaluate the long-form video understandingcapabilities within the ego-centric scenario.Derived from Ego4D(Gr
330、auman et al.2022),the benchmark comprises over5,031 multiple choice question-answer pairs spanning more than 250 hours real-world videos with a semi-automaticdata pipeline.VSI-Bench(Yang et al.2024)is designed to evaluate the visual-spatial comprehensive capabilities of large multimodalmodels.It con
331、sists of over 5,000 question-answer pairs across around 290 real indoor-scene videos.TOMATO(Shangguan et al.2025)is a video benchmark comprises 1,484 human-annotated question-answer pairs and1,417 videos.TOMATO focuses on evaluating the temporal reasoning capabilities of large multimodal models,incl
332、udingaction counting,direction prediction,rotation analysis,shape&trend detection,velocity&frequency estimation,andvisual cue interpretation.B.3Agent BenchmarkScreenSpot V2(Zhiyong Wu et al.2024)is an enhanced version of the ScreenSpot(K.Cheng et al.2024)benchmark,which focuses on evaluating the per
333、formance of GUI grounding models across multiple platforms,including web,desktop,and mobile interfaces.This updated version addresses several issues identifi ed in the original ScreenSpotdataset,such as incorrect or ambiguous annotations,spelling mistakes,and mislabeled bounding boxes.ScreenSpot Pro(K.Li et al.2025)is a benchmark for evaluating GUI grounding in high-resolution,complex UIenvironmen