《DeepSeek VL技術報告(英文版)(33頁).pdf》由會員分享,可在線閱讀,更多相關《DeepSeek VL技術報告(英文版)(33頁).pdf(33頁珍藏版)》請在三個皮匠報告上搜索。
1、DeepSeek-VL:Towards Real-World Vision-LanguageUnderstandingHaoyu Lu*1,Wen Liu*1,Bo Zhang*1,Bingxuan Wang1,Kai Dong1,Bo Liu1,Jingxiang Sun1,Tongzheng Ren1,Zhuoshu Li1,Hao Yang1,Yaofeng Sun1,Chengqi Deng1,Hanwei Xu1,Zhenda Xie1,Chong Ruan11DeepSeek-AIneal,liuwen,https:/ present DeepSeek-VL,an open-sou
2、rce Vision-Language(VL)Model designed for real-worldvision and language understanding applications.Our approach is structured around three keydimensions:Data Construction:We strive to ensure our data is diverse,scalable and extensively coversreal-world scenarios including web screenshots,PDFs,OCR,ch
3、arts,and knowledge-basedcontent(expert knowledge,textbooks),aiming for a comprehensive representation of practicalcontexts.Further,we create a use case taxonomy from real user scenarios and construct aninstruction-tuning dataset accordingly.The fine-tuning with this dataset substantially improvesthe
4、 models user experience in practical applications.Model Architecture:Considering efficiency and the demands of most real-world scenarios,DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolutionimages(1024 x 1024)within a fixed token budget,while maintaining a relat
5、ively low computa-tional overhead.This design choice ensures the models ability to capture critical semantic anddetailed information across various visual tasks.Training Strategy:We posit that a proficient Vision-Language Model should,foremost,possess strong language abilities.To ensure the preserva
6、tion of LLM capabilities duringpretraining,we investigate an effective VL pretraining strategy by integrating LLM trainingfrom the beginning and carefully managing the competitive dynamics observed between visionand language modalities.Starting with a focus on text,we gradually adjust the ratio to f
7、acilitatea balanced integration of both modalities.The DeepSeek-VL family(both 1.3B and 7B models)showcases superior user experiences as avision-language chatbot in real-world applications,achieving state-of-the-art or competitiveperformance across a wide range of visual-language benchmarks at the s
8、ame model size whilemaintaining robust performance on language-centric benchmarks.We have made both 1.3Band 7B models publicly accessible to foster innovations based on this foundation model.Equal contribution.Work done during the internship at DeepSeek-AI.Project lead.arXiv:2403.05525v2 cs.AI 11 Ma
9、r 2024Contents1Introduction32Data Construction62.1Vision-Language pretraining Data.62.2Supervised Fine-tuning Data.83Approach103.1Architecture.103.2Training Pipelines.123.2.1Stage 1:Training Vision-Language Adaptor.123.2.2Stage 2:Joint Vision-Language pretraining.133.2.3Stage 3:Supervised Fine-tunin
10、g.143.3Hyperparameters and Infrastructures.154Evaluation164.1Public Multimodal Benchmarks Evaluation.164.2Public Language Benchmarks Evaluation.174.3Human Evaluation.184.4Ablation Study.195Conclusion,Limitation,and Future Work22A Appendix3021.IntroductionThe remarkable success of large language mode
11、ls(LLMs)(Anthropic,2023;Google,2023;OpenAI,2022,2023a)has fueled the demand for a versatile interface that can handle multiplemodalities beyond language.In response to this growing demand,we have seen an emergenceof Large Multimodal Models(LMMs)like GPT-4V(OpenAI,2023b)and Gemini(Team et al.,2023),w
12、hich serve as versatile assistants capable of comprehending and acting upon instructionsthat span vision and language.These models exhibit considerable promise in executing complex,diverse real-world tasks,enabling more natural and human-like interactions.Recently,there has been a surge of open-sour
13、ce large multimodal models aimed at narrowingthe gap with proprietary counterparts.Substantial strides have been made,especially inbenchmark performance,yet a significant divide persists between the majority of open-sourcemodels and state-of-the-art closed-source models(Bai et al.,2023;Bavishi et al
14、.,2023;OpenAI,2023b;Team et al.,2023)when it comes to real-world performance and user experience.Itremains challenging for the open-source community to develop models with robust generalmultimodal capabilities for real-world applications.The performance gap between the most open-source models and th
15、e proprietary models islargely pronounced in real-world scenarios,primarily due to the following reasons:Many open-source solutions allocate a significant proportion of computational resourcesto the instruction tuning phase.However,the experience of training powerful languagemodels underscores the i
16、mportance of extensive pretraining in the development of generalintelligence.To imbue multimodal models with rich world knowledge,there should bean emphasis on comprehensive pretraining that leverages a broad spectrum of vision-language data.A common practice is to amalgamate various academic datase
17、ts during instruction tuning.While such an approach may yield good benchmark results,it often falls short in providingan authentic real-world usage experience.In terms of model architecture,prior works mostly adapt a vision transformer,typicallytext-aligned,to a pre-trained language model.However,mo
18、st of these models operateon a relatively low resolution,e.g.,336336 or 448448.The intricacies of complex real-world scenarios,such as optical character recognition or tiny object discernment,demandhigh-resolution processing capability.While some models(01-ai,2024;Lin et al.,2023a;Sun et al.,2023;Wa
19、ng et al.,2023b)havebegun to exploit pretraining,they often overlook the preservation of language skills.Often,there is a degradation of language capability after prolonged multimodal training.Sincewe aim for a generalist that possesses strong capabilities in both modalities,there shouldbe a trainin
20、g strategy that well preserves the language capability when developing thenew modality ability.In light of these,we present DeepSeek-VL,an open-source large multimodal model,whichis built upon the DeepSeek language model series.We develop the model in the pursuit ofadept performance in real-world sc
21、enarios,which involves extensive pretraining,careful datacuration based on a use case taxonomy,model architecture design for high-resolution processing,and a training strategy that balances the multi-modalities.On top of these,we develop atraining methodology that steers the model scaling,from 1B to
22、 7B.These comprehensiveexplorations bring a significant performance advantage in practical settings,compared to otherlarge multimodal models(LMMs)of similar size.3Prompt圖片描述了什么算法?Response圖片展示了一個名為CLIP的算法的流程圖。該算法被描述為一種新的零樣本圖像到圖像轉換方法,它使用文本描述來預測目標數據集的類。在流程圖中,我們看到了三個主要步驟:1.Contrastive pre-training(對比式預訓
23、練):在這個階段,模型被訓練以最小化兩個編碼器之間的差異,一個用于圖像,另一個用于文本。這通過將圖像和文本嵌入投影到相同的潛在空間中來實現。2.Create dataset classifier from label text(創建標簽文本分類器):這個步驟涉及從給定的標簽文本中提取特征向量,并將其與圖像嵌入一起輸入到一個分類器中。3.Use for zero-shot prediction(用于零樣本文本預測):最后,使用前面步驟中學習的模型進行零樣本文本預測。這涉及到將圖像嵌入和提取的特征向量傳遞給分類器,以預測圖像屬于哪個類別。整個過程都圍繞著學習一個能夠理解圖像和文本之間關系的模型這一
24、核心概念。這種方法的關鍵在于利用大量的無標注數據,通過對比式的預訓練來提高模型的泛化能力。在流程圖的底部,有一個例子展示了如何使用CLIP進行零樣本預測。這里,一張狗的照片和一個關于狗的短語被用來預測這張照片中的狗屬于哪個品種??偟膩碚f,流程圖清晰地描繪了CLIP算法的結構和它的工作原理,強調了它在處理大量未標記數據方面的潛力以及其對零樣本任務的有效性。PromptWhat art style is evident in this image?ResponseThe art style evident in this image is Impressionism.This can be det
25、ermined by several visual clues present:Brushwork:The painting exhibits a loose,visible brushstroke technique which is characteristic of Impressionist art.The strokes are not smooth or blended but rather they create a sense of movement and texture.Color Palette:The use of bright,vibrant colors that
26、seem to capture the light and atmosphere of the scene without precise detail is typical of Impressionist painters.The skys blue with white clouds,the warm tones on the cliffs,and the varied hues used for the boats all contribute to this impressionistic approach.Subject Matter:The subject matter itse
27、lfa coastal landscape with figures and boatsis common in Impressionist paintings,which often depicted scenes from daily life and nature.Lack of Detail:There is a lack of fine detail in the depiction of objects such as the boats and figures,which suggests an emphasis on capturing the overall mood and
28、 impression of the scene rather than meticulous realism.Light and Shadow:The play of light and shadow is another hallmark of Impressionist works.In this painting,we see how the artist has captured the interplay of sunlight on the water and rocks,creating a dynamic effect.These elements together stro
29、ngly suggest that the artwork belongs to the Impressionist genre.PromptWhich bathroom is the bedroom A closer to?ResponseBedroom A is closer to Bathroom A.This can be determined by observing the floor plan provided in the image.The layout of the rooms and bathrooms are clearly marked with labels,all
30、owing us to identify their relative positions.Bedroom A is located on the left side of the floor plan,while Bathroom A is situated directly below it,towards the bottom-left corner of the image.This positioning indicates that there is no other bathroom between them,making Bathroom A the closest one f
31、or Bedroom A.Prompt描述圖片Response這幅圖片是Minecraft官方網站的生動歡迎界面。背景是一片遼闊的藍天,點綴著朵朵蓬松的白云。下方是一個充滿活力的像素藝術世界,特色是一個村莊、一片茂密的森林和一條蜿蜒的小路。在圖片的中心,一個綠色的按鈕吸引著注意力,上面用粗體白色字母寫著“獲取Minecraft”。這個按鈕似乎邀請觀看者加入游戲并開始他們的Minecraft冒險。在圖片的頂部,一段文字熱情地歡迎觀眾來到官方Minecraft網站。它寫道:“歡迎來到官方 Minecraft網站!加入我們的社區,了解如何玩這款游戲,加入我們最大的社區之一,并在今天開始創造吧!”這段
32、文字傳達了游戲的包容性和社區精神,鼓勵新玩家探索和創造??偟膩碚f,這幅圖片捕捉到了Minecraft世界的精髓一它的無限可能性以及與他人的互動樂趣。PromptYou are a home robot.Now you are at the kitchen,this is what you can see from there.To fetch something from the fridge,what should you do?ResponseTo fetch something from the fridge,I would follow these steps:Approach the
33、 refrigerator:The refrigerator is located on the right side of the kitchen island.Its a stainless steel model with a freezer compartment at the bottom and an ice maker dispenser above it.Open the refrigerator door:To open the refrigerator door,I would use my hands to grasp the handle or pull bar loc
34、ated on the front edge of the refrigerator door.This action should be done carefully to avoid any accidental bumps into other items in the kitchen.Locate the desired item:Once inside the refrigerator,I would scan through the shelves and compartments for the specific item I am looking for.The refrige
35、rator has multiple shelves and drawers that are likely organized by type of food or drink.Retrieve the item:After locating the desired item,I would reach out and grab it using one hand while ensuring not to disturb the organization of the contents.If the item is placed on a shelf or drawer,I would l
36、ift it up gently to remove it without knocking anything over.Close the refrigerator door:After retrieving the item,I would close the refrigerator door securely using the handle or pull bar.This ensures that the cold air remains trapped within the appliance and maintains its temperature efficiency.Re
37、turn to the task at hand:With the item retrieved from the fridge,I can now proceed with whatever activity I was doing before,such as cooking or preparing a meal.PromptCan you translate the flowchart to a python code?ResponseThe flowchart provided is a simple decision-making process that compares two
38、 variables,x and y,to determine which one has the greater value.The steps are as follows:1.Start with an input for variable x and get its next input.2.Repeat step 1 for variable y.3.Compare x and y using the x y condition.If true,it means X is greater than y,so put the maximum of these two values in
39、to the output.If false,it means x is not greater than y,so put the maximum of x and y into the output.4.Finally,end the program by putting the max value into the output.Here is how you can translate this flowchart into Python code:In this Python code,we define a function called max_of_twothat takes
40、two arguments,x and y,and returns the maximum of these two values.We then call this function with the appropriate inputs and print the result.PromptPlease generate the latex code of the formula in the image.ResponseA_n=a_0 left 1+frac34 sum_k=1n left(frac49 right)k rightPromptTell me the main contri
41、bution of this paper.ResponseThe main contribution of this paper,as stated in the abstract,is to develop a cognitive model that integrates individual differences into human-automated vehicle interaction.The authors aim to provide insights into how humans interact with new technologies and propose a
42、framework for understanding these interactions.They also discuss potential pitfalls associated with using a one-size-fits-all model and explain how modeling individual differences can help design effective user interfaces and interventions.Additionally,they conduct a simulation study to demonstrate
43、possible benefits from incorporating individual differences into the design process.This conclusion is drawn directly from the text under the ABSTRACT section,where it outlines the purpose and contributions of the research.PromptWhich direction of paper is the most likely source of the table in the
44、picture.ResponseThe image shows a graph with various lines representing different models validation log perplexity over the length of the validation sequence,which is measured in x103.The title C4 language modeling(large model)suggests that this graph pertains to a study or analysis related to large
45、-scale natural language processing tasks,specifically focusing on language modeling using a C4 dataset.To determine the most likely direction of paper from which the table originated,we should look for visual clues in the image:1.Graph Style:The graph has a clear and professional appearance,suggesti
46、ng it could be part of an academic paper or technical report.2.Data Representation:The use of multiple lines with markers such as squares,diamonds,and circles indicates comparisons between different models or methods.This kind of data representation is common in research papers where different exper
47、imental results are presented side by side for comparison.3.Title and Content:The title provides context about the nature of the content it mentions language modeling and large model,indicating that the paper deals with computational linguistics or machine learning topics.4.Legend:There is a legend
48、at the bottom left that matches the colors of the lines to their respective models.This is typical in scientific graphs to help readers understand what each line represents without having to refer back to the text.5.Axis Labels:The axes have labels that are consistent with scientific notation(Valida
49、tion sequence length and Validation log perplexity).These labels suggest that the paper is intended for an audience familiar with these terms,which would include researchers or students in fields like computer science or engineering.6.Font and Formatting:The font used for the title and axis labels a
50、ppears to be sans-serif,which is often used in modern publications due to its readability and clean design.Based on these observations,the most likely source of the table in the picture is a publication from a field related to artificial intelligence,machine learning,or computational linguistics.It
51、could be a journal article,conference proceedings,a thesis,or even a presentation slide from a seminar or workshop.The style and content align well with scholarly work in those domains.Figure 1|DeepSeek-VL possesses general multimodal understanding capabilities,capable ofprocessing logical diagrams,
52、web pages,formula recognition,scientific literature,natural images,and embodied intelligence in complex scenarios.4DeepSeek-VLs pretraining dataset is compiled from a variety of sources,including but notlimited to Common Crawl,Web Code,E-books,Educational Materials,and arXiv Articles.Thiscollection
53、thoroughly encompasses real-world scenarios such as web screenshots,PDFs,OCR,charts,and knowledge-based content(expertise,textbooks),aiming for a broad and practicalrepresentation while remaining scalable.While our pretraining data encompasses a wide array of world knowledge,we meticulouslycurate ou
54、r instruction-tuning dataset to reflect real-world usage scenarios.To achieve this,wemanually gather authentic test cases for GPT-4V and Gemini from the Internet.These cases havebeen systematically organized into a comprehensive taxonomy.We use this structured taxonomyto choose prompts for each test
55、 image,ensuring a practical and relevant instruction tuningdataset.This taxonomy is also used to create an evaluation dataset that effectively assessesreal-world performance.The visual module is designed to optimize the utilization of high-resolution visual inputswhile remaining within a fixed token
56、 budget to manage inference costs effectively.As such,weemploy a hybrid vision encoder,which combines a text-aligned encoder for coarse semanticextraction at 384384 resolution with a high-resolution encoder that captures detailed visualinformation at 10241024 resolution.By fusing these two encoders,
57、our hybrid approachefficiently condenses a 10241024 resolution image(which suffices in most use cases)into 576tokens.This token count strikes a balance between rich visual representation and token economy,making it feasible for both text-image interleaving and multi-turn inference scenarios.During t
58、he pretraining of multimodal models,a common challenge encountered is thepotential degradation of language capabilities when the training process is overly reliant onvision-language data.Our research reveals that maintaining a significant proportion of languagedataspecifically,at least 70%is essenti
59、al to preserve the integrity of language knowledgewithin the model.This balance is critical for achieving a robust multimodal capability that doesnot compromise language performance.Moreover,we introduce a novel“modality warm-up”strategy.This approach carefully adjusts the ratio of modalities during
60、 training,graduallyincorporating more vision-language data.The careful tuning of the modality ratio along withthe warm-up strategy results in a balanced performance of both modalities.When iterating on our model,We conduct experiments on a small scale before scaling to alarger model size.However,a s
61、maller model,e.g.,1B model,cannot demonstrate reasonableperformanceonbenchmarks(Schaefferetal.,2024)andfaithfullyreflectthemodelsperformance.We adopt two approaches to address this.First,we modify the evaluation protocol from multi-choice to compare the perplexity of options.Also,to prevent the inst
62、ruction following abilityfrom becoming the bottleneck,we mix a small proportion of instruction tuning data during thepretraining phase.In this way,we can achieve reasonable performance using the 1B model andmore accurately measure the impact of each iteration during the experiment.Through extensive
63、evaluations of general vision and language benchmarks,the DeepSeek-VLfamily showcases superior user experiences in real-world applications and achieves state-of-the-art or competitive performance across a wide range of visual-language benchmarks at thesame model size,while maintaining robust languag
64、e-centric performance.To foster innovationand enable a wide range of applications,we have made two versions of our ours,1.3B and 7B,publicly accessible,in the hope of facilitating the needs of varying computational capabilities.52.Data ConstructionA diverse and large dataset is the most important in
65、gredient of visual language model training.Our dataset can be divided into two parts:Vision-Language pretraining Data and Vision-Language Supervised Fine-Tuning Data.VL pretraining Data is composed of visual-text datafrom various sources,aimed at enhancing the models fundamental cross-modal understa
66、ndingcapabilities;while VL Supervised Fine-Tuning Data has a relatively smaller size and aims toteach the model to complete specific downstream tasks.By design,VL pretraining Data isused to warm up the vision-language adaptor in training stage 1 and jointly pretrain the vision-language model in stag
67、e 2,and VL Supervised Fine-Tuning Data is exploited in training stage 3,i.e.,vision language supervised fine-tuning.2.1.Vision-Language pretraining DataThe pretraining dataset utilized in our study encompasses a diverse range of publicly accessiblesources,in addition to a selection of proprietary da
68、ta.We provide a comprehensive overview ofthe data sources employed during the joint vision and language pretraining stage in Table 1.Such a dataset can facilitate LLMs comprehension of the entities portrayed in the images.Furthermore,we present a detailed breakdown of the complete dataset,which is o
69、rganizedinto the following categories:Interleaved image-text data enable the models to have a better capability for in-contextlearning of multi-modality inputs,and we utilize three public datasets MMC4(Zhu et al.,2024),Wiki(Burns et al.,2023),Wikihow(Yang et al.,2021)and Epub textbooks.Image caption
70、 data come from three high-quality image-text paired datasets:Capsfusion(Yuet al.,2023a),TaiSu(Liu et al.,2022b)and Detailed Caption(echo840,2024).Table and chart data enable the models to learn the capability for general table and chartimage understanding.It encompasses a diverse range of public da
71、ta sources,includingChart2text(Kantharaj et al.,2022),Geo170K(Gao et al.,2023),Unichart(Masry et al.,2023),Ureader(Ye et al.,2023),M-paper(Hu et al.,2023),ScienceQA(Lu et al.,2022b),ScreenQA(Hsiaoet al.,2022),SciGraphQA-295K(Li and Tajbakhsh,2023),Paper2figure100k(Rodriguez et al.,2023),Widget Capti
72、oning(Li et al.,2020),Screen2words(Wang et al.,2021),and Refexp(Maoet al.,2016).Web Code data empowers models with the capability to reconstruct code from graphicalinterfacesorvisualplots.LeveragingWebsight(HuggingFaceM4,2024)forUIInverseRendering,we adopted a strategy akin to that used in MATCHA(Li
73、u et al.,2022a)for visual plots inverserendering.This involved the processing of approximately 1.46 million Jupyter notebooks fromthe Stack dataset(Kocetkov et al.,2023).By extracting these notebooks and collating all diagramsalong with their corresponding preceding code segments,we succeeded in cur
74、ating a collectionfeaturing 2 million pairs of images and codes.For better data quality,we filter 1.1 millioninstances,each comprising a singular image coupled with a minimum of 5 lines of code,toconstitute our primary training dataset.Document Optical Character Recognition(OCR)data facilitates the
75、recognition of opticalcharacters at the document level,even in challenging real-world scenarios.To the best ofour knowledge,there is currently no publicly available large-scale dataset encompassing bothEnglish and Chinese documents.Despite the existence of the publicly accessible small-scaledataset
76、Latex-OCR(Blecher,2024),we additionally constructed a comprehensive English and6Table 1|Summary of datasets used in the joint vision and language pretraining stage.CategoryDatasetRatioInterleaved image-textMMC4(Zhu et al.,2024)13.1%Wikipedia EN&CN(Foundation)Wikihow(Yang et al.,2021)in-house PDF and
77、 Epub textbooksImage captionCapsfusion(Yu et al.,2023a)11.1%TaiSu(Liu et al.,2022b)Detailed Caption(echo840,2024)Table and chartChart2text(Kantharaj et al.,2022)2.1%Geo170K(Gao et al.,2023)Ureader(Ye et al.,2023)Unichart(Masry et al.,2023)M-paper(Hu et al.,2023)ScienceQA(Lu et al.,2022b)ScreenQA(Hsi
78、ao et al.,2022)SciGraphQA-295K(Li and Tajbakhsh,2023)Paper2figure100k(Rodriguez et al.,2023)Widget Captioning(Li et al.,2020)Screen2words(Wang et al.,2021)Refexp(Mao et al.,2016)Web CodeWebsight(HuggingFaceM4,2024)0.4%python plots scraped from GitHub notebookScene text OCRArT(Chng et al.,2019)1.2%ML
79、T-17(Nayef et al.,2017)LSVT(Sun et al.,2019)UberText(Zhang et al.,2017)Coco-text(Veit et al.,2016)RCTW-17(Shi et al.,2017)ReCTS(Zhang et al.,2019)TextOCR(Singh et al.,2021)OpenVINO(Krylov et al.,2021)HierText(Long et al.,2022)Document OCRarXiv rendered markdown(Blecher et al.,2023)2.1%Text-only corp
80、usDeepSeek-LLM 2T text copus(DeepSeek-AI,2024)70.0%Chinese document OCR dataset.It is comprised of two parts:1):arXiv Articles:We collectedsource code and compiled PDFs from 1.4 million arXiv articles.Utilizing pre-processing toolsfrom Nougat(Blecher et al.,2023),we rendered these articles into pair
81、ed images and texts;2):E-books and Educational Materials:We cleaned 860K English and 180K Chinese e-books fromAnnas Archive(Annas Archive,2024)alongside millions of K-12 education exam questions.Subsequently,we employed HTML rendering tools(Kulkarni and Truelsen)to convert theseHTML files with diffe
82、rent templates into paired image and text formats.Scene text OCR data augment the capability of the model to recognize and extract text fromimages in which the text is integrated into the environment.The dataset is composed of multiple7Table 2|Summary of data used in our joint vision and language su
83、pervised fine-tuning stage.ClassDatasetRatioIn-house DataSFT data based on taxonomy(Figure 3)10.5%General Multi-modalityShareGPT4V(Chen et al.,2023)35.5%LAION-GPTV(LAION,2023)LVIS-Instruct4V(Wang et al.,2023a)textOCR-GPT4V(Carter,2024)LLaVA1.6-GPT4V(Liu et al.,2024a)IconQA(Lu et al.,2021)Table and c
84、hartUreader(Ye et al.,2023)4.1%Geo170K(Gao et al.,2023)ScienceQA(Lu et al.,2022b)Web CodeScreen-to-code(Abi,2024)2.0%ScreenQA(Hsiao et al.,2022)Text-only SFTDeepSeek-LLM(DeepSeek-AI,2024)47.9%public datasets,including ArT(Chng et al.,2019),MLT-17(Nayef et al.,2017),LSVT(Sun et al.,2019),UberText(Zha
85、ng et al.,2017),Coco-text(Veit et al.,2016),RCTW-17(Shi et al.,2017),ReCTS(Zhang et al.,2019),TextOCR(Singh et al.,2021),OpenVINO(Krylov et al.,2021)andHierText(Long et al.,2022).Text-only corpus serves to maintain proficiency in language-centric tasks.In this study,weemploy the same text corpus wit
86、h DeepSeek-LLM(DeepSeek-AI,2024).2.2.Supervised Fine-tuning DataThe supervised fine-tuning datasets utilized in our study encompass a diverse range of multi-modality and language data sources,including well-known open-source shared gpt4v datasetssuch as ShareGPT4V(Chen et al.,2023),LAION-GPTV(LAION,
87、2023),LVIS-Instruct4V(Wangetal.,2023a),textOCR-GPT4V(Carter,2024),LLaVA1.6-GPT4V(Liuetal.,2024a)andIconQA(Luet al.,2021).Additionally,we incorporate partial table and chart data extracted from pretrainingdatasets such as Ureader(Ye et al.,2023),ScreenQA(Hsiao et al.,2022),Geo170K(Gao et al.,2023),an
88、d ScienceQA(Lu et al.,2022b).Moreover,we integrate the UI Code dataset obtainedfrom Screen-to-code(Abi,2024)tasks.To enhance the quality of our multi-modality SFT data,we have also curated a portion of high-quality in-house multi-modality SFT data,some ofwhich are in the Chinese language.Our in-hous
89、e instruction-tuning dataset is meticulouslydesigned to reflect real-world usage scenarios and cover a wide range of tasks.We start bycollectingadiversesetofauthentictestcasesforGPT-4VandGeminifromvariousonlinesources.These test cases are then carefully analyzed and organized into a comprehensive ta
90、xonomy,which encompasses multiple categories,such as recognition,conversion,analysis,reasoning,evaluation,and safety,as detailed in Table 3.This structured taxonomy serves as a guidelinefor selecting representative prompts for each test image,ensuring that our instruction-tuningdataset is both pract
91、ical and relevant to real-world applications.Moreover,this taxonomy isalso employed to construct a balanced and comprehensive evaluation dataset,which allows usto effectively assess the models performance across different tasks and categories.By followingthis systematic approach,we ensure that the c
92、ategories covered by our in-house multi-modalitySFT data are well-aligned with the taxonomy and representative of real-world usage scenarios.8Main CategoryDescriptionSecondary CategoryTertiary CategoryRecognitionThis part of the use cases mainly examines the under-standing and description ability of
93、 large models for im-age content,which does not require high knowledgereserve and reasoning ability of the model,and sometasks can be completed using traditional machine learn-ing models.Global DescriptionTheme Description,Event/Behavior Description,Lo-cation/Scene Description,Emotion/Mood Descripti
94、on,Style Recognition,Food Recognition,OthersLocal DescriptionPointing Description,Position Description,PersonRecognition,Object Attribute Description,Logo Recog-nition,Counting,Currency RecognitionOCR and TranscriptionPrinted Text Transcription,Handwritten Text Transcrip-tion,Specified Format Transc
95、ription,Specified LanguageTranscriptionConversionThis type of use case requires the model to be able todescribe and recognize image content,and use specificknowledge(e.g.,code knowledge,prompt engineeringknowledge)to convert image content into another form.Image to CodeUI to Code,Chart to Code,Photo
96、 to SVG/p64 Encoding,Formula to Code,Flowchart to CodeImage to TextImage to Prompt,Text Summary,Image-based Creation,Text InterpretationAnalysisThis type of use case requires the model to use specificknowledge and logical ability to make reasonable anal-ysis and understanding based on image content,
97、anddescribe the image according to instructions.Data Chart AnalysisGraph Interpretation,Table InterpretationProfessional Chart AnalysisCircuit Diagram,Flowchart,Map,Music Score,FinancialChart,Floor Plan,OthersProfessional Image AnalysisSensor Image,Biological and Medical Image,VoiceprintImage,Point
98、Cloud ImageEncyclopedia Knowledge Analy-sisArt and Culture Knowledge,Natural EnvironmentKnowledge,Food/Clothing/Housing/TransportationRelated Knowledge,Entertainment Related Knowledge,Historical KnowledgeCommonsenseRea-soningThis type of use case mainly tests the models under-standing and mastery of
99、 common sense in life,whichrequires reasoning based on the interpretation and anal-ysis of image content combined with common sense.Relationship ReasoningInterpersonal Relationship,Spatial Relationship,SizeRelationship,Species RelationshipFunction ReasoningHardware Function Reasoning,Software Functi
100、on Rea-soningEnvironment ReasoningEnvironment State Analysis,Environment-based Behav-ior Reasoning,Embodied IntelligenceAnomaly ReasoningIdentifying Anomalies in Images,Defect Detection,Ac-cident JudgmentHumor Reasoning-Other Commonsense ReasoningState Reasoning,Cause Reasoning,Attribute Compari-son
101、,Optical Illusion,Fun Games,Intention Interpreta-tion,Behavior PredictionLogical ReasoningThis type of use case requires the model to combine theunderstanding of images,comprehensively use domainknowledge and logical reasoning ability to completecorresponding tasks.Mathematical ReasoningAlgebraandOp
102、eration,PlaneGeometry,SolidGeometryOther Logical ReasoningPhysics,Chemistry,Biology,Code,IQ QuestionsEvaluationThis type of use case requires the model to evaluate theimage content according to specific criteria.-Reality Evaluation,Similarity Evaluation,Aesthetic Eval-uation,Open-ended Evaluation,Im
103、provement Sugges-tionsMulti-graphThis type of use case examines the models ability toanalyze and understand multiple images.Temporal Sequence Understand-ingEvent Prediction,Image Sequencing,Behavior AnalysisMulti-graph ComparisonAttribute Comparison,Image-Text Matching,FindingAssociations,Spotting D
104、ifferences,Image Discrimina-tionSafetyThis type of use case examines the models performancein terms of safety.-Suggestive Questioning,Counterfactual Questioning,Prompt InjectionTable 3|Our taxonomy for the in-house SFT data.The categories covered by our high-qualityin-house multi-modality SFT data a
105、re comprehensively represented in this taxonomy.9Furthermore,we include the text-only SFT data employed in DeepSeek-LLM(DeepSeek-AI,2024)as part of our joint vision and language SFT data.3.Approach3.1.ArchitectureOur system contains three modules:a hybrid vision encoder,a vision adaptor,and a langua
106、gemodel.We introduce each part in this section.Hybrid Vision Encoder.We employ SigLIP as the vision encoder to extract high-levelsemantic feature representations from visual inputs.However,we observe that a single SigLIPencoder struggles to address all real-world questions comprehensively.Vision enc
107、oders in theCLIP family,including SigLIP,are primarily designed for semantic visual representations butare challenged by ambiguous encoding,resulting in visually distinct images being encoded assimilar due to what is referred to as CLIP-blind pairs Tong et al.(2024).Meanwhile,the CLIPfamily of model
108、s is limited by its relatively low-resolution inputs(e.g.,224 x 224,336 x 336,384x 384,512 x 512),which hinders their ability to handle tasks requiring more detailed low-levelfeatures like dense OCR and visual grounding task.To address these limitations,recent researches(Lin et al.,2023b;Tong et al.
109、,2024;Wei et al.,2023)have advocated for the integration of additional vision-only self-supervised encoders,toenhance the visual grounding capabilities of multi-modality models.Building upon previousmotivations,we additionally utilize a vision-only encoder based on the SAM-B(Kirillov et al.,2023),a
110、pre-trained ViTDet(Li et al.,2022)image encoder to process low-level features,whichaccepts high-resolution 1024 x 1024 image inputs.In addition to the SAM-B encoder,weretain the SigLIP-L vision encoder with low-resolution 384 x 384 image inputs.Consequently,our hybrid vision encoder combines the SAM
111、-B and SigLIP-L encoders,efficiently encodinghigh-resolution 1024 x 1024 images while preserving both semantic and detailed information.Specifically,a high-resolution SAM-B vision encoder first resizes the image into 1024 x 1024 andresults in a 64 x 64 x 256 feature map.In the case of a high-resolut
112、ion feature map of size,64 x 64 x 256 generated by SAM-B,theVL Adaptor initially interpolates it into a size of 96 x 96 x 256.Subsequently,it employs twoconvolutional layers with a stride of 2,producing a feature map of 24 x 24 x 1024,and reshapesit to 576 x 1024.Alongside this,the low-resolution fe
113、ature map of size 576 x 1024 generated bySigLIP-L is concatenated with the high-resolution features,resulting in 576 visual tokens with2048 dimensions.These visual tokens possess a substantial capacity for enhancing high-levelsemantic visual recognition and low-level visual grounding tasks.Then they
114、 undergo GeLUactivation and are directed through an embedding layer to establish a connection with thelanguage model.Vision-Language Adaptor.We employ a two-layer hybrid MLP to bridge the vision encoderand the LLM.Initially,distinct single-layer MLPs are used to process high-resolution featuresand l
115、ow-resolution features separately.Subsequently,these features are concatenated along theirdimensions and then transformed into the LLMs input space through another layer of MLP.Language Model.Our language model is built upon DeepSeek LLM(DeepSeek-AI,2024)whose micro design largely follows the design
116、 of LLaMA(Touvron et al.,2023a,b),adopt-ing a Pre-Norm structure with RMSNorm(Zhang and Sennrich,2019)function and usingSwiGLU(Shazeer,2020)as the activation function for the Feed-Forward Network(FFN),withan intermediate layer dimension of83.It also incorporates Rotary Embedding(Su et al.,10PromptIs
117、 the cyclist on the left or right side of the womans handbag?ResponseThe cyclist is on the left side of the womans handbag.This can be determined by observing the relative positions of the objects in the image:1.The woman is holding a handbag that is positioned to her right side,as we view the image
118、.2.There is a cyclist depicted on the road behind the woman and to her left when looking at the image.3.The cyclist appears to be moving away from the viewpoint,given their direction and the motion blur effect on their wheels4.The cyclist is further down the road compared to the woman,which supports
119、 the conclusion that they are on the left side of the handbag based on our perspective.These visual clues confirm that the cyclist is indeed con the left side of the womans handbag.Figure 2|Visualization results.DeepSeek-VL is capable of capturing tiny object and givingorganized explanations.2024)fo
120、r positional encoding and uses the same tokenizer with DeepSeek-LLM.We introducea family of DeepSeek-VL models.Given our objective of conducting joint pretraining withmultimodal and language,we select an intermediate checkpoint from DeepSeeks pretrainedmodels to continue pretraining.Specifically,the
121、 DeepSeek-VL-1B model is constructed based on the DeekSeek-LLM-1Bmodel,which underwent training with an approximate corpus of 500 billion text tokens.Andthe DeekSeek-VL-7B model is developed leveraging the DeepSeek-LLM-7B model trained withan estimated 2 trillion text tokens.11SAM-BSigLIP-LDeepSeek
122、LLMVision-LanguageAdaptorHybrid Vision EncoderImage-Text PairsStage 1:Training VL AdaptorSAM-BSigLIP-LDeepSeek LLMVision-LanguageAdaptorHybrid Vision EncoderInterleaved VL+Pure Language SequencesStage 2:Joint VL Pre-trainingSAM-BSigLIP-LDeepSeek LLMVision-LanguageAdaptorHybrid Vision EncoderVL Chat
123、Data+Pure Language Chat DataStage 3:Supervised FinetuningFigure 3|Our training pipelines consist of three stages.Stage 1 involves training the Vision-Language(VL)adaptor while keeping the hybrid vision encoder and language model fixed.Stage 2 is the crucial part of the joint vision and language pret
124、raining,where both VL adaptorand language model are trainable.Stage 3 is the supervised fine-tuning phase,during which thelow-resolution vision encoder SigLIP-L,VL adaptor,and language model will be trained.3.2.Training PipelinesWe train our DeepSeek-VL in three consecutive stages as shown in Figure
125、 3:vision-languageadaptor warmup,joint vision-language pretraining,and supervised fine-tuning.We currentlyfocus on visual understanding capabilities and only calculate the next token prediction loss onthe language part.3.2.1.Stage 1:Training Vision-Language AdaptorThe primary objective of this stage
126、 is to establish a conceptual link between visual and linguisticelements within the embedding space,thereby facilitating the comprehensive understandingof depicted entities in the images by the Large Language Model(LLM).Consistent with priorresearch conducted by LLaVA(Liu et al.,2024b)and Instruct-B
127、LIP(Dai et al.,2023),we adopt asimilar approach in which both the vision encoder and the LLM remain frozen during this stage,while solely allowing the trainable parameters within the vision-language adaptor.We utilize adataset comprising 1.25 million image-text paired captions obtained from ShareGPT
128、4V,alongwith 2.5 million Document OCR rendering pairs to train the VL adaptor.Nevertheless,compared to Large Language Models(LLMs),vision-language adaptors(e.g.,a 2-layer MLP)have a significantly smaller parameter capacity.This limitation in model capacityrestricts the capabilities that can be learn
129、ed during this stage.A natural question arises:Can thelaw of data scaling be effective at this stage?To address this question,we conducted a simpleexperiment in Table 8.The results demonstrate that expanding the data scale at this stage doesnot provide benefits and may even lead to inferior performa
130、nce.Consequently,we proceed tounfreeze the Large Language Model(LLM)and investigate efficient vision-language pretrainingapproaches during stage 2.123.2.2.Stage 2:Joint Vision-Language pretrainingIn this stage,we explore effective pretraining strategies which can be considered as an additionalstage
131、to enable Large Language Models(LLMs)to comprehend multimodal inputs.We keep thevision encoder frozen and optimize the language model and VL adaptor.Initially,we attempt to directly train the LLM with multimodal data.However,we findwhile the metrics for multimodal performance incrementally improved,
132、there is a stark andsevere decline in language metrics as illustrated in Figure 4(Multimodal:Language=100%:0%),.This underscores the inherent challenge in directly conducting multimodal pretraining on thefoundation of an LLM,revealing a critical trade-off between enhancing multimodal abilities andpr
133、eserving linguistic proficiency.We hypothesize that the observed phenomenon stems from two primary factors:firstly,themajority of multimodal corpora,are overly simplistic and exhibit a significant divergence fromthe complexity and distribution of linguistic data.Secondly,there appears to be a compet
134、itivedynamic between multimodal and linguistic modalities,leading to what can be described ascatastrophic forgetting of language capabilities within the LLM.Joint Language-multimodal Training To address this challenge,we devise a straightforwardyet effective joint language-multimodal training strate
135、gy.During training,we not only engagein multimodal data training but also incorporate a large proportion of language data intothe training.This approach aims to balance the training focus,mitigating the adverse effectsobserved.We conduct experiments on the DeepSeek-VL 1B model in Figure 4 to explore
136、 theimpact of varying the modality mixing ratios.The analysis of the graph yields several key conclusions:(1).Integrating language datasignificantly alleviates the decline in language capabilities,demonstrating a substantial improve-ment in the models linguistic performance.(2).The inclusion of lang
137、uage data does not lead toa significant loss in multimodal performance,indicating that the model retains its multimodalprocessing abilities.(3).The performance of different modalities is strongly correlated withtheir respective proportions in the training dataset,substantiating the competitive relat
138、ionshipbetween the two modalities.Ultimately,we opt for a training ratio of language to multimodaldata of roughly 7:3 for our final model.This ratio enables the model to maintain its languagecapabilities while simultaneously achieving better pretraining on multimodal data,effectivelybalancing the de
139、velopment of both language and multimodal proficiencies.Scaling Vision-Language Pretraining Nevertheless,the pretraining stage of the model incurs asubstantial computational cost,and performing iterations on the 7B model requires an excessiveamount of computing power and time.One suitable strategy i
140、nvolves conducting experimentson a smaller model,specifically the 1.3B model,and subsequently scaling it up to the 7B model.Fortunately,we have observed that a significant portion of the outcomes obtained from the 1.3Bmodels can be effectively transferred to the 7B model through the utilization of S
141、FT(e.g.,theencoder design).However,during the stage 2 training phase,we have encountered considerablefluctuations in the generative metrics of the 1.3B model,rendering it challenging to supervisethe training process effectively.And this has been discussed in Schaeffer et al.(2024),sharpand unpredict
142、able changes might be induced by the researchers choice of measurement,eventhough the model familys per-token error rate changes smoothly,continuously and predictablywith increasing scale.Subsequent experiments have led us to identify the root causes of thisissue:the limited capacity of the 1.3B mod
143、el and the absence of SFT data within the trainingdataset,both of which hinder the models ability to accurately follow instructions.Even whenthe model possesses knowledge of the correct options,it struggles to generate them precisely.130200040006000800010000120001400016000Step27.530.032.535.037.540.
144、042.545.0MCPPL AccuracySeedBenchMultimodal:Langauge=10%:90%Multimodal:Langauge=25%:75%Multimodal:Langauge=60%:40%Multimodal:Langauge=75%:25%Multimodal:Langauge=100%:0%0200040006000800010000120001400016000Step32.535.037.540.042.545.047.550.0MCPPL AccuracyMMBenchMultimodal:Langauge=10%:90%Multimodal:L
145、angauge=25%:75%Multimodal:Langauge=60%:40%Multimodal:Langauge=75%:25%Multimodal:Langauge=100%:0%0200040006000800010000120001400016000Step2830323436384042MCPPL AccuracyMMBench_CNMultimodal:Langauge=10%:90%Multimodal:Langauge=25%:75%Multimodal:Langauge=60%:40%Multimodal:Langauge=75%:25%Multimodal:Lang
146、auge=100%:0%0200040006000800010000120001400016000Step0.3100.3150.3200.3250.330LMPPL AccuracyMMLUMultimodal:Langauge=10%:90%Multimodal:Langauge=25%:75%Multimodal:Langauge=60%:40%Multimodal:Langauge=75%:25%Multimodal:Langauge=100%:0%0200040006000800010000120001400016000Step0.5640.5660.5680.5700.5720.5
147、740.5760.578AccuracyHellaSwagMultimodal:Langauge=10%:90%Multimodal:Langauge=25%:75%Multimodal:Langauge=60%:40%Multimodal:Langauge=75%:25%Multimodal:Langauge=100%:0%0200040006000800010000120001400016000Step2.142.162.182.202.222.24PPLPile-testMultimodal:Langauge=10%:90%Multimodal:Langauge=25%:75%Multi
148、modal:Langauge=60%:40%Multimodal:Langauge=75%:25%Multimodal:Langauge=100%:0%Figure 4|Comparative performance results on different modality fusion ratio on trainingstage 2.An excessively large proportion of multimodal data(multimodal:language=100%:0%)leads to significant forgetting of language capabi
149、lities in LLMs.A suitable ratio(multi-modal:language=70%:30%)can effectively mitigate the issue of language forgetting while simul-taneously enhancing the models multimodal abilities.To mitigate these challenges,we adopte a dual-pronged approach.Firstly,we employ theMulti-choice PPL methodology to m
150、onitor the models progress.This involves inputting notonly the prompt and image into the network but also all the answer associated with the question.Subsequently,we calculate the PPL for each answer position(e.g.,A,B,C,D)and select theoption deemed correct by the model as the final answer.Secondly,
151、we introduce SFT data intothe training dataset at a minimal proportion,allowing the model to acquire some proficiency infollowing instructions.The combination of these two approaches ensures the maintenance ofstable training metrics for the 1.3B model and bring better performance after stage3.3.2.3.
152、Stage 3:Supervised Fine-tuningIn this phase,we finetune the pretrained DeepSeek-VL model with instruction-based fine-tuningto bolster its ability to follow instructions and engage in dialogue,culminating in the creation ofthe interactive DeepSeek-VL-Chat model.We optimize the language model,VL adapt
153、or,and14PromptExplain the code,step by step.Response DeepSeek-VLThe code provided is a Python function that calculates the longest palindrome substring of a given string.The function takes in two parameters,which are not defined within this snippet but presumably represent the input strings to be pr
154、ocessed.Heres a step-by-step explanation of how the code works:1.It defines a helper function called longestPalIndrome with two arguments,self(which is likely a reference to an object or class instance)and s.This function returns a value,which we can assume is the length of the longest palindrome su
155、bstring found in s.2.The main function starts by checking if the length of s is less than 1.If true,it immediately returns sas there is no palindrome substring for empty or single character strings.3.Next,it initializes variables Max_Len and Max_Str to store the maximum length of a palindrome substr
156、ing found so far.4.A loop(for i in range(len(s)-1):)iterates over all possible starting points of substrings in s.For each starting point i,another nested loop(for j in range(i+1,len(s)checks whether the substring from index i to j forms a palindrome.The condition if j-i+1 Max_Len and si:j=sj:i:-1 e
157、nsures that the substring is both longer than the current maximum length and reads the same backward as forward.If this condition is met,it updates Max_Len and Max_Str with the new maximum length and substring.5.After iterating through all possible substrings,the function returns Max_Str,which conta
158、ins the longest palindrome substring found.6.Finally,the line return Max_Str indicates that the function will return the result of its computation.This function uses dynamic programming principles to find the longest palindromic subsequence efficiently by storing intermediate results and avoiding re
159、dundant calculations.Figure 5|Visualization results.DeepSeek-VL can understand Python code and provide detailedand organized explanations.hybrid vision encoder with the vision-language SFT data as shown in Table 2,SAM-B remainsfrozen due to the limited GPU memory.We only supervise answers and specia
160、l tokens and maskthe system and user prompts.To guarantee the models comprehensive proficiency in dialogue,we utilize a blend of multimodal data and pure text dialogue data used in DeepSeek-LLM.Thisapproach ensures the models versatility across various dialogue scenarios.3.3.Hyperparameters and Infr
161、astructuresThe detailed hyperparameters of all stages are illustrated in Table 4.We train and evaluate ourDeepSeek-VL with HAI-LLM(High-flyer,2023),a lightweight and efficient distributed trainingframework.Since we use visual encoders to convert images into embedding vectors and thentreat image embe
162、ddings and text embeddings uniformly,we can easily adapt pipeline paral-lelism to VL model training:all we need to do is to view visual encoders and text embedding asa single module and take it as the first layer of the resulting model.This very first layer has acomplicated model structure and precl
163、udes standard tensor parallelism technique,but luckilyit requires relatively small computation compared to upper standard transformer blocks.Wetherefore simply recompute the visual encoder forward pass in all tensor parallel ranks.Theexistence of visual encoders also leads to non-uniform execution t
164、ime across model layers,sowe re-divide model layers between pipeline parallelism ranks to achieve better load balance andthroughput.The upper layers of DeepSeek-VL are exactly the same as those in DeepSeek-LLM.With such minor modification,we can now perform canonical 3D parallelism techniques asin M
165、egatron(Korthikanti et al.,2023;Narayanan et al.,2021;Shoeybi et al.,2019)and overlapcomputation and communication as in DeepSeek-LLM(DeepSeek-AI,2024).DeepSeek-VL-7B consumed 5 days on a cluster of 64 nodes,each comprising 8 Nvidia A100 GPUs,whileDeepSeek-VL-1B consumed 7 days on a setup involving
166、16 nodes.15DeepSeek-VL 1BDeepSeek-VL-7BVision EncoderSigLIPSigLIP+SAMHyperparametersStage 1Stage 2Stage 3Stage 1Stage 2Stage 3Learning rate1.010331052.01051.01034.21052.0105LR schedulerCosineStepCosineCosineStepCosineWeight decay0.00.00.00.00.00.0Gradient clip1.01.01.01.01.01.0OptimizerAdamW(1=0.9,2
167、=0.95)AdamW(1=0.9,2=0.95)Warm-up steps12820002561282000256Training steps150009600010000150004200010000Batch size25610242562562304256Sequence length5124096409651240964096Sequence packingPipeline parallelismTable 4|Detailed hyperparameters of our DeepSeek-VL.4.Evaluation4.1.Public Multimodal Benchmark
168、s EvaluationWe evaluate our models on a series of public benchmarks:Multimodal comprehensive understanding datasets:MMMU(Yue et al.,2023),CM-MMU(Zhang et al.,2024),MMBench(Liu et al.,2023a),MMBench-CN(Liu et al.,2023a),SeedBench(Li et al.,2023a)and MMV(Yu et al.,2023b).We compare DeepSeek-VL withcom
169、petitors on MMB/MMC-dev as current official test download link is no longer active.Chart/table understanding datasets:OCRBench(Liu et al.,2023b);Hallucination datasets:POPE(Li et al.,2023b);Scientific problem datasets:ScienceQA(Lu et al.,2022a)and MathVista(Lu et al.,2023).We apply generation-based
170、evaluation with greedy decoding.The generation-based evalua-tion here refers to letting the model generate free texts and parsing results from generated texts.The comparative results,as illustrated in Table 5,show that DeepSeek-VL-7B surpasses mostopen-source models of similar size across a wide ran
171、ge of benchmarks.DeepSeek-VL outperforms open-source models of similar size in benchmarks such as MMB,MMC,and SEEDbench,even approaching proprietary models(DeepSeek-VL vs.GPT-4V=70.4vs.71.6 on seedbench),demonstrating its powerful natural image comprehension capability.Themodel also surpasses all op
172、en-source models in mathematical logic,but still lags significantlybehind proprietary models like GPT-4V(36.1 vs.47.8 on MathVista).This difference could beattributed to the variance in base model sizes.Furthermore,as shown in Table 6,DeepSeek-VL-1.3B significantly outperforms modelsof comparable si
173、ze.It demonstrates superior performance compared to leading open-sourcemodels in the MMB benchmark test,while utilizing only close to half the parameters(1.3B vs.2.7B),indicating its robust natural image comprehension capability.DeepSeek-VL-1.3B evenachieves comparable results to 7B open-source mode
174、ls on MathVista,further validating thepowerful logical understanding capabilities of the DeepSeek-VL family.16LLM MMMU CMMMU MMB MMC SEED OCRB POPE MathV MMVetClose-source LMMs:Gemini ProUnk48.9-75.274.070.7659-45.259.2GPT-4VUnk56.842.575.074.771.6659-47.849.9Qwen-VL-PlusUnk45.239.566.269.672.7-43.3
175、55.7Qwen-VL-MAXUnk51.4-78.176.472.7-51.061.8Open-source 13B LMMs:LLaVA-1.513B36.4-68.261.968.233185.926.438.3VILA13B-70.364.3-84.2-38.8LLaVA-Next13B36.2-70.064.471.9-86.735.348.4Open-source 7B LMMs:EMU2-Chat7B36.323.863.645.968.9-30.031.0Qwen-VL-Chat7B37.0-60.656.764.8-33.847.3CogVLM7B37.324.863.753
176、.868.8-34.754.5LLaVA-Next7B35.8-67.460.070.2-86.534.643.9Yi-VL6B37.835.868.268.967.6-28.031.1DeepSeek-VL(ours)7B36.637.973.272.870.445688.136.141.5Table 5|The comparison between different multi-modal models.The top half are proprietarymodels,while the bottom are open-source models.LLM MMMU CMMMU MMB
177、 MMC SEED OCRB POPE MathV MMVetTiny Model:MobileVLM1.4B-53.2-84.5-MobileVLM2.7B-59.6-84.9-MobileVLM V21.4B-59.6-84.3-MobileVLM V22.7B-63.2-84.7-LLaVA-Phi2.7B-59.5-85.0-28.9DeepSeek-VL(ours)1.3B32.227.464.661.366.740987.631.134.8Table 6|The comparison between tiny multi-modal models.4.2.Public Langua
178、ge Benchmarks EvaluationWe evaluate our models on the following public language benchmarks:Multi-subject multiple-choice datasets including MMLU(Hendrycks et al.,2020).Language understanding and reasoning datasets including HellaSwag(Zellers et al.,2019).Language modeling datasets including Pile(Gao
179、 et al.,2020).Math datasets including GSM8K(Cobbe et al.,2021).Code datasets including MBPP(Austin et al.,2021).Standardized exams including AGIEval(Zhong et al.,2023).We apply perplexity-based evaluation to datasets that require answers to be chosen fromseveral options.These datasets include HellaS
180、wag and MMLU.The perplexity-based evaluationhere refers to calculating the perplexity of each option and selecting the lowest one as the17VersionDeepSeek-VLDeepSeek-VLDeepSeek-LLM1B Chat7B Chat7B ChatEncoderSigLIPSigLIP+SAMNoneBenchmarkHellaSwag56.068.468.5MMLU32.552.449.4GSM8K18.055.063.0MBPP10.035
181、.235.2AGIEval14.027.819.3Table 7|The performance on language benchmarks.model prediction.Perplexity-based evaluation helps to distinguish subtle probability differencebetween model predictions and avoids discontinuity of exact match style evaluation.We applygeneration-based evaluation with greedy de
182、coding for GSM8K and AGIEval.The generation-based evaluation here refers to letting the model generate free texts and parsing results fromgenerated texts.We apply language-modeling-based evaluation for Pile-test,which meanscalculating the bits-per-byte on the test corpus.And the results are illustra
183、ted in Table 7It can be observed that across the majority of language benchmarks,DeepSeek-VL performscomparably to,or even surpasses,DeepSeek-7B.For instance,it achieves scores of 68.4 vs.68.5on HellaSwag,which serves as a general benchmark for evaluating general language ability.DeepSeek-VL outperf
184、orms DeepSeek-7B on metrics such as MMLU and AGIEval,indicating thatmultimodal training methods may even aid in language tasks.Nevertheless,DeepSeek-VL-7Bshows a certain degree of decline in mathematics(GSM8K),which suggests that despite effortsto promote harmony between vision and language modaliti
185、es,there still exists a competitiverelationship between them.This could be attributed to the limited model capacity(7B),andlarger models might alleviate this issue significantly.Overall,DeepSeek-VL strives to achievethe goal of minimizing declines in language capability while addressing these challe
186、nges.4.3.Human EvaluationTo further explore the capabilities of our DeepSeek-VL,we independently construct a datasetfor manual evaluation.This dataset comprises 100 questions,divided into seven categories,each encompassing specific tasks.These categories and tasks are same as our taxonomy forthe in-
187、house SFT data,as shown in Table 3.This approach ensures that the tasks we test areuniversal and encompass the majority of use cases for multimodal models.Moreover,based on the categories and tasks described in existing reports,we collect similarimage materials and developed prompts.The sources for
188、these image materials include royalty-free image communities and photographs taken by the researchers.This methodical collectionand prompt formulation process ensures our dataset is both comprehensive and representativeof real-world multimodal model applications.We compare our DeepSeek-VL-7B with In
189、ternLM-XComposer2-VL,CogVLM and GPT-4V as shown in Figure 6(and we also provide visualization results in Appendix A).GPT-4Vdemonstrates exceptional performance across most dimensions.All open-source models are stillfar behind GPT-4V in logical reasoning,highlighting the necessity of scaling up the s
190、ize of LargeLanguage Models(LLMs).DeepSeek-VL-7B achieves better results in overall performance,reaching outcomes close to GPT-4V in Recognition,Conversion,and Commonsense Reasoning.184.555.714.094.215.223.213.753.754.654.764.095.266.961.433.752.55.657.016.824.746.524.293.133.756.37.147.734.476.745.
191、368.133.75TOTAL SCORERECOGNITIONCONVERSIONANALYSISCOMMONSENSELOGICALMULTI-IMAGESEVALUATIONInternLM-XComposer2-VLCogVLM-17BDeepSeek-VL-7BGPT4VFigure 6|Human evaluation results on InternLM-XComposer2-VL(Dong et al.,2024),CogVLM(Wang et al.,2023b),DeepSeek-VL and GPT-4V(OpenAI,2023b).Figure 7|GPT-4V-ba
192、sed Evaluation Results of DeepSeek-VL vs.Other Models:The chart depictsresults from a GPT-4V-based assessment across 99 test samples,demonstrating DeepSeek-VLsfavorable outcomes against both open-source and proprietary models.In addition,we conduct a comparative assessment using GPT-4V to evaluate t
193、he performanceof DeepSeek-VL against other models across a set of 99 test samples designed for humanevaluation.Following(Zheng et al.,2024),we show GPT-4V the question and the answersfrom two different models and ask GPT-4V to determine which one is better or declare atie.The results indicate a pref
194、erence for DeepSeek-VLs responses in the majority of cases,asGPT-4V tends to rate the quality of DeepSeek-VLs answers more favorably.As illustrated inFigure 7,DeepSeek-VL is judged to be superior in over 60%of instances when compared to open-source multimodal models,including Fuyu-8B,CogVLM-17B,and
195、InternLM-XComposer2-VL.Moreover,in comparison with other proprietary models,such as GPT-4V itself,DeepSeek-VLdemonstrates comparably exceptional performance.4.4.Ablation StudyScale Up Projector Training We expand the dataset for stage 1(projector warmup)and sub-sequently apply supervised fine-tuning
196、.The results,depicted in Figure 8,demonstrate thataugmenting the training data volume does not enhance performance at this stage.This implies19Stage 1,Training StepMMBMMCSEEDPOPEMMMUAverage2K59.054.061.882.330.357.58K58.045.058.584.929.255.120K56.052.359.081.728.655.580K58.155.058.678.627.955.6Table
197、 8|Comparative directly SFT performance results on scaling up stage 1 data.The resultsdemonstrate that expanding the data scale at this stage does not yield benefits,or even results inworse performance.Stage 1Stage 2Stage 3MMBMMCSEEDPOPEMMMUAverage59.454.261.482.529.257.463.460.565.987.131.861.764.3
198、61.366.787.632.262.4Table 9|Analysis of model performance across training stages.0200040006000800010000 12000 14000 16000Step2.1502.1552.1602.1652.170PPLPile-testw/o group by modalityw/group by modality0200040006000800010000 12000 14000 16000Step32.535.037.540.042.545.047.550.0AccuracyMMBenchw/o gro
199、up by modalityw/group by modality0200040006000800010000 12000 14000 16000Step2830323436384042AccuracyMMBench_CNw/o group by modalityw/group by modalityFigure 8|Comparative analysis of modality warmup on language(Pile-test)and multimodal(MMBench and MMBench_CN)benchmarks demonstrates that modality gr
200、ouping consistentlysurpasses the non-grouped modality approach in language tasks,while simultaneously preserv-ing performance on multimodal tasks on training stage 2(Multimodal:Language=60%:40%).that the projectors capacity is inherently constrained,rendering it incapable of capturing theextensive k
201、nowledge necessary for multimodal tasks.Training Stage In Table 9,we examine the contributions of each stage to the models performance.Its evident that combining stage 1,stage 2,and stage 3 yields significantly better results acrossall metrics compared to combining stage 1 and stage 3 alone,demonstr
202、ating the effectiveness ofmultimodal pretraining.Additionally,the combination of stage 2 and stage 3 still slightly lagsbehind the combined performance of stage 1,stage 2,and stage 3,indicating that vision-languageadaptor warmup stage remains meaningful.Modality Group Training When mixing language a
203、nd multimodal data,we observe that directlyblending them at the batch level significantly reduces training efficiency.This inefficiency arisesbecause each batch gradient backpropagation process waits for the slowest sample to complete.As a result,the predominantly faster-to-process pure language dat
204、a ends up waiting for themultimodal samples to finish,leading to a decrease in overall training efficiency.To address this issue,we experiment with grouping different modalities of data at each global200200040006000800010000 12000 14000 16000Step2.1452.1502.1552.1602.165PPLPile-testw/o modality warm
205、upw/modality warmup0200040006000800010000 12000 14000 16000Step32.535.037.540.042.545.047.550.0AccuracyMMBenchw/o modality warmupw/modality warmup0200040006000800010000 12000 14000 16000Step30323436384042AccuracyMMBench_CNw/o modality warmupw/modality warmupFigure 9|Comparative performance results o
206、n language(Pile-test)and multimodal(MMBenchand MMBench_CN)benchmarks for modality warmup.Modality warmup consistently matchesor surpasses the performance of approaches without modality warmup across all evaluatedtasks on training stage 2(Multimodal:Language=60%:40%).step,sampling distinct modalities
207、 separately.This approach involves organizing the trainingdata so that batches are composed either entirely of language data or entirely of multimodaldata at different training steps,rather than mixing them within the same batch.The results are shown in Figure 8,we observe that this method does not
208、compromise themodels performance while enhancing the models training efficiency by 20%.This strategy effec-tively circumvents the bottleneck caused by the disparate processing times between modalities,optimizing the training workflow.Modality Warmup Considering that our approach involves multimodal
209、training on the founda-tion of a language model,directly mixing multimodal data in a fixed proportion from the outsetcan destabilize the model.To counteract this issue,we propose a simple yet effective modalitywarm-up strategy.Initially,we set the language data ratio to 1,and then gradually decrease
210、 it tothe target ratio for the final model training(e.g.,0.7).Our experiments,as illustrated in Figure 9,demonstrate that this strategy effectively preventsa significant decline in language capabilities at the beginning of training,while also yieldingcomparatively superior outcomes in the final phas
211、es for both the language and multimodal do-mains.This gradual adaptation enables the model to more seamlessly adjust to the incorporationof multimodal data,thereby improving overall training stability and performance.Vision Encoder Selection In order to better acquire and utilize image information,w
212、e comparethe training loss of different vision encoders under our training settings except for reducingtraining steps of stage 2 to 8000 for efficiency.As illustrated in Figure 10,the incorporation ofvision-only self-supervised encoders has been found to significantly enhance performance ontraining
213、loss.To more effectively process high-resolution images,our research ultimately adoptsa hybrid vision encoder strategy,combining SigLIP with SAM for our models implementation.Vision-Language Adaptor Design To improve the efficiency of extracting information from thevisual encoder while adhering to c
214、urrent token length constraints,adjustments can be made tothe Vision-Language adaptor in two main ways:the method used to combine visual featuresand the design of the MLP adaptor.Previous studies(Tong et al.,2024)have indicated that combining visual features alongthe sequence dimension can lead to b
215、etter model performance,although this comes with thetrade-off of increased computational requirements due to a longer sequence of visual featuretokens.As demonstrated in the top section of Table 10,reducing the sequence length by stacking21010002000300040005000600070008000Step1.82.02.22.42.6LossCLIP
216、SigLIPSigLIP+DINOSigLIP+SAMFigure 10|Comparative analysis of different vision encoders on training losses in stage 2.ArchitectureMMBMMCSEEDPOPEScienceQAMMMUOCRBAverageSequence Concatenation:Token Pooling-W61.259.661.686.557.731.630455.5Token Pooling-H59.958.361.683.855.032.029154.2Embedding Concaten
217、ation:Hybrid MLP61.760.162.987.856.631.330955.9Shared MLP62.058.962.586.654.730.231855.2Separate MLP57.558.763.186.556.629.029954.5Table 10|Comparison of different adaptor architectures using SigLIP and SAM as hybrid visionencoder,HybridMLPareusedforsequenceconcatenationexperiments.Boldedentriesrepr
218、esentthe best results,whileunderlinedentries denote the second-best results.For calculating theaverage score,we divide the OCRBench by the total number of questions.visual features along the images width or height dimensions before sequence concatenation,in order to keep the sequence length constant
219、,does not achieve better results compared tosimply merging them along the embedding dimension in most metrics.In terms of the adaptorarchitecture,employing separate MLP adaptors for each vision feature encoder allows formore precise adjustments to the specific values and distribution patterns of vis
220、ual features,facilitating smoother model training.Conversely,using a shared MLP adaptor for differentvision encoders contributes to adequate feature fusion.We adopt a mixed strategy and reportstable and improved performance,as outlined in the lower section of Table 10.5.Conclusion,Limitation,and Fut
221、ure WorkIn this technical report,we have introduced DeepSeek-VL,a series of Multimodal Large Lan-guage Models,available in scales of 1.3B and 6.7B parameters.This report has unveiled thelimitations inherent in the predominant projector-based pretraining methodologies,settingthe stage for the innovat
222、ive approach adopted by DeepSeek-VL.By prioritizing a joint visionand language(VL)pretraining phase,DeepSeek-VL transcends traditional models by ensuringthat the integration of multimodal data does not compromise the linguistic capabilities of theLarge Language Models(LLMs).This is achieved through
223、a strategic warm-up data ratio andthe introduction of a hybrid vision encoder,which together enable the efficient processing of22high-resolution images without losing sight of semantic richness.The incorporation of a hybrid vision encoder,capable of handling 1024 x 1024 images withina constrained to
224、ken budget,underscores our commitment to preserving the nuanced detailsand semantic integrity across diverse tasks.As a result,DeepSeek-VL emerges as a pioneeringmodel that not only meets but exceeds the standards set by generalist models in its class.Itshowcases exceptional performance across a wid
225、e range of visually-centric benchmarks whilesustaining formidable proficiency in language-centric evaluations.In making DeepSeek-VL publicly available,we aim to catalyze further innovation andexploration within the research community,providing a robust foundation upon which futurestudies can build.T
226、his gesture of openness is intended to facilitate the collective advancementof our understanding and capabilities in handling multimodal data.Looking ahead,we are excited to announce plans to scale up DeepSeek-VL to larger sizes,incorporating Mixture of Experts(MoE)technology.This forthcoming expans
227、ion promises tofurther enhance the models efficiency and effectiveness,opening up new horizons for researchand application in the field of AI.References01-ai.Yi-34B vision language model.https:/huggingface.co/01-ai/Yi-VL-34B,2024.Abi.Screenshot to code.https:/ Archive.Annas archive.https:/annas-arch
228、ive.org/,2024.Anthropic.Introducing Claude,2023.URLhttps:/ al.Program synthesis with large language models.arXiv preprint arXiv:2108.07732,2021.J.Bai,S.Bai,S.Yang,S.Wang,S.Tan,P.Wang,J.Lin,C.Zhou,and J.Zhou.Qwen-vl:Aversatile vision-language model for understanding,localization,text reading,and beyo
229、nd.arXiv preprint arXiv:2308.12966,2023.R.Bavishi,E.Elsen,C.Hawthorne,M.Nye,A.Odena,A.Somani,and S.Ta srlar.Introducingour multimodal models,2023.URLhttps:/www.adept.ai/blog/fuyu-8b.L.Blecher.Latex-ocr.GitHub repository,2024.URLhttps:/ R.Stojnic.Nougat:Neural optical understanding foracademic docume
230、nts.arXiv preprint arXiv:2308.13418,2023.A.Burns,K.Srinivasan,J.Ainslie,G.Brown,B.A.Plummer,K.Saenko,J.Ni,and M.Guo.Asuite of generative tasks for multi-level multimodal webpage understanding.InThe 2023Conference on Empirical Methods in Natural Language Processing(EMNLP),2023.URLhttps:/ D.Lin.Shareg
231、pt4v:Improvinglarge multi-modal models with better captions.arXiv preprint arXiv:2311.12793,2023.C.K.Chng,Y.Liu,Y.Sun,C.C.Ng,C.Luo,Z.Ni,C.Fang,S.Zhang,J.Han,E.Ding,et al.Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art.In2019 InternationalConference on Document Analysis and Recogn
232、ition(ICDAR),pages 15711576.IEEE,2019.K.Cobbe,V.Kosaraju,M.Bavarian,M.Chen,H.Jun,L.Kaiser,M.Plappert,J.Tworek,J.Hilton,R.Nakano,et al.Training verifiers to solve math word problems.arXiv preprintarXiv:2110.14168,2021.W.Dai,J.Li,D.Li,A.M.H.Tiong,J.Zhao,W.Wang,B.Li,P.Fung,and S.Hoi.Instructblip:Toward
233、s general-purpose vision-language models with instruction tuning,2023.DeepSeek-AI.Deepseek llm:Scaling open-source language models with longtermism.arXivpreprint arXiv:2401.02954,2024.URLhttps:/ al.Internlm-xcomposer2:Mastering free-form text-image composition and comprehensionin vision-language lar
234、ge model.arXiv preprint arXiv:2401.16420,2024.echo840.Detailed caption dataset.https:/huggingface.co/datasets/echo840/Detailed_Caption,2024.W.Foundation.Wikimedia downloads.URLhttps:/dumps.wikimedia.org.J.Gao,R.Pi,J.Zhang,J.Ye,W.Zhong,Y.Wang,L.Hong,J.Han,H.Xu,Z.Li,et al.G-llava:Solving geometric pro
235、blem with multi-modal large language model.arXiv preprintarXiv:2312.11370,2023.L.Gao,S.Biderman,S.Black,L.Golding,T.Hoppe,C.Foster,J.Phang,H.He,A.Thite,N.Nabeshima,et al.The Pile:An 800GB dataset of diverse text for language modeling.arXivpreprint arXiv:2101.00027,2020.Google.An important next step
236、on our AI journey,2023.URLhttps:/blog.google/technology/ai/bard-google-ai-search-updates/.D.Hendrycks,C.Burns,S.Basart,A.Zou,M.Mazeika,D.Song,and J.Steinhardt.Measuringmassive multitask language understanding.arXiv preprint arXiv:2009.03300,2020.High-flyer.Hai-llm:高效且輕量的大模型訓練工具,2023.URLhttps:/www.hi
237、gh- al.Screenqa:Large-scale question-answer pairs over mobileapp screenshots.arXiv preprint arXiv:2209.08199,2022.A.Hu,Y.Shi,H.Xu,J.Ye,Q.Ye,M.Yan,C.Li,Q.Qian,J.Zhang,and F.Huang.mplug-paperowl:Scientific diagram analysis with the multimodal large language model.arXivpreprint arXiv:2311.18248,2023.Hu
238、ggingFaceM4.Websight dataset.https:/huggingface.co/datasets/HuggingFaceM4/WebSight,2024.24S.Kantharaj,R.T.Leong,X.Lin,A.Masry,M.Thakkar,E.Hoque,and S.Joty.Chart-to-text:A large-scale benchmark for chart summarization.In S.Muresan,P.Nakov,andA.Villavicencio,editors,Proceedings of the 60th Annual Meet
239、ing of the Association forComputational Linguistics(Volume 1:Long Papers),pages 40054023,Dublin,Ireland,May2022.Association for Computational Linguistics.doi:10.18653/v1/2022.acl-long.277.URLhttps:/aclanthology.org/2022.acl-long.277.A.Kirillov,E.Mintun,N.Ravi,H.Mao,C.Rolland,L.Gustafson,T.Xiao,S.Whi
240、tehead,A.C.Berg,W.-Y.Lo,et al.Segment anything.arXiv preprint arXiv:2304.02643,2023.D.Kocetkov,R.Li,L.B.Allal,J.Li,C.Mou,C.M.Ferrandis,Y.Jernite,M.Mitchell,S.Hughes,T.Wolf,D.Bahdanau,L.von Werra,and H.de Vries.The stack:3 tb of permissively licensedsource code.In Transactions on Machine Learning Res
241、earch,2023.V.A.Korthikanti,J.Casper,S.Lym,L.McAfee,M.Andersch,M.Shoeybi,and B.Catanzaro.Reducing activation recomputation in large transformer models.Proceedings of MachineLearning and Systems,5,2023.I.Krylov,S.Nosov,and V.Sovrasov.Open images v5 text annotation and yet another mask textspotter.In A
242、sian Conference on Machine Learning,pages 379389.PMLR,2021.A.Kulkarni and J.Truelsen.wkhtmltopdf.https:/wkhtmltopdf.org/.Project maintainedby Ashish Kulkarni,originally created by Jakob Truelsen.Accessed:2024-02-22.LAION.Gpt-4v dataset.https:/huggingface.co/datasets/laion/gpt4v-dataset,2023.B.Li,R.W
243、ang,G.Wang,Y.Ge,Y.Ge,and Y.Shan.Seed-bench:Benchmarking multimodal llmswith generative comprehension.arXiv preprint arXiv:2307.16125,2023a.S.Li and N.Tajbakhsh.Scigraphqa:A large-scale synthetic multi-turn question-answeringdataset for scientific graphs,2023.Y.Li,G.Li,L.He,J.Zheng,H.Li,and Z.Guan.Wi
244、dget captioning:Generating natural languagedescription for mobile user interface elements.arXiv preprint arXiv:2010.04295,2020.Y.Li,H.Mao,R.Girshick,and K.He.Exploring plain vision transformer backbones for objectdetection.In European Conference on Computer Vision,pages 280296.Springer,2022.Y.Li,Y.D
245、u,K.Zhou,J.Wang,W.X.Zhao,and J.-R.Wen.Evaluating object hallucination inlarge vision-language models.arXiv preprint arXiv:2305.10355,2023b.J.Lin,H.Yin,W.Ping,Y.Lu,P.Molchanov,A.Tao,H.Mao,J.Kautz,M.Shoeybi,and S.Han.Vila:On pre-training for visual language models.arXiv preprint arXiv:2312.07533,2023a
246、.Z.Lin,C.Liu,R.Zhang,P.Gao,L.Qiu,H.Xiao,H.Qiu,C.Lin,W.Shao,K.Chen,et al.Sphinx:The joint mixing of weights,tasks,and visual embeddings for multi-modal large languagemodels.arXiv preprint arXiv:2311.07575,2023b.F.Liu,F.Piccinno,S.Krichene,C.Pang,K.Lee,M.Joshi,Y.Altun,N.Collier,andJ.M.Eisenschlos.Matc
247、ha:Enhancing visual language pretraining with math reasoning and chart derendering.arXiv preprint arXiv:2212.09662,2022a.H.Liu,C.Li,Y.Li,B.Li,Y.Zhang,S.Shen,and Y.J.Lee.Llava-next:Improved reasoning,ocr,and world knowledge,January 2024a.URLhttps:/llava-vl.github.io/blog/2024-01-30-llava-next/.25H.Li
248、u,C.Li,Q.Wu,and Y.J.Lee.Visual instruction tuning.Advances in neural informationprocessing systems,36,2024b.Y.Liu,G.Zhu,B.Zhu,Q.Song,G.Ge,H.Chen,G.Qiao,R.Peng,L.Wu,and J.Wang.Taisu:A166m large-scale high-quality dataset for chinese vision-language pre-training.In S.Koyejo,S.Mohamed,A.Agarwal,D.Belgr
249、ave,K.Cho,and A.Oh,editors,Advances in NeuralInformation Processing Systems,volume 35,pages 1670516717.Curran Associates,Inc.,2022b.URLhttps:/proceedings.neurips.cc/paper_files/paper/2022/file/6a386d703b50f1cf1f61ab02a15967bb-Paper-Datasets_and_Benchmarks.pdf.Y.Liu,H.Duan,Y.Zhang,B.Li,S.Zhang,W.Zhao
250、,Y.Yuan,J.Wang,C.He,Z.Liu,et al.Mmbench:Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281,2023a.Y.Liu,Z.Li,H.Li,W.Yu,M.Huang,D.Peng,M.Liu,M.Chen,C.Li,L.Jin,et al.On the hiddenmystery of ocr in large multimodal models.arXiv preprint arXiv:2305.07895,2023b.S.Long,S.Qin,D.P
251、anteleev,A.Bissacco,Y.Fujii,and M.Raptis.Towards end-to-end unifiedscene text detection and layout analysis.InProceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition,2022.P.Lu,L.Qiu,J.Chen,T.Xia,Y.Zhao,W.Zhang,Z.Yu,X.Liang,and S.-C.Zhu.Iconqa:Anew benchmark for abstract diag
252、ram understanding and visual language reasoning.arXivpreprint arXiv:2110.13214,2021.P.Lu,S.Mishra,T.Xia,L.Qiu,K.-W.Chang,S.-C.Zhu,O.Tafjord,P.Clark,and A.Kalyan.Learn to explain:Multimodal reasoning via thought chains for science question answering.In The 36th Conference on Neural Information Proces
253、sing Systems(NeurIPS),2022a.P.Lu,S.Mishra,T.Xia,L.Qiu,K.-W.Chang,S.-C.Zhu,O.Tafjord,P.Clark,and A.Kalyan.Learn to explain:Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems,35:25072521,2022b.P.Lu,H.Bansal,T.Xia,J.Liu,C.Li,H.Hajish
254、irzi,H.Cheng,K.-W.Chang,M.Galley,and J.Gao.Mathvista:Evaluating mathematical reasoning of foundation models in visual contexts.arXivpreprint arXiv:2310.02255,2023.J.Mao,J.Huang,A.Toshev,O.Camburu,A.L.Yuille,and K.Murphy.Generation andcomprehension of unambiguous object descriptions.InProceedings of
255、the IEEE conferenceon computer vision and pattern recognition,pages 1120,2016.A.Masry,P.Kavehzadeh,X.L.Do,E.Hoque,and S.Joty.Unichart:A universal vision-languagepretrained model for chart comprehension and reasoning.arXiv preprint arXiv:2305.14761,2023.D.Narayanan,M.Shoeybi,J.Casper,P.LeGresley,M.Pa
256、twary,V.Korthikanti,D.Vainbrand,P.Kashinkunti,J.Bernauer,B.Catanzaro,et al.Efficient large-scale language model trainingon gpu clusters using megatron-lm.InProceedings of the International Conference for HighPerformance Computing,Networking,Storage and Analysis,pages 115,2021.N.Nayef,F.Yin,I.Bizid,H
257、.Choi,Y.Feng,D.Karatzas,Z.Luo,U.Pal,C.Rigaud,J.Chazalon,et al.Icdar2017 robust reading challenge on multi-lingual scene text detection and scriptidentification-rrc-mlt.In2017 14th IAPR international conference on document analysis andrecognition(ICDAR),volume 1,pages 14541459.IEEE,2017.26OpenAI.Chat
258、gpt:Optimizing language models for dialogue.2022.URLhttps:/ technical report.arXiv,2023a.R.OpenAI.Gpt-4v(ision)system card.2023b.J.A.Rodriguez,D.Vazquez,I.Laradji,M.Pedersoli,and P.Rodriguez.Ocr-vqgan:Tam-ing text-within-image generation.InProceedings of the IEEE/CVF Winter Conference onApplications
259、 of Computer Vision,pages 36893698,2023.R.Schaeffer,B.Miranda,and S.Koyejo.Are emergent abilities of large language models amirage?Advances in Neural Information Processing Systems,36,2024.N.Shazeer.Glu variants improve transformer.arXiv preprint arXiv:2002.05202,2020.B.Shi,C.Yao,M.Liao,M.Yang,P.Xu,
260、L.Cui,S.Belongie,S.Lu,andX.Bai.Icdar2017competitionon reading chinese text in the wild(rctw-17).In2017 14th iapr international conference ondocument analysis and recognition(ICDAR),volume 1,pages 14291434.IEEE,2017.M.Shoeybi,M.Patwary,R.Puri,P.LeGresley,J.Casper,and B.Catanzaro.Megatron-lm:Training
261、multi-billion parameter language models using model parallelism.arXiv preprintarXiv:1909.08053,2019.A.Singh,G.Pang,M.Toh,J.Huang,W.Galuba,and T.Hassner.Textocr:Towards large-scale end-to-end reasoning for arbitrary-shaped scene text.In Proceedings of the IEEE/CVFconference on computer vision and pat
262、tern recognition,pages 88028812,2021.J.Su,M.Ahmed,Y.Lu,S.Pan,W.Bo,and Y.Liu.Roformer:Enhanced transformer with rotaryposition embedding.Neurocomputing,568:127063,2024.Q.Sun,Q.Yu,Y.Cui,F.Zhang,X.Zhang,Y.Wang,H.Gao,J.Liu,T.Huang,and X.Wang.Generative pretraining in multimodality.arXiv preprint arXiv:2
263、307.05222,2023.Y.Sun,Z.Ni,C.-K.Chng,Y.Liu,C.Luo,C.C.Ng,J.Han,E.Ding,J.Liu,D.Karatzas,et al.Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt.In2019International Conference on Document Analysis and Recognition(ICDAR),pages 15571562.IEEE,2019.G.Team,R.Anil,S.Borgeau
264、d,Y.Wu,J.-B.Alayrac,J.Yu,R.Soricut,J.Schalkwyk,A.M.Dai,A.Hauth,et al.Gemini:a family of highly capable multimodal models.arXiv preprintarXiv:2312.11805,2023.S.Tong,Z.Liu,Y.Zhai,Y.Ma,Y.LeCun,and S.Xie.Eyes wide shut?exploring the visualshortcomings of multimodal llms.arXiv preprint arXiv:2401.06209,2
265、024.H.Touvron,T.Lavril,G.Izacard,X.Martinet,M.-A.Lachaux,T.Lacroix,B.Rozire,N.Goyal,E.Hambro,F.Azhar,et al.LLaMA:Open and efficient foundation language models.arXivpreprint arXiv:2302.13971,2023a.H.Touvron,L.Martin,K.Stone,P.Albert,A.Almahairi,Y.Babaei,N.Bashlykov,S.Batra,P.Bhargava,S.Bhosale,D.Bike
266、l,L.Blecher,C.Canton-Ferrer,M.Chen,G.Cucurull,D.Esiobu,J.Fernandes,J.Fu,W.Fu,B.Fuller,C.Gao,V.Goswami,N.Goyal,A.Hartshorn,S.Hosseini,R.Hou,H.Inan,M.Kardas,V.Kerkez,M.Khabsa,I.Kloumann,A.Korenev,P.S.Koura,M.Lachaux,T.Lavril,J.Lee,D.Liskovich,Y.Lu,Y.Mao,X.Martinet,T.Mihaylov,P.Mishra,27I.Molybog,Y.Nie
267、,A.Poulton,J.Reizenstein,R.Rungta,K.Saladi,A.Schelten,R.Silva,E.M.Smith,R.Subramanian,X.E.Tan,B.Tang,R.Taylor,A.Williams,J.X.Kuan,P.Xu,Z.Yan,I.Zarov,Y.Zhang,A.Fan,M.Kambadur,S.Narang,A.Rodriguez,R.Stojnic,S.Edunov,andT.Scialom.Llama 2:Open foundation and fine-tuned chat models.CoRR,abs/2307.09288,20
268、23b.doi:10.48550/arXiv.2307.09288.URLhttps:/doi.org/10.48550/arXiv.2307.09288.A.Veit,T.Matera,L.Neumann,J.Matas,and S.Belongie.Coco-text:Dataset and benchmark fortext detection and recognition in natural images.arXiv preprint arXiv:1601.07140,2016.B.Wang,G.Li,X.Zhou,Z.Chen,T.Grossman,and Y.Li.Screen
269、2words:Automatic mobileui summarization with multimodal learning.InThe 34th Annual ACM Symposium on UserInterface Software and Technology,pages 498510,2021.J.Wang,L.Meng,Z.Weng,B.He,Z.Wu,and Y.-G.Jiang.To see is to believe:Prompting gpt-4vfor better visual instruction tuning.arXiv preprint arXiv:231
270、1.07574,2023a.W.Wang,Q.Lv,W.Yu,W.Hong,J.Qi,Y.Wang,J.Ji,Z.Yang,L.Zhao,X.Song,et al.Cogvlm:Visual expert for pretrained language models.arXiv preprint arXiv:2311.03079,2023b.H.Wei,L.Kong,J.Chen,L.Zhao,Z.Ge,J.Yang,J.Sun,C.Han,and X.Zhang.Vary:Scalingup the vision vocabulary for large vision-language mo
271、dels.arXiv preprint arXiv:2312.06109,2023.Y.Yang,A.Panagopoulou,Q.Lyu,L.Zhang,M.Yatskar,and C.Callison-Burch.Visual goal-stepinference using wikihow.arXiv preprint arXiv:2104.05845,2021.J.Ye,A.Hu,H.Xu,Q.Ye,M.Yan,G.Xu,C.Li,J.Tian,Q.Qian,J.Zhang,et al.Ureader:Universal ocr-free visually-situated langu
272、age understanding with multimodal large languagemodel.arXiv preprint arXiv:2310.05126,2023.Q.Yu,Q.Sun,X.Zhang,Y.Cui,F.Zhang,Y.Cao,X.Wang,and J.Liu.Capsfusion:Rethinkingimage-text data at scale.arXiv preprint arXiv:2310.20550,2023a.W.Yu,Z.Yang,L.Li,J.Wang,K.Lin,Z.Liu,X.Wang,and L.Wang.Mm-vet:Evaluati
273、ng largemultimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,2023b.X.Yue,Y.Ni,K.Zhang,T.Zheng,R.Liu,G.Zhang,S.Stevens,D.Jiang,W.Ren,Y.Sun,et al.Mmmu:A massive multi-discipline multimodal understanding and reasoning benchmark forexpert agi.arXiv preprint arXiv:2311.16502,2023
274、.R.Zellers,A.Holtzman,Y.Bisk,A.Farhadi,and Y.Choi.HellaSwag:Can a machine really finishyour sentence?In A.Korhonen,D.R.Traum,and L.Mrquez,editors,Proceedings of the 57thConference of the Association for Computational Linguistics,ACL 2019,Florence,Italy,July28-August 2,2019,Volume 1:Long Papers,pages
275、 47914800.Association for ComputationalLinguistics,2019.doi:10.18653/v1/p19-1472.URLhttps:/doi.org/10.18653/v1/p19-1472.B.Zhang and R.Sennrich.Root mean square layer normalization.Advances in NeuralInformation Processing Systems,32,2019.G.Zhang,X.Du,B.Chen,Y.Liang,T.Luo,T.Zheng,K.Zhu,Y.Cheng,C.Xu,S.
276、Guo,et al.Cmmmu:A chinese massive multi-discipline multimodal understanding benchmark.arXivpreprint arXiv:2401.11944,2024.28R.Zhang,Y.Zhou,Q.Jiang,Q.Song,N.Li,K.Zhou,L.Wang,D.Wang,M.Liao,M.Yang,et al.Icdar 2019 robust reading challenge on reading chinese text on signboard.In2019international confere
277、nce on document analysis and recognition(ICDAR),pages 15771581.IEEE,2019.Y.Zhang,L.Gueguen,I.Zharkov,P.Zhang,K.Seifert,and B.Kadlec.Uber-text:A large-scale dataset for optical character recognition from street-level imagery.InSUNw:SceneUnderstanding Workshop-CVPR 2017,Hawaii,U.S.A.,2017.URLhttp:/sun
278、w.csail.mit.edu/abstract/uberText.pdf.L.Zheng,W.-L.Chiang,Y.Sheng,S.Zhuang,Z.Wu,Y.Zhuang,Z.Lin,Z.Li,D.Li,E.Xing,et al.Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural InformationProcessing Systems,36,2024.W.Zhong,R.Cui,Y.Guo,Y.Liang,S.Lu,Y.Wang,A.Saied,W.Chen,and N.Duan.AGIE
279、val:Ahuman-centric benchmark for evaluating foundation models.CoRR,abs/2304.06364,2023.doi:10.48550/arXiv.2304.06364.URLhttps:/doi.org/10.48550/arXiv.2304.06364.W.Zhu,J.Hessel,A.Awadalla,S.Y.Gadre,J.Dodge,A.Fang,Y.Yu,L.Schmidt,W.Y.Wang,and Y.Choi.Multimodal c4:An open,billion-scale corpus of images
280、interleaved with text.Advances in Neural Information Processing Systems,36,2024.29A.AppendixPrompt將這段流程圖轉換為python程序Response Open-source Model 1sorry,as a language model AI,I cannot process image into any other content.Response Open-source Model 2當然!這里有一個 Python 代碼片段可以復制流程圖:pythonResponse DeepSeek-VL
281、這段流程圖展示了一個使用Python編程語言編寫的猜數字游戲。游戲的目的是玩家猜測一個在1到10之間的隨機數,如果猜對了,游戲就會結束并顯示“你贏了!”(You got it!)。如果玩家的猜測超過了或低于這個范圍,游戲會提示“太高了!”(Too high!)或“太低了!”(Too low!),然后玩家可以再次嘗試。根據流程圖的描述,我們可以將這個游戲實現為以下Python代碼:在這個程序中,我們使用了random.randint()函數來生成一個1到10之間的隨機整數作為答案。然后,我們進入一個循環,直到玩家的猜測等于答案為止。在每次猜測后,我們會檢查玩家的猜測是否在范圍內,以及它是否等于答
282、案。如果是,我們就打印出“你贏了!”;如果不是,我們就打印出“太高了!”或“太低了!”,然后讓玩家再次猜測。一旦玩家的猜測等于答案,游戲就結束了。Figure 11|Visualization results.DeepSeek-VL can understand childrens programming dia-grams from the real world and provide detailed and organized explanations.30PromptExplain the image.Response DeepSeek-VLThe image is a line gr
283、aph that compares the performance of two different learning rate schedulers,namely the Multi Step Learning Rate Scheduler and the Cosine Learning Rate Scheduler.The x-axis represents the number of processed tokens in billions,ranging from 0 to 100 billion.The y-axis represents the training loss,whic
284、h ranges from 2.0 to 3.0.The blue line represents the Multi Step Learning Rate Scheduler with three steps of increasing rates:(80%+10%+10%).This line starts at a high point near 3.0 on the y-axis when the number of processed tokens is 0 billion,indicating a higher initial loss.As the number of proce
285、ssed tokens increases,this line shows a steep decline in training loss,suggesting rapid improvement in model performance as more data is processed.However,after approximately 40 billion processed tokens,the line begins to level off,indicating diminishing returns on further processing.The orange line
286、 represents the Cosine Learning Rate Scheduler.It also starts at a high point near 3.0 when the number of processed tokens is 0 billion but has a much smoother descent than the blue line.This suggests a slower but steady decrease in training loss as more tokens are processed.After around 60 billion
287、processed tokens,this line levels off as well,similar to the blue line.Both lines show a general trend of decreasing training loss as the number of processed tokens increases,indicating improved model performance over time.However,the Multi Step Learning Rate Scheduler appears to achieve lower losse
288、s more quickly compared to the Cosine Learning Rate Scheduler,especially in the early stages of processing tokens.Response Open-source Model 1Line 1 expresses the number of tokens,following a nonlinear trend.Line 2 starts at value 80,then is decreasing and increases by 10%until reaching a value wher
289、e it then increases by 10%,before it flattens out.Line 3 also starts in increasing part,then has some constant decreases between different points,and after that express an increasing part.Response Open-source Model 2The image displays a line graph comparing the training losses of two different learn
290、ing rates schedulers:the Multi Step Learning Rate Scheduler(80%+10%=10%)and the Cosine Learning Rate Scheduler.The y-axis represents the training loss in terms of billions,ranging from approximately 2.2 to 3.0.The x-axis represents the number of processed tokens,with a range from 0 to 100.The Multi
291、Step Learning Rate Schedulers line is blue,showing a sharp decline initially and then stabilizing.The Cosine Learning Rate Schedulers line is orange,showing a more gradual decline.PromptHelp me find the bug.Response Open-source Model 1The function name in the code is find_max_in_sequence,but it has
292、not actually been called as the condition of the if-statement checks for not sequence instead of if not sequence.The actual line where the function should be called is missing from the provided code.Response Open-source Model 2The bug in the code is that the condition for returning None is set to if
293、 not sequence,which is incorrect.The correct condition should be if not max_value,to ensure that the function returns None when the sequence is not found or when the max value is not found within the sequence.Response DeepSeek-VLThe image shows a piece of code written in Python,which is intended to
294、find the maximum value in a sequence.The function definition starts with def indicating that this is a new function being defined.The name of the function is find_max_in_sequence.The first line checks if the input is not a sequence;if it is,the function returns None.This is an important check becaus
295、e functions should only operate on sequences(like lists or tuples)and not other types of data structures.If the input is not a sequence,the function will return no result,which can be useful for handling invalid inputs.Next,theres a statement defining what a sequence is:max_value=sequence 1.This see
296、ms incorrect as it assigns the index 1 to max_value instead of the actual element at position 1 in the sequence.Then we have a loop that iterates over each number in the sequence.It uses an if condition to check whether the current number(num)is greater than the current maximum value(max_value).If t
297、rue,it updates max_value to be equal to num.This part of the code appears correct,as it correctly compares each number against the current maximum value and updates the maximum when necessary.Finally,after the loop ends,the function returns the updated max_value,which contains the maximum value foun
298、d in the sequence.Based on the visual clues provided by the code snippet,the bug lies in the assignment of max_valuewhere it incorrectly assigns the index 1 rather than the value at that index in the sequence.The corrected version would look like this:max_value=sequence0.This error could potentially
299、 lead to unexpected results if the sequence has more than one element,as the function might fail to find the maximum value due to the wrong assignment of max_value.Figure 12|Visualization results.DeepSeek-VL has strong understanding capabilities for codeand charts in the real world.31PromptWhat coin
300、 is that?Response DeepSeek-VLThe coin is a 10 Baht coin from Thailand.Response Open-source Model 1The coin is an Asian coin,which features a cityscape on it.Response Open-source Model 2answering does not require reading text in the imagePrompt請根據這張圖片,創作一首七言絕句Response Open-source Model 1雪山蒼蒼籠寒日,霜林瑟瑟動
301、清聲。Response Open-source Model 2光禿的樹木高聳,金色的色調親吻著白雪,群山低語著古老的故事,白晝的擁抱開始消失。Response DeepSeek-VL山川暮色映斜陽,林間雪覆寒意長。孤樹影斜人未歸,一抹殘霞照斷腸。Prompt圖中的角色是哪家公司的?Response Open-source Model 1圖像中的角色來自游戲公司“原神”。Response Open-source Model 2這張圖片展示的角色是日本Square Enix公司旗下的一款游戲最終幻想14中的角色。Response DeepSeek-VL 圖中的角色是miHoYo(米哈游)公司的。這
302、家公司以其開發的手機游戲原神而聞名,該游戲中的角色設計風格與圖片中角色的風格相似。此外,圖片中的角色穿著和武器設計也符合miHoYo游戲的特征。Figure 13|Visualization results.DeepSeek-VL possesses extensive knowledge of the real world.32PromptConvert it to markdown table.Response Open-source Model 1Response Open-source Model 2Response DeepSeek-VLHere is the image content converted into a markdown table:Please note that the A column in the original table seems to be empty or not fully visible in the image provided.Figure 14|Visualization results.DeepSeek-VL is capable of accurately reading the contents ofreal-world tables.33