《豆包大模型團隊:Seedream 3.0技術報告(英文版)(22頁).pdf》由會員分享,可在線閱讀,更多相關《豆包大模型團隊:Seedream 3.0技術報告(英文版)(22頁).pdf(22頁珍藏版)》請在三個皮匠報告上搜索。
1、Seedream 3.0 Technical ReportByteDance SeedAbstractWe present Seedream 3.0,a high-performance Chinese-English bilingual image generation founda-tion model.We develop several technical improvements to address existing challenges in Seedream2.0,including alignment with complicated prompts,fine-grained
2、 typography generation,suboptimalvisual aesthetics and fidelity,and limited image resolutions.Specifically,the advancements ofSeedream 3.0 stem from improvements across the entire pipeline,from data construction to modeldeployment.At the data stratum,we double the dataset using a defect-aware traini
3、ng paradigmand a dual-axis collaborative data-sampling framework.Furthermore,we adopt several effectivetechniques such as mixed-resolution training,cross-modality RoPE,representation alignmentloss,and resolution-aware timestep sampling in the pre-training phase.During the post-trainingstage,we utili
4、ze diversified aesthetic captions in SFT,and a VLM-based reward model withscaling,thereby achieving outputs that well align with human preferences.Furthermore,See-dream 3.0 pioneers a novel acceleration paradigm.By employing consistent noise expectationand importance-aware timestep sampling,we achie
5、ve a 4 to 8 times speedup while maintainingimage quality.Seedream 3.0 demonstrates significant improvements over Seedream 2.0:it enhancesoverall capabilities,in particular for text-rendering in complicated Chinese characters which isimportant to professional typography generation.In addition,it prov
6、ides native high-resolutionoutput(up to 2K),allowing it to generate images with high visual quality.Official Page:https:/ 2.0Imagen 3Ideogram 3.0Midjourney v6.1FLUX1.1 ProSeedream 3.0Figure 1Seedream 3.0 demonstrates outstanding performance across all evaluation aspects.Due to missing data,thePortra
7、it result of Imagen 3 and overall result of Seedream 2.0 are represented by the average values of other models.In addition,Seedream 3.0 ranks first at Artificial Analysis Text to Image Model Leaderboard with an Arena ELOscore of 1158 at 17.0K Appearances at the time of publication1.1https:/artificia
8、lanalysis.ai/text-to-image/arena?tab=Leaderboard1arXiv:2504.11346v2 cs.CV 16 Apr 2025Figure 2Seedream 3.0 visualization.2Contents1Introduction.42Technical Details.52.1Data.52.2Model Pre-training.52.2.1Model Architectures.52.2.2Model Training Details.62.3Model Post-training.72.3.1Aesthetic Caption.72
9、.3.2Model Training Details.72.3.3Reward Model Scaling.72.4Model Acceleration.73Model Performance.83.1Artificial Analysis Arena.83.2Comprehensive Evaluation.93.2.1Human Evaluation.93.2.2Automatic Evaluation.103.3Text Rendering.123.4Photorealistic Portrait.143.5Comparison with GPT-4o.163.5.1Dense Text
10、 Rendering.163.5.2Image Editing.163.5.3Generation Quality.184Conclusion.19A Contributions and Acknowledgments.22A.1 Core Contributors.22A.2 Contributors.2231IntroductionRecent advances in diffusion models 3,8,10,18,21 have reshaped the landscape of image generation,propelling generative capabilities
11、 to unprecedented heights.Recently,the introduction of Seedream 2.0 hasmarked a significant milestone in bilingual text-to-image generation,demonstrating superior performance incapturing Chinese linguistic nuances and cultural semantics.However,our comprehensive evaluation identifiesseveral critical
12、 challenges that may impede its wide commercial application.Alignment with complicated prompts:Prompt following can be further enhanced,especially in numericalprecision and multi-object spatial relationships.Fine-grained typographic generation:Seedream 2.0 is still limited in generating high-fidelit
13、y small-sizetext characters,multi-line contextual compositions,and intricate typographic details.Suboptimal visual aesthetics and fidelity:Capturing nuanced aesthetic qualities,such as the beauty ofcinematic scenes and the texture of portraits,remains challenging.Limited image resolutions:Fundamenta
14、l models restrict native output to small resolution(e.g.,512512px),necessitating reliance on post-processing super-resolution pipelines.Our methodology introduces four key technical improvements.First,at the data stratum,we approximatelydoubled the dataset size with improved quality by using a new d
15、ynamic sampling mechanism,which is builton two orthogonal axes:image cluster distribution and textual semantic coherence.Second,we incorporate anumber of efficient training approaches in the pre-training stage,including i)mixed-resolution training,ii)a cross-modality RoPE,iii)a representation alignm
16、ent loss,iv)resolution-aware timestep sampling.Thisallows for better scalability and generalizability,resulting in better visual-language alignment.Third,inpost-training,we utilize diverse aesthetic captions in SFT,and a VLM-based reward model to further enhancethe models overall performance.Finally
17、,in model acceleration,we encourage stable sampling via consistentnoise expectation,effectively reducing the number of function evaluations(NFE)during inference.Compared to Seedream 2.0,Seedream 3.0 shows significant advances in multiple dimensions:Comprehensive capability enhancement:Demonstrates s
18、trong user preference and significant advancementsin key capabilities,including text-image alignment,compositional structure,aesthetic quality and textrendering.Enhanced text rendering performance:Achieves significantly enhanced text rendering performance,particularly excelling in generating small-s
19、ize text characters in both Chinese and English,and high-aesthetic long-text layouts.Seedream 3.0 represents a pioneering solution for the challenges of small-textgeneration and aesthetically pleasing long-text composition,outperforming human-designed templates fromplatforms like Canva in graphic de
20、sign output.Aesthetic improvement:Substantial improvement in image aesthetic quality,delivering exceptional perfor-mance in cinematic scenarios and enhanced realism in portrait generation.Native high-resolution output:Offers native support for 2K resolution output,eliminating the need forpost-proces
21、sing.Also,compatible with higher resolutions and adaptable to diverse aspect ratios.Efficient inference cost:With several model acceleration techniques,Seedream 3.0 can reduce its inferencecost considerably and generates an image of 1K resolution using only 3.0 seconds(without PE),which ismuch faste
22、r than other commercial models.Seedream 3.0 was integrated into multiple platforms in early April 2025,including Doubao1and Jimeng2.Wefervently hope that Seedream 3.0 can become a practical tool to improve productivity in all aspects of workand daily life.1https:/ Details2.1DataIn Seedream 2.0,we em
23、ploy a stringent data filtering strategy that systematically excluded image dataexhibiting minor artifacts,including watermarks,overlaid text,subtitles,and mosaic patterns.This strictfiltering protocol significantly limited the amount of data used in the training,especially considering that suchaffe
24、cted samples constituted a substantial portion of the original dataset(approximately 35%of the totalcollection).To address this limitation,Seedream 3.0 introduces an innovative defect-aware training paradigm.This paradigm includes a specialized defect detector trained on 15,000 manually annotated sa
25、mples selectedby an active learning engine.The detector precisely locates defect areas through bounding box predictions.When the total area of the detected defects is less than 20%of the image space(a configurable threshold),weretain these previously excluded samples while implementing mask latent s
26、pace optimization.Specifically,during the diffusion loss calculation in the latent representation space,we employ a spatial attention maskmechanism to exclude feature gradients from the identified defect areas.This innovative approach expandsthe effective training dataset by 21.7%while maintaining m
27、odel stability.To optimize data distribution,we propose a dual-axis collaborative data sampling framework,jointly optimizingfrom the dimensions of visual morphology and semantic distribution.In the visual modality,we continue touse hierarchical clustering methods to ensure a balanced representation
28、of different visual patterns.On thetextual semantic level,we achieve semantic balance through term frequency and inverse document frequency(TF-IDF 19),effectively addressing the long-tail distribution problem of descriptive texts.To further enhancethe coordination of the data ecosystem,we have devel
29、oped a cross-modal retrieval system that establishesa joint embedding space for image-text pairs.This system achieves state-of-the-art performance across allbenchmark tests.The retrieval-enhanced framework dynamically optimizes the dataset through the followingmethods:(1)injecting expert knowledge v
30、ia targeted concept retrieval;(2)performing distribution calibrationthrough similarity-weighted sampling;(3)utilizing retrieved neighboring pairs for cross-modal enhancement.2.2Model Pre-training2.2.1Model ArchitecturesOur core architecture design inherits from Seedream 2.0 4,which adopts an MMDiT 3
31、 to process theimage and text tokens and capture the relationship between the two modalities.We have increased the totalparameters in our base model,and introduced several improvements in Seedream 3.0,leading to enhancedscalability,generalizability,and visual-language alignment.Mixed-resolution Trai
32、ning.Transformers 23 natively supports variable lengths of tokens as input,which alsoproved to be effective in ViT-based visual recognition tasks 2.In Seedream 3.0,we adopt mixed-resolutiontraining by packing images of different aspect ratios and resolutions together at each training stage.Specifica
33、lly,we first pre-train our model at an average resolution of 2562(with various aspect ratios)and then finetune iton higher resolution images(from 5122to 20482).We also adopt size embedding as an additional condition tomake the model aware of the target resolution.Mixed-resolution training significan
34、tly increases data diversity,and improves the generalizability of our model on unseen resolutions.Cross-modality RoPE.In Seedream 2.0,we introduced Scaling RoPE to enable our model to better generalizeto untrained aspect ratios and resolutions.In Seedream 3.0,we extend this technique to a Cross-moda
35、lityRoPE,which further enhances the alignment of visual-text tokens.We treat the text tokens as 2D tokens withthe shape of 1,L and apply a 2D RoPE 22 to the text tokens.The column-wise position IDs of text tokensare assigned consecutively after the corresponding image tokens.The Cross-modality RoPE
36、effectively modelsthe intra-modality and cross-modality relationship,which are crucial for improving visual-text alignment andtext rendering accuracy.5騎掃把的紅發巫,只條紋相間的貓坐在掃把上,漫格(A red-haired witch riding a broomstick,a black and white striped cat sitting on the broomstick,Japanese cartoon style)粗顆粒膠拍攝,
37、朵艷麗的紅麗花擋住了模特的半張臉,她戴著珍珠環(Shot on grainy film,a bright red dahlia covers half of the face of a black female model wearing pearl earrings)PretrainCT SFTRLHFPE只戴著棒球帽的貴賓,拿著本字典,在板上寫著bonez(A poodle wearing a baseball cap holding a dictionary with the word bonez written on a blackboard)Figure 3The compariso
38、n of the effects at different stages.2.2.2Model Training DetailsTraining Objectives.In Seedream 3.0,we adopt flow matching 12,13 training objective,as well as arepresentation alignment loss(REPA 25):L=E(x0,C)D,tp(t;D),xtpt(xt|x0)?v(xt,t;C)dxtdt?22+LREPA,(1)where we use linear interpolant xt=(1 t)x0+
39、t,N(0,I)following common practice 3,13.Therepresentation alignment loss is computed as the cosine distance between the intermediate feature of ourMMDiT and the feature of a pre-trained vision encoder DINOv2-L 16,with the loss weight=0.5.Wefind that introducing the representation alignment objective
40、can accelerate the convergence of large-scaletext-to-image generation.Resolution-aware Timestep Sampling.As shown in Equation(1),the timesteps are sampled from a distributionp(t;D)that is adaptive to datasetD.Similar to 3,we design the distribution of timesteps by first samplingfrom the logit-normal
41、 distribution,and then performing timestep shifting based on the training resolution.Generally speaking,when training on higher resolutions,we shift the distribution to increase samplingprobability at lower SNRs.During training,we compute the average resolution of datasetDto determine theshifted tim
42、estep distribution.During inference,we compute the shift factor based on the desired resolutionand aspect ratio.6國畫風格,花鳥畫,墨與色相結合,細膩運筆。水墨暈染效果、寫意技法。墨色、淡紫色、深綠色。傳統中國畫構圖,流暢的線條和松散的筆觸結合,輕盈自然。畫面描繪了葡萄與枝蔓,葡萄粒粒飽滿、透明感十足,枝條彎曲自然,葉片繁茂。枝條和葡萄在畫面中部。畫面的右上角有豎排的書法字跡。水墨暈染效果,寫意技法。氛圍自然、寧靜、傳統??ㄍL格的營銷海報,標題為“夏日歡樂季”。畫面展示了一對卡通人
43、物坐在湖邊椅子上,背景是藍天白云和湖面,右側有一個裝飾有燈串的帳篷,旁邊擺放著飲料、零食和購物袋等物品,營造出輕松愉快的夏日氛圍。配色以藍色、綠色為主,搭配黃色點綴,整體色調明亮清新,氛圍輕松愉悅。標題位于畫面上方中央,使用白色手寫體,搭配黃色線條裝飾。標題上方是黃色手寫體書寫的英文標題,下方白底黃字寫著宣傳語“抽一分錢,夏日快樂大禮包”,畫面頂部中央有主辦方標志,畫面底部寫著活動信息、宣傳語“出門過夏天 超值好物省心選和電商標識。紀實攝影風格,平視視角,一名穿灰色外套、戴口罩的人高舉寫有“400 YEARS”的紙板,紙板邊緣有紅色涂鴉,背景為模糊的抗議人群,他們舉著寫有“BLACK LIVE
44、S MATTER”的標語,背景為一棟現代建筑,建筑外墻由玻璃和金屬構成,紀實攝影風格,中景,淺景深,背景虛化,中心構圖,自然光,強烈的社會氛圍。Figure 4Some examples of detailed captions that incorporate aesthetic terms.2.3Model Post-trainingSimilar to Seedream 2.0 4,our post-training process consists of the following stages:Continuing Training(CT),Supervised Fine-Tunin
45、g(SFT),Human Feedback Alignment(RLHF)and Prompt Engineering(PE).We omitted the Refiner stage,because our model is capable of directly generating images at any resolutionwithin the range from 5122to 20482.The comparison of the effects at different stages is shown in Figure 3.2.3.1Aesthetic CaptionWe
46、have specifically trained multiple versions of the caption models for the data in the CT and SFT stages.As shown in Figure 4,these caption models provide accurate descriptions in professional domains such asaesthetics,style,and layout.This ensures that the model can respond more effectively to relev
47、ant prompts,thereby improving the models controllability and its performance after prompt engineering.2.3.2Model Training DetailsTo ensure that the model could achieve favorable performance across different resolutions,we apply a resolutionbalancing strategy to the data during the training process.T
48、his approach guaranteed an adequate samplingof training data at different resolutions,thereby enhancing the models ability to follow prompts in variousscenarios.2.3.3Reward Model ScalingDifferent from our previous Seedream 2.0,which employed CLIP as the reward model,we now utilizeVision-Language Mod
49、els(VLMs)as the reward modeling framework.This change leverages VLMs superiorfoundational capabilities and reward scaling potential.Inspired by generative reward modeling(RM)techniquesin large language models(LLMs),we explicitly formulate instructions as queries and derive rewards fromthe normalized
50、 probability of the“Yes”response token.This approach effectively harnesses the knowledgeembedded in preetrained LLMs while naturally benefiting from LLM scaling effects to enhance reward quality.We systematically scale the reward model from 1B to 20B parameters.Empirical results reveal the emergence
51、of reward model scaling,indicating that increased reward model capacity correlates with improved rewardmodeling performance.2.4Model AccelerationOur acceleration framework builds upon Hyper-SD 17 and RayFlow 20.We rethink the diffusion process byenabling each sample to follow its own adaptive genera
52、tive trajectory,rather than forcing all samples through7a shared path that converges to a standard Gaussian prior.In conventional diffusion models,all samples areprogressively transformed into isotropic Gaussian noise,resulting in overlapping trajectories in probabilityspace.This overlap increases r
53、andomness,reduces controllability,and introduces instability during the reverseprocess.Instead,we guide each data point toward an instance-specific target distribution,enabling trajectorycustomization per sample.This significantly reduces path collisions and improves both generation stabilityand sam
54、ple diversity.Consistent Noise Expectation for Stable Sampling.To ensure smooth and consistent transitions duringsampling,we introduce a unified noise expectation vector,estimated from a pretrained model.This expectationserves as a global reference for all timesteps,aligning the denoising process ac
55、ross time.By maintainingconsistent expectations,we make it possible to compress the number of sampling steps without degrading imagequality.Theoretical analysis further shows that our design maximizes the probability of the forward-backwardpath from data to noise and back,which leads to improved sam
56、pling stability and more reliable reconstructions.Learning to Sample Important Timesteps.In addition to redesigning the generative path,we focus onimproving training efficiency.Standard training procedures for diffusion models sample timesteps uniformly,which introduces high variance in the loss and
57、 wastes computation on uninformative steps.To address this,we introduce an importance sampling mechanism that learns to focus on the most critical timesteps duringtraining.We achieve this by combining Stochastic Stein Discrepancy 6(SSD)with a neural network thatlearns a data-dependent distribution o
58、ver timesteps.This network predicts which time indices contributemost to reducing the training loss,allowing us to prioritize them during optimization.The result is fasterconvergence and more efficient use of training resources.Our framework supports efficient few-step sampling without compromising
59、generation quality.It followsan iterative denoising schedule with far fewer steps than unaccelerated baselines.Despite this reduction,our method achieves results that match or surpass baselines requiring 50 function evaluationsknown asthe Number of Function Evaluations(NFE)across key aspects includi
60、ng aesthetic quality,text-imagealignment,and structural fidelity.These results demonstrate the effectiveness of our trajectory design andnoise consistency mechanisms in enabling high-quality synthesis with minimal computational cost.For otheracceleration methods,such as Quantization,we directly foll
61、ow the solution of Seedream 2.0.3Model PerformanceIn a publicly conducted evaluation,Seedream 3.0 ranks first among top-tier text-to-image models globally,such as GPT-4o 15,Imagen 3 5,Midjourney v6.1 14,FLUX1.1 Pro 11,Ideogram 3.0 9,and others.Wefurther conduct a rigorous expert evaluations to asses
62、s Seedream 3.0,both manually and through automatedmeans.The results demonstrate marked improvements in Seedream 3.0 across all key performance indicatorscompared to the previous version,alongside superior performance against industry-leading counterparts.Notably,Seedream 3.0 exhibits achieves except
63、ional capabilities in two aspects:dense text rendering andphotorealistic human portrait generation.See Sections 3.3 and 3.4 for detailed explanations of these twoaspects,respectively.In addition,we provide a systematic comparative analysis with GPT-4o 15 in Section3.5,exploring the capability bounda
64、ries of the two models in different fields.The overall results are presentedin Figure 1.3.1Artificial Analysis ArenaArtificial Analysis 1 is a leading benchmarking platform for AI models,specifically focused on image andvideo generation.It offers dynamic leaderboards that evaluate models based on ke
65、y metrics such as outputquality,generation speed,and cost,providing an objective comparison of state-of-the-art AI systems.TheText-to-Image leaderboard allows users to anonymously compare the generated images from different models.This ensures fairness,as users vote on images generated using identic
66、al prompts without knowing what themodels are.Models are ranked using an ELO scoring system,which reflects user preferences to some extent.Seedream 3.0 participated in the Artificial Analysis ranking and secured the top position overall,outperformingGPT-4o and establishing a substantial lead over ot
67、her models,including Recraft V3,HiDream,Reve Image,Imagen 3(v002),FLUX1.1 Pro,and Midjourney v6.1.Additionally,it demonstrates the best performance8Figure 5Results from Artificial Analysis Arena.across most sub-dimensions,including Style categories such as General&Photorealistic,Anime,Cartoon&Illust
68、ration,and Traditional Art,as well as Subject categories such as People:Portraits,People:Groups&Activities,Fantasy,Futuristic,and Physical Spaces.3.2Comprehensive Evaluation3.2.1Human EvaluationA larger evaluation benchmark is established to conduct a more comprehensive evaluation of Seedream 3.0 in
69、different scenarios.This benchmark,named Bench-377,is made up of 377 prompts.In addition to examiningbasic dimensions such as text-to-image alignment,structure plausibility,and aesthetic sense,the designof prompts also takes into account the usage scenarios.We consider five main scenarios:cinematic,
70、arts,entertainment,aesthetic design,and practical design.We propose the practical design category as Seedream3.0 is proved to be helpful in assisting routine work and studying.For example,it can provide support intasks such as icon arrangements in slides and illustration design in handwriting newspa
71、pers.A systematic evaluation by human experts of text-to-image models was performed based on Bench-377.Theevaluation is carried out using three basic criteria:text-image alignment,structural correction,and aestheticquality.The specific results for the five usage scenarios are presented in Figure 6.S
72、eedream 3.0 significantlyoutperforms Seedream 2.0 and competing models across text-image alignment and structural fidelity.Notably,it achieves an overall score higher than that of Midjourney in terms of aesthetic performance.Moreover,it isnotably superior to it in the design category,though it lags
73、slightly behind in categories such as art.WhileImagen 3 also demonstrates competent performance in text-image alignment and structure,it underperformsin aesthetic evaluation.Midjourney exhibits superior aesthetic capabilities but shows limited proficiency infunctional alignment and structural fideli
74、ty.9OverallCinematicArtsEntertainmentAes-DesignPrac-DesignAlignmentOverallCinematicArtsEntertainmentAes-DesignPrac-DesignStructureOverallCinematicArtsEntertainmentAes-DesignPrac-DesignAestheticsSeedream 3.0Seedream 2.0Imagen3Ideogram 3.0FLUX1.1 ProMidjourney v6.1Figure 6Human evaluation results.Tabl
75、e 1Preference evaluation with different metrics.MetircFLUX1.1Ideogram 2.0MJ v6.1Imagen 3Seedream 2.0Seedream 3.0EvalMuse0.6170.6320.5830.6800.6840.694HPSv20.29460.29320.28500.29510.29940.3011MPS13.1113.0113.6713.3313.6113.93Internal-Align27.7527.9228.9328.7529.0530.16Internal-Aes25.1526.4027.0726.72
76、26.9727.68Figures 7,8,9,and 10 illustrate how enhanced fundamental capabilities facilitate the generation of diversescenarios.Improved text-to-image alignment enables more precise representation of user intentions.Forexample,the lively depiction of micro-expressions improves the portrayal of a movie
77、s atmosphere.Preciseunderstanding and expression of complex descriptions and specialized terms,such as three-view,effectivelyfulfill users design requirements.These capabilities are fundamentally supported by enhanced structuralstability and aesthetic quality.For example,the integrity of the limbs i
78、n dynamic motions,the detailedpresentation of small objects,as well as improved capabilities in color,lighting,texture,and composition areall instrumental to the high availability of Seedream 3.0.3.2.2Automatic EvaluationIn accordance with the automatic evaluation of the previous version,we assess t
79、he text-to-image generationmodel based on two criteria:text-image alignment and image quality.Seedream 3.0 consistently ranks firstacross all benchmarks.For automatic evaluation for text-to-image alignment,we mainly focus on EvalMuse 7,which exhibitsrelatively good consistency with human evaluations
80、 across multiple benchmarks.Seedream 3.0 outperformsother models as shown in Table 1.Further analysis in the fine-grand dimension shows that,compared toSeedream 2.0,Seedream 3.0 has improvements in most dimensions,especially in terms of objects,activities,locations,food,and space.To align with the p
81、revious reported results,Ideogram 2.0 is included in theassessment here and subsequent chapters.For image quality evaluation,we reuse two external metrics,HPSv2 24 and MPS 26,and two internalevaluation models,Internal-Align and Internal-Aes.Seedream 3.0 ranks first in all metrics as shown in Table 1
82、.In the aesthetic evaluation,which includes MPS and our in-house aesthetic evaluation models,Seedream 3.0outperforms Midjourney,while Seedream 2.0 didnt in previous assessments.At the same time,in terms of theHPSv2 index,Seedream3.0 exceeds 0.3 for the first time,indicating that our model has excell
83、ent consistencywith human preferences.10Seedream 3.0Seedream 2.0Midjourney v6.1Ideogram 3.0FLUX-1.1 ProFigure 7Alignment Comparison.Prompt:Two boys are in the haunted house.The boy in the front looks frightened,while the boy behind appears calm.Seedream 3.0Seedream 2.0Midjourney v6.1Ideogram 3.0FLUX
84、-1.1 ProFigure 8Structure Comparison.Prompt:Two 14-year-old boys,dressed in Y2K style,perform a one-handed groundmove on stage as part of a breakdancing routine.Warning:These images may cause discomfort.11Seedream 3.0Seedream 2.0Midjourney v6.1Imagen3FLUX-1.1 ProFigure 9Aesthetic Comparison.Prompt:A
85、 girl,one eye is purple,and the hair on that side is blue.The other eye isblue,and the hair on that side is purple.realistic.Seedream 3.0Seedream 2.0Midjourney v6.1Ideogram 3.0Imagen3Figure 10Design Comparison.Top Prompt:Sticker Series Design:Sticker 1:A monkey is grinning with the text“Happy”below.
86、Sticker 2:The monkey wears sunglasses with the text“Cool”below.Sticker 3:The monkey is holding aflower with a shy expression,with the text“Shy”below.Sticker 4:The monkey looks surprised,with the text“Surprise”below.Bottom Prompt:Chibi character,girl,full body,street dance,three-view drawing.3.3Text
87、RenderingSeedream 2.0s text rendering,particularly for Chinese characters,has garnered widespread acclaim fromusers.In Seedream 3.0,we have further optimized this capability and conducted thorough evaluations.Our12Acc RateHit RateAvail Rate020406080100Percentage(%)909494788278656645151918Chinese Tex
88、t RenderingSeedream 3.0Seedream 2.0MiracleVision 5.0Kolors 1.5Acc RateHit RateAvail Rate5060708090100Percentage(%)949594899186909184818578798477757772597167English Text RenderingSeedream 3.0Seedream 2.0Recraft V3Ideogram 2.0Imagen 3FLUX1.1 ProMidjourney v6.1Figure 11Text Rendering Evaluation.Seedrea
89、m 3.0Seedream 2.0Midjourney v6.1Ideogram 2.0FLUX-1.1 ProRecraft V3GPT-4oFigure 12Text Rendering comparisons.Prompt:A captivating and vibrant image,3D render,featuring sevencolorful,ornate felt mugs,each adorned with a heart and displaying bold text representing the days of the week:“lunes”,“martes”,
90、“mircoles”,“jueves”,“viernes”,“sbado”,“domingo”.These lively mugs are filled with whimsical feltsmoke,and they elegantly float in a dreamy,enchanting atmosphere.The diverse array of floating flowers adds depthand dimension to the scene,while the soft baby blue background harmoniously complements the
91、 design.fashion,illustration,typography,3d render,painting.text evaluation benchmark comprises 180 Chinese prompts and 180 English prompts,covering a diverse rangeof categories,including logo designs,posters,electronic displays,printed text,and handwritten text.One perception-based metric,availabili
92、ty rate,and two statistics-based metrics,textaccuracy rateandhitrate,are employed to evaluate text rendering capability.The availability rate refers to the proportion ofimages deemed acceptable when text rendering is generally correct,taking into account the integration of textwith other content and
93、 the overall aesthetic quality.The objective metrics are defined as follows:Text accuracy rateis defined as:Ra=?1 NeN?100%13Figure 13Text Rendering by Seedream 3.0.whereNrepresents the total number of target characters,andNedenotes the minimum edit distancebetween the rendered text and the target te
94、xt.Text hit rateis defined as:Rh=NcN 100%where Ncrepresents the number of characters correctly rendered in the output.Figure 11 demonstrates that Seedream 3.0 achieves superior text rendering performance compared to existingmodels,including its predecessor(Seedream 2.0).The system achieves a 94%text
95、 availability rate forboth Chinese and English characters,effectively eliminating text rendering as a limiting factor in imagegeneration.Notably,Chinese text availability shows an improvement of 16%over Seedream 2.0.The nearlyequivalent values of availability and hit rates further indicate minimal o
96、ccurrence of layout or medium-relatedrendering errors.These results validate the effectiveness of our native text rendering approach compared topost-processing composition methods and external plugin solutions.In addition to the overall improvement in availability rate,it is crucial to highlight the
97、 exceptional performanceof Seedream 3.0 in rendering dense text.Dense text,characterized by long passages with a high densityof small characters,such as greetings with numerous words,has posed a challenge for previous models.Incontrast,Seedream 3.0 shows significant advancements in handling such fin
98、e characters.As illustrated inFigures 12 and 13,Seedream 3.0 excels in both the precision of small character generation and the naturalnessof text layout.For comparison,GPT-4o,another model known for its dense text rendering capabilities,willbe evaluated in the following sections.3.4Photorealistic P
99、ortraitThe overly synthetic appearance of AI-generated images,especially in portraits,has long been a criticism ofText-to-Image models.Issues like overly smooth skin and an oily texture make the generated images appearartificial.To comprehensively assess Seedream 3.0s performance in this area,we con
100、struct a portrait evaluation setcomprising 100 prompts.These prompts focus on various aspects of portrait generation,including expressions,postures,angles,hair features,skin texture,clothing,and accessories.The evaluation follows an Elo battleapproach,where participants are asked to select their pre
101、ferred portraits generated by different modelsand justify their choice.The evaluation criteria focus on two primary dimensions:realism and emotion.Competitors include Seedream 3.0,Seedream 2.0,Midjourney v6.1,FLUX-Pro 1.1,and the recently updatedIdeogram 3.0,known for its photorealistic generation.T
102、o ensure a fair comparison,multiple rounds of image14Midjourney V6.1Seedream 3.0Ideogram 3.0Seedream 2.0FLUX1.1 Pro800850900950100010501100115012001250Elo Rating119819.7119311.1108613.010000.093615.8Figure 14Photorealistic Portrait Evaluation.generation are performed for Midjourney v6.1 to ensure a
103、realistic result,avoiding those that are overly artisticor abstract.After a public evaluation involving over 50,000 battle rounds,we obtain the results as shown in Figure 14.Notethat some model variants are not displayed.Seedream 3.0 and Midjourney v6.1 both rank first,significantlyoutperforming oth
104、er models.Examples in Figure 15 demonstrate that Seedream 3.0 effectively eliminates theartificial appearance.In portrait generation,the skin textures now exhibit realistic features such as wrinkles,fine facial hair,and scars,closely resembling natural human skin.Meanwhile,Seedream 3.0 can still gen
105、erateflawless skin textures when prompted.Additionally,while the texture of portraits generated by Midjourneyv6.1 appears slightly inferior to Seedream 3.0,it excels in conveying emotional expressions,contributing to itshigh ranking.Future versions will aim to further enhance both aspects.Seedream 3
106、.0Seedream 2.0Midjourney v6.1Ideogram 3.0FLUX-1.1 ProFigure 15Realistic Portrait comparisons.15Figure 16Human Portraits from Seedream 3.0 with higher resolution.High resolution provides enhanced texture andclarity.We especially highlight that Seedream 3.0 can directly generate images with higher res
107、olution,like 20482048,further enhancing portrait texture.Some examples of Seedream 3.0 can be found in Figure 16.The quality ofgenerated portraits shows promising progress toward professional photography standards,bringing significantnew possibilities for the application.3.5Comparison with GPT-4oRec
108、ently,GPT-4o has introduced an impressive image generation function,which features exceptionallypowerful multi-modal capabilities.Due to the absence of an API for large-scale image generation,a systematicevaluation has not yet been conducted.Nevertheless,a comparative analysis of selected cases reve
109、als thatGPT-4o and Seedream 3.0 each exhibit distinct strengths and weaknesses across different scenarios.3.5.1Dense Text RenderingGPT-4o 15 presents impressive text rendering capabilities,as evidenced by multiple examples.We generatecomparable cases for comparison,as shown in Figure 17.GPT-4o excel
110、s in the accuracy of rendering smallEnglish characters and certain LaTeX symbols.However,it exhibits notable limitations in rendering Chinesefonts.In contrast,Seedream 3.0 handles dense Chinese text generation with ease and outperforms GPT-4o interms of typesetting and aesthetic composition.3.5.2Ima
111、ge EditingImage editing tasks bridge the generation with real-world images,attracting significant attention for practicalusage.GPT-4o can perform editing operations on given images based on prompt descriptions.SeedEdit,derived from Seedream,also supports such capabilities.Additionally,Gemini-2.0 rec
112、ently demonstrates strongmulti-modal image generation,particularly in interleaved generation and multi-round editing.This studyfocuses on comparing the single-round image generation capabilities of these models,as shown in Figure 18.We demonstrate that SeedEdit exhibits better ID preserving and prom
113、pt following abilities.16Figure 17Comparisons of Text Rendering.Top for Seedream 3.0 and bottom for GPT-4o.Better to zoom in forbetter view.Figure 18Comparisons of Image Edit.From left to right:the original image,SeedEdit 1.6,GPT-4o,and Gemini-2.0.Top Prompt:換個藍紫色短發.Bottom Prompt:變成彩色圖片.These three
114、models exhibit distinct characteristics.GPT-4o excels at fulfilling a wide range of editingrequirements but tends to struggle with preserving the original image,particularly regarding IP and IDconsistency.Gemini-2.0 maintains the original image at the pixel level,but often produces issues with color
115、naturalness and image quality.SeedEdit 1.6 provides balanced performance,effectively addressing typicalediting needs while maintaining a relatively high availability rate.However,it still faces limitations whenhandling more complex tasks,such as multi-image reference and multi-round editing.These ar
116、eas will be17Figure 19Comparisons of Text Edit.From left to right:the original image,SeedEdit,and GPT-4o.Top Prompt:不要文字.Middle Prompt:小熊的身前擺了一個小木牌,上面雕刻著Merry Christmas.Bottom Prompt:把字改成彩色毛絨材質.improved in future versions.We primarily compared the performance of SeedEdit and GPT-4o on text-related e
117、diting tasks.Text editingis inherently challenging,as it requires not only text rendering but also the recognition and understandingof characters within images.The ability to handle text editing tasks marks a significant advancement incontrollable image generation,particularly for real images.Figure
118、 19 illustrates examples of tasks suchas text writing,removing,and modification.SeedEdit inherits the text-related capabilities of Seedream3.0,delivering satisfying results.It can detect text in images accurately,allowing for precise deletion ormodification.Additionally,when adding text,SeedEdit con
119、siders the layout and integrates the text seamlesslyinto the original image.In contrast,while GPT-4o can fulfill text editing requirements,it fails to preserve theoriginal image,limiting its practical use.3.5.3Generation QualityGeneration quality,including color,texture,clarity,and aesthetic appeal,
120、is a critical factor in assessingtext-to-image models.Seedream models have consistently demonstrated strong performance in these areas,while GPT-4o shows some shortcomings.As shown in Figure 20,images generated by GPT-4o tend to have adark yellowish hue and exhibit significant noise,which notably im
121、pacts the usability of the generated imagesin various scenarios.18Figure 20Image Quality Comparisons.Left:Seedream 3.0,Right:GPT-4o.4ConclusionIn this paper,we have introduced Seedream 3.0,which employs several innovative strategies to address existingchallenges,including limited image resolutions,c
122、omplex attributes adherence,fine-grained typography genera-tion,and suboptimal visual aesthetics and fidelity.Through system-level upgrades in data construction,modelpretraining,post-training,and model acceleration,Seedream 3.0 has achieved comprehensive improvementsin multiple aspects compared to o
123、ur previous version.Seedream 3.0 provides native high-resolution output,comprehensive capability,superior text rendering quality,enhanced visual appeal,and extreme generationspeed.With its integration into platforms like Doubao and Jimeng,Seedream 3.0 exhibits strong potential tobecome a powerful pr
124、oductivity tool across various work and daily life scenarios.19References1 artificialanalysis.ai.artificialanalysis.https:/artificialanalysis.ai/text-to-image/arena?tab=Leaderboard,2025.2Mostafa Dehghani,Basil Mustafa,Josip Djolonga,Jonathan Heek,Matthias Minderer,Mathilde Caron,AndreasSteiner,Joan
125、Puigcerver,Robert Geirhos,Ibrahim M Alabdulmohsin,et al.Patch npack:Navit,a visiontransformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems,36:22522274,2023.3Patrick Esser,Sumith Kulal,Andreas Blattmann,Rahim Entezari,Jonas Mller,Harry Saini,Yam Levi,DominikLo
126、renz,Axel Sauer,Frederic Boesel,et al.Scaling rectified flow transformers for high-resolution image synthesis.In Forty-first International Conference on Machine Learning,2024.4Lixue Gong,Xiaoxia Hou,Fanshi Li,Liang Li,Xiaochen Lian,Fei Liu,Liyang Liu,Wei Liu,Wei Lu,YichunShi,et al.Seedream 2.0:A nat
127、ive chinese-english bilingual image generation foundation model.arXiv preprintarXiv:2503.07703,2025.5 Google.Imagen 3.https:/labs.google/fx/tools/image-fx,2025.6Jackson Gorham,Anant Raj,and Lester Mackey.Stochastic stein discrepancies.Advances in Neural InformationProcessing Systems,33:1793117942,20
128、20.7Shuhao Han,Haotian Fan,Jiachen Fu,Liang Li,Tao Li,Junhui Cui,Yunqiu Wang,Yang Tai,Jingwei Sun,Chunle Guo,and Chongyi Li.Evalmuse-40k:A reliable and fine-grained benchmark with comprehensive humanannotations for text-to-image generation model evaluation,2024.URL https:/arxiv.org/abs/2412.18150.8J
129、onathan Ho,Ajay Jain,and Pieter Abbeel.Denoising diffusion probabilistic models.NeurIPS,33:68406851,2020.9 Ideogram.Ideogram.https:/about.ideogram.ai/2.0,2024.10Tero Karras,Miika Aittala,Timo Aila,and Samuli Laine.Elucidating the design space of diffusion-basedgenerative models.NeurIPS,35:2656526577
130、,2022.11 Black Forest Labs.Flux.https:/ Lipman,Ricky TQ Chen,Heli Ben-Hamu,Maximilian Nickel,and Matt Le.Flow matching for generativemodeling.arXiv preprint arXiv:2210.02747,2022.13Nanye Ma,Mark Goldstein,Michael S Albergo,Nicholas M Boffi,Eric Vanden-Eijnden,and Saining Xie.Sit:Exploring flow and d
131、iffusion-based generative models with scalable interpolant transformers.arXiv preprintarXiv:2401.08740,2024.14 Midjourney.Midjourney v6.1.https:/ OpenAI.Gpt-4o.https:/ Oquab,Timothe Darcet,Tho Moutakanni,Huy Vo,Marc Szafraniec,Vasil Khalidov,Pierre Fernandez,Daniel Haziza,Francisco Massa,Alaaeldin E
132、l-Nouby,et al.Dinov2:Learning robust visual features withoutsupervision.arXiv preprint arXiv:2304.07193,2023.17Yuxi Ren,Xin Xia,Yanzuo Lu,Jiacheng Zhang,Jie Wu,Pan Xie,Xing Wang,and Xuefeng Xiao.Hyper-sd:Trajectory segmented consistency model for efficient image synthesis.Advances in Neural Informat
133、ion ProcessingSystems,37:117340117362,2025.18Robin Rombach,Andreas Blattmann,Dominik Lorenz,Patrick Esser,and Bjrn Ommer.High-resolution imagesynthesis with latent diffusion models.In CVPR,pages 1068410695,2022.19Gerard Salton and Christopher Buckley.Term-weighting approaches in automatic text retri
134、eval.Informationprocessing&management,24(5):513523,1988.20Huiyang Shao,Xin Xia,Yuhong Yang,Yuxi Ren,Xing Wang,and Xuefeng Xiao.Rayflow:Instance-awarediffusion acceleration via adaptive flow trajectories.arXiv preprint arXiv:2503.07699,2025.21Yang Song,Jascha Sohl-Dickstein,Diederik P Kingma,Abhishek
135、 Kumar,Stefano Ermon,and Ben Poole.Score-based generative modeling through stochastic differential equations.In ICLR,2021.2022Jianlin Su,Murtadha Ahmed,Yu Lu,Shengfeng Pan,Wen Bo,and Yunfeng Liu.Roformer:Enhanced transformerwith rotary position embedding.Neurocomputing,568:127063,2024.23Ashish Vaswa
136、ni,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N Gomez,ukasz Kaiser,andIllia Polosukhin.Attention is all you need.Advances in neural information processing systems,30,2017.24Xiaoshi Wu,Yiming Hao,Keqiang Sun,Yixiong Chen,Feng Zhu,Rui Zhao,and Hongsheng Li.Humanpreference score v2:A so
137、lid benchmark for evaluating human preferences of text-to-image synthesis.arXivpreprint arXiv:2306.09341,2023.25Sihyun Yu,Sangkyung Kwak,Huiwon Jang,Jongheon Jeong,Jonathan Huang,Jinwoo Shin,and Saining Xie.Representation alignment for generation:Training diffusion transformers is easier than you th
138、ink.arXiv preprintarXiv:2410.06940,2024.26Sixian Zhang,Bohan Wang,Junqiang Wu,Yan Li,Tingting Gao,Di Zhang,and Zhongyuan Wang.Learningmulti-dimensional human preference for text-to-image generation.InProceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition,pages 80188027,2024
139、.21AppendixAContributions and AcknowledgmentsAll contributors of Seedream are listed in alphabetical order by their last names.A.1Core ContributorsYu Gao,Lixue Gong,Qiushan Guo,Xiaoxia Hou,Weilin Huang,Zhichao Lai,Fanshi Li,Liang Li,XiaochenLian,Chao Liao,Liyang Liu,Wei Liu,Yichun Shi,Shiqi Sun,Yu T
140、ian,Zhi Tian,Peng Wang,Rui Wang,Xuanda Wang,Xun Wang,Ye Wang,Guofeng Wu,Jie Wu,Xin Xia,Xuefeng Xiao,Jianchao Yang,ZhonghuaZhai,Xinyu Zhang,Qi Zhang,Yuwei Zhang,Shijia Zhao.A.2ContributorsHaoshen Chen,Kaixi Chen,Xiaojing Dong,Jing Fang,Yongde Ge,Meng Guo,Shucheng Guo,Bibo He,Lurui Jin,Bo Li,Hao Li,Huixia Li,Jiashi Li,Ying Li,Yiying Li,Yameng Li,Heng Lin,Feng Ling,Shu Liu,Zuxi Liu,Yanzuo Lu,Wei Lu,Tongtong Ou,Keer Qin,Yinuo Wang,Yonghui Wu,Yao Yao,Fengxuan Zhao,Wenliang Zhao,Wenjia Zhu.22