當前位置：首頁 > 報告詳情

3-1 一個圖像到文本的生成模型以及在多模態領域的應用.pdf

上傳人：云閑編號：102334 2021-01-01 PDF PDF 19頁 2.73MB

該報告所屬合集： DataFunSummit：2022NLP峰會嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/19

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《3-1 一個圖像到文本的生成模型以及在多模態領域的應用.pdf》由會員分享，可在線閱讀，更多相關《3-1 一個圖像到文本的生成模型以及在多模態領域的應用.pdf（19頁珍藏版）》請在三個皮匠報告上搜索。

1、GIT:A Generative Image-to-text Transformer for Vision and LanguageJianfeng WangPrinciple ResearcherMicrosoft Cloud&AIA cartoon illustration of a pikachu and a mouse talking to each other.a cell phone screen shows the time of 1:44 wednesday,november 4.A rollpack sign that says$14.88 on itA white back

2、ground with the numbers 49a785 and c392z6.A text that says marlowprocesses are stochastic processes,traditionally in discrete or continuous time,that have the marlow property.A gold and brown sign that says university of colorado 1876.GIT:A Generative Image-to-text TransformerTokenize&EmbedMulti-hea

3、d self-attentionFeed forwardaradioBOS.EOSa tecsun radio withthe time of 12:54.Text decoder(a)Pre-training/captioningImage encoder54tscsunaradio.54tscsun(b)VQAQ:what time is it?A:12:54whatBOSEOSText decoderittimeis?12:5412:54(c)VideoImage encoderImage encoderFrame 6Frame 1temporal embedding 1temporal

4、 embedding 6Jianfeng Wang,Zhengyuan Yang,Xiaowei Hu,Linjie Li,Kevin Lin,Zhe Gan,Zicheng Liu,Ce Liu,Lijuan Wang;GIT:A Generative Image-to-text Transformer for Vision and Language;arxivone image encoder(Florence/CoSwin)+one text decoderpretrain on 0.8 billion image-text pairsRelation with existing app

5、roaches vs Flamingo/CocaRelation with existing approaches Novel object captioning(nocaps)Existing approaches Tags as extra input Object detector/classifier/CLIP Ours No such dependency Scene-text related tasks Existing approaches OCR text as extra input Ours No such dependency vs Flamingo/Coca GIT(o

6、urs):smaller model size/fewer data,better performancemodelDataCOCOnocapsTextVQAVizWiz-QAVATEXYouCook2Flamingo(Deepmind)80B2.3B+27M/video138.1-54.165.484.2118.6Coca(Google)2.1B4.8B143.6120.6-GIT(ours)0.7B0.8B144.8123.459.867.593.8129.8Data&model scaling Data 4M,14M,800M Model CLIP/ViT-B,CLIP/ViT-L,Fl

7、orence/CoSwinResults on image/video captioning/QA New SOTA 12 image/video captioning and question answering tasks*:evaluated on serversPrior SOTA:reported in technical reports or publications.CocaFlamingoFlamingo:84.2Human performance:125.5Results on image classification Method Traditional approache

8、s Vocabulary pre-defined Our proposal Label name as caption Vocabulary-free Any out-of-vocabulary prediction?0.026%Results on scene text recognition Method GIT-TextCaps Same TextCaps model If contains keyword GIT-MJST Scene text as caption Exact match“motion”from MJ/STResults on scene text recogniti

9、on GIT predicted captions Recognizing and describing scene texts GIT predicted captions Diverse entities and conceptsOpen-ended VQAScene text understandingQuestion answeringOpen-ended vocabularyConclusion GIT,a generative image-to-text transformer model 0.7B parameters;0.8B image-text pairs New SOTA on 12 vision-language tasks Surpass human performance on TextCaps for the first time New scheme of classification as generation Vocabulary-free in both training&inference

相關圖表

本文介紹了一種名為GIT的生成式圖像到文本轉換模型。GIT是一個具有生成能力的圖像到文本轉換器模型，采用了多頭自注意力機制和前饋神經網絡。該模型在圖像/視頻標題和問答任務上取得了新的最先進性能，并在場景文本識別方面也表現出色。GIT具有較小的模型大小和較少的訓練數據，但性能更優。與現有的Flamingo和Coca模型相比，GIT在多個任務上表現更好。該模型在12個圖像/視頻標題和問答任務上取得了新的最先進性能，并在場景文本識別方面也表現出色。GIT預測的標題具有多樣化的實體和概念，實現了開放詞匯的視覺問答。

"GIT模型如何實現圖像到文本的轉換？" "GIT在哪些視覺語言任務上取得了突破性成果？" "如何利用GIT模型實現無詞匯限制的圖像分類？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站