1、01Background02Unstructured Feedback03Structured Feedback04Future Work目錄 CONTENT|01BackgroundBackgroundHuge economic value of fashion domain.|BackgroundNumerous online clothing data on the Internet.Precise image retrieval that meets the users search intent is a key challenge.|BackgroundConventional p
2、aradigms for item search take either text or image as theinput query to search for items of interest.a blue overcoat with a lapel collar and a belt around the waist Text QueryImage QueryUnstructured feedback+I want the dress to be black and more professional.|Background Flexible image retrieval:allo
3、w users to use reference image and modification feedbackto search items.Structured feedback|Background Application:dialog-based fashion search/conversational fashion search At the beginning,the recommended fashionproduct image may not be the desired one.Based on this reference image,the usertypicall
4、y would like to refine the retrieval byproviding feedbacks,describing the relativedifference between the current retrievedreference image and his/her desired one.|02Structured FeedbackTask A query image can be described by itsassociated attributes:=a1,1,2,The target image can be described by:=a1,1,2
5、,Attribute Manipulation1and 2are the to-be-manipulated attributes.|Related WorkCategoryRelated WorkFeature Fusion-basedMemory-Augmented Attribute Manipulation Networks for Interactive FashionSearch,In CVPR2017.Feature Substitution-basedEfficient Multi-Attribute Similarity Learning Towards Attribute-
6、based FashionSearch,In WACV2018.Automatic Spatially-aware Fashion Concept Discovery,In ICCV2017.Learning Attribute Representations with Localization for Flexible FashionSearch,In CVPR2018.|Fusion-based Method Memory-AugmentedAttributeManipulationNetworksforInteractiveFashion Search.In CVPR2017.Attri
7、buteRepresentationLearning Image Representation Learning RepresentationFusionFashion SearchFusion-based:learn the latent representation of the target item by directly fusing the visual features of thequery image and the semantic features of wanted attribute(s).:original image representation:prototyp
8、e attribute representation:binary indicator vector:memory matrix:manipulated representationAttribute Manipulation|Substitution-based MethodLearning Attribute Representations with Localization for Flexible Fashion Search.In CVPR2018.Learning Deep Features for Discriminative Localization.CVPR 2016:292
9、1-2929Attribute LocalizationAttribute RepresentationLearningOptimizationSubstitution-based:characterize the query image with multiple attributes,and the attributemanipulation can be conducted by replacing the unwished attribute features with desired ones.Class Activation Mapping isusedforlocalizingt
10、hediscriminative image regions.Afterthetraining,thefeatures extracted from thetrainingimageswiththesame attribute value areaveragedandusedforattribute manipulation.MotivationBlueGANFeature SpaceSimilar ItemsGenerated Prototype Image Existing methods ignore the potential of Generative Adversarial Net
11、works(GANs)in enhancing the visual understanding of target items.We aim to boost the performance of content-based fashion search with attribute manipulation by directly generating the target item image.|MethodPrototype Image GenerationMetric Learning for Fashion Search|Ground truth valueof the attri
12、bute Generated prototype imageMake the discriminator to learn how toaccurately classify the attributes.Encourage the generator to synthesizethe prototype image with correct attributemanipulation.The proposed AMGAN.Semantic Discriminative LearningMethodPrototype Image GenerationMetric Learning for Fa
13、shion Search|The proposed AMGAN.Adversarial Metric Learning(Pair-Based)Maximize the similaritybetween the positive pairMinimize the similaritybetween the negative pairEncourage the generator to produce similar to the positive image to fool the learned metricSimilarity Probability:+:shares the sameat
14、tribute values withMethodPrototype Image GenerationMetric Learning for Fashion Search|The proposed AMGAN.Adversarial Metric Learning(Triplet-based)Encourage to be moresimilar with+than Relative similarity Probability:Method|Prototype Image GenerationMetric Learning for Fashion Search DARN 213,636 im
15、ages,9 attributes and 179 possible valuesDatasetSome examples of online-offline image pairs in DARNAttributes and value examples of DARN.|Junshi Huang,Rogrio Schmidt Feris,Qiang Chen,Shuicheng Yan:Cross-Domain Image Retrieval with a Dual Attribute-Aware Ranking Network.ICCV 2015:1062-1070Attributes
16、and value examples of Shopping100K.Shopping100K 101,021 images,12 attributes and 151 possible valuesDatasetSamples in Shopping100KKenan E.Ak,Joo-Hwee Lim,Jo Yew Tham,Ashraf A.Kassim:Efficient Multi-attribute Similarity Learning Towards Attribute-BasedFashion Search.WACV 2018:1671-1679|Model Comparis
17、on(a)Top-K on Shopping100K(b)NDCGK on Shopping100K(c)MRRK on Shopping100K(d)Top-K on DARN(e)NDCGK on DARN(f)MRRK on DARN|Fig.1:Overall performance comparison on Shopping100K and DARN.Symbols and denote the statistical significance for 0.05and LAlgorithmFor testing,we will rank the gallery images by
18、jointly evaluating their cosinesimilarities to both local-wise and global-wise composed query representations.DatasetAdapting existing dataset:Creating new dataset:MIT-States Phillip Isola et al.CVPR 2015Birds-to-words Maxwell Forbes et al.EMNLP 2019 Shoes Xiaoxiao Guo et al.NeurIPS 2018CSS Nam Vo e
19、t al.CVPR 2019CIRR Zheyuan Liu et al.ICCV 2021 Fashion200k Xintong Han et al.ICCV 2017|FashionIQ.Released on ICCV 2019 workshopDatasetMIT-StatesExamples of training triplets derived from MIT-States.Contains around 60k images.Each image comes with an object/noun label and a state/adjective label(such
20、 as“red tomato”or“new camera”).unripe bananaripe bananaReplace unripe with ripeCluttered bagempty bagReplace clustered with empty(a)(b)|Phillip Isola,Joseph J.Lim,Edward H.Adelson:Discovering states and transformations in image collections.CVPR 2015:383-1391DatasetBirds-to-wordsSamples from the Bird
21、s-to-Words dataset.A dataset for relative captioning.Consists of 3,347 image pairs,annotated with 16,067 paragraphs describingthe differences between pairs of images.|Maxwell Forbes,Christine Kaeser-Chen,Piyush Sharma,Serge J.Belongie:Neural Naturalist:Generating Fine-Grained Image Comparisons.EMNLP
22、/IJCNLP(1)2019:708-717DatasetFashion200kExamples of the image-text pairs in Fashion200k.More than 200k image-text pairs,crawled from online shopping websites.Removed stop words,symbols,as well as words that occur fewer than 5 timesblue one shoulder dressblack one shoulder dressReplace blue with blac
23、kExamples of training triplets for CTI-IR.|Xintong Han,Zuxuan Wu,Phoenix X.Huang,Xiao Zhang,Menglong Zhu,Yuan Li,Yang Zhao,Larry S.Davis:Automatic Spatially-AwareFashion Concept Discovery.ICCV 2017:1472-1480DatasetShoesExamples of relative captions in Shoes dataset.A dataset for relative captioning,
24、collected in a scenario of a shopping chatting session between a shopping assistant and a customer.10,751 captions,with one caption per pair of images.AMT annotation interface.|Xiaoxiao Guo,Hui Wu,Yu Cheng,Steven Rennie,Gerald Tesauro,Rogrio Schmidt Feris:Dialog-based Interactive Image Retrieval.Neu
25、rIPS 2018:676-686DatasetAdapting existing dataset:Creating new dataset:MIT-States Phillip Isola et al.CVPR 2015Birds-to-words Maxwell Forbes et al.EMNLP 2019 Shoes Xiaoxiao Guo et al.NeurIPS 2018CSS Nam Vo et al.CVPR 2019CIRR Zheyuan Liu et al.ICCV 2021 Fashion200k Xintong Han et al.ICCV 2017|Fashio
26、nIQ.Released on ICCV 2019 workshopDatasetFashionIQRelative Caption:Relative Caption:Relative Caption:Relative Caption:Hui Wu,Yupeng Gao,Xiaoxiao Guo,Ziad Al-Halah,Steven Rennie,Kristen Grauman,Rogrio Feris:Fashion IQ:A New DatasetTowards Retrieving Images by Natural Language Feedback.CVPR 2021:11307
27、-11317The dataset contains 77,684 diverse fashion images(dresses,shirts,andtops&tees),side information in form of textual descriptions and product meta-data,attribute labels,and large-scale annotations of high quality relativecaptions collected from human annotators.|DatasetCSSExample images in CSS
28、dataset.The same scene arerendered in 2D and 3D images.Using the CLEVR toolkit for generating synthesized images.Render objects with different Color,Shape and Size(CSS)occupy.Three types of modification texts:adding,removing or changing object attributes.16K triplets for training and 16K triplets fo
29、r test.Examples of training triplets for CTI-IR.Nam Vo,Lu Jiang,Chen Sun,Kevin Murphy,Li-Jia Li,Li Fei-Fei,James Hays:Composing Text and Image for Image Retrieval-anEmpirical Odyssey.CVPR 2019:6439-6448|Dataset Limitations of previous dataset non-complex images within narrow domains contain many fal
30、se-negativesRelative Caption:Narrow Domains|Reference imageTarget imageare black with a colorful floral printPotential target images(false-negatives)Dataset Use the popular NLVR dataset for natural language visual reasoning as the source of images.Compose Image Retrieval on Real-life images(CIRR)dat
31、asetSamples in CIRR dataset(Over 36,000 pairs).Overview of the data collection process.|Zheyuan Liu,Cristian Rodriguez Opazo,Damien Teney,Stephen Gould:Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models.ICCV 2021:2105-2114Model Comparison Our method consistently surpass
32、all the baselines for all the three datasets,which reflects thesuperiority of our CLVC-Net.Performance comparison on FashionIQ,Shoes,and Fashion200k.|Case StudyIllustration of CTI-IR results obtained by our CLVC-Net on three datasets.Failure casesFailure casesgreen boxes:target items|Demo|We are the
33、 first to unify the global-wise and local-wise compositions withmutual enhancement in the context of CTI-IR.We devise two affine transformation-based attentive compositionmodules,towards the fine-grained multi-modal compositions for bothangles.Extensive experiments conducted on three real-world data
34、sets validatethe superiority of our model.Haokun Wen,Xuemeng Song,Xin Yang,Yibing Zhan,Liqiang Nie:Comprehensive Linguistic-Visual CompositionNetwork for Image Retrieval.SIGIR 2021:1369-1378Conclusion|04Future Work Pre-training TechniqueFuture Work|Using CLIP-based FeaturesUsing OSCAR as the composi
35、tion module Liu et al.ICCV 2021Baldrati et al.MMAsia 21 Limited Annotated SamplesFuture Work|Reference imagehas small straps,more plain and more revealingModification textTarget image Case1 from FashionIQ:Potential target images Case2 from Shoes:Reference imageTarget imageModification textare black with a colorful floral printPotential target images非常感謝您的觀看|