Multi-modal fashion search - early work

Multi-modal fashion search - early work

Multimodal search requires intermodal representations of visual and textual fashion attributes which can be mixed and matched to form user’s desired product, and which have a mechanism to indicate when a visual and textual fashion attribute represent same concept. With a neural network [Laenen 18], a common, multimodal space is induced for visual and textual fashion attributes where their inner product measures their semantic similarity. A multimodal retrieval model is built which operates on obtained intermodal representations and ranks images based on their relevance to a multimodal query. The model retrieve images that both exhibit necessary query image attributes and satisfy query texts.

No alt text provided for this image

Products retrieved should not only be relevant to submitted textual query, but also match user preferences from both textual and visual modalities. To achieve the goal, [Guo 18] first leverages also_view and buy_after_viewing products to construct visual and textual latent spaces, which are expected to preserve visual and semantic similarity of products, respectively. A translation-based search model (TranSearch) is then proposed to 1) learn a multi-modal latent space based on pre-trained visual and textual latent spaces; and 2) map users, queries and products into this space for direct matching. TranSearch model is trained based on a comparative learning strategy, such that multi-modal latent space is oriented to personalized ranking in training stage.

No alt text provided for this image

Latent meaning of learned feature vectors hinders explanation of retrieval results and integration of user feedback. Online shopping websites however organize fashion items into hierarchical structures based on product taxonomy and domain knowledge revealing how human perceive relatedness among products. [Liao 18] presents techniques for organizing fashion hierarchies to facilitate reasoning of search results and user intent. An EI (Exclusive & Independent) tree cooperates with deep models for end-to-end multimodal learning - organizing concepts into multiple semantic levels and augmenting tree structure with exclusive as well as independent constraints. From EI tree, an explicit hierarchical unction is learnt to characterize semantic similarities among products - facilitating an interpretable retrieval scheme that integrates concept-level feedback.

No alt text provided for this image

Rich attributes associated with fashion items, e.g., off-shoulder dress and black skinny jean, which describe the semantics of items in a human-interpretable way, have been largely ignored in existing compatible outfit research. Given a corpus of matched pairs of items, [Yang 19] not only can predict compatibility score of unseen pairs, but also learn interpretable patterns that lead to a good match, e.g., white T-shirt matches with black trouser. Attribute-based Interpretable Compatibility (AIC) method consists of : 1) a tree-based module that extracts decision rules on matching prediction, 2) an embedding module that learns vector representation for a rule by accounting for attribute semantics in rule, and 3) a joint modeling module that unifies visual/rule embedding to predict matching score.?

No alt text provided for this image

Different from general domain, fashion matching pays much more aention to fine-grained information in fashion images and texts. Pioneer approaches detect region of interests (i.e., RoIs) from images and us RoI embeddings as image representations. In general, RoIs tend to represent “object-level” information in fashion images, while fashion texts are prone to describe more detailed information, e.g. styles, aributes. RoIs are thus not fine-grained enough for fashion text and image matching. To this end, FashionBERT [Gao 20] leverages patches as image features. With pre-trained BERT model as backbone network, high level representations of texts and images are learnt. Meanwhile, an adaptive loss trades off multitask learning (i.e., text/image matching and cross-modal retrieval).

No alt text provided for this image

[Laenen 18] Web Search of Fashion Items with Multimodal Querying

[Guo 18] Multi-modal preference modeling for product search

[Liao 18] Interpretable Multimodal Retrieval for Fashion Products

[Yang 19] Interpretable Fashion Matching with Rich Attributes

[Gao 20] Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval

Jacopo Tagliabue

Improving my backhand volley

2 年

Check out our CLIP-based work in fashion ;-) https://arxiv.org/abs/2204.03972

要查看或添加评论,请登录

社区洞察

其他会员也浏览了