Multi-modal fashion search - early work
Multimodal search requires intermodal representations of visual and textual fashion attributes which can be mixed and matched to form user’s desired product, and which have a mechanism to indicate when a visual and textual fashion attribute represent same concept. With a neural network [Laenen 18], a common, multimodal space is induced for visual and textual fashion attributes where their inner product measures their semantic similarity. A multimodal retrieval model is built which operates on obtained intermodal representations and ranks images based on their relevance to a multimodal query. The model retrieve images that both exhibit necessary query image attributes and satisfy query texts.
Products retrieved should not only be relevant to submitted textual query, but also match user preferences from both textual and visual modalities. To achieve the goal, [Guo 18] first leverages also_view and buy_after_viewing products to construct visual and textual latent spaces, which are expected to preserve visual and semantic similarity of products, respectively. A translation-based search model (TranSearch) is then proposed to 1) learn a multi-modal latent space based on pre-trained visual and textual latent spaces; and 2) map users, queries and products into this space for direct matching. TranSearch model is trained based on a comparative learning strategy, such that multi-modal latent space is oriented to personalized ranking in training stage.
Latent meaning of learned feature vectors hinders explanation of retrieval results and integration of user feedback. Online shopping websites however organize fashion items into hierarchical structures based on product taxonomy and domain knowledge revealing how human perceive relatedness among products. [Liao 18] presents techniques for organizing fashion hierarchies to facilitate reasoning of search results and user intent. An EI (Exclusive & Independent) tree cooperates with deep models for end-to-end multimodal learning - organizing concepts into multiple semantic levels and augmenting tree structure with exclusive as well as independent constraints. From EI tree, an explicit hierarchical unction is learnt to characterize semantic similarities among products - facilitating an interpretable retrieval scheme that integrates concept-level feedback.
Rich attributes associated with fashion items, e.g., off-shoulder dress and black skinny jean, which describe the semantics of items in a human-interpretable way, have been largely ignored in existing compatible outfit research. Given a corpus of matched pairs of items, [Yang 19] not only can predict compatibility score of unseen pairs, but also learn interpretable patterns that lead to a good match, e.g., white T-shirt matches with black trouser. Attribute-based Interpretable Compatibility (AIC) method consists of : 1) a tree-based module that extracts decision rules on matching prediction, 2) an embedding module that learns vector representation for a rule by accounting for attribute semantics in rule, and 3) a joint modeling module that unifies visual/rule embedding to predict matching score.?
领英推荐
Different from general domain, fashion matching pays much more aention to fine-grained information in fashion images and texts. Pioneer approaches detect region of interests (i.e., RoIs) from images and us RoI embeddings as image representations. In general, RoIs tend to represent “object-level” information in fashion images, while fashion texts are prone to describe more detailed information, e.g. styles, aributes. RoIs are thus not fine-grained enough for fashion text and image matching. To this end, FashionBERT [Gao 20] leverages patches as image features. With pre-trained BERT model as backbone network, high level representations of texts and images are learnt. Meanwhile, an adaptive loss trades off multitask learning (i.e., text/image matching and cross-modal retrieval).
Improving my backhand volley
2 年Check out our CLIP-based work in fashion ;-) https://arxiv.org/abs/2204.03972