Image-Text Matching
阅读原文时间:2023年07月11日阅读:2

重要性和意义:

Image-text matching has received a large amount of interest since it associates different modalities and improves the understanding of image and natural language.

Image-text matching is an emerging task that matches instance from one modality with instance from another modality. This enables to bridge vision and language, which has potential to improve the performance of other multimodal applications.

As a fundamental task of multimodal interaction  image-text matching, focusing on measuring the semantic similarity between an image and a text, has attracted extensive research attention.

任务:

It aims to retrieve semantically related images based on the given text query, and vice versa.

The key point of image-text matching is how to accurately measure the similarity between visual and textual inputs.

The key challenge in image-text matching lies in learning the correspondence of image and text, such that can reflect the similarity of image-text pairs accurately.

现有的方法:

①:one-to-one approaches

One-to-one approaches learn the correspondence between the whole image and text without external object detection tools. The general framework of global correspondence learning methods is to jointly project the whole image and text into a common latent space, where corresponding image and text can be unified into similar representations. Techniques to common space projection range from designing specific networks [23] to adding constraints, such as triplet loss [29], adversarial loss [27], and classification loss [15].

  1)Existing one-to-one approaches typically project the image and text into a latent common space where semantic relationships between different modalities can be measured through distance computation. 之前的工作采用多神经网络来改进特征表示,使语义相关的数据彼此接近,否则变远,例如,多模态卷积神经网络(m-CNNs)[15]、多模态递归神经网络(m-RNN)[16]、递归残差融合(RRF)[17]等。

  2)还有一些关注于优化[8,9,17,18]。例如,[8,9]应用了排序损失,强制在每次文本查询中,语义相关的图像的排序要高于语义不相关的图像,对于给定的图像查询也是如此。

Among these works, all the data are processed uniformly whether they are neighbors. 在这些工作中,所有的数据无论是否是相邻的,都被统一处理。

②:many-to-many approaches

Many to- many approaches learn latent alignment between objects in the image and words in the text, which requires external object detection tools pre-trained on large-scale datasets. Another branch of image-text matching learns local region-word correspondence, which is used to infer the global similarity of image-text pairs. Some researchers focus on learning local correspondence between salient regions and keywords.

One simple way is to utilize the aggregate similarity of all fragments of image and text [12].

Cross attention or co-attention which involves multi-step of attending to image regions based on text or attending to words based on image [17, 18] can also be applied. However, existing strategies require computational demanding pairwise similarity computation between all image-text pairs with complex methods at the test stage, which lack efficiency in real-world application scenarios.

----------------------------------------------------------------------------------------------

LOSS function

Text-image matching has been one of the most popular ones among them. Most methods involve two phases: 1) training: two neural networks (one image encoder and one text encoder) are learned end-to-end, mapping texts and images into a joint space, where vectors (either texts or images) with similar meanings are close to each other; 2) inference: for a query in modality A, after being encoded into a vector, the nearest neighbor search is performed to match the vector against all vector representations of items2 in modality B. As the embedding space is learned through jointly modeling vision and language, it is often referred to as Visual Semantic Embeddings (VSE).

文本-图像匹配是其中最流行的一种。
大多数方法涉及两个阶段:1)训练:端到端学习两个神经网络(一个图像编码器和一个文本编码器),将文本和图像映射到一个关节空间中,这个空间中具有相似意义的向量(无论是文本还是图像)彼此接近;
2)推理:查询在形态,被编码到一个向量后,执行最近邻搜索匹配的向量对所有向量表示items2在嵌入空间形态b .学会了通过共同愿景和建模语言,它通常被称为视觉语义映射进行(VSE)。

Loss function. Faghri et al. (2018) brought the most notable improvement on loss function used for training VSE. They proposed a max-margin triplet ranking loss that emphasizes on the hardest negative sample within a min-batch. We, however, point out that the max-margin loss is very sensitive to label noise and encoder performance, and also easily overfits.  Through experiments, we show that it only achieves the best performance under a careful selection of model architecture and dataset.

Before Faghri et al. (2018), a pairwise ranking loss has been usually adopted for text-image model training. The only difference is that, instead of only using the hardest negative sample, it sums overall negative samples (we thus refer to it as the sum margin loss). Though sum-margin loss yields stable and consistent performance under all dataset and architecture conditions, it does not make use of information from hard samples but treats all samples equally by summing the margins up.

During training, a margin-based triplet ranking loss is adopted to cluster positive pairs and push negative pairs away from each other.

1) In this paper, we propose the use of a tradeoff: a kNN-margin loss that sums over the k hardest sample within a mini-batch. It 1) makes sufficient use of hard samples and also 2) is robust across different model architectures and datasets.

2)Inverted Softmax (IS):The main idea of IS is to estimate the confidence


① projecting the image and text into a common space

现有的方法已经取得了很大的进展,将图像和文本投影到一个公共空间中,可以区分具有不同语义的数据。

但是,它们对所有数据点进行统一处理,而忽略了邻域内的数据由于视觉上的相似性或句法结构上的相似性而难以区分。

Existing approaches have achieved much progress by projecting the image and text into a common space where data with different semantics can be distinguished.

② a neighbor-aware network to image-text matching :an intra-attention module and neighbor-aware ranking loss

然而,一个邻居的数据很难区分,被现有的方法忽略了。它主要是由于邻域内的数据在内容层面上(如图像的视觉外观相似,文本的句法结构相似)而不是语义层面上的相似,导致内容相似但语义不同的数据点成为邻域。因此,需要对邻域内的数据进行更多的关注,学习更多的鉴别特征,因为它们只是在细微的部分有所不同,这有利于更有效地区分不同语义的数据。

针对这一问题,我们提出了一种基于邻域感知的图像-文本匹配网络,该网络利用内注意模块和邻域感知的排序损失来联合区分具有不同语义的数据,更重要的是可以区分邻域中语义不相关的数据。

通过对各数据及其语义无关的相邻数据进行详细的比较,利用内注意分别学习图像和文本表示,并放大它们之间的细微差异。邻居感知的排序损失强调邻居,并利用放大的差异来明确区分它们。intra-attention模块的设计目的是利用全局信息作为参考,对不同的特征赋予不同的重要性。

③ 图建模 Graph Structured Network for Image-Text Matching---- learn fine-grained correspondence.

The GSMN explicitly models object, relation, and attribute as a structured phrase, which not only allows to learn correspondence of object, relation, and attribute separately, but also benefits to learn fine-grained correspondence of structured phrase.

This is achieved by node-level matching and structure-level matching.

The node-level matching associates each node with its relevant nodes from another modality, where the node can be the object, relation, or attribute. The associated nodes then jointly infer fine-grained correspondence by fusing neighborhood associations at structure-level matching.

However, existing works only learn coarse correspondence based on object co-occurrence statistics, while failing to learn fine-grained correspondence of structured object, relation, and attribute. As a result, they suffer from two limitations: (1) it is hard to learn correspondences of the relation and attribute as they are overwhelmed by object correspondence. (2) objects are prone to correspond to wrong categories without the guidance of descriptive relation and attribute.

Given a textual graph G1 = (V1, E1) of a text, and a visual graph G2 = (V2, E2) of an image, our goal is to match two graphs to learn fine-grained correspondence, producing similarity g(G1, G2) as global similarity of an image-text pair.

Concretely, we first compute the similarities between visual and textual nodes.  The similarity value measures how the visual node corresponds to each textual node.

④ basic: Attention (cross attention)

Stacked Cross Attention expects two inputs: a set of image features V = {v1, …, vk}, vi 2 RD, such that each image feature encodes a region in an image; a set of word features E = {e1, …, en}, ei 2 RD, in which each word feature encodes a word in a sentence. The output is a similarity score, which measures the similarity of an image-sentence pair.

Stacked Cross Attention attends differentially to image regions and words using both as context to each other while inferring the similarity.

⑤ Visual Semantic Reasoning for Image-Text Matching

Our work also belongs to this direction of learning joint space for image and sentence with an emphasis on improving image representations.

Our goal is to infer the similarity between a full sentence and a whole image by mapping image regions and the text descriptions into a common embedding space.

a region relationship reasoning model to enhance the region-based representation by considering the semantic correlation between image regions.

Then a fully-connected relationship graph Gr = (V, E), where V is the set of detected regions and edge set E is described by the affinity matrix R. R is obtained by calculating the affinity edge of each pair of regions using Eq. 2. That means there will be an edge with a high-affinity score connecting two image regions if they have strong semantic relationships and are highly correlated.

⑥ Learning Fragment Self-Attention Embeddings

In this paper, we propose Self-Attention Embeddings (SAEM) to exploit fragment relations in images or texts by self-attention mechanism, and aggregate fragment information into visual and textual embeddings.

The self-attention layers are built to model subtle and fine-grained fragment relation in image and text respectively, which consists of multi-head self-attention sub-layer and position-wise feed-forward network sub-layer.

Consequently, the fragment self-attention mechanism can discover the fragment relations and identify the semantically salient regions in images or words in sentences, and capture their interaction more accurately.

Instead of exhaustively computing similarities of all pairs of image regions and words in sentence, we consider learning embeddings for images and texts which independently project the two heterogeneous data modalities into a joint space.  Thus, the similarity between image and text can be directed compared on the learned embeddings.

With the self-attention mechanism, each output fragment can attend to all input fragments, and the distance between each fragment is just one. Thus, our model does not consider any specific order of image regions.


目前认为有用的几个指标:

1、re-rank:间隙由差分向量c i−u i的范数定义,以反映视觉和文本特征空间之间的差异对度量的表现的影响,如Eq.(3)所示。如果跨模态和单模态的相似性之间有一个小的差距,这意味着该度量可以更好地处理视觉和文本特征空间之间的差异,并将跨模态表示重新划分为单模态表示,以阻止挖掘它们的相似性。

视觉特征空间和文本特征空间的不同使得图像-文本匹配中跨模态相似度的准确度量变得更加困难。上面的指标是对这种模态间空间差异的抽象估计,越小,表示当前的metric越好。

如图4所示,我们用图库中最近的句子替换句子检索中的图像探针,并重新计算单模态距离,以获得确定指标性能的差距。

因此,对于间隙较小的指标G i,我们应该设置较大的权重进行融合,因此我们将G i作为后续模块组合中Eq.(4)中的分母,以确定单独指标的最终得分。

这篇文章:CRE模块中R_i越大,意味着在度量下,在两个方向上接近图像探针v_i的排序列表顶端的句子越多。QRG模块中较小的G_i表明度量受视觉特征空间和文本特征空间差异的影响较小。两个对排序指标的重新衡量,融合。re-rank

2、KNN-Hardest loss


Cross-modality Person re-identification with Shared-Specific Feature Transfer

现有的研究主要集中在通过将不同的模态嵌入到同一个特征空间中来学习共同的表达。然而,只学习共同特征意味着巨大的信息损失,降低了特征的差异性。

In this paper, we tackle the above limitation by proposing a novel cross-modality shared- specific feature transfer algorithm (termed cm-SSFT) to explore the potential of both the modality-shared information and the modality-specific characteristics to boost the reidentification performance.

所以如何找不同模态间的 共性 和 个性?一方面不同模态之间的信息有互补作用,另一方面模态自己的特异性又有很强的标识功能。但是怎么把两者分开呢?即如何找到这两种表示。作者提出了一种新的跨模态共享特征转移算法(cm-SSFT)

We model the affinities of different modality samples according to the shared features and then transfer both shared and specific features among and across modalities.

We also propose a complementary feature learning strategy including modality adaption, project adversarial learning and reconstruction enhancement to learn discriminative and complementary shared and specific features of each modality, respectively.

区别性、互补性、共通性和特殊性

Previous methods can be summarized into two major categories to overcome the modality discrepancy: modality-shared feature learning and modality-specific feature compensation.

The shared feature learning aims to embed images of whatever modality into a same feature space.  With shared cues only, the upper bound of the discrimination ability of the feature representation is limited. 只有共享线索的情况下,特征表示的识别能力的上界是有限的。

As a result, modality specific feature compensation methods try to make up the missing specific information from one modality to another. 从一种形态到另一种形态弥补缺失的特定信息

cm-SSFT to explore the potential of both the modality-shared information and the modality specific characteristics to boost the re-identification performance.

总结的优点:

It models the affinities between intra-modality and inter-modality samples and utilizes them to propagate information.

Every sample accepts the information from its inter-modality and intra-modality near neighbors and meanwhile shares its own information with them.

This scheme can compensate for the lack of specific information and enhance the robustness of the shared feature, thus improving the overall representation ability.

Difference:

Our method can exploit the specific information that is unavailable in traditional shared feature learning. Since our method is dependent on the affinity modeling of neighbors, the compensation(补偿) process can also overcome the choice difficulty of generative methods.

具体的实现过程Cross-Modality Shared-Specific Feature Transfer(end-to-end manner)

1)Input images are first fed into the two-stream feature extractor to obtain the shared and specific features.

2)Then the shared-specific transfer network (SSTN) models the intra-modality and inter-modality affinities. It then propagates the shared and specific features across modalities to compensate for the lacked specific information and enhance the shared features.

3) To obtain discriminative and complementary shared and specific features, ①two project adversarial and ②reconstruction blocks and ③one modality adaptation module are added on the feature extractor.

1、Two stream feature extractor

discriminative:1)模态的分类损失;2)模态内和模态间的三元组损失;

The classification loss ensures that features can distinguish the identities of the inputs.

Besides, we add a single modality triplet loss on specific features and a crossmodality triplet loss on shared features for better discriminability:

2、Shared-Specific Transfer Network

For unified feature representation

For cross-modality retrieval, we need to transfer the specific features from one modality to another to compensate for these zero-padding vectors.

The proposed shared-specific transfer network can make up the lacking specific features and enhance the robustness of the overall representation jointly.

1)As shown in Figure 2, SSTN first models the affinity of samples according to the two kinds of features.

2)Then it propagates both intra-modality and inter-modality information with the affinity model.

3)Finally, the feature learning stage guides the optimization of the whole process with classification and triplet losses.

Affinity modeling. We use the shared and specific features to model the pair-wise affinity.

The intra-similarity and inter-similarity represent the relation between each sample with others of both the same and different modalities. We define the final affinity matrix as:

It keeps the top-k values for each row of a matrix and sets the others to zero.

Shared and specific information propagation

The affinity matrix represents the similarities across samples. SSTN utilizes this matrix to propagate features.

手机扫一扫

移动阅读更方便

阿里云服务器
腾讯云服务器
七牛云服务器

你可能感兴趣的文章