Inferring Context from Pixels for Multimodal Image Classification

NLPIR SEMINAR Y2020#45

INTRO

In the new semester, our Lab, Web Search Mining and Security Lab, plans to hold an academic seminar every Monday, and each time a keynote speaker will share understanding of papers on his/her related research with you.

Arrangement

Tomorrow’s seminar is organized as follows:

  1. The seminar time is 1:20.pm, Mon ( January 13, 2020), at Zhongguancun Technology Park ,Building 5, 1306.
  2. Ian is going to give a presentation on the paper, Inferring Context from Pixels for Multimodal Image Classification.
  3. The seminar will be hosted by Ziyu Liu.

Everyone interested in this topic is welcomed to join us.

Inferring Context from Pixels for Multimodal Image Classification

Manan Shah, Krishnamurthy Viswanathan, etc

Abstract

Image classification models take image pixels as input and predict labels in a predefined taxonomy. While contextual information (e.g. text surrounding an image) can provide valuable orthogonal signals to improve classification, the typical setting in literature assumes the unavailability of text and thus focuses on models that rely purely on pixels. In this work, we also focus on the setting where only pixels are available in the input. However, we demonstrate that if we predict textual information from pixels, we can subsequently use the predicted text to train models that improve overall performance.

We propose a framework that consists of two main components: (1) a phrase generator that maps image pixels to a contextual phrase, and (2) a multimodal model that uses textual features from the phrase generator and visual features from the image pixels to produce labels in the output taxonomy. The phrase generator is trained using web-based query-image pairs to incorporate contextual information associated with each image and has a large output space.

We evaluate our framework on diverse benchmark datasets (specifically, theWebVision dataset for evaluating multi-class classification and OpenImages dataset for evaluating multi-label classification), demonstrating performance improvements over approaches based exclusively on pixels and showcasing benefits in prediction interpretability.We additionally present results to demonstrate that our framework provides improvements in few-shot learning of minimally labeled concepts. We further demonstrate the unique benefits of the multimodal nature of our framework by utilizing intermediate image/text co-embeddings to perform baseline zero-shot learning on the ImageNet dataset.

You May Also Like

About the Author: nlpvv

发表回复