Inferring Context from Pixels for Multimodal Image Classification

NLPIR SEMINAR Y2020#45

INTRO

In the new semester, our Lab, Web Search Mining and Security Lab, plans to hold an academic seminar every Monday, and each time a keynote speaker will share understanding of papers on his/her related research with you.

Arrangement

Tomorrow’s seminar is organized as follows:

The seminar time is 1:20.pm, Mon ( January 13, 2020), at Zhongguancun Technology Park ,Building 5, 1306.
Ian is going to give a presentation on the paper, Inferring Context from Pixels for Multimodal Image Classification.
The seminar will be hosted by Ziyu Liu.

Everyone interested in this topic is welcomed to join us.

Inferring Context from Pixels for Multimodal Image Classification

Manan Shah, Krishnamurthy Viswanathan, etc

Abstract

Image classification models take image pixels as input and predict labels in a predefined taxonomy. While contextual information (e.g. text surrounding an image) can provide valuable orthogonal signals to improve classification, the typical setting in literature assumes the unavailability of text and thus focuses on models that rely purely on pixels. In this work, we also focus on the setting where only pixels are available in the input. However, we demonstrate that if we predict textual information from pixels, we can subsequently use the predicted text to train models that improve overall performance.

We propose a framework that consists of two main components: (1) a phrase generator that maps image pixels to a contextual phrase, and (2) a multimodal model that uses textual features from the phrase generator and visual features from the image pixels to produce labels in the output taxonomy. The phrase generator is trained using web-based query-image pairs to incorporate contextual information associated with each image and has a large output space.

We evaluate our framework on diverse benchmark datasets (specifically, theWebVision dataset for evaluating multi-class classification and OpenImages dataset for evaluating multi-label classification), demonstrating performance improvements over approaches based exclusively on pixels and showcasing benefits in prediction interpretability.We additionally present results to demonstrate that our framework provides improvements in few-shot learning of minimally labeled concepts. We further demonstrate the unique benefits of the multimodal nature of our framework by utilizing intermediate image/text co-embeddings to perform baseline zero-shot learning on the ImageNet dataset.

自然语言处理与信息检索共享平台

Natural Language Processing & Information Retrieval Sharing Platform 自然语言处理、大数据实验室、智能语义平台汉语分词、中文语义分析、中文信息处理、语义分析系统、中文知识图谱、大数据分析工具

NLPIR SEMINAR Y2020#45

INTRO

Arrangement

About the Author: nlpvv

发表回复取消回复

NLPIR SEMINAR Y2020#45

INTRO

Arrangement

You May Also Like

张华平教授获全国工业和信息化系统先进工作者

【转载】DeepSeek启示:可信可控可用的大模型未来之路

About the Author: nlpvv

发表回复 取消回复

发表回复取消回复