Procedural Reasoning Networks

Understanding and reasoning about procedural texts (e.g. cooking recipes, how-to guides, scientific processes) are very hard for machines as it demands modeling the intrinsic dynamics of the procedures. Can we model these intrinsic dynamics and better comprehend procedures? With this goal, here we propose Procedural Reasoning Networks (PRN) to address the problem of comprehending procedural commonsense knowledge.

1 A recipe for preparing a cheeseburger Adapted from the cooking instructions available at https://www.instructables.com/id/In-N-Out-Double-Double-Cheeseburger-Copycat. Each basic ingredient (entity) is highlighted by a different color in the text and with bounding boxes on the accompanying images. Over the course of the recipe instructions, ingredients interact with each other, change their states by each cooking action (underlined in the text), which in turn alter the visual and physical properties of entities. For instance, the tomato changes it form by being sliced up and then stacked on a hamburger bun.
paper thumbnail
Paper

arXiv #, 2019.

Citation

Mustafa Sercan Amac, Semih Yagcioglu, Aykut Erdem, and Erkut Erdem. "Procedural Reasoning Networks for Understanding Multimodal Procedures", accepted to CoNLL 2019.
Bibtex

Code: PyTorch Implementation

Abstract

With this work address the problem of comprehending procedural commonsense knowledge. This is a challenging task as it requires identifying key entities, keeping track of their state changes, and understanding temporal and causal relations. Contrary to most of the previous work, in this study, we do not rely on strong inductive bias and explore the question of how multimodality can be exploited to provide a complementary semantic signal. Towards this end, we introduce a new entity-aware neural comprehension model augmented with external relational memory units. Our model learns to dynamically update entity states in relation to each other while reading the text instructions. Our experimental analysis on the visual reasoning tasks in the recently proposed RecipeQA dataset reveals that our approach improves the accuracy of the previously reported models by a large margin. Moreover, we find that our model learns effective dynamic representations of entities even though we do not use any supervision at the level of entity states.

Introduction

A great deal of commonsense knowledge about the world we live is procedural in nature and involves steps that show ways to achieve specific goals. Understanding and reasoning about procedural texts (e.g. cooking recipes, how-to guides, scientific processes) are very hard for machines as it demands modeling the intrinsic dynamics of the procedures. That is, one must be aware of the entities present in the text, infer relations among them and even anticipate changes in the states of the entities after each action.

In recent years, tracking entities and their state changes have been explored in the literature from a variety of perspectives. In an early work, Henaff et al. (2017) proposed a dynamic memory based network which updates entity states using a gating mechanism while reading the text. Bansal et al. (2017) presented a more structured memory augmented model which employs memory slots for representing both entities and their relations. Pavez et al. (2017) suggested a conceptually similar model in which the pairwise relations between attended memories are utilized to encode the world state.

Perez and Liu (2017) showed that similar ideas can be used to compile supporting memories in tracking dialogue state. Wang et al. (2017) has shown the importance of coreference signals for reading comprehension task. More recently, Dhingra et al. (2018) introduced a specialized recurrent layer which uses coreference annotations for improving reading comprehension tasks. On language modeling task, Ji et al. (2017) proposed a language model which can explicitly incorporate entities while dynamically updating their representations for a variety of tasks such as language modeling, coreference resolution, and entity prediction.

Our work builds upon and contributes to the growing literature on tracking states changes in procedural text. Bosselut et al. (2018) presented a neural model that can learn to explicitly predict state changes of ingredients at different points in a cooking recipe. Dalvi et al. (2018) proposed another entity-aware model to track entity states in scientific processes. Tandon et al. (2018) demonstrated that the prediction quality can be boosted by including hard and soft constraints to eliminate unlikely or favor probable state changes. In a follow-up work, Du et al. (2019) exploited the notion of label consistency in training to enforce similar predictions in similar procedural contexts. Das et al. (2019) proposed a model that dynamically constructs a knowledge graph while reading the procedural text to track the ever-changing entities states.

To mitigate the aforementioned challenges, the existing works rely mostly on heavy supervision and focus on predicting the individual state changes of entities at each step. Although these models can accurately learn to make local predictions, they may lack global consistency , not to mention that building such annotated corpora is very labor-intensive. As discussed earlier these previous methods use a strong inductive bias and assume that state labels are present during training. In our study, we deliberately focus on unlabeled procedural data and ask the question: Can multimodality help to identify and provide insights to understanding state changes. Hence, take a different direction by exploring the problem from a multimodal standpoint.

Procedural Reasoning Networks (PRN)

2 An illustration of our Procedural Reasoning Networks (PRN). For a sample question from visual coherence task in RecipeQA, while reading the cooking recipe, the model constantly performs updates on the representations of the entities (ingredients) after each step and makes use of their representations along with the whole recipe when it scores a candidate answer. Please refer to the main text for more details.

We follow a simple framework and employ PRN model for comprehending procedural text. In that regard we use 3 visual tasks from RecipeQA dataset, namely visual cloze, visual coherence and ordering tasks. The goal of these task is to provide correct answers for each question. For instance as described in the 2, our model reads a recipe in a step-by-step manner and while reading the recipe, internally update the entity states based on the new information. Finally, the model decides which answer should be the correct one based on the question asked to the model.

Our model consists of five main modules: An input module, an attention module, a reasoning module, a modeling module, and an output module. Note that the question answering tasks we consider here are multimodal in that while the context is a procedural text, the question and the multiple choice answers are composed of images.

We have five modules in the proposed PRN model as described below.

Results

Table 1: Quantitative comparison of the proposed PRN model against the baselines.
Single-task Training Multi-task Training
Model Cloze Coherence Ordering Average Cloze Coherence Ordering All
Human* 77.60 81.60 64.00 74.40
Hasty Student 27.35 65.80 40.88 44.68
Impatient Reader 27.36 28.08 26.74 27.39
BIDAF 53.95 48.82 62.42 55.06 44.62 36.00 63.93 48.67
BIDAF w/ static memory 51.82 45.88 60.90 52.87 47.81 40.23 62.94 50.59
PRN 56.31 53.64 62.77 57.57 46.45 40.58 62.67 50.17
*Taken from the RecipeQA project website, based on 100 questions sampled randomly from the validation set.
Table 1 presents the quantitative results for the visual reasoning tasks in RecipeQA. In single-task training setting, PRN gives state-of-the-art results visual cloze and visual coherence tasks, outperforming other neural models. Moreover, it achieves the best performance on average. These results demonstrate the importance of having a dynamic memory and keeping track of entities extracted from the recipe. In multi-task training setting where a single model is trained to solve all the tasks at once, PRN and BIDAF w/ static memory perform comparably and give significantly better results than BIDAF. Note that the model performances in the multi-task training setting are worse than single-task performances.
3 t-SNE visualizations of learned embeddings from each memory snapshot mapping to each entity and their corresponding states from each step for visual cloze task.
4 Step-aware entity representations can be used to discover the changes occurred in the states of the ingredients between two different recipe steps. The difference vector between two entities can then be added to other entities to find their next states. For instance, in the first example, the difference vector encodes the chopping action done on onions. In the second example, it encodes the pouring action done on the water. When these vectors are added to the representations of raw tomatoes and milk, the three most likely next states capture the semantics of state changes in an accurate manner.

Interactive Demo

Visual Reasoning Tasks in RecipeQA

Entity Embedding Arithmetics

Nearest Neighbor Retrieval



Embedding Projector

Click the below image to interact with live entity embeddings!

Conclusion

We have presented a new neural architecture called Procedural Reasoning Networks (PRN) for multimodal understanding of step-by-step instructions. Our proposed model is based on the successful BiDAF framework but also equipped with an explicit memory unit that provides an implicit mechanism to keep track of the changes in the states of the entities over the course of the procedure. Our experimental analysis on visual reasoning tasks in the RecipeQA dataset shows that the model significantly improves the results of the previous models, indicating that it better understands the procedural text and the accompanying images. Additionally, we carefully analyze our results and find that our approach learns meaningful dynamic representations of entities without any entity-level supervision. Although we achieve state-of-the-art results on RecipeQA, clearly there is still room for improvement compared to human performance.

Acknowledgments

This work was supported by TUBA GEBIP fellowship awarded to E. Erdem; and by the MMVC project via an Institutional Links grant (Project No. 217E054) under the Newton-Katip Çelebi Fund partnership funded by the Scientific and Technological Research Council of Turkey (TUBITAK) and the British Council. We also thank NVIDIA Corporation for the donation of GPUs used in this research.