Understanding and reasoning about procedural texts (e.g. cooking recipes, how-to guides, scientific processes) are very hard for machines as it demands modeling the intrinsic dynamics of the procedures. Can we model these intrinsic dynamics and better comprehend procedures? With this goal, here we propose Procedural Reasoning Networks (PRN) to address the problem of comprehending procedural commonsense knowledge.
arXiv #, 2019.
CitationMustafa Sercan Amac, Semih Yagcioglu, Aykut Erdem, and Erkut Erdem. "Procedural Reasoning Networks for Understanding Multimodal Procedures", accepted to CoNLL 2019.
Bibtex
With this work address the problem of comprehending procedural commonsense knowledge. This is a challenging task as it requires identifying key entities, keeping track of their state changes, and understanding temporal and causal relations. Contrary to most of the previous work, in this study, we do not rely on strong inductive bias and explore the question of how multimodality can be exploited to provide a complementary semantic signal. Towards this end, we introduce a new entity-aware neural comprehension model augmented with external relational memory units. Our model learns to dynamically update entity states in relation to each other while reading the text instructions. Our experimental analysis on the visual reasoning tasks in the recently proposed RecipeQA dataset reveals that our approach improves the accuracy of the previously reported models by a large margin. Moreover, we find that our model learns effective dynamic representations of entities even though we do not use any supervision at the level of entity states.
A great deal of commonsense knowledge about the world we live is procedural in nature and involves steps that show ways to achieve specific goals. Understanding and reasoning about procedural texts (e.g. cooking recipes, how-to guides, scientific processes) are very hard for machines as it demands modeling the intrinsic dynamics of the procedures. That is, one must be aware of the entities present in the text, infer relations among them and even anticipate changes in the states of the entities after each action.
In recent years, tracking entities and their state changes have been explored in the literature from a variety of perspectives. In an early work, Henaff
et al. (2017)
Perez and Liu (2017)
Our work builds upon and contributes to the growing literature on tracking states changes in procedural text. Bosselut et al. (2018)
To mitigate the aforementioned challenges, the existing works rely mostly on heavy supervision and focus on predicting the individual state changes of entities at each step. Although these models can accurately learn to make local predictions, they may lack global consistency
We follow a simple framework and employ PRN model for comprehending procedural text. In that regard we use 3 visual tasks from RecipeQA dataset, namely visual cloze, visual coherence and ordering tasks. The goal of these task is to provide correct answers for each question. For instance as described in the 2, our model reads a recipe in a step-by-step manner and while reading the recipe, internally update the entity states based on the new information. Finally, the model decides which answer should be the correct one based on the question asked to the model.
Our model consists of five main modules: An input module, an attention module, a reasoning module, a modeling module, and an output module. Note that the question answering tasks we consider here are multimodal in that while the context is a procedural text, the question and the multiple choice answers are composed of images.
We have five modules in the proposed PRN model as described below.
Single-task Training | Multi-task Training | |||||||
Model | Cloze | Coherence | Ordering | Average | Cloze | Coherence | Ordering | All |
Human* | 77.60 | 81.60 | 64.00 | 74.40 | – | – | – | – |
Hasty Student | 27.35 | 65.80 | 40.88 | 44.68 | – | – | – | – |
Impatient Reader | 27.36 | 28.08 | 26.74 | 27.39 | – | – | – | – |
BIDAF | 53.95 | 48.82 | 62.42 | 55.06 | 44.62 | 36.00 | 63.93 | 48.67 |
BIDAF w/ static memory | 51.82 | 45.88 | 60.90 | 52.87 | 47.81 | 40.23 | 62.94 | 50.59 |
PRN | 56.31 | 53.64 | 62.77 | 57.57 | 46.45 | 40.58 | 62.67 | 50.17 |
*Taken from the RecipeQA project website, based on 100 questions sampled randomly from the validation set. | ||||||||
Entity Embeddings | Step Descriptions | ||
---|---|---|---|
+ | |||
- | |||
Nearest Neighbours | |||
We have presented a new neural architecture called Procedural Reasoning Networks (PRN) for multimodal understanding of step-by-step instructions. Our proposed model is based on the successful BiDAF framework but also equipped with an explicit memory unit that provides an implicit mechanism to keep track of the changes in the states of the entities over the course of the procedure. Our experimental analysis on visual reasoning tasks in the RecipeQA dataset shows that the model significantly improves the results of the previous models, indicating that it better understands the procedural text and the accompanying images. Additionally, we carefully analyze our results and find that our approach learns meaningful dynamic representations of entities without any entity-level supervision. Although we achieve state-of-the-art results on RecipeQA, clearly there is still room for improvement compared to human performance.
This work was supported by TUBA GEBIP fellowship awarded to E. Erdem; and by the MMVC project via an Institutional Links grant (Project No. 217E054) under the Newton-Katip Çelebi Fund partnership funded by the Scientific and Technological Research Council of Turkey (TUBITAK) and the British Council. We also thank NVIDIA Corporation for the donation of GPUs used in this research.