1 A recipe for preparing a cheeseburger Adapted from the cooking instructions available at https://www.instructables.com/id/In-N-Out-Double-Double-Cheeseburger-Copycat. Each basic ingredient (entity) is highlighted by a different color in the text and with bounding boxes on the accompanying images. Over the course of the recipe instructions, ingredients interact with each other, change their states by each cooking action (underlined in the text), which in turn alter the visual and physical properties of entities. For instance, the tomato changes it form by being sliced up and then stacked on a hamburger bun.

Paper

arXiv #, 2019.

Citation

Mustafa Sercan Amac, Semih Yagcioglu, Aykut Erdem, and Erkut Erdem. "Procedural Reasoning Networks for Understanding Multimodal Procedures", accepted to CoNLL 2019.
Bibtex

Code: PyTorch Implementation

Abstract

With this work address the problem of comprehending procedural commonsense knowledge. This is a challenging task as it requires identifying key entities, keeping track of their state changes, and understanding temporal and causal relations. Contrary to most of the previous work, in this study, we do not rely on strong inductive bias and explore the question of how multimodality can be exploited to provide a complementary semantic signal. Towards this end, we introduce a new entity-aware neural comprehension model augmented with external relational memory units. Our model learns to dynamically update entity states in relation to each other while reading the text instructions. Our experimental analysis on the visual reasoning tasks in the recently proposed RecipeQA dataset reveals that our approach improves the accuracy of the previously reported models by a large margin. Moreover, we find that our model learns effective dynamic representations of entities even though we do not use any supervision at the level of entity states.

Introduction

A great deal of commonsense knowledge about the world we live is procedural in nature and involves steps that show ways to achieve specific goals. Understanding and reasoning about procedural texts (e.g. cooking recipes, how-to guides, scientific processes) are very hard for machines as it demands modeling the intrinsic dynamics of the procedures. That is, one must be aware of the entities present in the text, infer relations among them and even anticipate changes in the states of the entities after each action.

In recent years, tracking entities and their state changes have been explored in the literature from a variety of perspectives. In an early work, Henaff et al. (2017) proposed a dynamic memory based network which updates entity states using a gating mechanism while reading the text. Bansal et al. (2017) presented a more structured memory augmented model which employs memory slots for representing both entities and their relations. Pavez et al. (2017) suggested a conceptually similar model in which the pairwise relations between attended memories are utilized to encode the world state.

Perez and Liu (2017) showed that similar ideas can be used to compile supporting memories in tracking dialogue state. Wang et al. (2017) has shown the importance of coreference signals for reading comprehension task. More recently, Dhingra et al. (2018) introduced a specialized recurrent layer which uses coreference annotations for improving reading comprehension tasks. On language modeling task, Ji et al. (2017) proposed a language model which can explicitly incorporate entities while dynamically updating their representations for a variety of tasks such as language modeling, coreference resolution, and entity prediction.

Our work builds upon and contributes to the growing literature on tracking states changes in procedural text. Bosselut et al. (2018) presented a neural model that can learn to explicitly predict state changes of ingredients at different points in a cooking recipe. Dalvi et al. (2018) proposed another entity-aware model to track entity states in scientific processes. Tandon et al. (2018) demonstrated that the prediction quality can be boosted by including hard and soft constraints to eliminate unlikely or favor probable state changes. In a follow-up work, Du et al. (2019) exploited the notion of label consistency in training to enforce similar predictions in similar procedural contexts. Das et al. (2019) proposed a model that dynamically constructs a knowledge graph while reading the procedural text to track the ever-changing entities states.

To mitigate the aforementioned challenges, the existing works rely mostly on heavy supervision and focus on predicting the individual state changes of entities at each step. Although these models can accurately learn to make local predictions, they may lack global consistency , not to mention that building such annotated corpora is very labor-intensive. As discussed earlier these previous methods use a strong inductive bias and assume that state labels are present during training. In our study, we deliberately focus on unlabeled procedural data and ask the question: Can multimodality help to identify and provide insights to understanding state changes. Hence, take a different direction by exploring the problem from a multimodal standpoint.

Procedural Reasoning Networks (PRN)

We follow a simple framework and employ PRN model for comprehending procedural text. In that regard we use 3 visual tasks from RecipeQA dataset, namely visual cloze, visual coherence and ordering tasks. The goal of these task is to provide correct answers for each question. For instance as described in the 2, our model reads a recipe in a step-by-step manner and while reading the recipe, internally update the entity states based on the new information. Finally, the model decides which answer should be the correct one based on the question asked to the model.

Our model consists of five main modules: An input module, an attention module, a reasoning module, a modeling module, and an output module. Note that the question answering tasks we consider here are multimodal in that while the context is a procedural text, the question and the multiple choice answers are composed of images.

We have five modules in the proposed PRN model as described below.

Input Module extracts vector representations of inputs at different levels of granularity by using several different encoders.
Reasoning Module scans the procedural text and tracks the states of the entities and their relations through a recurrent relational memory core unit .
Attention Module computes context-aware query vectors and query-aware context vectors as well as query-aware memory vectors.
Modeling Module employs two multi-layered RNNs to encode previous layers outputs.
Output Module scores a candidate answer from the given multiple-choice list.

Results

Table 1: Quantitative comparison of the proposed PRN model against the baselines.
	Single-task Training				Multi-task Training
Model	Cloze	Coherence	Ordering	Average	Cloze	Coherence	Ordering	All
Human*	77.60	81.60	64.00	74.40	–	–	–	–
Hasty Student	27.35	65.80	40.88	44.68	–	–	–	–
Impatient Reader	27.36	28.08	26.74	27.39	–	–	–	–
BIDAF	53.95	48.82	62.42	55.06	44.62	36.00	63.93	48.67
BIDAF w/ static memory	51.82	45.88	60.90	52.87	47.81	40.23	62.94	50.59
PRN	56.31	53.64	62.77	57.57	46.45	40.58	62.67	50.17
*Taken from the RecipeQA project website, based on 100 questions sampled randomly from the validation set.

Table 1 presents the quantitative results for the visual reasoning tasks in RecipeQA. In single-task training setting, PRN gives state-of-the-art results visual cloze and visual coherence tasks, outperforming other neural models. Moreover, it achieves the best performance on average. These results demonstrate the importance of having a dynamic memory and keeping track of entities extracted from the recipe. In multi-task training setting where a single model is trained to solve all the tasks at once, PRN and BIDAF w/ static memory perform comparably and give significantly better results than BIDAF. Note that the model performances in the multi-task training setting are worse than single-task performances.

3 t-SNE visualizations of learned embeddings from each memory snapshot mapping to each entity and their corresponding states from each step for visual cloze task.

4 Step-aware entity representations can be used to discover the changes occurred in the states of the ingredients between two different recipe steps. The difference vector between two entities can then be added to other entities to find their next states. For instance, in the first example, the difference vector encodes the chopping action done on onions. In the second example, it encodes the pouring action done on the water. When these vectors are added to the representations of raw tomatoes and milk, the three most likely next states capture the semantics of state changes in an accurate manner.

Interactive Demo

Visual Reasoning Tasks in RecipeQA

Recipe: Minnesota Cookies

Step 1

Step 1: Make Dough

Ingredients: 1 cup of butter (2 sticks) 2 eggs 1 tsp vanilla 1 1/2 cups of sugar 3 cups of flour jelly or jam (flavor of choice)non-food materials: cookie press - use star attachment. Mix butter, eggs, sugar, and vanilla together. Once smooth add in flour 1/2 a cup at a time. When all the ingredients are mixed let the dough sit out for 15 minutes. In my experience this makes the dough firmer and the cookies chewier.

Step 2

Step 2: Using Cookie Press

You can buy a cookie press at most kitchen supply stores and online. Many of the newer models operate and look like a caulk gun - this is what I use currently. My mom has one of the really old school cookie presses. To fill the cookie press take a handful of dough and roll it in the palm of your hands so it looks like a snake. This will help it easily slide into the cookie press tube. Once the dough is in (and not hanging out the end) put the star attachment on the top and secure it with the front ring.

Step 3

Step 3: Using Cookie Press 2

Put cookies on a non-stick pan (don't grease). Position the cookie press as seen in the image and press down handle of cookie press and dough will come out. You want to fill the gap between the cookie sheet and the cookie press with dough. Don't press too hard, you don't want dough squeezing out the sides. Once you have pressed out enough dough lift the cookie press up and you will see your cookie! You will get the hang of this over time. I get very particular about the size of my cookies.

Step 4

Step 4: Adding Jam

To make room for the jam use your knuckle to press a small indentation in the center of each cookie. Put a little jam on the tip of a knife and use the knife to put the jam in indents of the cookies.

Question

Question: Choose the best image for the missing blank to correctly complete the recipe.

Choices:


(A)	(B)	(C)	(D)

Correct Answer : A

Recipe: Healthier-For-You Peanut Butter Cups

Step 1

Step 1: Mix Up the Filling

X tbsps. of peanut butter (X being the number of peanut butter cups that you want) icing sugar chocolate chips (or molding chocolate wafers. I just used what I had at home at the time of the craving, which was chocolate chips!) Mix peanut butter and icing sugar to taste. Seriously, I cannot give you an exact amount, because everyone likes a different amount of sweetness. I used chunky peanut butter, but you can use all-natural or already sweetened or hey, why not almond butter! Add enough icing sugar to the peanut butter that it becomes a sort of dough. It has to be able to be rolled into little balls of about a tbsp. each. If you put in too much icing sugar, it will crack, but too little icing sugar and the peanut butter will stick to your hands. Roll the peanut butter mixture into little balls, the size of the molds. I happen to have a candy mold that I bought on sale in a craft store, but frankly, you could use the molded plastic that chocolates often come in, when they're in layers in boxes. Or even a mini muffin tin in a pinch.

Step 2

Step 2: The Molds and the All-Important Chocolate!

Melt chocolate chips or chocolate wafers in a mug in the microwave, stirring every 30 seconds until fully melted. Paint the inside of the molds, and set the mold in the freezer until the chocolate is set.

Step 3

Step 3: Fill 'er Up!

Push a ball of the peanut butter mixture into every mold. Press down with your thumb to make sure that it fills up the space. I also made up some wine jelly bonbons. These are great for a more adult end to a nice dinner, to serve with coffee! I used some wine jelly that I had lying around the kitchen, possibly from this recipe. You follow the same instructions as above in terms of the chocolate coating, but just use the wine jelly as an alternative to the peanut butter filling. Easy-peasy homemade fancy chocolates! Can we say holiday entertaining? *grin*

Step 4

Step 4: Topping 'em Up!

Brush the top with the melted chocolate and return to freezer (technically, a fridge is better, but my cravings have no patience). When set, pop out the peanut butter cups from the molds, and voila! Healthier-for-you peanut butter cups!Want to fancy them up? Why not sprinkle a little something on top the chocolates before they set up? Maybe some fleur de sel? Some chopped peanuts? A candied violet for the wine bonbons? It's up to you!

Question

Question: Select the incoherent image in the following sequence of images.

Choices:


(A)	(B)	(C)	(D)

Correct Answer : A

Recipe: Key Lime Buttered Green Beans

Step 1

Step 1: Ingredients and Tools

Here is everything you will need to make these delicious tart green beans. Ingredients one key lime 1 tablespoon(14g) of butter either salted or unsalted is fine 1lbs(450g) of green beans this is about 1/4 lbs per serving. salt if you are not using salted butter Tools Strainer Skillet Spatula Micro plainer Fire Citrus juice extractor(this could just be your hands) Knife Cutting board Once you have everything gathered it is time to make some tasty green beans.

Step 2

Step 2: Washing, Zesting, and Melting Oh My

So here is where I let you in a secret. I know I skipped the cutting the beans step and that is because I purchased my green beans precut. If you did not you can do that and I will hang out here while you cut the ends off your green beans.... Cool you're back. Lets wash those green beans.(I put my green beans in a strainer and rinse them with cold water.) Then turn your burner on to medium high heat toss your 1 tbsp(14g) of butter into your skillet. While your butter is melting quickly zest your tiny tiny key lime lime. The trick to zesting is you just want to shave off the colored part of the fruit leaving behind the white bitter part of the rind. Once your lime is zested cut it in half, grab it, your zest, and green beans and join me over at the skillet.

Step 3

Step 3: Some Like It Hot or the Cookening

So now that your butter is almost all melted go ahead and dump your zest right into that melted buttery goodness and stir it around. Once the aroma of the zest hits you (or about 30 seconds whichever comes first) you can add your green beans. Make sure once your green beans are in the skillet to give them a good stir so they get evenly covered in zesty butter goodness. Now I like my beans to have a bit of a crunch to them still so it only takes about 3 minutes of stirring occasionally before I get to the next step. If you like them firm stick with me otherwise cook them until they are almost as soft as you like them. Once they are almost done squeeze the juice from your lime into the skillet and stir the green beans for about a minute more. Now all that is left is to plate them and enjoy.

Step 4

Step 4: Enjoy

As you can see I paired my key lime buttered green beans with two meat muffins and artfully arranged them as a smiley face. This is because I am still five at heart and happy food is fun to eat. Thanks for taking the time to read over this instructable. I hope you have enjoyed it. If you make these tart tasty green beans I hope they turn out as delicious as mine did.

Question

Question: Choose the correct order of images to make a story

Choices:

1,2,3,4	1,4,3,2	1,3,2,4	1,2,4,3
(A)	(B)	(C)	(D)

Correct Answer : A

Entity Embedding Arithmetics

Nearest Neighbor Retrieval

	Entity Embeddings	Step Descriptions

+
-

	Nearest Neighbours

Embedding Projector

Click the below image to interact with live entity embeddings!

Conclusion

We have presented a new neural architecture called Procedural Reasoning Networks (PRN) for multimodal understanding of step-by-step instructions. Our proposed model is based on the successful BiDAF framework but also equipped with an explicit memory unit that provides an implicit mechanism to keep track of the changes in the states of the entities over the course of the procedure. Our experimental analysis on visual reasoning tasks in the RecipeQA dataset shows that the model significantly improves the results of the previous models, indicating that it better understands the procedural text and the accompanying images. Additionally, we carefully analyze our results and find that our approach learns meaningful dynamic representations of entities without any entity-level supervision. Although we achieve state-of-the-art results on RecipeQA, clearly there is still room for improvement compared to human performance.

Procedural Reasoning Networks

Abstract

Introduction

Procedural Reasoning Networks (PRN)

Results

Interactive Demo

Visual Reasoning Tasks in RecipeQA

Recipe: Minnesota Cookies

Step 1: Make Dough

Step 2: Using Cookie Press

Step 3: Using Cookie Press 2

Step 4: Adding Jam

Recipe: Healthier-For-You Peanut Butter Cups

Step 1: Mix Up the Filling

Step 2: The Molds and the All-Important Chocolate!

Step 3: Fill 'er Up!

Step 4: Topping 'em Up!

Recipe: Key Lime Buttered Green Beans

Step 1: Ingredients and Tools

Step 2: Washing, Zesting, and Melting Oh My

Step 3: Some Like It Hot or the Cookening

Step 4: Enjoy

Entity Embedding Arithmetics

Nearest Neighbor Retrieval

Embedding Projector

Conclusion

Acknowledgments