Understand the Dynamic World: An End-to-End Knowledge Informed Framework for Open Domain Entity State Tracking

Open domain entity state tracking aims to predict reasonable state changes of entities (i.e., [attribute] of [entity] was [before_state] and [after_state] afterwards) given the action descriptions. It's important to many reasoning tasks to support human everyday activities. However, it's challenging as the model needs to predict an arbitrary number of entity state changes caused by the action while most of the entities are implicitly relevant to the actions and their attributes as well as states are from open vocabularies. To tackle these challenges, we propose a novel end-to-end Knowledge Informed framework for open domain Entity State Tracking, namely KIEST, which explicitly retrieves the relevant entities and attributes from external knowledge graph (i.e., ConceptNet) and incorporates them to autoregressively generate all the entity state changes with a novel dynamic knowledge grained encoder-decoder framework. To enforce the logical coherence among the predicted entities, attributes, and states, we design a new constraint decoding strategy and employ a coherence reward to improve the decoding process. Experimental results show that our proposed KIEST framework significantly outperforms the strong baselines on the public benchmark dataset OpenPI.

Excepted Output: state of celery was whole before and in sticks afterwards, length of celery was longer before and shorter afterwards, composition of celery was whole before and cut up afterwards, cleanness of knife was clean before and dirty afterwards T5: moisture of celery was in fridge before and on cutting board afterwards GPT-2: shape of celery was whole before and cut into sticks afterwards KIEST: state of stick was whole before and cut afterwards, cleanness of knife was clean before and dirty afterwards, length of celery was longer before and shorter afterwards, shape of celery was uncut before and cut into sticks afterwards An example demonstrating the input, expected outputs and system outputs (from T5 [21], GPT-2 [20] and our KI-EST approach) for open-domain entity state tracking.The purple words in the action description highlight the keywords that are used to retrieve the additional entity and attribute knowledge, e.g., the concepts shown in the bottom entity-attribute knowledge graph.The green and orange words in the expected outputs and system outputs highlight the entities and attributes for describing the entity state changes.

INTRODUCTION
Open domain entity state tracking [30] aims to predict all the state changes of entities that are related to a given action description, where each state change can be described with a template, e.g., "[attribute] of [entity] was [before_state] and [after_state] afterwards".It's an important task to many reasoning and information retrieval systems to better understand human daily activities and make recommendations, e.g., the subsequent actions that humans need to perform. Figure 1 shows an example, where given the action, Cut the celery into sticks., we need to infer all the entity state changes that are related to the action, such as length of celery was longer before and shorter afterwards, cleanness of knife was clean before and dirty afterwards, and so on.There are two particular challenges for the open domain entity state tracking task: first, given that the actions are from open domains, the entities, attributes, and states that are related to state changes are usually from open set vocabularies, e.g., in the example of Figure 1, the entity such as knife is not explicitly mentioned in the action description and there is no close-set attribute vocabulary to indicate the state changes of entities, making it hard to be formulated as a reading comprehension task as previous studies [6,16,27] which only focus on a few predefined entities and states (e.g., location, existence).In addition, there could be an arbitrary number of entity state changes that are caused by the action while the system is expected to predict most, if not all, state changes of entities.Previous study [30] tackles these challenges by exploring an autoregressive language model, such as GPT-2 [20], to directly generate all the entity state changes (green box in Figure 1).However, as shown in Figure 1, these approaches suffer from very low coverage of the entities and attributes for the predicted state changes, and without any constraint, the models easily generate state changes that are not related to the context of the action or consistent with human commonsense, e.g., the generated "fridge" (before_state) from T5 is not coherent with the attribute "moisture".
In this work, we argue that knowledge graphs (KGs), such as ConceptNet [12], can provide meaningful knowledge to inform the model of relevant entities and attributes to the given action description, and thus facilitate the model to better predict entity state changes.For example, as shown in Figure 1, given the source concepts, such as cut, from the action description, ConceptNet provides more implicitly relevant concepts, such as knife, cleanness and so on.Motivated by this, we propose a novel end-to-end Knowledge Informed framework for open domain Entity State Tracking, namely KIEST, which consists of two major steps: (1) retrieving and selecting all the relevant entities and attributes to the given action description from ConceptNet; (2) incorporating the external entity and attribute knowledge into a novel Dynamic Knowledge Grained Encoder-Decoder framework, which is based on an autoregressive pre-trained language model, such as T5 [21], to generate all the entity state changes.To encourage the model to generate coherent and reasonable state changes, we further propose a constrained decoding strategy and an Entity State Change Coherence Reward (ESCCR) to improve the generation process.The experimental results demonstrate the effectiveness of our proposed KIEST framework with a significant improvement over the strong baselines.The contributions of this work can be summarized as: • To the best of our knowledge, we are the first to incorporate the external entity and attribute knowledge to inform the model to better generate entity state changes with higher coverage; • We design a novel Dynamic Knowledge Grained Encoder-Decoder approach to dynamically incorporate the external knowledge and autoregressively generate all the entity state changes; • We also design a new constrained decoding strategy and an automatic reward function to estimate the coherence of entity state changes so as to encourage the model to generate state changes that are more coherent to the context actions and human commonsense; • We conduct a thorough analysis of our method, including an ablation study, demonstrating the robustness of our framework.

RELATED WORK 2.1 Entity State Tracking
Tracking the state changes of entities is important to understand the natural language text describing actions and procedures.Most previous studies [3,6,15,16,18,23,25] only focus on a particular domain with a set of predefined entities and attributes.For example, Mishra et al. [16] tackle this problem as a question-answering task and only focus on location and existence attributes to track the entity states in process paragraphs.Faghihi and Kordjamshidi [6] propose a Time-Stamped Language Model to understand the location changes of entities.PiGLET [36] predicts the post-state of an entity by giving its pre-state, a specific attribute and context, while EVENTS REALM [27] determines whether an entity has a state change with respect to the set of given attributes.Recently, [30] further extended entity state tracking to open domain actions and explore pre-trained language models, such as GPT-2 [20], to autoregressively generate all the entity state changes.Compared with all these studies, our work focuses on open-domain entity state tracking and aims to encourage the model to generate entity state changes with high coverage and are more coherent to the context and human commonsense.

Knowledge Informed Language Understanding and Generation
Many studies [1,2,8,10,13,28,29,32,34,38] have been proposed to incorporate external knowledge to better understand the text or generate the expected output.For example, on the language understanding task, [13] injects expanded knowledge into the language model by adding the entity and relation from the knowledge graph as additional words.Different from the masking strategy of BERT [5], [29] proposes an entity-level masking strategy to incorporate the informative entities into the language model.[2] verbalize extracted facts aligning with input questions as natural language and incorporate them as prompts to the language model to improve story comprehension.For open-domain question answering, [8] incorporate the informative entities extracted from the input question and passage with the output of language model T5 to jointly optimize the knowledge representations based on their proposed relation-aware GNN.Inspired by these studies, we retrieve the relevant entities and attributes to the action description from external knowledge graphs and further dynamically incorporate them to better predict the entity state changes.

METHODOLOGY
Given a procedural paragraph with a sequence of action descriptions, we aim to predict all the state changes of entities related to each action.We follow [30]   on the action description, and then conceive the relevant entities and their state changes about certain attributes.We thus follow this human cognitive process and propose KIEST, a knowledge-informed framework to track the entity states given open domain actions.Specifically, as illustrated in Figure 2, KIEST consists of two main steps: (1) it first retrieves and selects all the candidate entities and attributes that are related to the action description from an external knowledge graph.Here, we use ConceptNet [12] given its high coverage of open-domain concepts; (2) it then dynamically incorporates the relevant entity and attribute knowledge into a Dynamic Knowledge Grained Encoder-Decoder to generate all the entity state changes while a constrained decoding strategy and a Entity State Change Coherence Reward are employed to encourage the decoder to generate state changes that are more coherent to the action descriptions and human commonsense.Next, we provide details for each of the components.

Entity and Attribute Knowledge Retrieval and Selection
We observe that state changes usually happen to the entities that are either directly included in the action description or are conceptually relevant to the key entities and actions (called as anchors) included in the action description.For example, in Figure 2, the state change happens to the entity body, which is closely related to the anchor person in the action description.Similarly, the entity related attributes are also conceptually related to the anchors contained in the action description.
Motivated by this, we propose to acquire a rich set of entities and attributes that are relevant to state changes from ConceptNet [26], a general KG covering about 4,716,604 open-domain concepts and their relations.Specifically, given the input  = (  ,   ), we first find all the spans in  that are included as concepts in ConceptNet and take each span as an anchor to retrieve the connected concepts within  hops in ConceptNet, denoted as   .Taking the action description in Figure 2 as an example, given the context and query as input, we first extract the anchors from them, such as person and you, and then take each anchor as a query to ConceptNet and obtain a set of relevant concepts, such as coach, rich, hand, placement, energy, body, position, muscle and so on.
For each input , there are hundreds or thousands of neighboring concepts retrieved in   while most of them are not related to entity state changes.For instance, in Figure 2, the retrieved concepts, such as rich and coach, are not relevant to any state changes.Based on this observation, we further design two selection models   and   to select the most relevant entity and attribute knowledge to the action description from   .Both selection models share the same architecture and training objective.Taking   as an example, it takes the following information as input: (1) action description ; (2) positive entity set   = {  }, which is constructed based on the entities included in the human-annotated state changes; (3) negative entity set   = {  }, which is created by randomly sampling the entities from state changes annotated for other actions while ensuring each    is not included in   .To differentiate the positive and negative entities,   utilizes a pre-trained BERT model  (.) [5,11] to extract a semantic representation  (),  (  ),  (  ) for ,   ,   , respectively, and select the positive entities by measuring their distance to the action description.  is optimized with the following triple loss: where ∥.∥ denotes the Euclidean Distance and  is a margin parameter, which we set to 1 as default.During training, the triplet loss reduces the distance between  () and  (  ) while enlarging the distance between  () and  (  ).At inference time, we calculate the similarity scores between  and each candidate concept from   and select the candidates as relevant entity knowledge    if their scores are higher than a threshold . is regarded as a hyper-parameter and is discussed in section 6.1.
We utilize the same architecture for the attribute selection model   and the same way to obtain the positive and negative set of attributes for each input  to optimize   with the triplet loss.Note that, during inference, for each input , we apply   to select the relevant attribute knowledge    from the same set of candidate concepts   .After selecting the relevant entities and attributes for input , we construct a heterogeneous knowledge graph, named ...   2 and describe its major components as follows.

Cross Attention Cross Attention Cross Attention
DKGED first computes the hidden representation X = [w 1 , ..., w  ] of each token   in the raw action description : where  (.) is a -layer Transformer encoder of T5.To capture the interaction among the relevant entities and attributes in the entityattribute KG, we utilize the Relational Graph Convolutional Network (RGCN) [24] to learn the embedding matrix C e x and C a x for the entity set    and attribute set    , respectively: After encoding the input action description and the relevant entity and attribute knowledge, DKGED further autoregressively generates a set of entity state changes, denoted as a concatenation of a sequence of words [s 1 , ..., s  ], while dynamically incorporating the relevant entities or attributes.Specifically, as shown in Figure 3, the DKGED adopts a -layers of Transformer-based decoder, which converts the previous tokens into vector representations together with positional embedding and encodes them with a masked multi-head attention (MHA) in each layer: At each generation step , DKGED dynamically incorporates the relevant entity knowledge, attribute knowledge, and the input action description to predict the output token, based on the position of  in the state change template: "[attribute] of [entity] was [before_state] and [after_state] afterwards".Intuitively, when the current decoding step aims to predict an attribute, the external attribute knowledge is more meaningful than entity knowledge.Similarly, when the current decoding step is to predict an entity, the external entity knowledge is more helpful.Other positions, such as the template tokens (e.g., of, was, afterwards) and states can be decoded based on the input action descriptions.To achieve this goal, we design three Cross Attention mechanisms while each Cross Attention is the same as Multi-Head Attention defined in Transformer [31]: where A   , E   , and H   is a contextual representation by attending over all the relevant attribute knowledge, entity knowledge, and all the input representations, respectively.  and   are the start and end word positions of the entity in the template.  and   are the start and end word positions of the attribute in the template.During the decoding process, these positions are known based on the previously generated template tokens, such as of, was.
Based on these contextual representations and the decoding position  in the state change template, we further apply a Feed Forward layer to obtain an overall feature representation ℎ   .The feature representation from the last layer ℎ   is then used to predict a probability distribution over the whole vocabulary  based on Softmax where W is a learnable parameter matrix.

Constrained Decoding
At each decoding step, existing studies [7,9,21] usually predict the output token from the whole vocabulary  , which results in outputs that are not related to the input.To improve the decoding process, we further design a constrained decoding strategy to select a subset of candidate tokens that are more related to the input action description from the vocabulary  .Specifically, given an input  and a candidate token  from  , we calculate a cosine similarity score between  and each token in  based on their contextual representations from a pre-trained T5 encoder, and use the highest score as the relevance score   between  and : where   is the −th word in , c and w  is the contextual representation of  and   from a pre-trained T5 encoder.For each input , after computing the relevance scores for all the candidate tokens in  , we only select the tokens with relevance scores higher than , which is regarded as a hyper-parameter and discussed in section 6.1.

Entity State Change Coherence Reward
An important issue with the generated state changes from baseline models such as GPT-2 [20] is that the entity, attribute, before_state and after_state are not coherent enough or aligned with human commonsense.For example, in the generated output "composition of flowers were in dead flowers before and fresh flowers afterwards", the generated states, e.g., dead/fresh flowers, are not relevant to the attribute composition.To address this problem, we further design a classifier-based automatic metric to evaluate the coherence of each entity state change and use it as a reward to improve the generation process.To train the classifier, we use all the human-annotated entity state changes in the training set of OpenPI [30] as the positive instances   , and create the same number of negative instances   by randomly replacing the entity, attribute, before_state, after_state of each  ∈   with a concept sampled from other positive instances but the same slot.Then, each entity state change  is fed as input to a T5 model which outputs a score "1" if the entity state change is coherent, otherwise, "0".We fine-tune the T5-based classifier with the same number of positive and negative training instances.The classifier is optimized with the cross-entropy objective: where  is 0 or 1 and  ( |) is the probability indicating the coherence or incoherence of the entity state change . denotes the set of parameters of the classifier.
After training the classifier, we use it to estimate a coherence score for each generated entity state change and compute a reward based on the coherence score to further optimize the knowledge-grained encoder-decoder framework.The reward is computed by: where  is the generated target sequence sampled from the model's distribution at each time step in decoding,   is the output probability of the sampled tokens. 0 and  1 denote the label 0 and 1, respectively.We then further optimize the DKGED framework with the reward using learning.The policy gradient is calculated by: where  represents the model parameters. is the action description.
The overall learning objective for our proposed KIEST is the combination of the RL and cross-entropy (CE) loss: where  ∈ [0, 1] is a tunable parameter.

EXPERIMENTAL SETUP 4.1 Dataset
We evaluate our approach on OpenPI [30], which, to the best of our knowledge, is the only public benchmark dataset for open domain entity state tracking.It comprises 23,880, 1,814, and 4,249 pairs of action description and entity state change in the training, development and test sets, respectively.We also design the following steps to correct the annotation errors of OpenPI: • We first correct all the spelling errors contained in OpenPI, such as liqour ("liquor"), skiier ("skier"), necklacce ("necklace"), apperance ("appearance"), compostion ("composition") and so on.• some annotated entity state changes are not following the template "[attribute] of [entity] was [before_state] and [af-ter_state] afterwards", such as "flexibility of was hard before and soft afterwards", "location of vegetable".We thus remove all these entity state changes from OpenPI.

Baselines and Evaluation Metrics
We compare the performance of KIEST with several strong baselines based on the state-of-the-art pre-trained generative models, including T5 base/large [21] and GPT-2 base/large [20].Note that the only previous work [30] on open domain entity state tracking utilized GPT-2-base [20] to generate the state changes.We also design two variants of KIEST by removing the classifier-based coherence reward (denoted as KIEST w/o ESC), and removing both the classifier-based coherence reward and the constrained decoding strategy (denoted as KIEST w/o ESC+CD).Same as [30], we evaluate all the models based on generative evaluation metrics, including Exact Match [22] , BLEU-2 [19], and ROUGE-L [4].

Implementation Details
In DKGED, the embedding of each node in    or    is initialized by the pre-trained model BERT2 with 1024-dim.During training, we use AdamW [14] as the optimizer and Cross Entropy as the loss function with a learning rate of 0.00005.We use label smoothing [17] to prevent the model from being over-confident.The batch size is 6.For the overall loss function in section.3.4,we tune the  value in 0, 0.1, 0.3, 0.5, 0.7, 0.9.We find that KIEST works well when the weight of   is 0.1.For the constrained decoding strategy, we tune the threshold  in {0, 0.2, 0.4, 0.6, 0.8}, and when  = 0.4, our approach achieves the best performance.In our entity selection model, we use AdamW Optimizer [14] with the learning rate of 0.00002.We also use gradient clipping [37] to constrain the max  2 -norm of the gradients to be 1.We tune the threshold  in {0.3, 0.5} 3 , and hops  in {1, 2}.We find that our model works well when  = 0.5,  = 2.By evaluation, we use the same configuration as above in our attribute selection model.By using selection models, the maximum number of filtered entities and attributes in all action descriptions are 1000 and 140, so when training our model, the maximum number of filtered entities and attributes are set to 1000 and 140 in each action description.
To train the classifier for estimating the coherence of each entity state change, we set the dropout rate to 0.1 in the last linear layer, batch size to 32, and use AdamW optimizer [14] with the learning rate of 0.00002.We also apply gradient clipping [37] to constrain the maximum value of  2 -norm of the gradients to be 1.The accuracy of predicting coherence or incoherence for the development and test sets of OpenPI is both 98%, demonstrating that the performance of the classifier has been high enough to be used as a reward function.

RESULTS AND DISCUSSION
Table 1 presents the experimental results of various approaches based on the metrics, including Exact Match, BLEU-2 and ROUGE-L.We have the following observations: (1) our KIEST significantly outperforms all the strong baselines and its variants across all evaluation metrics; (2) T5-large model shows obvious superiority compared with GPT-2-base/large model.( 3) by adding the classifier-based coherence reward, our KIEST approach significantly improves over the baseline KIEST w/o ESC, especially on recall of all evaluation metrics, demonstrating the effectiveness of the classifier-based coherence reward especially in encouraging the model to generate more valid entity state changes.The overall precision is dropped after adding the classifier-based coherence reward.We guess the reason is that KIEST tends to generate more entity state changes than KIEST w/o ESC.(4) the constrained decoding can help remove most of the noisy tokens from the target vocabulary, thus the precision of the 3 we found there are 60%, 52%, 40% entities in    belongs to the training entity set when  = 0.3, 0.5, 0.7 respectively.To obtain abundant knowledge, we discuss the model performance when  = 0. x (w/o entity) 4 to our model, our model still get a significant improvement over the best generation model T5-large, demonstrating that by incorporating the relevant entity or attribute knowledge, the generation of entity state changes can be significantly improved, which validates our motivation that the relevant entity and attribute knowledge from the knowledge graph can help improve the coverage of the generation of entity state changes.(2) Notably, without the entity and attribute selection (KIEST w/o selection), the precision of KIEST significantly drops on all evaluation metrics, which validates our assumption that without selection, the massive concepts retrieved from ConceptNet are likely to introduce noise and hurt the model's performance.
To evaluate the effectiveness of our proposed selection method in reducing storage and running time, we compared the storage space and run time between KIEST with and without selection on the OpenPI dataset.As presented in Table 2, our model without selection required 4,845 MB of storage whereas KIEST with selection only used 3,793 MB, resulting in a 23.6% reduction in storage costs.This  reduction is primarily due to the selection approach which reduced noise from the original entity and attribute sets.Furthermore, the selection method further significantly reduced the running time cost for both training and inference.

ANALYSIS 6.1 The Impact of Hyper-Parameters
In our proposed model, there are two parameters controlling the size of the relevant entities and attributes retrieved from the ConceptNet: (1)  ∈ {0.3, 0.5} which is the threshold to select relevant entities and attributes based on their distance to the input action description, and (2)  ∈ {1, 2} which is the number of hops to retrieve the concepts that are related to the anchors of the action description from ConceptNet.We analyze the impact of these two hyper-parameters in Figure 4.As we can see, when  = 0.5, KIEST selects entities and attributes that are more relevant to the input action description and consistently provides better performance than the setting of  = 0.3.
In addition,  = 1 is also better than  = 2, indicating that when we set  = 2, too many concepts are retrieved from ConceptNet which leads to too much noise and hurts the performance of both the selection models and KIEST.
We also analyze the impact of  ∈ {0, 0.2, 0.4, 0.6, 0.8}, which is the threshold to select tokens for the target output vocabulary.As Figure 5 shows, KIEST achieves the best performance when  is 0.2 or 0.4.When  < 0.2, too many candidate tokens are considered at each decoding step and thus leading to irrelevant output.When  > 0.4, too few candidate tokens are included in the target vocabulary and thus also leading to negative impacts on the generation of entity state changes.

Content Richness and Coherence of Outputs
We measure the content richness of KIEST based on the generated entity state changes and compare it with the two strong baseline models, GPT-2-Large and T5-Large.As shown in Figure 6, KIEST tends to generate longer outputs, e.g., more entity state changes with sequence length in the range of 16-31 than the baseline models.
The results imply that our knowledge-informed framework encourages the model to generate more entity state changes with higher coverage.
To evaluate the coherence of the entity state changes generated by KIEST, its variants and the baseline models, we use the trained classifier in Section 3.4 as an automatic metric.The output of the  classifier is normalized into [0, 1] with the sigmoid function.As shown in Figure 7, KIEST achieves the highest coherence compared with the GPT-2 and T5, suggesting that our knowledge-informed model with the constraint decoding and coherent reward method is effective to improve the coherence of entity state change.For further analysis, KIEST w/o ESC+CD has a lower score compared to GPT-2, suggesting that simply incorporating the additional entity and attribute knowledge into the language model but without any control on the decoding process may have a negative impact on the coherence of the output.Without using the classifier-based coherence reward (i.e., KIEST w/o ESC), the average coherence score of the outputs is lower than the coherence score of KIEST, demonstrating the effectiveness of the coherence reward.

Is ChatGPT a Good Open Domain Entity State Tracker?
Recently, ChatGPT from OpenAI has shown significant advances in various downstream NLP tasks.Here, we systematically compare the state-of-the-art GPT models, including GPT-3.5-turbo 5 , GPT-4 6 , with KIEST on open domain entity state tracking task, to answer the potential research question: "Is ChatGPT a good open domain entity state tracker?".We first design prompts to instruct the GPT models to predict the entity state changes for each input action description.As shown in Figure 8, we design two prompts: (1) Prompt 1 is based on a task description defined manually; (2) prompt 2 contains a task definition from [30] and 8 demonstration examples that are randomly selected from the OpenPI training dataset.Table 3 presents the zero-shot performance of GPT-3.5-turbo and GPT-4 on the whole testing set of OpenPI.Our analysis reveals that different prompts yield significantly different results for both GPT-3.5-turbo and GPT-4.Although GPT-4 outperforms GPT-3.5-turbo in terms of Exact Match and BLEU-2 scores, both models fail to achieve satisfactory results on this task compared to all the supervised models including KIEST.These findings suggest that, without any tuning, the large language models still have limited capability to perform complex reasoning tasks.

Model
Exact Match BLEU-2 ROUGE-L GPT-3."Output: the location of insecticide was in bottle before and on peonies afterwards, the health of bugs were healthy before and dying afterwards" 2: "Sentence: Dip the peony flowers in water.Now what happens" "Output: the moisture of flowers was dry before and wet afterwards….
System Prompt 2: You are a smart and intelligent entity state tracking (EST) system.Input: As input we are given a procedural textcomprising current step (stepi) as query and all past step as context We denote the input as x=(xq,xc),where xq is the step for which we need the state changes (i.e. the query) and xc is the context.Output: The output is a set of zero or more state changes y= {yi}.A state change yi is of the form: attr of ent was valpre before and valpost afterwards.Here, attr is the attribute or state change type, and ent is the changed entity.valpre is the precondition (i.e., the state value before), and valpost is the postcondition (i.e., the state value afterwards).Pre/postcondition adj or relp(yprei) can be an adjectival phrase or a relational phrase.In this task, attr, ent, valpre and valpost are open form text i.e. they are not tied to any fixed, constrained vocabulary."ok now your task is to give the input and output the output definition, Consider the running example: x=(context: The window of your car is foggy, query: Rub half potato on the window).Then, {y} = { transparency of window was fogged before and partially clear afterwards, stickiness of window was smooth before and sticky afterwards }.In y1, attr = transparency, ent = window, valpre = fogged and valpost = partially clear. in the next part I will give you the excample about the input and output" "flowers" in the generated output are not relevant to the context described in the input.

LIMITATION
In this work, we explored various state-of-the-art large language models for the open-domain entity state-tracking task.Though such models are capable of generating relevant and fluent outputs given the input action description, they still cannot generate all the accurate entity state changes with high coverage.In our proposed KIEST framework, we attempted to locate the relevant entities and attributes from ConceptNet.Such knowledge is still noisy and the coverage is not satisfying.Going forward, we plan to develop more appropriate prompts to stimulate and instruct the language models such as GPT-3.5 or GPT-4 to generate more coherent entity state changes.In addition, this work only considers textual descriptions of actions while their corresponding image or video illustrations are also beneficial for predicting the relevant entities and state changes.
In the future, we will extend entity state tracking to multimodality by leveraging the state-of-the-art multimodal pre-trained language models [33] and instruction tuning [35].

CONCLUSION
This paper introduces KIEST, a knowledge-informed framework for open domain entity state tracking.It consists of two major steps: first, retrieving and selecting the relevant entity and attribute knowledge from an external knowledge base, i.e., ConceptNet; and then dynamically incorporating the entity and attribute knowledge into an encoder-decoder framework with an effective constrained decoding strategy and a classifier-based entity state change coherence reward.
Experimental results and extensive analysis on the public benchmark dataset -OpenPI demonstrate the effectiveness of our overall framework and each component with significant improvements over the strong baselines, including T5, GPT-2, GPT-3.5, and GPT-4.

Query:
Cut the celery into sticks.Question: Now, what happens?Action Description:

Figure 1 :
Figure1: An example demonstrating the input, expected outputs and system outputs (from T5[21], GPT-2[20] and our KI-EST approach) for open-domain entity state tracking.The purple words in the action description highlight the keywords that are used to retrieve the additional entity and attribute knowledge, e.g., the concepts shown in the bottom entity-attribute knowledge graph.The green and orange words in the expected outputs and system outputs highlight the entities and attributes for describing the entity state changes.

Figure 2 :
Figure 2: Overview of the KIEST.In the Entity-attribute KG, the green circle, orange circle, and blue circle refer to the entity, attribute and noise entity separately.The black line and blue line refer to the relation relatedTo and capableof separately.

Figure 3 :
Figure 3: Overview of the DKGED decoder which dynamically incorporates additional relevant entities or attributes to generate entity state changes.

Figure 4 :
Figure 4: F1 results with different  and  settings.

Figure 6 :Figure 7 :
Figure 6: Number of entity state changes for each action description generated by different approaches.

Figure 8 :
Figure 8: Example of Prompt 1 and Prompt 2 defined for GPT-based models.

T5-Encoder Embedding Dynamic Knowledge Aggregator Input Action Description x: Context: Be in
and formulate the task of open domain entity state tracking as follows.The Input  = (  ,   ) consists of a context   that includes one or a few history action descriptions and a query   that is the concatenation of a description for the current action and a short question "what happens?", and the Output is a set of entity state changes s = {  } while each   follows a template: [attribute] of [entity] was [before_state] before and [after_state] afterwards.Intuitively, when asked to write state changes given an action description, humans will always call to mind the scenarios based

ConceptNet Retrieved Concept Set Entity Selection Model Attribute Selection Model Relevant Entity Eet Relevant Attribute Set RGCN
DKGED takes the action description , relevant entity set    and attribute set    as input and generates a set of entity state changes s = {  }.We illustrate the DKGED at the bottom of Figure and    .We remove a concept from    and    if it's not connected with any other concepts, thus    and    are subsets of    and    , respectively.After selecting entities and attributes from ConceptNet that are relevant to the entity state changes, we aim to further incorporate them to improve the generation of the entity state changes.To do this, we propose a Dynamic Knowledge Grained Encoder-Decoder (DKGED) framework to dynamically select the relevant entity or attribute knowledge when we predict the tokens of each slot in the state change template, i.e., "[attribute] of [entity] was [before_state] and [after_state] afterwards".Specifically,

Table 1 :
Results of various approaches for open domain entity state tracking on OpenPI.

Table 2 :
3, 0.5.approach is significantly improved, which can be seen by comparing KIEST w/o ESC+CD with KIEST w/o ESC).Reduction of storage and run time on OpenPI dataset.
S-Space refers to Storage Space, R-Time refers to Running Time ( hours: minutes: seconds), w/o selection refers to without selection model, w/ selection refers to with selection model.

Table 4 :
The examples of prediction error about GPT-3.You are a smart and intelligent entity state tracking (EST) system.I will provide you the definition of the element you need to extract (the entity, the attribute, the before state, and after state about the entity of this attribute), the sentence from where your extract, the elements, and the output format with examples Sure, I'm ready to help you with your EST task.Please provide me with the necessary information to get started.Initiator of the action" "2.attr: it is the attribute, the attribute about this entity" "3.valpre: it is the before state, the before state about this attribute" "4.valpost: it is the after state, the after state about this attribute" Output Format: "attr of ent was valpre before and valpost afterwards" "If no entities are presented in any categories keep it None, The output is a set of zero or more state changes" Examples: 1: "Sentence: Apply insecticide to peonies.Now what happens" plausible entity state changes, but in some cases, they may not be aligned with the information presented in the input action description.Such as for the generations of model GPT-3.5-turbo(P1) and GPT-4 (P1), the entities "hunger" and System Prompt 1: