PhD Thesis Defense: Gregory Hyde

Oct

Thursday
10:00am - 12:30pm ET

Jackson Conf Rm/ Online

"Reward Representational Learning"

Abstract

The \emph{Markov decision process} (MDP) often feels like a square peg forced into the round hole of a non-Markov world. While the MDP framework provides a clean foundation for evaluating agent performance, real-world phenomena rarely conform to such neatly defined structures. Humans, by contrast, are adept at abstracting sequential experience into representations that integrate information across time, revealing temporal regularities. We call this process \emph{world building}, and at its center lies reward. In this setting, we posit that reward precedes the representations used to predict it; thus, predicting reward is initially a non-Markov task that requires constructing suitable representations of reward prediction on-the-fly. Yet, reward functions under MDP conventions are ill-suited for this task due to their shallow and physically entangled dynamics.

To address these limitations, we introduce the \emph{abstract reward Markov decision process} (ARMDP), which formalizes \emph{world building} as the process by which agents abstract sequential experience into compact Markov models for reward prediction. An ARMDP comprises two components: (1) an \emph{observation model} (OM) that defines the agent's perceptual interface with the environment, and (2) an \emph{abstract reward model} (ARM) that encodes structural and temporal regularities of reward into intrinsic \emph{agent states}, defined within a representational space disentangled from the physical world. Agents acquire ARMs through \emph{reward representational learning} (RRL), an active-learning framework in which ARMs are continuously refined and interrogated entirely through agent experience. We evaluate RRL in both \emph{reinforcement learning} (RL) and \emph{inverse RL} (IRL) scenarios.

While RRL expands the temporal and structural form of reward, our experiments reveal a further limitation of the MDP: the interpretation of the reward scalar remains ambiguous. To address this, we propose an information-theoretic reframing in which reward is treated as an \emph{information residual} reflecting the difference between an agent's current certainty and its anticipated certainty horizon. This perspective grounds reward in the agent's epistemic dynamics, establishing a principled basis for scale and semantic consistency that can be predictably applied across tasks and domains. Together, these contributions form a cohesive treatment of reward that spans its temporal, structural, and scalar dimensions, offering a unified conceptual foundation for extending the MDP framework.

Thesis Committee

Eugene Santos Jr. (Chair)
George Cybenko
Vikrant Vaze
Dr. Dasgupta (Naval Research Lab, Washington, DC)

Contact

For more information, contact Thayer Registrar at thayer.registrar@dartmouth.edu.