A Mutual Information-based Metric for Temporal Expressivity and Trainability Estimation in Quantum Policy Gradient Pipelines
AI Breakdown
Get a structured breakdown of this paper — what it's about, the core idea, and key takeaways for the field.
Abstract
In recent years, various limitations of conventional supervised learning have been identified, motivating the development of reinforcement learning--and quantum reinforcement learning that leverages quantum resources such as entanglement and superposition. Among the various reinforcement learning methodologies, the policy gradient method is considered to have many benefits; for instance, it allows an agent to learn without explicitly knowing the crucial information of the environment such as state transition probabilities and initial state distribution. Meanwhile, from the perspective of learning, two indicators are often regarded as significant: expressivity and trainability (for gradient-based methods). While a number of attempts have been made to quantify the expressivity and trainability of Neural Network models and PQCs, clear efforts suitable for reinforcement learning settings have so far been lacking, despite the inherent differences between conventional supervised learning and reinforcement learning. Therefore, in this study, we propose revising the notion of expressivity into a temporal expressivity suited to reinforcement learning dynamics, and show that the mutual information between the action distribution and the discretized reward signal provides an upper bound for the scaled gradient norm, while yielding an information-theoretic decomposition and a residual-aware upper bound for the proposed temporal expressivity metric. Finally, under explicit concentration assumptions, we show that MI-TET induces an assumption-based, one-sided prescreening criterion for initialization-time gradient fragility across PQC architectures.