PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency
Anonymous ACL Submission
Abstract
Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants.
Method
Phase 1: Interrogation
PICon operates as a black-box framework — no prior knowledge of the target persona is assumed. The interrogation proceeds through three stages:
-
Get-to-KnowPredefined demographic questions from the World Values Survey (WVS) — age, occupation, economic status, family composition.
-
Main InterrogationThe Questioner generates logically chained follow-ups derived from each preceding response, progressively narrowing the space for fabrication. Meanwhile, the Entity & Claim Extractor identifies verifiable claims and web-searchable entities (institutions, locations, organizations) for fact-checking.
-
RetestThe initial demographic questions are re-asked after 40 turns of intervening dialogue to capture answer drift.
Phase 2: Evaluation
The Evaluator (an LLM-as-a-Judge1) receives the full interrogation log and scores the persona along three dimensions:
Internal Consistency (IC)
For each response $r_t$, the Evaluator checks whether it contradicts any preceding response $r_{<t}$ — not just the immediately previous one. This is what enables multi-hop contradiction detection: a statement that is consistent with each earlier statement individually may still contradict them when considered jointly. The Evaluator also flags evasive or non-substantive responses (e.g., “I’d rather not say”). IC is the harmonic mean of cooperativeness (fraction of substantive responses) and non-contradiction rate, so a persona cannot inflate its score by refusing to engage.
Formal definition
Cooperativeness. A persona agent that consistently evades questions produces no verifiable statements, making consistency unmeasurable rather than high. Cooperativeness is the fraction of turns with a substantive response:
$$S_{\mathrm{coop}} = \frac{1}{T}\sum_{t=1}^{T}\mathbb{I}(r_t = \texttt{cooperative})$$
Non-contradiction rate. Defined as one minus the fraction of responses that contradict the preceding responses. Since no verifiable statements exist before the first cooperative turn $t^*$, counting begins from that turn onward. For each subsequent response $r_t$, the Evaluator checks whether it contradicts $r_{<t}$, so that contradictions requiring multiple statements to surface can also be captured:
$$S_{\mathrm{nc}} = 1-\frac{1}{T - t^*}\sum_{t=t^*+1}^{T}\mathbb{I}(r_t \bot r_{<t})$$
where $r_t \bot r_{<t}$ denotes that $r_t$ contradicts the preceding responses.
The final internal consistency score (IC) is the harmonic mean of the two components:
$$\mathrm{IC} = \frac{2 \cdot S_{\mathrm{coop}} \cdot S_{\mathrm{nc}}}{S_{\mathrm{coop}} + S_{\mathrm{nc}}}$$
External Consistency (EC)
During interrogation, the Entity & Claim Extractor pulls verifiable entities and claims from each response. The Questioner web-searches each entity, then asks the persona to confirm whether the retrieved evidence refers to the same entity it mentioned. For confirmed claims, the Evaluator classifies each as supported, refuted, or not enough information (NEI), following the fact-verification paradigm (Thorne et al., 2018). EC is the harmonic mean of coverage (proportion of turns with verifiable claims) and non-refutation rate (proportion of confirmed claims not contradicted by evidence).
Formal definition
Coverage. At each turn $t$, the Extractor produces entity-claim pairs $\{(e_j, C_j)\}_j$ from the response $r_t$, and the Questioner performs a web search to obtain evidence $v_j$. Coverage captures the agent’s ability to ground its responses in specific, verifiable factual details. Let $T$ be the total number of turns and $T_c = \{t \mid \mathcal{E}_t \neq \emptyset\}$ be the set of turns with at least one extracted entity-claim pair:
$$c = \frac{|T_c|}{T}$$
Non-refutation rate. Following the fact-verification paradigm of Thorne et al. (2018), each confirmed claim ($\tilde{r}_j = 1$) is classified as supported, refuted, or NEI against the retrieved evidence $v_j$. Let $T_v \subseteq T_c$ be the set of turns containing at least one confirmed claim, and $n_t^{\mathrm{ref}}$ the number of refuted claims in turn $t \in T_v$. The turn-level non-refutation rate is:
$$p_t = 1 - \frac{n_t^{\mathrm{ref}}}{\sum_j\tilde{r}_j|C_j|}$$
Unconfirmed claims ($\tilde{r}_j = 0$) and NEI labels are excluded, as our definition of consistency requires non-refutation rather than positive verification. The macro-averaged non-refutation rate over all turns with confirmed claims:
$$\bar{p} = \frac{1}{|T_v|} \sum_{t \in T_v} p_t$$
The external consistency score (EC) is the harmonic mean of non-refutation rate and coverage:
$$\mathrm{EC} = \frac{2 \cdot \bar{p} \cdot c}{\bar{p} + c}$$
Retest Consistency (RC)
The Evaluator compares the original demographic responses from Get-to-Know with the re-posed responses from Retest, measuring the fraction of answers that remain consistent after 40+ turns of intervening dialogue.
Formal definition
The Evaluator compares the original response $r_{o}^i$ and the re-posed response $r_{re}^i$ for each of the $m$ demographic questions within a single session:
$$\mathrm{RC} = \frac{1}{m}\sum_{i=1}^{m}\mathbb{I}(r_{o}^i \approx r_{re}^i)$$
Framework Configuration
Each agent role was selected through human evaluation.
Evaluated Targets
PICon evaluates seven groups of persona agents spanning different architectures:
| Group | Type | Architecture |
|---|---|---|
| Character.ai | Commercial service | Proprietary (black-box) |
| OpenCharacter | Research | Fine-tuned |
| Consistent LLM | Research | Fine-tuned (RL) |
| Twin 2K 500 | Research | Prompting |
| DeepPersona | Research | Prompting |
| Li et al. (2025) | Research | Prompting |
| Human Simulacra | Research | RAG-based |
All groups are benchmarked against 63 real human participants recruited via snowball sampling across multiple countries (IRB-approved).
Results
Overall Performance
No persona agent group achieved a larger normalized triangle area (IC × EC × RC) than the human baseline. The three top-scoring groups rely on inference-time conditioning (prompting or RAG), while the two lowest-scoring groups are both fine-tuned models.
Internal Consistency
| Group | Internal Consistency (IC) | Non-contradiction | Cooperativeness |
|---|---|---|---|
| Human | 0.90 ± 0.05 | 0.94 ± 0.05 | 0.86 ± 0.07 |
| Human Simulacra | 0.79 ± 0.13 | 0.88 ± 0.09 | 0.74 ± 0.19 |
| Li et al. (2025) | 0.73 ± 0.12 | 0.97 ± 0.03 | 0.60 ± 0.17 |
| DeepPersona | 0.72 ± 0.11 | 0.97 ± 0.03 | 0.57 ± 0.13 |
| Character.ai | 0.71 ± 0.04 | 0.79 ± 0.06 | 0.66 ± 0.07 |
| Twin 2K 500 | 0.53 ± 0.16 | 0.98 ± 0.02 | 0.38 ± 0.17 |
| Consistent LLM | 0.31 ± 0.15 | 0.96 ± 0.06 | 0.20 ± 0.11 |
| OpenCharacter | 0.16 ± 0.07 | 0.54 ± 0.25 | 0.11 ± 0.05 |
Key finding: OpenCharacter and Consistent LLM report high consistency in their original studies, yet record the lowest IC under PICon. The reason: both maintain moderate non-contradiction rates but their cooperativeness collapses — they frequently generate responses irrelevant to the question. PICon's harmonic-mean formulation appropriately penalizes such evasion.
External Consistency
| Group | External Consistency (EC) | Non-refutation | Coverage | Discarded |
|---|---|---|---|---|
| Human | 0.66 ± 0.07 | 0.95 ± 0.06 | 0.51 ± 0.08 | 0.18 ± 0.08 |
| Character.ai | 0.71 ± 0.07 | 0.79 ± 0.13 | 0.66 ± 0.10 | 0.10 ± 0.05 |
| Human Simulacra | 0.63 ± 0.13 | 0.89 ± 0.12 | 0.52 ± 0.15 | 0.33 ± 0.22 |
| Li et al. (2025) | 0.59 ± 0.14 | 0.98 ± 0.03 | 0.44 ± 0.17 | 0.13 ± 0.05 |
| DeepPersona | 0.54 ± 0.18 | 0.96 ± 0.04 | 0.40 ± 0.17 | 0.07 ± 0.06 |
| Consistent LLM | 0.30 ± 0.09 | 1.00 ± 0.00 | 0.18 ± 0.06 | 0.69 ± 0.10 |
| Twin 2K 500 | 0.26 ± 0.17 | 1.00 ± 0.01 | 0.16 ± 0.13 | 0.09 ± 0.09 |
| OpenCharacter | 0.15 ± 0.14 | 0.70 ± 0.49 | 0.09 ± 0.07 | 0.77 ± 0.32 |
Coverage is universally low because interrogation targets personal memories and experiences — claims that are inherently harder to verify via web search. Character.ai achieves the highest EC, likely because its personas are based on real public figures with extensive web presence.
Retest Consistency
Human Simulacra achieved the highest retest consistency (0.87 ± 0.11), closest to the human baseline. Under greedy decoding with shuffled question order, most groups maintained or improved their RC, suggesting that stochastic decoding is a primary source of retest instability.
Summary
Fine-tuning for persona does not guarantee robust consistency under chained interrogation. The results suggest that inference-time conditioning (prompting, RAG) currently offers more reliable consistency than specialized fine-tuning, though no approach matches human-level performance across all three dimensions.
1 The Evaluator (Gemini-2.5-Flash) achieves a Gwet’s AC1 of 0.829 against human annotators, compared to 0.885 inter-annotator agreement — indicating near-human-level reliability. ↑