PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

Anonymous ACL Submission

Evaluation Framework 7 Persona Agent Groups + 63 Human Participants 3 Consistency Dimensions

Abstract

Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants.

Method

PICon Framework Overview
PICon operates in two phases — Interrogation and Evaluation.

Phase 1: Interrogation

PICon operates as a black-box framework — no prior knowledge of the target persona is assumed. The interrogation proceeds through three stages:

  1. Get-to-Know
    Predefined demographic questions from the World Values Survey (WVS) — age, occupation, economic status, family composition.
  2. Main Interrogation
    The Questioner generates logically chained follow-ups derived from each preceding response, progressively narrowing the space for fabrication. Meanwhile, the Entity & Claim Extractor identifies verifiable claims and web-searchable entities (institutions, locations, organizations) for fact-checking.
  3. Retest
    The initial demographic questions are re-asked after 40 turns of intervening dialogue to capture answer drift.
What makes PICon different: each follow-up is derived from the logical implications of the previous answer — not independent or merely topically related. This chained questioning is what exposes contradictions that simpler evaluations miss.

Phase 2: Evaluation

The Evaluator (an LLM-as-a-Judge1) receives the full interrogation log and scores the persona along three dimensions:

Internal Consistency (IC)

For each response $r_t$, the Evaluator checks whether it contradicts any preceding response $r_{<t}$ — not just the immediately previous one. This is what enables multi-hop contradiction detection: a statement that is consistent with each earlier statement individually may still contradict them when considered jointly. The Evaluator also flags evasive or non-substantive responses (e.g., “I’d rather not say”). IC is the harmonic mean of cooperativeness (fraction of substantive responses) and non-contradiction rate, so a persona cannot inflate its score by refusing to engage.

Formal definition

Cooperativeness. A persona agent that consistently evades questions produces no verifiable statements, making consistency unmeasurable rather than high. Cooperativeness is the fraction of turns with a substantive response:

$$S_{\mathrm{coop}} = \frac{1}{T}\sum_{t=1}^{T}\mathbb{I}(r_t = \texttt{cooperative})$$

Non-contradiction rate. Defined as one minus the fraction of responses that contradict the preceding responses. Since no verifiable statements exist before the first cooperative turn $t^*$, counting begins from that turn onward. For each subsequent response $r_t$, the Evaluator checks whether it contradicts $r_{<t}$, so that contradictions requiring multiple statements to surface can also be captured:

$$S_{\mathrm{nc}} = 1-\frac{1}{T - t^*}\sum_{t=t^*+1}^{T}\mathbb{I}(r_t \bot r_{<t})$$

where $r_t \bot r_{<t}$ denotes that $r_t$ contradicts the preceding responses.

The final internal consistency score (IC) is the harmonic mean of the two components:

$$\mathrm{IC} = \frac{2 \cdot S_{\mathrm{coop}} \cdot S_{\mathrm{nc}}}{S_{\mathrm{coop}} + S_{\mathrm{nc}}}$$

External Consistency (EC)

During interrogation, the Entity & Claim Extractor pulls verifiable entities and claims from each response. The Questioner web-searches each entity, then asks the persona to confirm whether the retrieved evidence refers to the same entity it mentioned. For confirmed claims, the Evaluator classifies each as supported, refuted, or not enough information (NEI), following the fact-verification paradigm (Thorne et al., 2018). EC is the harmonic mean of coverage (proportion of turns with verifiable claims) and non-refutation rate (proportion of confirmed claims not contradicted by evidence).

Formal definition

Coverage. At each turn $t$, the Extractor produces entity-claim pairs $\{(e_j, C_j)\}_j$ from the response $r_t$, and the Questioner performs a web search to obtain evidence $v_j$. Coverage captures the agent’s ability to ground its responses in specific, verifiable factual details. Let $T$ be the total number of turns and $T_c = \{t \mid \mathcal{E}_t \neq \emptyset\}$ be the set of turns with at least one extracted entity-claim pair:

$$c = \frac{|T_c|}{T}$$

Non-refutation rate. Following the fact-verification paradigm of Thorne et al. (2018), each confirmed claim ($\tilde{r}_j = 1$) is classified as supported, refuted, or NEI against the retrieved evidence $v_j$. Let $T_v \subseteq T_c$ be the set of turns containing at least one confirmed claim, and $n_t^{\mathrm{ref}}$ the number of refuted claims in turn $t \in T_v$. The turn-level non-refutation rate is:

$$p_t = 1 - \frac{n_t^{\mathrm{ref}}}{\sum_j\tilde{r}_j|C_j|}$$

Unconfirmed claims ($\tilde{r}_j = 0$) and NEI labels are excluded, as our definition of consistency requires non-refutation rather than positive verification. The macro-averaged non-refutation rate over all turns with confirmed claims:

$$\bar{p} = \frac{1}{|T_v|} \sum_{t \in T_v} p_t$$

The external consistency score (EC) is the harmonic mean of non-refutation rate and coverage:

$$\mathrm{EC} = \frac{2 \cdot \bar{p} \cdot c}{\bar{p} + c}$$

Retest Consistency (RC)

The Evaluator compares the original demographic responses from Get-to-Know with the re-posed responses from Retest, measuring the fraction of answers that remain consistent after 40+ turns of intervening dialogue.

Formal definition

The Evaluator compares the original response $r_{o}^i$ and the re-posed response $r_{re}^i$ for each of the $m$ demographic questions within a single session:

$$\mathrm{RC} = \frac{1}{m}\sum_{i=1}^{m}\mathbb{I}(r_{o}^i \approx r_{re}^i)$$

Framework Configuration

Questioner GPT-5
Entity & Claim Extractor GPT-5.1
Evaluator Gemini-2.5-Flash
Session length 50 turns (10 get-to-know + 40 main)

Each agent role was selected through human evaluation.

Evaluated Targets

PICon evaluates seven groups of persona agents spanning different architectures:

Group Type Architecture
Character.aiCommercial serviceProprietary (black-box)
OpenCharacterResearchFine-tuned
Consistent LLMResearchFine-tuned (RL)
Twin 2K 500ResearchPrompting
DeepPersonaResearchPrompting
Li et al. (2025)ResearchPrompting
Human SimulacraResearchRAG-based

All groups are benchmarked against 63 real human participants recruited via snowball sampling across multiple countries (IRB-approved).

Results

Overall Performance

No persona agent group achieved a larger normalized triangle area (IC × EC × RC) than the human baseline. The three top-scoring groups rely on inference-time conditioning (prompting or RAG), while the two lowest-scoring groups are both fine-tuned models.

Radar charts showing IC, EC, RC per group
(a) Radar charts: mean IC, EC, RC for each group (dashed = std dev).
Normalized area bar chart
(b) Normalized triangle areas as aggregate scores.

Internal Consistency

GroupInternal Consistency (IC)Non-contradictionCooperativeness
Human0.90 ± 0.050.94 ± 0.050.86 ± 0.07
Human Simulacra0.79 ± 0.130.88 ± 0.090.74 ± 0.19
Li et al. (2025)0.73 ± 0.120.97 ± 0.030.60 ± 0.17
DeepPersona0.72 ± 0.110.97 ± 0.030.57 ± 0.13
Character.ai0.71 ± 0.040.79 ± 0.060.66 ± 0.07
Twin 2K 5000.53 ± 0.160.98 ± 0.020.38 ± 0.17
Consistent LLM0.31 ± 0.150.96 ± 0.060.20 ± 0.11
OpenCharacter0.16 ± 0.070.54 ± 0.250.11 ± 0.05

Key finding: OpenCharacter and Consistent LLM report high consistency in their original studies, yet record the lowest IC under PICon. The reason: both maintain moderate non-contradiction rates but their cooperativeness collapses — they frequently generate responses irrelevant to the question. PICon's harmonic-mean formulation appropriately penalizes such evasion.

External Consistency

GroupExternal Consistency (EC)Non-refutationCoverageDiscarded
Human0.66 ± 0.070.95 ± 0.060.51 ± 0.080.18 ± 0.08
Character.ai0.71 ± 0.070.79 ± 0.130.66 ± 0.100.10 ± 0.05
Human Simulacra0.63 ± 0.130.89 ± 0.120.52 ± 0.150.33 ± 0.22
Li et al. (2025)0.59 ± 0.140.98 ± 0.030.44 ± 0.170.13 ± 0.05
DeepPersona0.54 ± 0.180.96 ± 0.040.40 ± 0.170.07 ± 0.06
Consistent LLM0.30 ± 0.091.00 ± 0.000.18 ± 0.060.69 ± 0.10
Twin 2K 5000.26 ± 0.171.00 ± 0.010.16 ± 0.130.09 ± 0.09
OpenCharacter0.15 ± 0.140.70 ± 0.490.09 ± 0.070.77 ± 0.32

Coverage is universally low because interrogation targets personal memories and experiences — claims that are inherently harder to verify via web search. Character.ai achieves the highest EC, likely because its personas are based on real public figures with extensive web presence.

Retest Consistency

Human Simulacra achieved the highest retest consistency (0.87 ± 0.11), closest to the human baseline. Under greedy decoding with shuffled question order, most groups maintained or improved their RC, suggesting that stochastic decoding is a primary source of retest instability.

Summary

Fine-tuning for persona does not guarantee robust consistency under chained interrogation. The results suggest that inference-time conditioning (prompting, RAG) currently offers more reliable consistency than specialized fine-tuning, though no approach matches human-level performance across all three dimensions.


1 The Evaluator (Gemini-2.5-Flash) achieves a Gwet’s AC1 of 0.829 against human annotators, compared to 0.885 inter-annotator agreement — indicating near-human-level reliability.