Coding Agents for Academic Research

Tiago Ventura

Georgetown University, McCourt School of Public Policy

Six ways I have used Claude Code (CC)

Using CC to Develop Ideas (thinking partner) — Ex: Using a Multi-Agent Tournament to select a paper to join a big collaboration replication exercise
Build a Second Brain (RAG-Based) to support Claude Code and Me — Ex: An MD Vault with structured literature notes, guides for writing and coding style, status for all my papers, daily diary update, and suggestions on what I should be working on
Data-to-Paper Pipeline — Full paper writing with Claude (helped me to not invest time on this project)
Coding Tasks — Currently supervising a 23 country RCT, CC helps me write code, verify merge issues, copy and paste across countries, write documentation and reusable workflows
Pre-Submission Peer Review — Four specialized review skills as an orchestrated system
Building Skills to Manage Tasks I dislike — Custom skills to write cover letters for paper submissions, social media threads, build presentations, all based on .tex files of papers.

Using CC to Develop Ideas: Multi-Agent Tournament

The Problem

Coppock & McGrath (2026) launched a replication competition for survey experiments published in political science journals, and I wanted to find a survey experiment paper to join the team

I had some papers in mind, but not sold on anything.
My goal was to do a broad search on what could be an interesting paper to work on this task. But… I was severely time-constrained

Built a workflow in CC to:
- Search hundreds of papers across 8 political science journals
- Score them on a detailed rubric statistical power, effect sizes, replicability, team fit
- Make a defensible, well-documented decision, make the papers compete among each other
- List winning papers for me to use in the replication

Claude Code Workflow

I (CC) wrote a ~900-line instructions.md with a process for this task:

Your task (in phases):
0. Build a deep understanding of the RFP call
1. Build a deep understanding of my research
   pipeline and my-coauthor on this project
2. Search for high-impact survey experiments published
   across eight leading political science journals (2010-2025)
3. Evaluate papers as candidates for a replication exercise
4. Run a multi-agent tournament between the top papers

Key Rules

## IMPORTANT: Stop-and-Check Points
At each of these points, you must:
1. Summarize what you have completed
2. Present key outputs for review
3. List any issues or concerns
4. Wait for human approval before proceeding

Do not proceed past a checkpoint without explicit approval.

Phase 1 — Understand the RFP and My Research Pipeline

Before searching, CC read two things:

The RFP (rfp.pdf): Coppock & McGrath (2026) competition rules — what counts as a valid replication, length constraints, submission format
My research pipeline: CV, paper statuses, and co-author profile (Nejla Asimovic — intergroup relations, polarization) — to ground the researcher fit criterion before any paper was seen
Stop-and-check: Claude summarized its understanding of the RFP constraints and the team’s research areas before proceeding to search.

Phase 2 — Finding the Papers

A systematic journal-by-journal search across 8 leading journals (APSR, AJPS, JoP, JEPS, Political Behavior, Political Communication, BJPS, Political Psychology), 2010–2025.

Filter applied at abstract level: paper must explicitly mention “survey experiment” — conjoint, list, field, and natural experiments excluded
Per paper: authors, year, title, DOI, design, N conditions, N subjects, country, replication data URL
Output: candidate_papers.csv — ~299 papers, then hard-filtered to 285 eligible

Phase 3 — Cross-Model Deep Research

Before scoring, I used GPT Deep Research to run live web searches on every candidate paper.

gpt-5-search-api evaluated each of the 285 papers using real-time web retrieval with deep-research

Each paper got a structured 6-section markdown assessment:
- Design summary (arms, randomization, outcomes)
- Key estimand + original effect size
- Replication feasibility (can stimuli transfer?)
- Data availability (public data + code?)
- Effect sizes (Cohen’s d, SE)
- Power assessment (underpowered?)

Output: 285 individual .md files — these files, not training data, fed the scoring step. This is already 285 lit reviews, for “free”.

The Scoring Rubric

Each paper scored on 8 criteria using the deep evaluation files as input:

Criterion	Weight	What it captures
S1 Theoretical importance	20%	Does a null replication reshape a debate?
S2 Design simplicity	10%	<5 min, 2-arm, few outcomes
S3 Replication feasibility	15%	Can stimuli transfer to a US sample?
S4 Data availability	10%	Public data + code?
S5 Researcher fit	15%	Social media, misinformation, polarization
S6 Impact & visibility	10%	Top journal, citations, active debate
S7 Low statistical power	10%	Underpowered originals are worth verifying
S8 Large effect sizes	10%	Effects >0.3 SD worth checking

Top 18 by weighted total → tournament shortlist.

The Tournament

Top 18 papers compete in a two-stage multi-agent debate:

Champion agents (one per paper, Claude Opus): write a 5–6 page advocacy brief arguing why their paper is the best replication candidate
Judge agent (Claude Opus or GPT-5.2): reads both briefs head-to-head and picks a winner with written reasoning
Stage 1: 18 papers → 3 groups of 6 → top 2 per group advance
Stage 2: Full round-robin among 6 finalists (15 matches) → ranked by record
Three independent trials with different models and filters — P176 (Druckman et al. 2022) placed top 3 in all three

Full Workflow

Case Study 1: Lessons

The prompt was ~900 lines:
- My input to CC was much shorter. CC wrote most of the prompt
- Plan-Mode + Refinement + Cross-checked with GPT.
Checkpoint architecture works: mandatory stops kept Claude on track across a very long task
Multi-model tournament: different models have different evaluation biases, running many with CC + Plugins is relatively easy.
Literature Reviews via web search sometimes hallucinated paper details: deep research or better yet pdfs
Takeaway: not a replacement, but it broadened the perspective on options for me to work on + opens up more diversity in the research pipeline