Projects
Industry-focused builds, RL work, and evaluation tooling.
Frontier Benchmark Suites for Agentic AI
Large-scale evaluation infrastructure for frontier LLMs and agents.
- Benchmark suites for agentic reasoning, tool use, and long-horizon tasks.
- Scalable infra measuring capabilities, robustness, and safety failure modes.
- Led research: benchmark design, data curation, evaluation methodology.
BankerToolBench: AI Agents in Investment Banking
End-to-end benchmark for AI agents in professional investment banking workflows.
- Covers real-world tasks: deal sourcing, financial analysis, report generation.
- Evaluates tool use, multi-step reasoning, and domain accuracy.
Browser Agent Red-teaming Toolkit
Safety benchmark for LLM web agents.
- 100+ adversarial behaviors across 40 sandboxed sites.
- Harness + reports for reproducible evaluations.
- Surfaced jailbreak classes; informed mitigations.
Reasoning Datasets & RL Training
Curricula + reward design for math/STEM reasoning.
- Built custom datasets and staged reward schemata.
- Improved pass@k on targeted benchmarks.
- Production training pipelines with partners.
Adaptive Guidance for RL of Reasoning Models
Guided training signals to accelerate reasoning RL.
- Stability and sample-efficiency improvements.
- Reduces reward hacking via staged curricula.
Rubrics as Rewards: RL Beyond Verifiable Domains
Rubric-driven rewards to train models where exact verification is hard.
- Task-specific rubrics (clarity, safety, usefulness) as reward signals.
- Reduces reliance on ground-truth labels; aligns with evaluator preferences.
Tool-RL Data Valuation (ToolRL-Val)
Data valuation for tool-using LLMs to guide RL training and curation.
QGFN — Controllable Greediness
Action-value modulation for diverse high-reward discovery.
- Mixture policies with action-value guidance.
- ~4× more distinct high-reward modes on benchmarks.
Replay Buffers for Mode Discovery
Ablations on buffer policies for generative exploration.
- Improved mode coverage vs. baselines.
- Open scripts for reproducibility.
DeepVent — Clinical RL
Conservative RL for ventilator personalization.
- Offline clinical data; safety-aware training.
- Equal contribution; peer-reviewed results.
Automated Multi-Agent Data Review
Pipeline that filters low-quality code examples in real-time.
- Multiple reviewers (heuristic + model-based) with quorum rules.
- Streaming moderation; audit logs & dashboards.
Feedback Usefulness Detection (GFN)
Decision model to detect useful user feedback.
- End-to-end pipeline integrated with team workflows.
- Telemetry + behavior features; statistical analyses.
Multimodal Small-Object Retrieval
Upgraded retrieval stack for small objects.
- Object-level similarity search; dev-friendly endpoints.
Policy Gradients Incorporating the Future
Improves PG by looking ahead to future state values.