Projects

Industry-focused builds, RL work, and evaluation tooling.

Frontier Evals

Frontier Benchmark Suites for Agentic AI

Evaluation · Handshake AI · 2025–Present

Large-scale evaluation infrastructure for frontier LLMs and agents.

  • Benchmark suites for agentic reasoning, tool use, and long-horizon tasks.
  • Scalable infra measuring capabilities, robustness, and safety failure modes.
  • Led research: benchmark design, data curation, evaluation methodology.
BankerToolBench

BankerToolBench: AI Agents in Investment Banking

Evaluation · 2026

End-to-end benchmark for AI agents in professional investment banking workflows.

  • Covers real-world tasks: deal sourcing, financial analysis, report generation.
  • Evaluates tool use, multi-step reasoning, and domain accuracy.
BrowserART

Browser Agent Red-teaming Toolkit

Agents & Safety · 2024–25

Safety benchmark for LLM web agents.

  • 100+ adversarial behaviors across 40 sandboxed sites.
  • Harness + reports for reproducible evaluations.
  • Surfaced jailbreak classes; informed mitigations.
Reasoning RL

Reasoning Datasets & RL Training

RL for LLMs · 2024–25

Curricula + reward design for math/STEM reasoning.

  • Built custom datasets and staged reward schemata.
  • Improved pass@k on targeted benchmarks.
  • Production training pipelines with partners.
CurriculumRLAIF/RLHFEval
Adaptive Guidance

Adaptive Guidance for RL of Reasoning Models

RL for LLMs · 2025 (under review)

Guided training signals to accelerate reasoning RL.

  • Stability and sample-efficiency improvements.
  • Reduces reward hacking via staged curricula.
Rubrics

Rubrics as Rewards: RL Beyond Verifiable Domains

RL for LLMs · ICLR 2026

Rubric-driven rewards to train models where exact verification is hard.

  • Task-specific rubrics (clarity, safety, usefulness) as reward signals.
  • Reduces reliance on ground-truth labels; aligns with evaluator preferences.
ToolRL-Val

Tool-RL Data Valuation (ToolRL-Val)

RL for LLMs · 2025 — in progress

Data valuation for tool-using LLMs to guide RL training and curation.

QGFN

QGFN — Controllable Greediness

Exploration & Generative RL · NeurIPS 2024

Action-value modulation for diverse high-reward discovery.

  • Mixture policies with action-value guidance.
  • ~4× more distinct high-reward modes on benchmarks.
Replay

Replay Buffers for Mode Discovery

Exploration & Generative RL · ICML 2023 WS

Ablations on buffer policies for generative exploration.

  • Improved mode coverage vs. baselines.
  • Open scripts for reproducibility.
DeepVent

DeepVent — Clinical RL

Applied RL · AAAI 2023 / RLDM 2022

Conservative RL for ventilator personalization.

  • Offline clinical data; safety-aware training.
  • Equal contribution; peer-reviewed results.
Data Review

Automated Multi-Agent Data Review

Data & Systems · 2024–25

Pipeline that filters low-quality code examples in real-time.

  • Multiple reviewers (heuristic + model-based) with quorum rules.
  • Streaming moderation; audit logs & dashboards.
GFN Feedback

Feedback Usefulness Detection (GFN)

Data & Systems · NVIDIA 2022

Decision model to detect useful user feedback.

  • End-to-end pipeline integrated with team workflows.
  • Telemetry + behavior features; statistical analyses.
Vision Retrieval

Multimodal Small-Object Retrieval

Data & Systems · 2023

Upgraded retrieval stack for small objects.

  • Object-level similarity search; dev-friendly endpoints.
PGiF

Policy Gradients Incorporating the Future

Exploration & RL · ICLR 2022

Improves PG by looking ahead to future state values.