Projects | Elaine Lau

Frontier Evals

Frontier Benchmark Suites for Agentic AI

Evaluation · Handshake AI · 2025–Present

Large-scale evaluation infrastructure for frontier LLMs and agents.

Benchmark suites for agentic reasoning, tool use, and long-horizon tasks.
Scalable infra measuring capabilities, robustness, and safety failure modes.
Led research: benchmark design, data curation, evaluation methodology.

BankerToolBench

BankerToolBench: AI Agents in Investment Banking

Evaluation · 2026

End-to-end benchmark for AI agents in professional investment banking workflows.

Covers real-world tasks: deal sourcing, financial analysis, report generation.
Evaluates tool use, multi-step reasoning, and domain accuracy.

arXiv Paper

BrowserART

Browser Agent Red-teaming Toolkit

Agents & Safety · 2024–25

Safety benchmark for LLM web agents.

100+ adversarial behaviors across 40 sandboxed sites.
Harness + reports for reproducible evaluations.
Surfaced jailbreak classes; informed mitigations.

Paper Code

Reasoning RL

Reasoning Datasets & RL Training

RL for LLMs · 2024–25

Curricula + reward design for math/STEM reasoning.

Built custom datasets and staged reward schemata.
Improved pass@k on targeted benchmarks.
Production training pipelines with partners.

CurriculumRLAIF/RLHFEval

Adaptive Guidance

Adaptive Guidance for RL of Reasoning Models

RL for LLMs · 2025 (under review)

Guided training signals to accelerate reasoning RL.

Stability and sample-efficiency improvements.
Reduces reward hacking via staged curricula.

Rubrics

Rubrics as Rewards: RL Beyond Verifiable Domains

RL for LLMs · ICLR 2026

Rubric-driven rewards to train models where exact verification is hard.

Task-specific rubrics (clarity, safety, usefulness) as reward signals.
Reduces reliance on ground-truth labels; aligns with evaluator preferences.

ToolRL-Val

Tool-RL Data Valuation (ToolRL-Val)

RL for LLMs · 2025 — in progress

Data valuation for tool-using LLMs to guide RL training and curation.

QGFN

QGFN — Controllable Greediness

Exploration & Generative RL · NeurIPS 2024

Action-value modulation for diverse high-reward discovery.

Mixture policies with action-value guidance.
~4× more distinct high-reward modes on benchmarks.

Paper Code

Replay

Replay Buffers for Mode Discovery

Exploration & Generative RL · ICML 2023 WS

Ablations on buffer policies for generative exploration.

Improved mode coverage vs. baselines.
Open scripts for reproducibility.

Paper

DeepVent

DeepVent — Clinical RL

Applied RL · AAAI 2023 / RLDM 2022

Conservative RL for ventilator personalization.

Offline clinical data; safety-aware training.
Equal contribution; peer-reviewed results.

AAAI RLDM

Data Review

Automated Multi-Agent Data Review

Data & Systems · 2024–25

Pipeline that filters low-quality code examples in real-time.

Multiple reviewers (heuristic + model-based) with quorum rules.
Streaming moderation; audit logs & dashboards.

GFN Feedback

Feedback Usefulness Detection (GFN)

Data & Systems · NVIDIA 2022

Decision model to detect useful user feedback.

End-to-end pipeline integrated with team workflows.
Telemetry + behavior features; statistical analyses.

Vision Retrieval

Multimodal Small-Object Retrieval

Data & Systems · 2023

Upgraded retrieval stack for small objects.

Object-level similarity search; dev-friendly endpoints.

PGiF

Policy Gradients Incorporating the Future

Exploration & RL · ICLR 2022

Improves PG by looking ahead to future state values.

Paper