Projects
Industry-focused builds, RL work, and evaluation tooling.
BrowserART
Browser Agent Red-teaming Toolkit
Safety benchmark for LLM web agents.
- 100+ adversarial behaviors across 40 sandboxed sites.
- Harness + reports for reproducible evaluations.
- Surfaced jailbreak classes; informed mitigations.
Reasoning RL
Reasoning Datasets & RL Training
Curricula + reward design for math/STEM reasoning.
- Built custom datasets and staged reward schemata.
- Improved pass@k on targeted benchmarks.
- Production training pipelines with partners.
CurriculumRLAIF/RLHFEval
Adaptive Guidance
Adaptive Guidance for RL of Reasoning Models
Guided training signals to accelerate reasoning RL.
- Stability and sample-efficiency improvements.
- Reduces reward hacking via staged curricula.
Rubrics
Rubrics as Rewards: RL Beyond Verifiable Domains
Rubric-driven rewards to train models where exact verification is hard.
- Task-specific rubrics (clarity, safety, usefulness) as reward signals.
- Reduces reliance on ground-truth labels; aligns with evaluator preferences.
ToolRL-Val
Tool-RL Data Valuation (ToolRL-Val)
Data valuation for tool-using LLMs to guide RL training and curation.
QGFN
QGFN — Controllable Greediness
Action-value modulation for diverse high-reward discovery.
- Mixture policies with action-value guidance.
- ~4× more distinct high-reward modes on benchmarks.
Replay
Replay Buffers for Mode Discovery
Ablations on buffer policies for generative exploration.
- Improved mode coverage vs. baselines.
- Open scripts for reproducibility.
DeepVent
DeepVent — Clinical RL
Conservative RL for ventilator personalization.
- Offline clinical data; safety-aware training.
- Equal contribution; peer-reviewed results.
Data Review
Automated Multi-Agent Data Review
Pipeline that filters low-quality code examples in real-time.
- Multiple reviewers (heuristic + model-based) with quorum rules.
- Streaming moderation; audit logs & dashboards.
GFN Feedback
Feedback Usefulness Detection (GFN)
Decision model to detect useful user feedback.
- End-to-end pipeline integrated with team workflows.
- Telemetry + behavior features; statistical analyses.
Vision Retrieval
Multimodal Small-Object Retrieval
Upgraded retrieval stack for small objects.
- Object-level similarity search; dev-friendly endpoints.
PGiF
Policy Gradients Incorporating the Future
Improves PG by looking ahead to future state values.