Supervised fine-tuning
TRL SFTTrainer or Unsloth. unwrap_dataset() pulls per-owner keys via dstack-guest-agent — your training loop is unchanged. Out: a sealed checkpoint plus a signed manifest.
Preference / RL alignment
DPOTrainer / IPOTrainer for preference-pair optimization, or full RLHF with a reward model + PPO. Sealed prompts, sealed preference data, attested reward model.
PEFT · LoRA / QLoRA
HuggingFace PEFT. Train low-rank adapters against a frozen base; the LoRA weights are sealed to the compose-hash and merged on attested re-derive. 4-bit QLoRA for memory-bound runs.
Continued pre-training
Domain-adapt a base model on sealed token corpora. Streaming dataloader unwraps shards in TDX memory; the run emits one signed manifest covering token-hashes, hyperparameters, and final checkpoint.