🔒 Permission Denied — Role Viewer has limited access. Some actions are disabled.
Distributed Jobs
Manage multi-node distributed training jobs.
🌍
416
Total
▶️
4,945
Running
✅
6,388
Completed
❌
6,603
Failed
| Name | Status | Framework | Nodes | GPUs/Node | Total GPUs | Duration | Owner | |
|---|---|---|---|---|---|---|---|---|
| ray-train-46 | Failed | Horovod | 3 | 8 | 100 | alice.liu | ||
| ray-train-28 | Failed | Ray Train | 5 | 4 | 81 | henry.zhao | ||
| ray-train-16 | Completed | PyTorch FSDP | 13 | 4 | 110 | henry.zhao | ||
| deepspeed-18 | Running | DeepSpeed | 10 | 8 | 73 | bob.zhang | ||
| pretrain-19 | Running | Megatron-LM | 10 | 8 | 75 | henry.zhao | ||
| ray-train-8 | Completed | Megatron-LM | 15 | 4 | 66 | henry.zhao | ||
| ray-train-15 | Queued | DeepSpeed | 9 | 8 | 112 | henry.zhao | ||
| megatron-30 | Running | PyTorch FSDP | 14 | 4 | 59 | alice.liu | ||
| deepspeed-46 | Queued | DeepSpeed | 11 | 8 | 32 | henry.zhao | ||
| ray-train-10 | Queued | Ray Train | 12 | 4 | 99 | alice.liu |