🔒 Permission Denied — Role Viewer has limited access. Some actions are disabled.
Distributed Jobs
Manage multi-node distributed training jobs.
🌍
2,107
Total
▶️
7,138
Running
✅
4,916
Completed
❌
9,217
Failed
| Name | Status | Framework | Nodes | GPUs/Node | Total GPUs | Duration | Owner | |
|---|---|---|---|---|---|---|---|---|
| megatron-13 | Running | PyTorch FSDP | 3 | 4 | 49 | alice.liu | ||
| pretrain-22 | Queued | Horovod | 14 | 8 | 45 | bob.zhang | ||
| ray-train-12 | Completed | Megatron-LM | 7 | 8 | 96 | alice.liu | ||
| pretrain-18 | Queued | Horovod | 14 | 4 | 26 | bob.zhang | ||
| pretrain-14 | Running | Ray Train | 12 | 8 | 100 | alice.liu | ||
| deepspeed-14 | Failed | Ray Train | 6 | 8 | 11 | bob.zhang | ||
| deepspeed-12 | Failed | PyTorch FSDP | 11 | 4 | 39 | henry.zhao | ||
| megatron-22 | Failed | PyTorch FSDP | 15 | 8 | 9 | bob.zhang | ||
| ray-train-3 | Running | Ray Train | 3 | 4 | 83 | bob.zhang | ||
| deepspeed-38 | Failed | PyTorch FSDP | 14 | 4 | 34 | alice.liu |