Skip to content

Development Roadmap (v0.4.0) #3

@gujingit

Description

@gujingit

Here is the development roadmap for v0.4.0. Contributions and feedback are welcome.

Upgrades

  • In-Place Upgrades: Support for updating components without pod recreation.
  • Orchestrated Upgrade Order: Ensure the upgrade sequence is coordinated with the required component startup order.

Scheduling

  • Original Node Scheduling: Support for scheduling pods back to their original nodes after restarts or preemptions.
  • Multi-Level Gang Scheduling: Enable the co-scheduling of multiple, dependent groups of pods.
  • Volcano Integration: Support for gang scheduling via the Volcano scheduler.
  • Topology-Aware Scheduling: Co-locate Prefill and Decode pods on the same node whenever possible to maximize GPU utilization and VRAM efficiency.

Fault Tolerance

  • Configurable Failure Policies: Allow users to define various FailurePolicy strategies to handle pod failures.

Runtime

  • Simplified, Runtime-less Service Discovery: Streamline the cluster ConfigMap to reduce overhead and enable service discovery without requiring a dedicated EngineRuntime component.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions