Skip to content

Commit c0ee46f

Browse files
authored
[router] Update doc for dynamic scaling and fault tolerance (#2454)
1 parent 9208618 commit c0ee46f

File tree

3 files changed

+73
-119
lines changed

3 files changed

+73
-119
lines changed

docs/router/router.md

Lines changed: 57 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,14 @@ The router is a independent Python package, and it can be used as a drop-in repl
77
## Installation
88

99
```bash
10-
pip install sglang-router
10+
$ pip install sglang-router
1111
```
1212

1313
Detailed usage of the router can be found in [launch_router](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang_router/launch_router.py) and [launch_server](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang/launch_server.py). Also, you can directly run the following command to see the usage of the router.
1414

1515
```bash
16-
python -m sglang_router.launch_server --help
17-
python -m sglang_router.launch_router --help
16+
$ python -m sglang_router.launch_server --help
17+
$ python -m sglang_router.launch_router --help
1818
```
1919

2020
The router supports two working modes:
@@ -27,7 +27,7 @@ The router supports two working modes:
2727
This will be a drop-in replacement for the existing `--dp-size` arguement of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.
2828

2929
```bash
30-
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1
30+
$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1
3131
```
3232

3333
After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.
@@ -47,12 +47,62 @@ print(response.json())
4747
This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.
4848

4949
```bash
50-
python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2
50+
$ python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2
5151
```
5252

53-
## Strategies
53+
## Dynamic Scaling APIs
5454

55-
### Cache-Aware Load-Balancing Router
55+
We offer `/add_worker` and `/remove_worker` APIs to dynamically add or remove workers from the router.
56+
57+
- `/add_worker`
58+
59+
Usage:
60+
61+
```bash
62+
$ curl -X POST http://localhost:30000/add_worker?url=http://worker_url_1
63+
```
64+
65+
Example:
66+
67+
```bash
68+
$ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30001
69+
$ curl -X POST http://localhost:30000/add_worker?url=http://127.0.0.1:30001
70+
Successfully added worker: http://127.0.0.1:30001
71+
```
72+
73+
- `/remove_worker`
74+
75+
Usage:
76+
77+
```bash
78+
$ curl -X POST http://localhost:30000/remove_worker?url=http://worker_url_1
79+
```
80+
81+
Example:
82+
83+
```bash
84+
$ curl -X POST http://localhost:30000/remove_worker?url=http://127.0.0.1:30001
85+
Successfully removed worker: http://127.0.0.1:30001
86+
```
87+
88+
Note:
89+
90+
- For cache-aware router, the worker will be removed from the tree and the queues.
91+
92+
## Fault Tolerance
93+
94+
We provide retries based for failure tolerance.
95+
96+
1. If the request to a worker fails for `max_worker_retries` times, the router will remove the worker from the router and move on to the next worker.
97+
2. If the total number of retries exceeds `max_total_retries`, the router will return an error.
98+
99+
Note:
100+
101+
- `max_worker_retries` is 3 and `max_total_retries` is 6 by default.
102+
103+
## Routing Strategies
104+
105+
#### Cache-Aware Load-Balancing Router
56106

57107
The native router combines two strategies to optimize both cache utilization and request distribution:
58108

rust/README.md

Lines changed: 15 additions & 111 deletions
Original file line numberDiff line numberDiff line change
@@ -2,115 +2,13 @@
22

33
SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances.
44

5-
## Installation
5+
## User docs
66

7-
```bash
8-
pip install sglang-router
9-
```
10-
11-
## Usage
12-
The router offers two modes:
13-
14-
### 1. Co-launch workers and router
15-
This will be a drop-in replacement for the existing `--dp-size`. This part of code will be moved into sglang core.
16-
Under the hood, it uses multi-processes to launch multiple sglang workers, wait for them to be healthy, then launch the router.
17-
18-
```bash
19-
$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 8
20-
```
21-
22-
### 2. Launch only router
23-
This is useful for multi-node DP. You can launch workers on different nodes, then connect the router to them.
24-
25-
```bash
26-
$ python -m sglang_router.launch_router --worker-urls http://worker1:8000 http://worker2:8000
27-
28-
$ python -m sglang_router.launch_router --help
29-
usage: launch_router.py [-h] [--host HOST] [--port PORT] [--worker-urls WORKER_URLS [WORKER_URLS ...]]
30-
[--policy {random,round_robin,cache_aware}] [--cache-threshold CACHE_THRESHOLD]
31-
[--balance-abs-threshold BALANCE_ABS_THRESHOLD] [--balance-rel-threshold BALANCE_REL_THRESHOLD]
32-
[--eviction-interval EVICTION_INTERVAL] [--max-tree-size MAX_TREE_SIZE]
33-
34-
options:
35-
-h, --help show this help message and exit
36-
--host HOST Host address to bind the router server (default: 127.0.0.1)
37-
--port PORT Port number to bind the router server (default: 30000)
38-
--worker-urls WORKER_URLS [WORKER_URLS ...]
39-
List of worker URLs (e.g., http://worker1:8000 http://worker2:8000) (default: None)
40-
--policy {random,round_robin,cache_aware}
41-
Load balancing policy to use (default: cache_aware)
42-
--cache-threshold CACHE_THRESHOLD
43-
Cache threshold (0.0-1.0) for cache-aware routing (default: 0.5)
44-
--balance-abs-threshold BALANCE_ABS_THRESHOLD
45-
Load balancing is triggered when (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold (default: 32)
46-
--balance-rel-threshold BALANCE_REL_THRESHOLD
47-
Load balancing is triggered when (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold (default: 1.0001)
48-
--eviction-interval EVICTION_INTERVAL
49-
Interval in seconds between cache eviction operations (default: 60)
50-
--max-tree-size MAX_TREE_SIZE
51-
Maximum size of the approximation tree for cache-aware routing (default: 16777216)
52-
```
53-
54-
## Strategy
55-
56-
### Cache-Aware Load-Balancing Router
57-
58-
This router combines two strategies to optimize both cache utilization and request distribution:
59-
60-
1. Cache-Aware Routing (Approximate Tree)
61-
2. Load-Balancing Routing (Shortest Queue with Balance Thresholds)
7+
Please check https://sgl-project.github.io/router/router.html
628

63-
The router dynamically switches between these strategies based on load conditions:
64-
- Uses load balancing when the system is imbalanced
65-
- Uses cache-aware routing when the system is balanced
9+
## Developer docs
6610

67-
A system is considered imbalanced if both conditions are met:
68-
1. (max_load - min_load) > balance_abs_threshold
69-
2. max_load > balance_rel_threshold * min_load
70-
71-
#### 1. Cache-Aware Routing (Approximate Tree)
72-
This strategy maintains an approximate radix tree for each worker based on request history,
73-
eliminating the need for direct cache state queries. The tree stores raw text characters
74-
instead of token IDs to avoid tokenization overhead.
75-
76-
Process:
77-
- For each request, find the worker with the highest prefix match
78-
- If match rate > cache_threshold:
79-
- Route to the worker with highest match (likely has relevant data cached)
80-
- If match rate ≤ cache_threshold:
81-
- Route to the worker with smallest tree size (most available cache capacity)
82-
- Background maintenance:
83-
- Periodically evict least recently used leaf nodes to prevent memory overflow
84-
85-
#### 2. Load-Balancing (Shortest Queue)
86-
This strategy tracks pending request counts per worker and routes new requests
87-
to the least busy worker when the system is detected to be imbalanced. This helps
88-
maintain optimal load distribution across workers.
89-
90-
### Configuration Parameters
91-
92-
1. `cache_threshold`: (float, 0.0 to 1.0, default: 0.5)
93-
- Minimum prefix match ratio to use highest-match routing
94-
- Below this threshold, routes to worker with most available cache space
95-
96-
2. `balance_abs_threshold`: (integer, default: 32)
97-
- Absolute difference threshold for load imbalance detection
98-
- System is potentially imbalanced if (max_load - min_load) > abs_threshold
99-
100-
3. `balance_rel_threshold`: (float, default: 1.0001)
101-
- Relative ratio threshold for load imbalance detection
102-
- System is potentially imbalanced if max_load > min_load * rel_threshold
103-
- Used in conjunction with abs_threshold to determine final imbalance state
104-
105-
4. `eviction_interval`: (integer, default: 60)
106-
- Interval in seconds between LRU eviction cycles for the approximate trees
107-
- Background thread periodically evicts least recently used nodes to maintain tree size
108-
109-
5. `max_tree_size`: (integer, default: 16777216)
110-
- Maximum nodes per tree
111-
- When exceeded, LRU leaf nodes are evicted during the next eviction cycle
112-
113-
## Development
11+
### Prerequisites
11412

11513
- Rust and Cargo installed
11614

@@ -134,21 +32,27 @@ cargo --version
13432
#### 1. Build Rust Project
13533

13634
```bash
137-
cargo build
35+
$ cargo build
13836
```
13937

14038
#### 2. Build Python Binding
14139

14240
##### Option A: Build and Install Wheel
14341
1. Build the wheel package:
14442
```bash
145-
pip install setuptools-rust wheel build
146-
python -m build
43+
$ pip install setuptools-rust wheel build
44+
$ python -m build
14745
```
14846

14947
2. Install the generated wheel:
15048
```bash
151-
pip install <path-to-wheel>
49+
$ pip install <path-to-wheel>
50+
```
51+
52+
If you want one handy command to do build + install for every change you make:
53+
54+
```bash
55+
$ python -m build && pip install --force-reinstall dist/*.whl
15256
```
15357

15458
##### Option B: Development Mode
@@ -158,7 +62,7 @@ For development purposes, you can install the package in editable mode:
15862
Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance.
15963

16064
```bash
161-
pip install -e .
65+
$ pip install -e .
16266
```
16367

16468
**Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect.

rust/src/server.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ async fn remove_worker(
118118
None => return HttpResponse::BadRequest().finish(),
119119
};
120120
data.router.remove_worker(&worker_url);
121-
HttpResponse::Ok().finish()
121+
HttpResponse::Ok().body(format!("Successfully removed worker: {}", worker_url))
122122
}
123123

124124
pub struct ServerConfig {

0 commit comments

Comments
 (0)