[router] Update doc for dynamic scaling and fault tolerance (#2454)

ByronHsu · web-flow · commit c0ee46fe10e7 · 2024-12-11T13:11:42.000-08:00
diff --git a/docs/router/router.md b/docs/router/router.md
@@ -7,14 +7,14 @@ The router is a independent Python package, and it can be used as a drop-in repl
 ## Installation
 
 ```bash
-pip install sglang-router
+$ pip install sglang-router
 ```
 
 Detailed usage of the router can be found in [launch_router](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang_router/launch_router.py) and [launch_server](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang/launch_server.py). Also, you can directly run the following command to see the usage of the router.
 
 ```bash
-python -m sglang_router.launch_server --help
-python -m sglang_router.launch_router --help
+$ python -m sglang_router.launch_server --help
+$ python -m sglang_router.launch_router --help
 ```
 
 The router supports two working modes:
@@ -27,7 +27,7 @@ The router supports two working modes:
 This will be a drop-in replacement for the existing `--dp-size` arguement of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.
 
 ```bash
-python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1
+$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1
 ```
 
 After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.
@@ -47,12 +47,62 @@ print(response.json())
 This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.
 
 ```bash
-python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2
+$ python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2
 ```
 
-## Strategies
+## Dynamic Scaling APIs
 
-### Cache-Aware Load-Balancing Router
+We offer `/add_worker` and `/remove_worker` APIs to dynamically add or remove workers from the router.
+
+- `/add_worker`
+
+Usage:
+
+```bash
+$ curl -X POST http://localhost:30000/add_worker?url=http://worker_url_1
+```
+
+Example:
+
+```bash
+$ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30001
+$ curl -X POST http://localhost:30000/add_worker?url=http://127.0.0.1:30001
+Successfully added worker: http://127.0.0.1:30001
+```
+
+- `/remove_worker`
+
+Usage:
+
+```bash
+$ curl -X POST http://localhost:30000/remove_worker?url=http://worker_url_1
+```
+
+Example:
+
+```bash
+$ curl -X POST http://localhost:30000/remove_worker?url=http://127.0.0.1:30001
+Successfully removed worker: http://127.0.0.1:30001
+```
+
+Note:
+
+- For cache-aware router, the worker will be removed from the tree and the queues.
+
+## Fault Tolerance
+
+We provide retries based for failure tolerance.
+
+1. If the request to a worker fails for `max_worker_retries` times, the router will remove the worker from the router and move on to the next worker.
+2. If the total number of retries exceeds `max_total_retries`, the router will return an error.
+
+Note:
+
+- `max_worker_retries` is 3 and `max_total_retries` is 6 by default.
+
+## Routing Strategies
+
+#### Cache-Aware Load-Balancing Router
 
 The native router combines two strategies to optimize both cache utilization and request distribution:
 
diff --git a/rust/README.md b/rust/README.md
@@ -2,115 +2,13 @@
 
 SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances.
 
-## Installation
+## User docs
 
-```bash
-pip install sglang-router
-```
-
-## Usage
-The router offers two modes:
-
-### 1. Co-launch workers and router
-This will be a drop-in replacement for the existing `--dp-size`. This part of code will be moved into sglang core.
-Under the hood, it uses multi-processes to launch multiple sglang workers, wait for them to be healthy, then launch the router.
-
-```bash
-$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 8
-```
-
-### 2. Launch only router
-This is useful for multi-node DP. You can launch workers on different nodes, then connect the router to them.
-
-```bash
-$ python -m sglang_router.launch_router --worker-urls http://worker1:8000 http://worker2:8000
-
-$ python -m sglang_router.launch_router --help
-usage: launch_router.py [-h] [--host HOST] [--port PORT] [--worker-urls WORKER_URLS [WORKER_URLS ...]]
-                       [--policy {random,round_robin,cache_aware}] [--cache-threshold CACHE_THRESHOLD]
-                       [--balance-abs-threshold BALANCE_ABS_THRESHOLD] [--balance-rel-threshold BALANCE_REL_THRESHOLD]
-                       [--eviction-interval EVICTION_INTERVAL] [--max-tree-size MAX_TREE_SIZE]
-
-options:
-  -h, --help            show this help message and exit
-  --host HOST          Host address to bind the router server (default: 127.0.0.1)
-  --port PORT          Port number to bind the router server (default: 30000)
-  --worker-urls WORKER_URLS [WORKER_URLS ...]
-                       List of worker URLs (e.g., http://worker1:8000 http://worker2:8000) (default: None)
-  --policy {random,round_robin,cache_aware}
-                       Load balancing policy to use (default: cache_aware)
-  --cache-threshold CACHE_THRESHOLD
-                       Cache threshold (0.0-1.0) for cache-aware routing (default: 0.5)
-  --balance-abs-threshold BALANCE_ABS_THRESHOLD
-                       Load balancing is triggered when (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold (default: 32)
-  --balance-rel-threshold BALANCE_REL_THRESHOLD
-                       Load balancing is triggered when (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold (default: 1.0001)
-  --eviction-interval EVICTION_INTERVAL
-                       Interval in seconds between cache eviction operations (default: 60)
-  --max-tree-size MAX_TREE_SIZE
-                       Maximum size of the approximation tree for cache-aware routing (default: 16777216)
-```
-
-## Strategy
-
-### Cache-Aware Load-Balancing Router
-
-This router combines two strategies to optimize both cache utilization and request distribution:
-
-1. Cache-Aware Routing (Approximate Tree)
-2. Load-Balancing Routing (Shortest Queue with Balance Thresholds)
+Please check https://sgl-project.github.io/router/router.html
 
-The router dynamically switches between these strategies based on load conditions:
-- Uses load balancing when the system is imbalanced
-- Uses cache-aware routing when the system is balanced
+## Developer docs
 
-A system is considered imbalanced if both conditions are met:
-1. (max_load - min_load) > balance_abs_threshold
-2. max_load > balance_rel_threshold * min_load
-
-#### 1. Cache-Aware Routing (Approximate Tree)
-This strategy maintains an approximate radix tree for each worker based on request history,
-eliminating the need for direct cache state queries. The tree stores raw text characters
-instead of token IDs to avoid tokenization overhead.
-
-Process:
-- For each request, find the worker with the highest prefix match
-- If match rate > cache_threshold:
-  - Route to the worker with highest match (likely has relevant data cached)
-- If match rate ≤ cache_threshold:
-  - Route to the worker with smallest tree size (most available cache capacity)
-- Background maintenance:
-  - Periodically evict least recently used leaf nodes to prevent memory overflow
-
-#### 2. Load-Balancing (Shortest Queue)
-This strategy tracks pending request counts per worker and routes new requests
-to the least busy worker when the system is detected to be imbalanced. This helps
-maintain optimal load distribution across workers.
-
-### Configuration Parameters
-
-1. `cache_threshold`: (float, 0.0 to 1.0, default: 0.5)
-   - Minimum prefix match ratio to use highest-match routing
-   - Below this threshold, routes to worker with most available cache space
-
-2. `balance_abs_threshold`: (integer, default: 32)
-   - Absolute difference threshold for load imbalance detection
-   - System is potentially imbalanced if (max_load - min_load) > abs_threshold
-
-3. `balance_rel_threshold`: (float, default: 1.0001)
-   - Relative ratio threshold for load imbalance detection
-   - System is potentially imbalanced if max_load > min_load * rel_threshold
-   - Used in conjunction with abs_threshold to determine final imbalance state
-
-4. `eviction_interval`: (integer, default: 60)
-   - Interval in seconds between LRU eviction cycles for the approximate trees
-   - Background thread periodically evicts least recently used nodes to maintain tree size
-
-5. `max_tree_size`: (integer, default: 16777216)
-   - Maximum nodes per tree
-   - When exceeded, LRU leaf nodes are evicted during the next eviction cycle
-
-## Development
+### Prerequisites
 
 - Rust and Cargo installed
 
@@ -134,21 +32,27 @@ cargo --version
 #### 1. Build Rust Project
 
 ```bash
-cargo build
+$ cargo build
 ```
 
 #### 2. Build Python Binding
 
 ##### Option A: Build and Install Wheel
 1. Build the wheel package:
 ```bash
-pip install setuptools-rust wheel build
-python -m build
+$ pip install setuptools-rust wheel build
+$ python -m build
 ```
 
 2. Install the generated wheel:
 ```bash
-pip install <path-to-wheel>
+$ pip install <path-to-wheel>
+```
+
+If you want one handy command to do build + install for every change you make:
+
+```bash
+$ python -m build && pip install --force-reinstall dist/*.whl
 ```
 
 ##### Option B: Development Mode
@@ -158,7 +62,7 @@ For development purposes, you can install the package in editable mode:
 Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance.
 
 ```bash
-pip install -e .
+$ pip install -e .
 ```
 
 **Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect.
diff --git a/rust/src/server.rs b/rust/src/server.rs
@@ -118,7 +118,7 @@ async fn remove_worker(
         None => return HttpResponse::BadRequest().finish(),
     };
     data.router.remove_worker(&worker_url);
-    HttpResponse::Ok().finish()
+    HttpResponse::Ok().body(format!("Successfully removed worker: {}", worker_url))
 }
 
 pub struct ServerConfig {

Original file line number	Diff line number	Diff line change
`@@ -118,7 +118,7 @@ async fn remove_worker(`
`118`	`118`	`None => return HttpResponse::BadRequest().finish(),`
`119`	`119`	`};`
`120`	`120`	`data.router.remove_worker(&worker_url);`
`121`		`- HttpResponse::Ok().finish()`
	`121`	`+ HttpResponse::Ok().body(format!("Successfully removed worker: {}", worker_url))`
`122`	`122`	`}`
`123`	`123`
`124`	`124`	`pub struct ServerConfig {`