Skip to content

Commit 1c19b09

Browse files
v0.3.2 (#97)
1 parent 0b6807c commit 1c19b09

File tree

10 files changed

+51
-14
lines changed

10 files changed

+51
-14
lines changed

Cargo.lock

Lines changed: 4 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ run-bloom-560m-quantize:
2828
text-generation-launcher --model-id bigscience/bloom-560m --num-shard 2 --quantize
2929

3030
download-bloom:
31-
text-generation-server download-weights bigscience/bloom
31+
HF_HUB_ENABLE_HF_TRANSFER=1 text-generation-server download-weights bigscience/bloom
3232

3333
run-bloom:
3434
text-generation-launcher --model-id bigscience/bloom --num-shard 8

README.md

Lines changed: 29 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,15 +39,17 @@ to power LLMs api-inference widgets.
3939

4040
## Features
4141

42+
- Serve the most popular Large Language Models with a simple launcher
43+
- Tensor Parallelism for faster inference on multiple GPUs
4244
- Token streaming using Server-Sent Events (SSE)
4345
- [Dynamic batching of incoming requests](https://github.com/huggingface/text-generation-inference/blob/main/router/src/batcher.rs#L88) for increased total throughput
4446
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
4547
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
46-
- 45ms per token generation for BLOOM with 8xA100 80GB
48+
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
4749
- Logits warpers (temperature scaling, topk, repetition penalty ...)
4850
- Stop sequences
4951
- Log probabilities
50-
- Distributed tracing with Open Telemetry
52+
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
5153

5254
## Officially supported architectures
5355

@@ -58,6 +60,7 @@ to power LLMs api-inference widgets.
5860
- [SantaCoder](https://huggingface.co/bigcode/santacoder)
5961
- [GPT-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b)
6062
- [FLAN-T5-XXL](https://huggingface.co/google/flan-t5-xxl)
63+
- [FLAN-UL2](https://huggingface.co/google/flan-ul2)
6164

6265
Other architectures are supported on a best effort basis using:
6366

@@ -97,6 +100,30 @@ curl 127.0.0.1:8080/generate_stream \
97100
-H 'Content-Type: application/json'
98101
```
99102

103+
or from Python:
104+
105+
```python
106+
import requests
107+
108+
result = requests.post("http://127.0.0.1:8080/generate", json={"inputs":"Testing API","parameters":{"max_new_tokens":9}})
109+
print(result.json())
110+
```
111+
112+
```shell
113+
pip install sseclient-py
114+
```
115+
116+
````python
117+
import sseclient
118+
import requests
119+
120+
r = requests.post("http://127.0.0.1:8080/generate_stream", stream=True, json={"inputs":"Testing API","parameters":{"max_new_tokens":9}})
121+
sse_client = sseclient.SSEClient(r)
122+
123+
for i, event in enumerate(sse_client.events()):
124+
print(i, event.data)
125+
````
126+
100127
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
101128

102129
### API documentation

docs/openapi.json

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
"name": "Apache 2.0",
1212
"url": "https://www.apache.org/licenses/LICENSE-2.0"
1313
},
14-
"version": "0.3.1"
14+
"version": "0.3.2"
1515
},
1616
"paths": {
1717
"/generate": {
@@ -290,6 +290,11 @@
290290
"nullable": true,
291291
"exclusiveMinimum": 0.0
292292
},
293+
"return_full_text": {
294+
"type": "boolean",
295+
"default": "None",
296+
"example": false
297+
},
293298
"seed": {
294299
"type": "integer",
295300
"format": "int64"
@@ -328,6 +333,11 @@
328333
"nullable": true,
329334
"maximum": 1.0,
330335
"exclusiveMinimum": 0.0
336+
},
337+
"watermark": {
338+
"type": "boolean",
339+
"default": "false",
340+
"example": true
331341
}
332342
}
333343
},

launcher/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "text-generation-launcher"
3-
version = "0.3.1"
3+
version = "0.3.2"
44
edition = "2021"
55
authors = ["Olivier Dehaene"]
66
description = "Text Generation Launcher"

router/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "text-generation-router"
3-
version = "0.3.1"
3+
version = "0.3.2"
44
edition = "2021"
55
authors = ["Olivier Dehaene"]
66
description = "Text Generation Webserver"

router/client/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "text-generation-client"
3-
version = "0.3.1"
3+
version = "0.3.2"
44
edition = "2021"
55

66
[dependencies]

router/grpc-metadata/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "grpc-metadata"
3-
version = "0.3.1"
3+
version = "0.3.2"
44
edition = "2021"
55

66
[dependencies]

server/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
transformers_commit := 712d62e83c28236c7f39af690e7792a54288dbd9
1+
transformers_commit := 2f87dca1ca3e5663d0637da9bb037a6956e57a5e
22

33
gen-server:
44
# Compile protos

server/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "text-generation"
3-
version = "0.3.1"
3+
version = "0.3.2"
44
description = "Text Generation Inference Python gRPC Server"
55
authors = ["Olivier Dehaene <olivier@huggingface.co>"]
66

0 commit comments

Comments
 (0)