You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+29-2Lines changed: 29 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,15 +39,17 @@ to power LLMs api-inference widgets.
39
39
40
40
## Features
41
41
42
+
- Serve the most popular Large Language Models with a simple launcher
43
+
- Tensor Parallelism for faster inference on multiple GPUs
42
44
- Token streaming using Server-Sent Events (SSE)
43
45
-[Dynamic batching of incoming requests](https://github.com/huggingface/text-generation-inference/blob/main/router/src/batcher.rs#L88) for increased total throughput
44
46
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
result = requests.post("http://127.0.0.1:8080/generate", json={"inputs":"Testing API","parameters":{"max_new_tokens":9}})
109
+
print(result.json())
110
+
```
111
+
112
+
```shell
113
+
pip install sseclient-py
114
+
```
115
+
116
+
````python
117
+
import sseclient
118
+
import requests
119
+
120
+
r = requests.post("http://127.0.0.1:8080/generate_stream", stream=True, json={"inputs":"Testing API","parameters":{"max_new_tokens":9}})
121
+
sse_client = sseclient.SSEClient(r)
122
+
123
+
for i, event inenumerate(sse_client.events()):
124
+
print(i, event.data)
125
+
````
126
+
100
127
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
0 commit comments