Skip to content

Commit 2ba396c

Browse files
committed
Merge branch 'main' into add_logs_gaudi_warmup
2 parents 3245b89 + 3752143 commit 2ba396c

File tree

144 files changed

+7327
-8667
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

144 files changed

+7327
-8667
lines changed

.github/workflows/nix_build.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ jobs:
2121
nix_path: nixpkgs=channel:nixos-unstable
2222
- uses: cachix/cachix-action@v14
2323
with:
24-
name: text-generation-inference
24+
name: huggingface
2525
# If you chose signing key for write access
2626
authToken: '${{ secrets.CACHIX_AUTH_TOKEN }}'
2727
env:

.github/workflows/nix_cache.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ jobs:
2020
nix_path: nixpkgs=channel:nixos-unstable
2121
- uses: cachix/cachix-action@v14
2222
with:
23-
name: text-generation-inference
23+
name: huggingface
2424
# If you chose signing key for write access
2525
authToken: "${{ secrets.CACHIX_AUTH_TOKEN }}"
2626
env:

.github/workflows/nix_tests.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ jobs:
2525
nix_path: nixpkgs=channel:nixos-unstable
2626
- uses: cachix/cachix-action@v14
2727
with:
28-
name: text-generation-inference
28+
name: huggingface
2929
# If you chose signing key for write access
3030
authToken: '${{ secrets.CACHIX_AUTH_TOKEN }}'
3131
env:

Cargo.lock

Lines changed: 8 additions & 8 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ default-members = [
2121
resolver = "2"
2222

2323
[workspace.package]
24-
version = "3.3.0-dev0"
24+
version = "3.3.2-dev0"
2525
edition = "2021"
2626
authors = ["Olivier Dehaene"]
2727
homepage = "https://github.com/huggingface/text-generation-inference"

Dockerfile

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS pytorch-install
4848
WORKDIR /usr/src/
4949

5050
# NOTE: When updating PyTorch version, beware to remove `pip install nvidia-nccl-cu12==2.22.3` below in the Dockerfile. Context: https://github.com/huggingface/text-generation-inference/pull/2099
51-
ARG PYTORCH_VERSION=2.6
51+
ARG PYTORCH_VERSION=2.7
5252
ARG PYTHON_VERSION=3.11
5353

5454
# Keep in sync with `server/pyproject.toml
@@ -121,13 +121,6 @@ COPY server/Makefile-awq Makefile
121121
# Build specific version of transformers
122122
RUN . .venv/bin/activate && make build-awq
123123

124-
# Build Lorax Punica kernels
125-
FROM kernel-builder AS lorax-punica-builder
126-
WORKDIR /usr/src
127-
COPY server/Makefile-lorax-punica Makefile
128-
# Build specific version of transformers
129-
RUN . .venv/bin/activate && TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" make build-lorax-punica
130-
131124
# Build Transformers CUDA kernels
132125
FROM kernel-builder AS custom-kernels-builder
133126
WORKDIR /usr/src
@@ -210,8 +203,6 @@ COPY --from=exllama-kernels-builder /usr/src/build/lib.linux-x86_64-cpython-311
210203
COPY --from=exllamav2-kernels-builder /usr/src/exllamav2/build/lib.linux-x86_64-cpython-311 /usr/src/.venv/lib/python3.11/site-packages
211204
# Copy build artifacts from awq kernels builder
212205
COPY --from=awq-kernels-builder /usr/src/llm-awq/awq/kernels/build/lib.linux-x86_64-cpython-311 /usr/src/.venv/lib/python3.11/site-packages
213-
# Copy build artifacts from lorax punica kernels builder
214-
COPY --from=lorax-punica-builder /usr/src/lorax-punica/server/punica_kernels/build/lib.linux-x86_64-cpython-311 /usr/src/.venv/lib/python3.11/site-packages
215206
# Copy build artifacts from mamba builder
216207
COPY --from=mamba-builder /usr/src/mamba/build/lib.linux-x86_64-cpython-311/ /usr/src/.venv/lib/python3.11/site-packages
217208
COPY --from=mamba-builder /usr/src/causal-conv1d/build/lib.linux-x86_64-cpython-311/ /usr/src/.venv/lib/python3.11/site-packages

Dockerfile.neuron

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ RUN mkdir -p /tgi
55
# Fetch the optimum-neuron sources directly to avoid relying on pypi deployments
66
FROM alpine AS optimum-neuron
77
RUN mkdir -p /optimum-neuron
8-
ADD https://github.com/huggingface/optimum-neuron/archive/refs/tags/v0.1.0.tar.gz /optimum-neuron/sources.tar.gz
8+
ADD https://github.com/huggingface/optimum-neuron/archive/refs/tags/v0.2.0.tar.gz /optimum-neuron/sources.tar.gz
99
RUN tar -C /optimum-neuron -xf /optimum-neuron/sources.tar.gz --strip-components=1
1010

1111
# Build cargo components (adapted from TGI original Dockerfile)
@@ -108,10 +108,10 @@ RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEU
108108
# Install neuronx packages
109109
RUN apt-get update -y \
110110
&& apt-get install -y --no-install-recommends \
111-
aws-neuronx-dkms=2.19.64.0 \
112-
aws-neuronx-collectives=2.23.135.0-3e70920f2 \
113-
aws-neuronx-runtime-lib=2.23.112.0-9b5179492 \
114-
aws-neuronx-tools=2.20.204.0 \
111+
aws-neuronx-dkms=2.20.28.0 \
112+
aws-neuronx-collectives=2.24.59.0-838c7fc8b \
113+
aws-neuronx-runtime-lib=2.24.53.0-f239092cc \
114+
aws-neuronx-tools=2.22.61.0 \
115115
libxml2 \
116116
&& rm -rf /var/lib/apt/lists/* \
117117
&& apt-get clean
@@ -125,11 +125,10 @@ RUN pip3 install \
125125
--index-url https://download.pytorch.org/whl/cpu
126126

127127
RUN pip3 install \
128-
neuronx-cc==2.16.372.0 \
129-
torch-neuronx==2.5.1.2.4.0 \
130-
transformers-neuronx==0.13.322 \
131-
neuronx-distributed==0.10.1 \
132-
libneuronxla==2.1.681.0 \
128+
neuronx-cc==2.17.194.0 \
129+
torch-neuronx==2.5.1.2.6.0 \
130+
neuronx-distributed==0.11.0 \
131+
libneuronxla==2.2.1630.0 \
133132
--extra-index-url=https://pip.repos.neuron.amazonaws.com
134133

135134
# Install HuggingFace packages
@@ -160,7 +159,7 @@ RUN pip install dist/text_generation_server*.tar.gz
160159
# Final image
161160
FROM neuron
162161

163-
COPY backends/neuron/tgi_env.py /tgi_env.py
162+
COPY backends/neuron/tgi_entry_point.py /tgi_entry_point.py
164163
COPY backends/neuron/tgi-entrypoint.sh /tgi-entrypoint.sh
165164
RUN chmod +x /tgi-entrypoint.sh
166165

Dockerfile.nix

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
FROM nixos/nix:2.18.8 AS builder
77
RUN echo "experimental-features = nix-command flakes" >> /etc/nix/nix.conf
88
RUN nix profile install nixpkgs#cachix
9-
RUN cachix use text-generation-inference
9+
RUN cachix use huggingface
1010
WORKDIR /root
1111
ADD . .
1212
RUN nix build .

Dockerfile_gaudi

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Those arguments are required to build the image
2-
ARG HABANA_VERSION=1.20.0
2+
ARG HABANA_VERSION=1.21.0
33
ARG PYTORCH_VERSION=2.6.0
44

55
# Rust builder
@@ -57,9 +57,12 @@ ARG PYTORCH_VERSION
5757

5858
FROM vault.habana.ai/gaudi-docker/${HABANA_VERSION}/ubuntu22.04/habanalabs/pytorch-installer-${PYTORCH_VERSION}:latest AS base
5959

60-
ENV ATTENTION=default
60+
ENV ATTENTION=paged
6161
ENV PREFIX_CACHING=0
6262
ENV PREFILL_CHUNKING=0
63+
ENV PT_HPU_LAZY_MODE=1
64+
ENV PT_HPU_WEIGHT_SHARING=0
65+
ENV VLLM_EXPONENTIAL_BUCKETING=true
6366

6467
# Text Generation Inference base env
6568
ENV HF_HOME=/data \
@@ -95,7 +98,9 @@ RUN cd server && \
9598
pip install "git+https://github.com/HabanaAI/DeepSpeed.git@${HABANA_VERSION}" && \
9699
BUILD_CUDA_EXT=0 pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git@097dd04e --no-build-isolation && \
97100
pip install . --no-cache-dir
98-
RUN pip install git+https://github.com/sywangyi/vllm-hpu-extension.git
101+
RUN pip install git+https://github.com/sywangyi/vllm-hpu-extension.git@bmax_fix
102+
RUN pip install compressed-tensors==0.9.1
103+
99104
# Install benchmarker
100105
COPY --from=builder /usr/src/target/release-opt/text-generation-benchmark /usr/local/bin/text-generation-benchmark
101106
# Install router

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ model=HuggingFaceH4/zephyr-7b-beta
8484
volume=$PWD/data
8585

8686
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
87-
ghcr.io/huggingface/text-generation-inference:3.3.0 --model-id $model
87+
ghcr.io/huggingface/text-generation-inference:3.3.2 --model-id $model
8888
```
8989

9090
And then you can make requests like
@@ -121,7 +121,7 @@ curl localhost:8080/v1/chat/completions \
121121

122122
**Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.
123123

124-
**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/installation_amd#using-tgi-with-amd-gpus). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.3.0-rocm --model-id $model` instead of the command above.
124+
**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/installation_amd#using-tgi-with-amd-gpus). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.3.2-rocm --model-id $model` instead of the command above.
125125

126126
To see all options to serve your models (in the [code](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or in the cli):
127127
```
@@ -152,7 +152,7 @@ volume=$PWD/data # share a volume with the Docker container to avoid downloading
152152
token=<your cli READ token>
153153

154154
docker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:/data \
155-
ghcr.io/huggingface/text-generation-inference:3.3.0 --model-id $model
155+
ghcr.io/huggingface/text-generation-inference:3.3.2 --model-id $model
156156
```
157157

158158
### A note on Shared Memory (shm)
@@ -256,7 +256,7 @@ Another option is to install `text-generation-inference` locally using [Nix](htt
256256
we only support Nix on x86_64 Linux with CUDA GPUs. When using Nix, all dependencies can
257257
be pulled from a binary cache, removing the need to build them locally.
258258

259-
First follow the instructions to [install Cachix and enable the TGI cache](https://app.cachix.org/cache/text-generation-inference).
259+
First follow the instructions to [install Cachix and enable the Hugging Face cache](https://app.cachix.org/cache/huggingface).
260260
Setting up the cache is important, otherwise Nix will build many of the dependencies
261261
locally, which can take hours.
262262

0 commit comments

Comments
 (0)