Skip to content

Releases: huggingface/text-generation-inference

v0.9.0

01 Jul 17:26
e28a809
Compare
Choose a tag to compare

Highlights

  • server: add paged attention to flash models
  • server: Inference support for GPTQ (llama + falcon tested) + Quantization script
  • server: only compute prefill logprobs when asked

Features

  • launcher: parse oom signals
  • server: batch tokenization for flash causal lm
  • server: Rework loading by
  • server: optimize dist ops
  • router: add ngrok integration
  • server: improve flash attention import errors
  • server: Refactor conversion logic
  • router: add header option to disable buffering for the generate_stream response by @rkimball
  • router: add arg validation

Fix

  • docs: CUDA_VISIBLE_DEVICES comment by @antferdom
  • docs: Fix typo and use POSIX comparison in the makefile by @piratos
  • server: fix warpers on CPU
  • server: Fixing T5 in case the names are mixed up
  • router: add timeout on flume sends
  • server: Do not init process group if already initialized
  • server: Add the option to force another dtype than f16
  • launcher: fix issue where launcher does not properly report shard failures

New Contributors

Full Changelog: v0.8.2...v0.9.0

v0.8.2

01 Jun 17:51
Compare
Choose a tag to compare

Features

  • server: remove trust_remote_code requirement for falcon models
  • server: load santacoder/starcoder models with safetensors

Fix

  • server: fix has_position_ids

Full Changelog: v0.8.1...v0.8.2

v0.8.1

31 May 10:10
Compare
Choose a tag to compare

Features

  • server: add retry on download

Fix

  • server: fix bnb quantization for CausalLM models

Full Changelog: v0.8.0...v0.8.1

v0.8.0

30 May 16:45
Compare
Choose a tag to compare

Features

  • router: support vectorized warpers in flash causal lm (co-authored by @jlamypoirier )
  • proto: decrease IPC proto size
  • benchmarker: add summary tables
  • server: support RefinedWeb models

Fix

  • server: Fix issue when load AutoModelForSeq2SeqLM model (contributed by @CL-Shang)

New Contributors

Full Changelog: v0.7.0...v0.8.0

v0.7.0

23 May 19:21
d31562f
Compare
Choose a tag to compare

Features

  • server: reduce vram requirements of continuous batching (contributed by @njhill)
  • server: Support BLOOMChat-176B (contributed by @njhill)
  • server: add watermarking tests (contributed by @ehsanmok)
  • router: Adding response schema for compat_generate (contributed by @gsaivinay)
  • router: use number of tokins in batch as input for dynamic batching (co-authored by @njhill)
  • server: improve download and decrease conversion to safetensors RAM requirements
  • server: optimize flash causal lm decode token
  • server: shard decode token
  • server: use cuda graph in logits warping
  • server: support trust_remote_code
  • tests: add snapshot testing

Fix

  • server: use float16
  • server: fix multinomial implem in Sampling
  • server: do not use device_map auto on single GPU

Misc

  • docker: use nvidia base image

New Contributors

Full Changelog: v0.6.0...v0.7.0

v0.6.0

21 Apr 19:02
6ded76a
Compare
Choose a tag to compare

Features

  • server: flash attention past key values optimization (contributed by @njhill)
  • router: remove requests when client closes the connection (co-authored by @njhill)
  • server: support quantization for flash models
  • router: add info route
  • server: optimize token decode
  • server: support flash sharded santacoder
  • security: image signing with cosign
  • security: image analysis with trivy
  • docker: improve image size

Fix

  • server: check cuda capability before importing flash attention
  • server: fix hf_transfer issue with private repositories
  • router: add auth token for private tokenizers

Misc

  • rust: update to 1.69

v0.5.0

11 Apr 18:32
6f0f1d7
Compare
Choose a tag to compare

Features

  • server: add flash-attention based version of Llama
  • server: add flash-attention based version of Santacoder
  • server: support OPT models
  • router: make router input validation optional
  • docker: improve layer caching

Fix

  • server: improve token streaming decoding
  • server: fix escape charcaters in stop sequences
  • router: fix NCCL desync issues
  • router: use buckets for metrics histograms

v0.4.3

30 Mar 15:29
fef1a1c
Compare
Choose a tag to compare

Fix

  • router: fix OTLP distributed tracing initialization

v0.4.2

30 Mar 15:10
84722f3
Compare
Choose a tag to compare

Features

  • benchmark: tui based benchmarking tool
  • router: Clear cache on error
  • server: Add mypy-protobuf
  • server: reduce mlp and attn in one op for flash neox
  • image: aws sagemaker compatible image

Fix

  • server: avoid try/except to determine the kind of AutoModel
  • server: fix flash neox rotary embedding

v0.4.1

26 Mar 14:38
ab5fd8c
Compare
Choose a tag to compare

Features

  • server: New faster GPTNeoX implementation based on flash attention

Fix

  • server: fix input-length discrepancy between Rust and Python tokenizers