change HPU warmup logic: seq length should be with exponential growth #3217

kaixuanliu · 2025-05-09T09:22:03Z

@regisss pls help review, thx

regisss

LGTM, we should probably do the same in vlm_causal_lm.py

kaixuanliu · 2025-05-09T10:01:02Z

Seems, in vlm_causal_lm.py there exists similar logit: L1503-L1506

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

regisss · 2025-05-10T08:13:27Z

@kaixuanliu I just tested this PR. I can confirm that warmup time is divided by a factor ~2.
However, when sending a request to the server such as

curl 127.0.0.1:8080/generate \
     -X POST \
     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
     -H 'Content-Type: application/json'

I get the following error:

2025-05-10T08:09:30.141092Z ERROR text_generation_launcher: Method Prefill encountered an error.                                             
Traceback (most recent call last):                                                                                                           
  File "/usr/local/bin/text-generation-server", line 8, in <module>                                                                          
    sys.exit(app())                                                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 322, in __call__                                                        
    return get_command(self)(*args, **kwargs)                                                                                                
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1161, in __call__                                                       
    return self.main(*args, **kwargs)                                                                                                        
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 740, in main                                                            
    return _main(                                                                                                                            
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 195, in _main                                                           
    rv = self.invoke(ctx)                                                                                                                    
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1697, in invoke                                                         
    return _process_result(sub_ctx.command.invoke(sub_ctx))                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1443, in invoke                                                         
    return ctx.invoke(self.callback, **ctx.params)                                                                                           
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 788, in invoke                                                          
    return __callback(*args, **kwargs)                                                                                                       
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 697, in wrapper                                                         
    return callback(**use_params)                                                                                                            
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 170, in serve                                           
    server.serve(                                                                                                                            
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 320, in serve
    asyncio.run(
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
    File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 25, in intercept
    return await response
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 176, in Prefill
    batch = self.model.batch_type.from_pb(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 605, in from_pb
    input_ids = torch.nn.functional.pad(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 5209, in pad
    return torch._C._nn.pad(input, pad, mode, value)
TypeError: pad(): argument 'pad' (position 2) must be tuple of ints, but found element of type float at pos 0
2025-05-10T08:09:30.141452Z ERROR batch{batch_size=1}:prefill:prefill{id=1 size=1}:prefill{id=1 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: pad(): argument 'pad' (position 2) must be tuple of ints, but found element of type float at pos 0
2025-05-10T08:09:30.142004Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(20), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }}:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:546: Request failed during generation: Server error: pad(): argument 'pad' (position 2) must be tuple of ints, but found element of type float at pos 0

regisss · 2025-05-10T13:40:12Z

Just an int that was a float, I just pushed a commit to make sure the returned rounded sequence is an int.

kaixuanliu · 2025-05-12T02:16:54Z

@regisss , sorry , I forget one case that the val of exponent here maybe negative，it should be no less than 0 to align with the start value of seq len in prefill phase. I made an adjustment in #3224, pls help review

regisss reviewed May 9, 2025

View reviewed changes

regisss previously approved these changes May 9, 2025

View reviewed changes

change HPU warmup logic: seq length should be with exponential growth

9c5ec4a

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Cast rounded sequence to int

966606d

regisss dismissed their stale review via 966606d May 10, 2025 13:38

regisss approved these changes May 10, 2025

View reviewed changes

regisss merged commit c94f415 into huggingface:main May 10, 2025

kaixuanliu deleted the seq-len-exp branch May 12, 2025 01:02

kaixuanliu mentioned this pull request May 12, 2025

adjust the round_up_seq logic to align with prefill warmup phase on… #3224

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

change HPU warmup logic: seq length should be with exponential growth #3217

change HPU warmup logic: seq length should be with exponential growth #3217

Uh oh!

kaixuanliu commented May 9, 2025

Uh oh!

regisss left a comment

Uh oh!

kaixuanliu commented May 9, 2025

Uh oh!

regisss commented May 10, 2025

Uh oh!

regisss commented May 10, 2025

Uh oh!

kaixuanliu commented May 12, 2025

Uh oh!

Uh oh!

change HPU warmup logic: seq length should be with exponential growth #3217

change HPU warmup logic: seq length should be with exponential growth #3217

Uh oh!

Conversation

kaixuanliu commented May 9, 2025

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

kaixuanliu commented May 9, 2025

Uh oh!

regisss commented May 10, 2025

Uh oh!

regisss commented May 10, 2025

Uh oh!

kaixuanliu commented May 12, 2025

Uh oh!

Uh oh!