Skip to content

Conversation

kaixuanliu
Copy link
Contributor

@regisss pls help review, thx

Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we should probably do the same in vlm_causal_lm.py

@kaixuanliu
Copy link
Contributor Author

Seems, in vlm_causal_lm.py there exists similar logit: L1503-L1506

regisss
regisss previously approved these changes May 9, 2025
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
@regisss
Copy link
Collaborator

regisss commented May 10, 2025

@kaixuanliu I just tested this PR. I can confirm that warmup time is divided by a factor ~2.
However, when sending a request to the server such as

curl 127.0.0.1:8080/generate \
     -X POST \
     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
     -H 'Content-Type: application/json'

I get the following error:

2025-05-10T08:09:30.141092Z ERROR text_generation_launcher: Method Prefill encountered an error.                                             
Traceback (most recent call last):                                                                                                           
  File "/usr/local/bin/text-generation-server", line 8, in <module>                                                                          
    sys.exit(app())                                                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 322, in __call__                                                        
    return get_command(self)(*args, **kwargs)                                                                                                
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1161, in __call__                                                       
    return self.main(*args, **kwargs)                                                                                                        
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 740, in main                                                            
    return _main(                                                                                                                            
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 195, in _main                                                           
    rv = self.invoke(ctx)                                                                                                                    
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1697, in invoke                                                         
    return _process_result(sub_ctx.command.invoke(sub_ctx))                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1443, in invoke                                                         
    return ctx.invoke(self.callback, **ctx.params)                                                                                           
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 788, in invoke                                                          
    return __callback(*args, **kwargs)                                                                                                       
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 697, in wrapper                                                         
    return callback(**use_params)                                                                                                            
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 170, in serve                                           
    server.serve(                                                                                                                            
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 320, in serve
    asyncio.run(
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
    File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 25, in intercept
    return await response
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 176, in Prefill
    batch = self.model.batch_type.from_pb(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 605, in from_pb
    input_ids = torch.nn.functional.pad(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 5209, in pad
    return torch._C._nn.pad(input, pad, mode, value)
TypeError: pad(): argument 'pad' (position 2) must be tuple of ints, but found element of type float at pos 0
2025-05-10T08:09:30.141452Z ERROR batch{batch_size=1}:prefill:prefill{id=1 size=1}:prefill{id=1 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: pad(): argument 'pad' (position 2) must be tuple of ints, but found element of type float at pos 0
2025-05-10T08:09:30.142004Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(20), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }}:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:546: Request failed during generation: Server error: pad(): argument 'pad' (position 2) must be tuple of ints, but found element of type float at pos 0

@regisss
Copy link
Collaborator

regisss commented May 10, 2025

Just an int that was a float, I just pushed a commit to make sure the returned rounded sequence is an int.

@regisss regisss merged commit c94f415 into huggingface:main May 10, 2025
@kaixuanliu kaixuanliu deleted the seq-len-exp branch May 12, 2025 01:02
@kaixuanliu
Copy link
Contributor Author

@regisss , sorry , I forget one case that the val of exponent here maybe negative,it should be no less than 0 to align with the start value of seq len in prefill phase. I made an adjustment in #3224, pls help review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants