Walmart halts H-1B visa offers amid Trump’s $100,000 fee increase - Bloomberg
Investing.com -- Alibaba Cloud has published a paper detailing its Aegaeon GPU resource optimization solution for large language model (LLM) concurrent inferencing, the company announced Monday.
The cloud computing arm of Alibaba Group also revealed it successfully reduced the number of GPUs required by 82% in deployment through this new approach.
LLM inferencing typically involves numerous burst requests, which creates challenges for efficient GPU usage. Alibaba Cloud improved efficiency by implementing a model that processes work based on tokens rather than requests.
The solution speeds up inference processing by splitting it into two phases - prefill and decoding - and handling each in separate GPU pools.
If commercialized, this optimization would likely reduce AI inference server costs and potentially increase demand for non-GPGPU server semiconductors and specialized processing elements (SPE).
This article was generated with the support of AI and reviewed by an editor. For more information see our T&C.