Introducing vLLM – an open source LLM inference and service library that accelerates HuggingFace Transformers by 24x

https://vllm.ai/

Large language models, or LLMs for short, have emerged as a revolutionary advance in the field of artificial intelligence (AI). These models, like GPT-3, have completely revolutionized natural language understanding. With the ability of such models to interpret large amounts of existing data and generate human-like text, these models have immense potential to shape the future of AI and open up new possibilities for human-machine interaction and communication. However, despite the enormous success of LLMs, a significant challenge often associated with such models is their computational inefficiency, leading to slow performance even on the most powerful hardware. Because these models include millions and billions of parameters, training these models requires large computational resources, memory, and processing power, which are not always accessible. Additionally, these complex architectures with slow response times can make LLMs impractical for real-time or interactive applications. Consequently, addressing these challenges becomes essential to unlocking the full potential of LLMs and making their benefits more widely accessible.

Addressing this problem statement, researchers at the University of California, Berkeley developed vLLM, an open source library that is a simpler, faster, and cheaper alternative for LLM inference and service. The Large Model Systems Organization (LMSYS) is currently using the library to power their Vicuna and Chatbot Arena. By switching to vLLM as a backend, unlike the initial HuggingFace Transformers-based backend, the research organization was able to handle traffic peaks efficiently (5 times more than before) using limited computational resources and reducing the high operating costs. Currently, vLLM supports several HuggingFace models such as GPT-2, GPT BigCode and LLaMA, just to name a few. Achieves productivity levels 24 times higher than HuggingFace transformers while maintaining the same model architecture and requiring no modifications.

As part of their preliminary research, the Berkeley researchers determined that memory-related issues are the primary constraint on LLM performance. LLMs use input tokens to generate attention keys and value tensors, which are then cached in GPU memory to generate subsequent tokens. These dynamic key and value tensors, known as KV caches, take up a substantial portion of memory, and managing them becomes a cumbersome task. To address this challenge, researchers introduced the innovative concept of PagedAttention, a new attention algorithm that extends the conventional idea of ​​paging in operating systems to the LLM service. PagedAttention offers a more flexible approach to handling key and value tensors by storing them in non-contiguous memory spaces, eliminating the need for long continuous memory blocks. These blocks can be fetched independently using a block table during attention computation, leading to more efficient memory usage. Adopting this clever technique reduces memory waste to less than 4%, with near-optimal memory usage. Also, PagedAttention can bundle 5x more sequences together, thus improving GPU utilization and throughput.

Check out 100s AI Tools in our AI Tools Club

PagedAttention offers the added benefit of efficient memory sharing. During parallel sampling, which is when multiple output sequences are created at the same time from a single prompt, PagedAttention enables sharing of the computational resources and memory associated with that prompt. This is accomplished by using a block table, where different sequences within PagedAttention can share blocks by mapping logical blocks to the same physical block. Using this memory sharing mechanism, PagedAttention not only minimizes memory usage but also ensures secure sharing. Experimental evaluations conducted by the researchers revealed that parallel sampling could reduce memory usage by a whopping 55%, resulting in a 2.2x increase in throughput.

To sum up, vLLM effectively handles attention key and value memory management through the implementation of the PagedAttention mechanism. This results in outstanding throughput performance. Furthermore, vLLM integrates seamlessly with well-known HuggingFace models and can be used in conjunction with different decoding algorithms, such as parallel sampling. The library can be installed using a simple pip command and is currently available for both offline inference and online publishing.


Check out TheBlog articleANDGithub.Don’t forget to subscribeour 25k+ ML SubReddit,Discord channel,ANDEmail newsletterwhere we share the latest news on AI research, cool AI projects, and more. If you have any questions regarding the above article or if you have missed anything, please do not hesitate to email us atAsif@marktechpost.com

Check out 100s AI Tools in the AI ​​Tools Club

Khushboo Gupta is a Consulting Intern at MarktechPost. He is currently pursuing his B.Tech in Indian Institute of Technology (IIT), Goa. She is passionate about Machine Learning, Natural Language Processing and Web Development. She likes to learn more about the technical field by participating in different challenges.

Unleash the power of Live Proxies: private, undetectable residential and mobile IPs.

#Introducing #vLLM #open #source #LLM #inference #service #library #accelerates #HuggingFace #Transformers #24x
Image Source : www.marktechpost.com

Leave a Comment