Hacker News Clone

Ask HN: How to deploy LLMs to serve a million users?

by nothrowaways on 4/11/2023, 7:47 PM with 15 comments

What are current practices of serving inherently large LLM models to large number of users (with fast token/second)?

by oceanplexian on 4/11/2023, 9:10 PM
I don't work at OpenAI but as a Systems Engineer it's pretty easy to conceptualize how they'd do it.
- Instead of serving up one 175B parameter model, ChatGPT is probably a collection of smaller models that classify the input and then send it off to an optimized model to reduce the memory footprint for that task, such as classification, summarization, coding, and so on.
- Faking the memory from the chat. Context size = compute time. If they can "compress" a conversation by summarizing previous context, it will likely reduce the tokens they need to generate for each response. It definitely feels like they are doing something like this, as the chat gets longer and the model is less able to recall context from the chat.
- Model quantization: This is being heavily used in the OSS world and I'd be shocked if OpenAI weren't trying something similar. You can use quantized models without losing much accuracy and further reduce the memory footprint.
- Heavy use of caching. If I were them I'd be caching like crazy. There are a lot of clever tricks with caching variants, so users don't realize they're looking at a cached response. You could take the cached response from the big LLM, then using a simple LLM to reword it so users think they're still talking to the big, smart model.
- All the standard stuff in a BigCo stack (Containers, some kind of IaC, CI/CD stacks, testing environments, various types of databases or caching software, service discovery, observability tooling, etc) And probably a shitload of glue code that makes the magic happen behind the scenes.
by greatpostman on 4/12/2023, 12:39 AM
It’s a lot simpler than you think, model serving is Horizontally scalable. You deploy the model running on a top of gpus each with a web server. Load balancer distributes the requests, probably a web socket streams the inference in chunks back to the client.
by FrenchDevRemote on 4/11/2023, 8:58 PM
i guess you'd need a good relationship with a cloud provider, because I'm not even sure a regular customer could rent that many GPUs through normal means, or just make your own servers/datacenters
then I guess some basic decent load balancing/queuing on your inference endpoints
by uaas on 4/12/2023, 7:02 AM
There’s this, but it’s two years old now: https://openai.com/research/scaling-kubernetes-to-7500-nodes
by blibble on 4/11/2023, 9:03 PM
serious answer? get given $10 billion of cloud computing credit
by quickthrower2 on 4/12/2023, 5:33 AM
Make it open source, let the community figure it out. Hobbyists will optimize. :-)
by la64710 on 4/12/2023, 12:34 AM
From the horses mouth ;)
Deploying a large language model like OpenAI's GPT-3 for a million users requires careful planning and consideration of various technical and logistical aspects. Here are some steps that can be taken to deploy a large language model for a large number of users:
Determine the hardware and infrastructure requirements: Deploying a large language model for a million users requires significant computing power and infrastructure. You will need to decide whether to use cloud-based solutions like AWS, Azure, or Google Cloud, or build your own hardware infrastructure. Build a scalable architecture: To ensure that your deployment can handle a million users, you need to build a scalable architecture that can accommodate additional users as needed. This may involve setting up a load balancer and multiple servers to distribute the load. Ensure data security: Since a large language model like GPT-3 deals with sensitive information, it is important to ensure data security. This may involve encrypting data, using secure APIs, and setting up firewalls to protect against cyber-attacks. Develop APIs: To enable users to access the language model, you need to develop APIs that can be accessed through a web or mobile application. These APIs should be designed to handle multiple requests simultaneously. Test and monitor the deployment: Before deploying the language model for a million users, it is important to thoroughly test the system to ensure it is functioning as expected. You should also set up monitoring tools to track performance and detect any issues. Provide user support: When deploying a large language model for a million users, it is important to provide user support to address any issues or concerns that users may have. This may involve setting up a helpdesk or providing documentation to help users understand how to use the system. Overall, deploying a large language model like OpenAI's GPT-3 for a million users requires a significant investment of resources and expertise. By carefully planning and executing each step of the deployment process, you can ensure that the system is scalable, secure, and user-friendly.
by verdverm on 4/11/2023, 8:18 PM
Is it that much different from other APIs serving large numbers of users?
(other than the large GPU costs and lower request throughput per instance)