Ollama and Cloud Run with GPUs

12/3/24 •

Summaries by topic

• English

This demonstration showcases the ease of deploying and utilizing large language models (LLMs) like Llama 3.1 and Gemmini 2 on Google Cloud Run with GPUs. Using the Ollama container image, a Gemmini 2 model is deployed on Cloud Run with GPU acceleration, enabling fast inference through both the Ollama CLI and the Open Web UI, both accessed via Cloud Run services.

Deploying LLMs with Ollama and Cloud Run

• 00:00:07 Ollama simplifies running LLMs, including Llama 3.1, Gemmini 2, and others, across various operating systems and with GPU acceleration. It also offers a way to deploy AI models to the cloud, as demonstrated with Google Cloud Run, providing scalable access to the models from any location.

Cloud Run GPU-Enabled Service Deployment

• 00:01:30 A GPU-enabled Cloud Run service is created using the official Ollama container image, allocating resources like memory, CPUs, and an Nvidia L4 GPU. The service is configured to expose the port 11434, connect to a VPC for traffic routing, and utilize a Google Cloud Storage bucket as a data volume for model data. The service is then accessible via a generated URL.