LLM Deployment on Google Cloud (Vertex AI)

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM Deployment on Google Cloud (Vertex AI)
Medium
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Deploying LLM to Google Cloud Vertex AI

Vertex AI is Google Cloud's managed ML platform. For LLM: Model Garden (ready-made models), Vertex AI Endpoints (custom deployment), Workbench (laptops), Pipelines (ML pipelines). Deep integration with TPU is a unique advantage of GCP.

Vertex AI Model Garden

Ready-made models with one click: Llama-3, Gemma, Mistral – deploy without infrastructure setup:

import vertexai
from vertexai.preview.language_models import TextGenerationModel

vertexai.init(project="my-project", location="us-central1")

# Gemini через Vertex
from vertexai.generative_models import GenerativeModel

model = GenerativeModel("gemini-1.5-pro-002")
response = model.generate_content(
    "Explain machine learning",
    generation_config={"max_output_tokens": 512, "temperature": 0.7}
)

Custom deployment via Vertex AI Endpoints

from google.cloud import aiplatform
from google.cloud.aiplatform import gapic

aiplatform.init(project="my-project", location="us-central1")

# Загрузка модели из GCS
model = aiplatform.Model.upload(
    display_name="llama3-8b-vllm",
    artifact_uri="gs://my-bucket/models/llama3-8b/",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.2-2:latest",
    serving_container_command=[
        "python", "-m", "vllm.entrypoints.openai.api_server",
        "--model=/gcs/models/llama3-8b/",
        "--tensor-parallel-size=1",
        "--max-model-len=8192",
        "--host=0.0.0.0",
        "--port=8080"
    ],
    serving_container_ports=[{"containerPort": 8080}],
    serving_container_health_route="/health",
    serving_container_predict_route="/v1/completions",
    serving_container_environment_variables={
        "TRANSFORMERS_CACHE": "/gcs/hf_cache/",
    }
)

# Деплой endpoint
endpoint = aiplatform.Endpoint.create(display_name="llama3-8b-endpoint")
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name="llama3-8b-v1",
    machine_type="g2-standard-12",     # 1x L4 GPU
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    min_replica_count=1,
    max_replica_count=10,             # автоскейлинг
    traffic_percentage=100,
)

Invocation via REST API

import google.auth
import google.auth.transport.requests
import requests

def invoke_vertex_endpoint(project: str, endpoint_id: str, payload: dict) -> dict:
    credentials, _ = google.auth.default()
    request = google.auth.transport.requests.Request()
    credentials.refresh(request)

    url = f"https://us-central1-aiplatform.googleapis.com/v1/projects/{project}/locations/us-central1/endpoints/{endpoint_id}:rawPredict"

    response = requests.post(
        url,
        headers={"Authorization": f"Bearer {credentials.token}"},
        json=payload
    )
    return response.json()

Cloud TPU for LLM

TPU v4/v5 is a competitive advantage for GCP over AWS/Azure for transformer models. JAX + MaxText for efficient TPU LLM inference:

# Запуск Llama на TPU v5e через JetStream
gcloud compute tpus tpu-vm create llm-tpu \
  --zone=us-central2-b \
  --accelerator-type=v5litepod-8 \
  --version=v2-tpuv5-lite

# На TPU VM
pip install jetstream maxengine
python -m jetstream.core.implementations.maxtext.server \
  --model=llama-3-8b \
  --tokenizer_path=tokenizer.model \
  --load_parameters_path=gs://bucket/llama3-8b/ \
  --port=9000

TPU v5e vs A100: ~2x throughput at ~0.4x cost per token for large batches.

Vertex AI Pipelines for ML

from kfp import dsl, compiler
import vertexai.preview.pipeline_jobs as pipeline_jobs

@dsl.component(base_image="python:3.11")
def train_model(data_path: str, output_path: str) -> None:
    import subprocess
    subprocess.run(["python", "train.py", "--data", data_path, "--output", output_path])

@dsl.pipeline(name="llm-fine-tuning-pipeline")
def fine_tuning_pipeline(base_model: str, data_gcs_path: str):
    prepare_data_task = prepare_data(gcs_path=data_gcs_path)
    train_task = train_model(
        data_path=prepare_data_task.output,
        output_path="gs://bucket/fine-tuned/"
    ).set_accelerator_type("NVIDIA_A100_80G").set_accelerator_limit(4)

compiler.Compiler().compile(fine_tuning_pipeline, "pipeline.yaml")

job = aiplatform.PipelineJob(
    display_name="llm-fine-tuning",
    template_path="pipeline.yaml",
    parameter_values={"base_model": "meta-llama/Llama-3-8b", "data_gcs_path": "gs://..."}
)
job.run()

Monitoring via Cloud Monitoring

Vertex AI automatically publishes metrics to Cloud Monitoring: prediction latency, request count, error rate. Custom metrics via Cloud Logging + Log-based metrics for attached business metrics.