Deploying LLM to Google Cloud Vertex AI
Vertex AI is Google Cloud's managed ML platform. For LLM: Model Garden (ready-made models), Vertex AI Endpoints (custom deployment), Workbench (laptops), Pipelines (ML pipelines). Deep integration with TPU is a unique advantage of GCP.
Vertex AI Model Garden
Ready-made models with one click: Llama-3, Gemma, Mistral – deploy without infrastructure setup:
import vertexai
from vertexai.preview.language_models import TextGenerationModel
vertexai.init(project="my-project", location="us-central1")
# Gemini через Vertex
from vertexai.generative_models import GenerativeModel
model = GenerativeModel("gemini-1.5-pro-002")
response = model.generate_content(
"Explain machine learning",
generation_config={"max_output_tokens": 512, "temperature": 0.7}
)
Custom deployment via Vertex AI Endpoints
from google.cloud import aiplatform
from google.cloud.aiplatform import gapic
aiplatform.init(project="my-project", location="us-central1")
# Загрузка модели из GCS
model = aiplatform.Model.upload(
display_name="llama3-8b-vllm",
artifact_uri="gs://my-bucket/models/llama3-8b/",
serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/pytorch-gpu.2-2:latest",
serving_container_command=[
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model=/gcs/models/llama3-8b/",
"--tensor-parallel-size=1",
"--max-model-len=8192",
"--host=0.0.0.0",
"--port=8080"
],
serving_container_ports=[{"containerPort": 8080}],
serving_container_health_route="/health",
serving_container_predict_route="/v1/completions",
serving_container_environment_variables={
"TRANSFORMERS_CACHE": "/gcs/hf_cache/",
}
)
# Деплой endpoint
endpoint = aiplatform.Endpoint.create(display_name="llama3-8b-endpoint")
model.deploy(
endpoint=endpoint,
deployed_model_display_name="llama3-8b-v1",
machine_type="g2-standard-12", # 1x L4 GPU
accelerator_type="NVIDIA_L4",
accelerator_count=1,
min_replica_count=1,
max_replica_count=10, # автоскейлинг
traffic_percentage=100,
)
Invocation via REST API
import google.auth
import google.auth.transport.requests
import requests
def invoke_vertex_endpoint(project: str, endpoint_id: str, payload: dict) -> dict:
credentials, _ = google.auth.default()
request = google.auth.transport.requests.Request()
credentials.refresh(request)
url = f"https://us-central1-aiplatform.googleapis.com/v1/projects/{project}/locations/us-central1/endpoints/{endpoint_id}:rawPredict"
response = requests.post(
url,
headers={"Authorization": f"Bearer {credentials.token}"},
json=payload
)
return response.json()
Cloud TPU for LLM
TPU v4/v5 is a competitive advantage for GCP over AWS/Azure for transformer models. JAX + MaxText for efficient TPU LLM inference:
# Запуск Llama на TPU v5e через JetStream
gcloud compute tpus tpu-vm create llm-tpu \
--zone=us-central2-b \
--accelerator-type=v5litepod-8 \
--version=v2-tpuv5-lite
# На TPU VM
pip install jetstream maxengine
python -m jetstream.core.implementations.maxtext.server \
--model=llama-3-8b \
--tokenizer_path=tokenizer.model \
--load_parameters_path=gs://bucket/llama3-8b/ \
--port=9000
TPU v5e vs A100: ~2x throughput at ~0.4x cost per token for large batches.
Vertex AI Pipelines for ML
from kfp import dsl, compiler
import vertexai.preview.pipeline_jobs as pipeline_jobs
@dsl.component(base_image="python:3.11")
def train_model(data_path: str, output_path: str) -> None:
import subprocess
subprocess.run(["python", "train.py", "--data", data_path, "--output", output_path])
@dsl.pipeline(name="llm-fine-tuning-pipeline")
def fine_tuning_pipeline(base_model: str, data_gcs_path: str):
prepare_data_task = prepare_data(gcs_path=data_gcs_path)
train_task = train_model(
data_path=prepare_data_task.output,
output_path="gs://bucket/fine-tuned/"
).set_accelerator_type("NVIDIA_A100_80G").set_accelerator_limit(4)
compiler.Compiler().compile(fine_tuning_pipeline, "pipeline.yaml")
job = aiplatform.PipelineJob(
display_name="llm-fine-tuning",
template_path="pipeline.yaml",
parameter_values={"base_model": "meta-llama/Llama-3-8b", "data_gcs_path": "gs://..."}
)
job.run()
Monitoring via Cloud Monitoring
Vertex AI automatically publishes metrics to Cloud Monitoring: prediction latency, request count, error rate. Custom metrics via Cloud Logging + Log-based metrics for attached business metrics.







