Migrating an AI solution between cloud providers
Migrating an AI workload between AWS, GCP, and Azure is a complex task: it requires transferring not only code and models, but also data, training infrastructure, inference services, and integrated managed services. Minimizing downtime and avoiding model quality degradation is crucial.
Typical reasons for migration
- Cost optimization: the difference in the cost of GPU spot instances reaches 30-40%
- Vendor lock-in reduction: reducing dependence on proprietary services
- Compliance: data residency requirements
- Performance: a specific cloud offers the best GPU (H100 is available from different providers at different times)
Pre-migration audit
Cloud-specific dependencies checklist:
□ Managed ML services (SageMaker / Vertex AI / Azure ML)
□ Data storage (S3 / GCS / Azure Blob)
□ Container registry (ECR / GCR / ACR)
□ Message queues (SQS+SNS / Pub/Sub / Event Hub)
□ Database services (RDS / Cloud SQL / Azure Database)
□ Secrets management (Secrets Manager / Secret Manager / Key Vault)
□ Monitoring (CloudWatch / Cloud Monitoring / Azure Monitor)
□ Networking (VPC, security groups, DNS)
Migration Strategy
Strangler Fig pattern - gradual replacement, not a big bang:
- Raise parallel infrastructure in a new cloud
- Transfer non-critical workload (batch training)
- Configure shadow deployment: inference in both clouds
- Gradually switch traffic (canary migration)
- Remove old infrastructure
Abstraction from cloud-specific API
# Cloud-agnostic storage abstraction
from abc import ABC, abstractmethod
class ObjectStorage(ABC):
@abstractmethod
def upload(self, local_path: str, remote_path: str): ...
@abstractmethod
def download(self, remote_path: str, local_path: str): ...
class S3Storage(ObjectStorage):
def __init__(self, bucket: str, region: str = "us-east-1"):
self.client = boto3.client('s3', region_name=region)
self.bucket = bucket
def upload(self, local_path: str, remote_path: str):
self.client.upload_file(local_path, self.bucket, remote_path)
class GCSStorage(ObjectStorage):
def __init__(self, bucket: str):
self.client = storage.Client()
self.bucket = self.client.bucket(bucket)
def upload(self, local_path: str, remote_path: str):
blob = self.bucket.blob(remote_path)
blob.upload_from_filename(local_path)
# Переключение через конфиг
storage = S3Storage("my-bucket") if CLOUD == "aws" else GCSStorage("my-bucket")
Data transfer
# AWS S3 → GCS через Storage Transfer Service (Google)
gcloud transfer jobs create \
--source-agent-pool=aws-pool \
--aws-source-bucket=my-aws-bucket \
--destination-bucket=my-gcs-bucket \
--do-not-delete-from-source
# Для больших датасетов (>TB): gsutil -m rsync -r
gsutil -m rsync -r s3://source-bucket gs://dest-bucket
Post-migration testing
Mandatory check: reproducibility of model predictions—the same input data should produce identical results (taking stochasticity into account). A difference in the distribution of predictions between the old and new environments indicates reproducibility issues. Comparison of metrics on the holdout dataset should show a discrepancy of <0.1%.
Typical timeframe: 4-8 weeks for a full-scale AI platform with ML pipelines, Feature Store, and inference services.







