Setting up Docker containers for AI/ML projects
Docker for AI/ML solves the problem of environment reproducibility: the same container runs identically on a developer's laptop, a CI server, and a production GPU cluster. Key features include: NVIDIA Container Toolkit for GPU access, multi-stage builds to minimize image size, and layer caching to speed up CI.
NVIDIA Container Toolkit
# Установка для доступа GPU внутри контейнеров
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list \
| sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker
# Тест
docker run --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
Dockerfile for ML project
# Multi-stage build: меньший финальный образ
FROM nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04 AS builder
# Зависимости сборки
RUN apt-get update && apt-get install -y python3.11 python3-pip git \
&& rm -rf /var/lib/apt/lists/*
# Установка зависимостей отдельным слоем (кэшируется)
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
FROM nvidia/cuda:12.2.0-cudnn8-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3.11 \
&& rm -rf /var/lib/apt/lists/*
# Копирование только установленных пакетов
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH
WORKDIR /app
COPY src/ ./src/
# Non-root пользователь для безопасности
RUN useradd -m -u 1000 mluser
USER mluser
CMD ["python", "src/train.py"]
Docker Compose for local development
# docker-compose.yml
version: '3.8'
services:
training:
build: .
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- ./data:/app/data:ro # Данные read-only
- ./src:/app/src # Код монтируется для hot-reload
- ./outputs:/app/outputs # Результаты
environment:
- MLFLOW_TRACKING_URI=http://mlflow:5000
depends_on:
- mlflow
mlflow:
image: python:3.11-slim
command: mlflow server --host 0.0.0.0 --port 5000
ports:
- "5000:5000"
volumes:
- mlflow-data:/mlflow
volumes:
mlflow-data:
Optimizing image size
Common issues: PyTorch images >10GB due to the inclusion of CUDA dev libraries and unnecessary packages. Solution:
- Use
runtimetag instead ofdevel(4-5GB difference) -
--no-cache-dirduring pip install - Delete apt cache after installing packages
- Use
.dockerignoreto exclude data and virtual environments
Final image for inference: PyTorch ONNX Runtime + application = 2-3GB vs. 8-10GB with the naive approach.







