Integration of synthetic data platforms (Gretel, Mostly AI, Tonic)
Commercial synthetic data platforms offer managed services with higher generation quality, compliance certifications, and enterprise SLAs compared to open-source solutions. The choice depends on the data type, privacy requirements, and infrastructure constraints.
Gretel.ai
Gretel specializes in differentially private data generation and supports tabular, text, and time-based data:
import gretel_client as gretel
gretel.configure_session(api_key="grtu_...")
# Создание проекта
project = gretel.create_project(name="customer-data-synthesis")
# Обучение ACTGAN модели (Gretel's version of CTGAN)
model = project.create_model_obj(
model_config={
"schema_version": "1.0",
"name": "customer-actgan",
"models": [{
"actgan": {
"data_source": "customers.csv",
"params": {
"epochs": 400,
"batch_size": 500,
"generator_lr": 0.0002,
},
"privacy_filters": {
"similarity": "medium", # high/medium/low
"outliers": "medium"
}
}
}]
}
)
model.submit_cloud()
model.poll(verbose=True) # Ожидание обучения
# Генерация
record_handler = model.create_record_handler_obj(
params={"num_records": 10000}
)
record_handler.submit_cloud()
record_handler.poll(verbose=True)
synthetic_df = record_handler.get_artifact_link("data")
Mostly AI
Mostly AI is a high-quality enterprise platform for financial data:
import mostlyai
client = mostlyai.MostlyAI(
api_key="...",
base_url="https://app.mostly.ai"
)
# Создание генератора на основе исходных данных
generator = client.generators.create(
name="transaction-generator",
tables=[{
"name": "transactions",
"data": transactions_df,
"columns": [
{"name": "amount", "model_encoding_type": "NUMERIC_AUTO"},
{"name": "merchant_category", "model_encoding_type": "CATEGORICAL"},
{"name": "is_fraud", "model_encoding_type": "CATEGORICAL"},
]
}]
)
generator.train() # Асинхронное обучение
# Генерация
synthetic = client.synthetic_datasets.create(
generator=generator,
tables=[{"name": "transactions", "configuration": {"sample_size": 50000}}]
)
synthetic_df = synthetic.tables["transactions"].data()
Tonic.ai
Tonic specializes in de-identification and subsetting for dev/test environments:
import tonic
workspace = tonic.Workspace(api_key="...")
# Создание датасет transformation
transform = workspace.create_transform(
name="production-to-staging",
source_connection=prod_db_connection,
destination_connection=staging_db_connection
)
# Правила трансформации
transform.add_generator("email", "RandomEmail")
transform.add_generator("ssn", "RandomSsn")
transform.add_generator("credit_card", "RandomCreditCard")
transform.add_generator("first_name", "RandomFirstName")
# Сохранение числовых зависимостей (correlation preservation)
transform.add_consistency_rule(
columns=["income", "loan_amount"],
preserve_correlation=True
)
transform.run()
Comparison of platforms
| Criterion | Gretel | Mostly AI | Tonic |
|---|---|---|---|
| Data type | Tabular, text, time series | Tabular, relational | Relational databases |
| DP support | Yes | No | No |
| Self-hosted | Yes | Yes (enterprise) | Yes |
| Use case | Privacy-first generation | Finance, banking | Dev/test data |
Integration takes 1-2 weeks: connecting data sources, setting up transformation/generation rules, setting up scheduled synchronization to keep test environments up to date.







