Development of a prompt versioning system
Prompt versioning is Git for LLM instructions. When a prompt changes, you need to know who changed it, what exactly changed, how it affected quality, and be able to revert to any previous version.
Principles of versioning
Version immutability: the created version never changes. If a prompt needs to be corrected, a new version is created.
Semantic versioning: major.minor.patch:
- Major: a fundamental change in the instruction or task
- Minor: improving the wording without changing the problem
- Patch: typo fixes, minor tweaks
Link with results: Each version is linked to metrics on the evaluation set.
Git-based versioning of prompts
For small teams, storing prompts in Git is often sufficient:
prompts/
├── customer-support/
│ ├── system-prompt.v1.txt
│ ├── system-prompt.v2.txt
│ └── system-prompt.current -> system-prompt.v2.txt
├── summarization/
│ ├── prompt.v1.yaml
│ └── prompt.v2.yaml
└── prompts.json # Индекс с метаданными
# prompts/summarization/prompt.v2.yaml
version: "2.0.0"
name: "document-summarizer"
created: "2024-11-15"
author: "ml-team"
changelog: "Added length constraint, improved tone instruction"
model:
provider: "openai"
name: "gpt-4o"
temperature: 0.2
max_tokens: 500
variables:
- name: document
required: true
- name: max_sentences
required: false
default: "3"
content: |
Summarize the following document in exactly {{max_sentences}} sentences.
Be concise and focus on the main points.
Do not add information not present in the document.
Document:
{{document}}
metrics:
rouge_l: 0.47
human_rating: 4.2
eval_set: "summarization-benchmark-v3"
Automatic diff prompts
import difflib
def diff_prompt_versions(v1_content: str, v2_content: str) -> str:
"""Показать diff между версиями промпта"""
v1_lines = v1_content.splitlines(keepends=True)
v2_lines = v2_content.splitlines(keepends=True)
diff = difflib.unified_diff(
v1_lines, v2_lines,
fromfile="version_1",
tofile="version_2",
lineterm=""
)
return "".join(diff)
def analyze_prompt_change(v1: str, v2: str) -> dict:
"""Анализ характера изменений"""
v1_words = set(v1.lower().split())
v2_words = set(v2.lower().split())
added_words = v2_words - v1_words
removed_words = v1_words - v2_words
return {
"length_change": len(v2) - len(v1),
"added_words": list(added_words)[:10],
"removed_words": list(removed_words)[:10],
"similarity": difflib.SequenceMatcher(None, v1, v2).ratio(),
"change_type": "major" if difflib.SequenceMatcher(None, v1, v2).ratio() < 0.7 else "minor"
}
Promotion workflow
[Draft] → [In Review] → [Approved] → [Staging] → [Production]
↑ ↓
Reviewer A/B Test (5%)
↓
Full Rollout / Rollback
Key rule: no prompts are promoted to production without passing the evaluation set. The automated CI job runs tests every time a prompt changes and blocks promotion if the regression is greater than 3%.







