AI-Powered Product Description Generation for Mobile Apps
Seller on marketplace takes photo of item with phone and presses "Publish". No need to write description manually — system should offer ready text from photo and product category. Not fiction: Vision API + LLM combo closes task in 2–4 seconds.
Where Description Content Comes From
Visual Analysis of Photos
First step — extract characteristics from image. Google Cloud Vision Product Search or Azure Cognitive Services Computer Vision return: object tags, color, brand on packaging, text on label (OCR). For mobile app, better than separate calls — multimodal LLM (GPT-4o, Gemini Pro Vision) — one request with image analyzes and generates text immediately.
Structured Attributes from Form
User fills minimum: category, price, condition (new/used). This data included in prompt as structured context. Model infers rest from photo.
How Implemented on Client
Entire flow async: user selects photo, clicks "Create Description", sees skeleton loader, in 2–3 seconds gets editable text.
// Android: send image for generation
class DescriptionGeneratorViewModel : ViewModel() {
fun generateDescription(imageUri: Uri, category: String) {
_uiState.value = UiState.Loading
viewModelScope.launch {
try {
val base64Image = imageUri.toBase64(contentResolver)
val response = descriptionApi.generate(
GenerationRequest(
imageBase64 = base64Image,
category = category,
language = Locale.getDefault().language,
maxLength = 300
)
)
_uiState.value = UiState.Success(response.description)
} catch (e: Exception) {
_uiState.value = UiState.Error(e.message)
}
}
}
}
On iOS similar via async/await + URLSession:
func generateDescription(image: UIImage, category: String) async throws -> String {
let imageData = image.jpegData(compressionQuality: 0.8)!
let base64 = imageData.base64EncodedString()
let request = DescriptionRequest(imageBase64: base64, category: category, language: Locale.current.languageCode ?? "en")
let response = try await api.generateDescription(request)
return response.text
}
Compress image to JPEG quality 0.8 before sending — reduces payload from ~3 MB (RAW from camera) to ~300–500 KB without visible quality loss for Vision API.
Backend: Prompt Engineering for Quality
def build_prompt(category: str, image_tags: list, language: str) -> str:
return f"""
You are a professional copywriter for an online marketplace.
Write a product description based on the provided image.
Category: {category}
Detected attributes: {', '.join(image_tags)}
Language: {language}
Requirements:
- 2-3 sentences, 50-100 words
- Start with the main product feature, not "This is a..."
- Include detected color, condition, and brand if visible
- Use active voice
- No adjectives like "great", "amazing", "perfect"
"""
Ban on "great", "amazing", "perfect" — not formalism. Models by default shove them into every other sentence, descriptions become indistinguishable.
Streaming for Better UX
Instead of waiting for full answer — stream via Server-Sent Events. Text appears as generated, like ChatGPT. Android implements via okhttp3.EventSource, iOS — URLSessionDataTask with didReceive data delegate.
Description Structure by Type
| Product Type | Length | Focus |
|---|---|---|
| Electronics | 100–150 words | Technical specs + condition |
| Clothing | 60–80 words | Size, color, material, condition |
| Furniture | 80–120 words | Dimensions, material, style |
| Books | 40–60 words | Author, topic, condition |
Process
API design: request format, error handling for Vision API (blurry photo, no objects).
Prompt template setup by product categories.
Client UI development with skeleton loader and result editor.
Streaming implementation for better UX on long descriptions.
Timeline Guidance
Basic integration (photo → description via GPT-4o / Gemini) — 3–4 days. With prompt tuning by categories and streaming — up to 1 week.







