Development of AI System for Automated Journalism
Automated journalism is news text generation from structured data: financial reports, sports statistics, election data, weather summaries. The technology works where there is data and a clear narrative template. AP and Reuters publish tens of thousands of such stories annually.
Where Automation is Justified
Financial Reporting: quarterly company results — data from EDGAR/stock exchanges → text with key metrics, dynamics, comparison with forecasts. One template covers thousands of companies.
Sports Statistics: match results, game statistics — standard narrative with variation on key moments.
Registry Summaries: property registry data on deals, traffic accident data, bankruptcy registries — automatic summaries with anomalies.
Weather Summaries and Alerts: weather forecast into readable text with emphasis on hazardous phenomena.
Data-to-Text System Architecture
class DataToTextPipeline:
def __init__(self, template: NarrativeTemplate):
self.template = template
self.data_analyzer = DataAnalyzer()
self.text_generator = TextGenerator()
def generate(self, data: dict) -> GeneratedArticle:
# 1. Data analysis: key fact extraction
key_facts = self.data_analyzer.extract_key_facts(data, self.template.fact_rules)
# 2. Determine article "angle"
angle = self.data_analyzer.determine_angle(key_facts, self.template.angle_rules)
# 3. Generate text by narrative template
text = self.text_generator.generate(
facts=key_facts,
angle=angle,
template=self.template,
style_guide=self.template.style_guide
)
# 4. Post-processing: fact checking, number formatting
text = self.postprocess(text, data)
return GeneratedArticle(
headline=self.generate_headline(key_facts, angle),
body=text,
data_sources=data.get("sources", []),
generated_at=datetime.utcnow(),
template_version=self.template.version
)
def postprocess(self, text: str, data: dict) -> str:
# Verification: every number in text must match source data
return FactChecker(data).verify_and_fix(text)
Narrative Templates
A template defines narrative logic, not specific text. For financial reporting:
class EarningsReportTemplate(NarrativeTemplate):
fact_rules = [
FactRule("revenue", comparisons=["yoy", "qoq", "consensus"]),
FactRule("net_income", comparisons=["yoy", "consensus"]),
FactRule("eps", comparisons=["consensus", "guidance"]),
FactRule("guidance_next_quarter", type="forward_looking"),
]
angle_rules = [
AngleRule(condition="revenue_beat > 5%", angle="strong_beat"),
AngleRule(condition="revenue_miss > 5%", angle="disappointment"),
AngleRule(condition="guidance_raised", angle="optimism"),
AngleRule(condition="guidance_lowered", angle="caution"),
]
Variability and Anti-Boilerplate
One problem with automated journalism is text uniformity. Several techniques:
- Synonym variation: multiple versions of each key phrase, random selection
- Sentence structure variation: reordering of facts depending on "angle"
- Contextual enrichment: adding context (industry trends, company history) from knowledge base
- LLM rewriting: final pass through LLM for style variety while preserving facts
Fact Verification
Critical: every numerical claim in text must be traceable to source data. Automatic verification:
def verify_facts(article_text: str, source_data: dict) -> VerificationResult:
# Extract all numerical claims from text
claims = extract_numerical_claims(article_text)
errors = []
for claim in claims:
# Find corresponding value in source data
source_value = find_in_data(source_data, claim.entity, claim.metric)
if source_value is None:
errors.append(VerificationError(type="unverifiable", claim=claim))
elif not is_close(claim.value, source_value, tolerance=0.01):
errors.append(VerificationError(
type="mismatch",
claim=claim,
expected=source_value
))
return VerificationResult(is_valid=len(errors) == 0, errors=errors)
Metadata and Transparency
All automatically generated content is marked: "Automatically generated based on data [source]". Reader can access source data. This is AP Automation standard and media ethics requirement.
Performance
One system instance on GPU A100: ~500 stories per hour at average 300 words. For a news agency this means complete coverage of financial reporting for all exchange companies on results publication day.







