Cookbook: Classification¶
Classify structured data into predefined or open-ended categories.
Industry classification¶
Classify companies by their industry sector:
from typing import Literal
from pydantic import BaseModel, Field
from smelt import Model, Job
class IndustryClassification(BaseModel):
sector: str = Field(description="Primary industry sector (e.g. Technology, Healthcare, Finance)")
sub_sector: str = Field(description="More specific sub-sector (e.g. Cloud Computing, Pharmaceuticals)")
is_public: bool = Field(description="Whether the company is publicly traded on a major exchange")
model = Model(provider="openai", name="gpt-4.1-mini")
job = Job(
prompt="Classify each company by its primary industry sector and sub-sector. "
"Determine if the company is publicly traded on a major stock exchange (NYSE, NASDAQ, etc.).",
output_model=IndustryClassification,
batch_size=10,
)
companies = [
{"name": "Apple Inc.", "description": "Consumer electronics, software, and services", "founded": "1976"},
{"name": "JPMorgan Chase", "description": "Global financial services and investment banking", "founded": "1799"},
{"name": "Pfizer", "description": "Pharmaceutical company developing medicines and vaccines", "founded": "1849"},
{"name": "Tesla", "description": "Electric vehicles and clean energy products", "founded": "2003"},
{"name": "Spotify", "description": "Digital music and podcast streaming platform", "founded": "2006"},
]
result = job.run(model, data=companies)
for company, cls in zip(companies, result.data):
print(f"{company['name']:20s} → {cls.sector} / {cls.sub_sector} (public: {cls.is_public})")
Category assignment with constrained values¶
Use Literal types to restrict outputs to a fixed set of categories:
class TicketCategory(BaseModel):
category: Literal["billing", "technical", "shipping", "account", "general"] = Field(
description="Support ticket category"
)
priority: Literal["low", "medium", "high", "urgent"] = Field(
description="Priority level based on urgency and business impact"
)
requires_human: bool = Field(
description="Whether this ticket needs human escalation vs automated response"
)
job = Job(
prompt="Triage each support ticket. Assign a category, priority level, "
"and determine if human escalation is needed. "
"Urgent priority is for issues affecting billing or complete service outage.",
output_model=TicketCategory,
batch_size=10,
)
tickets = [
{"ticket_id": "TK-1001", "message": "I was charged twice for my subscription", "product": "Pro Plan"},
{"ticket_id": "TK-1002", "message": "App crashes when I try to export PDF", "product": "Desktop App"},
{"ticket_id": "TK-1003", "message": "When will my order arrive?", "product": "Wireless Mouse"},
]
result = job.run(model, data=tickets)
for ticket, triage in zip(tickets, result.data):
print(f"{ticket['ticket_id']}: {triage.category} / {triage.priority} (human: {triage.requires_human})")
Multi-label classification¶
Use list[str] for multi-label outputs:
class ArticleTags(BaseModel):
primary_topic: str = Field(description="Main topic of the article")
tags: list[str] = Field(description="1-5 relevant tags for the article")
reading_level: Literal["beginner", "intermediate", "advanced"] = Field(
description="Target reading level"
)
job = Job(
prompt="Analyze each article and assign a primary topic, relevant tags (1-5), "
"and a reading level.",
output_model=ArticleTags,
batch_size=5,
)
articles = [
{"title": "Introduction to Neural Networks", "abstract": "A beginner's guide to understanding how neural networks work..."},
{"title": "Advanced Kubernetes Networking", "abstract": "Deep dive into CNI plugins, service meshes, and network policies..."},
]
result = job.run(model, data=articles)
for article, tags in zip(articles, result.data):
print(f"{article['title']}")
print(f" Topic: {tags.primary_topic}")
print(f" Tags: {', '.join(tags.tags)}")
print(f" Level: {tags.reading_level}")
Binary classification¶
Simple yes/no classification:
class SpamCheck(BaseModel):
is_spam: bool = Field(description="Whether the email is spam")
confidence: float = Field(description="Confidence score from 0.0 (uncertain) to 1.0 (certain)")
reason: str = Field(description="Brief explanation for the classification")
job = Job(
prompt="Determine if each email is spam. Consider common spam indicators: "
"urgency language, suspicious links, too-good-to-be-true offers, "
"requests for personal information.",
output_model=SpamCheck,
batch_size=20,
)
Tips for classification tasks¶
- Be specific in your prompt — "Classify by GICS sector" is better than "classify by industry"
- Use
Literaltypes when you have a fixed set of categories — prevents hallucinated categories - Add
Field(description=...)to every field — tells the LLM exactly what you expect - Add a
confidencefield — lets you filter low-confidence results for human review - Include examples in your prompt if the classification criteria are nuanced
- Use
temperature=0for deterministic, consistent classification: