QuickStart¶

Boosting Sentiment Detection for Enterprise Support Emails¶

🏢 Overview¶

A major enterprise support team manages thousands of facility maintenance requests via email every week. Each message can be:

😊 Positive — expressing satisfaction or thanks
😐 Neutral — routine updates or requests
😞 Negative — reporting issues or dissatisfaction

But manual triage is slow and inconsistent, and the team’s first AI solution struggled with accuracy — especially distinguishing between neutral and negative feedback.

Goal: Rapidly improve sentiment classification Accuracy so every support request is routed and prioritized correctly, using real-world data from Meta's Facility Support Analyzer dataset.

⚠️ Challenge

Imbalanced dataset: most emails are neutral or positive.
Initial AI agent accuracy: 66.4% ±1.5% — too low for business needs.
Confusion between neutral and negative messages led to misrouted urgent issues.

🚀 Results

Accuracy jumped to 80.8% ±12.5% — a 14.5% absolute gain.
Neutral and negative messages are now reliably detected.
Support tickets are routed faster and more fairly.

In [ ]:

!pip install afnio

In [ ]:

import os
import json
import re
from getpass import getpass

import afnio
import afnio.cognitive as cog
import afnio.cognitive.functional as F
import afnio.tellurio as te
from afnio.models.openai import AsyncOpenAI
from afnio.trainer import Trainer
from afnio.utils.data import DataLoader, WeightedRandomSampler
from afnio.utils.datasets import FacilitySupport

🔑 Setup: API Keys and Project Initialization¶

Set your OpenAI and Tellurio API keys, then initialize your project and experiment run.

In [2]:

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

In [3]:

if not (tellurio_api_key := os.getenv("TELLURIO_API_KEY")):
    tellurio_api_key = getpass("🔑 Enter your Tellurio API key: ")  # This is automatically generated at signup and visible on the Tellurio Studio overview page (or you can create a new one under `https://platform.tellurio.ai/settings/api-keys`)

In [4]:

if not (username := os.getenv("TELLURIO_USERNAME")):
    tellurio_username = input("🔑 Enter your Tellurio username: ")  # Replace with your Tellurio username (in slug format). You can find this in the Tellurio Studio header bar or in the URL when logged in (e.g., `https://platform.tellurio.ai/your-username-slug`)

In [5]:

te.configure_logging("INFO")
te.login(api_key=tellurio_api_key)
run = te.init(tellurio_username, "Facility Support")

[afnio] API key provided and stored securely in local keyring.
[afnio] Currently logged in as 'dmpiergiacomo' to 'https://platform.tellurio.ai'. Use `afnio login --relogin` to force relogin.
[afnio] Project with slug 'facility-support' already exists in namespace 'dmpiergiacomo'.
[afnio] Run 'compassionate_sambar_231' created successfully at: https://platform.tellurio.ai/dmpiergiacomo/projects/facility-support/runs/compassionate-sambar-231/

📊 Data Preparation¶

Balance the training set, prepare your data loaders, and get the dataset ready for training and evaluation.

In [6]:

# The training set is inbalanced, so we assign weights to each sample to ensure fair learning across all classes
def compute_sample_weights(data):
    with te.suppress_variable_notifications():
        labels = [y.data for _, (_, y, _) in data]
        counts = {label: labels.count(label) for label in set(labels)}
        total = len(data)
    return [total / counts[label] for label in labels]

In [7]:

BATCH_SIZE = 33

training_data = FacilitySupport(split="train", root="data")
validation_data = FacilitySupport(split="val", root="data")
test_data = FacilitySupport(split="test", root="data")

weights = compute_sample_weights(training_data)
sampler = WeightedRandomSampler(weights, num_samples=len(training_data), replacement=True)

train_dataloader = DataLoader(training_data, sampler=sampler, batch_size=BATCH_SIZE)
val_dataloader = DataLoader(validation_data, batch_size=BATCH_SIZE, seed=42)
test_dataloader = DataLoader(test_data, batch_size=BATCH_SIZE, seed=42)

Using downloaded and verified file: data/FacilitySupport/raw/dataset.json

Using downloaded and verified file: data/FacilitySupport/raw/dataset.json

Using downloaded and verified file: data/FacilitySupport/raw/dataset.json

🧠 AI Agent Configuration¶

Define the initial prompt, response format, LM model clients used for inference and optimization, and the sentiment classification agent.

In [8]:

# Start with a simple prompt. The optimizer will refine it, but it can't guess your intent—so clearly state what you want the model to do
sentiment_task = "Read the provided message and determine the sentiment."
sentiment_user = "Read the provided message and determine the sentiment.\n\n**Message:**\n\n{message}\n\n"
SENTIMENT_RESPONSE_FORMAT = {
    "type": "json_schema",
    "json_schema": {
        "strict": True,
        "name": "sentiment_response_schema",
        "schema": {
            "type": "object",
            "properties": {
                "sentiment": {"type": "string", "enum": ["positive", "neutral", "negative"]},
            },
            "additionalProperties": False,
            "required": ["sentiment"],
        },
    },
}

In [9]:

# We use gpt-4.1-nano for the forward pass (inference), gpt-5 for the backward pass (feeedback generation), and gpt-5 for the optimization step (prompt rewriting)
afnio.set_backward_model_client("openai/gpt-5", completion_args={"temperature": 1.0, "max_completion_tokens": 32000, "reasoning_effort": "low"})
fw_model_client = AsyncOpenAI()
optim_model_client = AsyncOpenAI()

In [10]:

class FacilitySupportAnalyzer(cog.Module):

  def __init__(self):
    super().__init__()
    self.sentiment_task = cog.Parameter(data=sentiment_task, role="system prompt for sentiment classification", requires_grad=True)
    self.sentiment_user = afnio.Variable(data=sentiment_user, role="input template to sentiment classifier")
    self.sentiment_classifier = cog.ChatCompletion()

  def forward(self, fwd_model, inputs, **completion_args):
    sentiment_messages = [
      {"role": "system", "content": [self.sentiment_task]},
      {"role": "user", "content": [self.sentiment_user]},
    ]
    return self.sentiment_classifier(fwd_model, sentiment_messages, inputs=inputs, response_format=SENTIMENT_RESPONSE_FORMAT, **completion_args)

  def training_step(self, batch, batch_idx):
    X, y = batch
    _, gold_sentiment, _ = y
    pred_sentiment = self(fw_model_client, inputs={"message": X}, model="gpt-4.1-nano", temperature=0.0)
    pred_sentiment.data = [json.loads(re.sub(r"^```json\n|\n```$", "", item))["sentiment"].lower() for item in pred_sentiment.data]
    loss = F.exact_match_evaluator(pred_sentiment, gold_sentiment)
    return {"loss": loss, "accuracy": loss[0].data / len(gold_sentiment.data)}

  def validation_step(self, batch, batch_idx):
    return self.training_step(batch, batch_idx)

  def test_step(self, batch, batch_idx):
    return self.validation_step(batch, batch_idx)

  def configure_optimizers(self):
    constraints = [
      afnio.Variable(
        data="The improved variable must never include or reference the characters `{` or `}`. Do not output them, mention them, or describe them in any way.",
        role="optimizer constraint"
      )
    ]
    optimizer = afnio.optim.TGD(self.parameters(), model_client=optim_model_client, constraints=constraints, momentum=3, model="gpt-5", temperature=1.0, max_completion_tokens=32000, reasoning_effort="low")
    return optimizer

🚀 Training and Evaluation¶

Instantiate the agent and trainer, establish baseline performance, train the agent, and validate results.

In [11]:

agent = FacilitySupportAnalyzer()
trainer = Trainer(max_epochs=5, enable_agent_summary=False)
print(agent)

FacilitySupportAnalyzer(
  (sentiment_classifier): ChatCompletion()
)

In [12]:

# Establish baseline performance by testing the untrained agent on the test set
llm_clients=[fw_model_client, afnio.get_backward_model_client(), optim_model_client]
trainer.test(agent=agent, test_dataloader=test_dataloader, llm_clients=llm_clients)

Testing
[Test] 68/68 ━━━━━━━━━━━━━━━━━━━━ 0:00:07 tot_cost: $0.0024  - test_loss: 17.3333 - test_accuracy: 0.6818

Out[12]:

{'loss': 17.333333333333332, 'accuracy': 0.6818181818181818}

In [13]:

# Train the agent and validate results
trainer.fit(agent=agent, train_dataloader=train_dataloader, val_dataloader=val_dataloader, llm_clients=llm_clients)

Epoch 1/5
  [Training] 66/66 ━━━━━━━━━━━━━━━━━━━━ 0:01:41 1.2m/step tot_cost: $0.0104 train_loss: 24.5000 - train_accuracy:  
                                                          0.7424 - val_loss: 22.0000 - val_accuracy: 0.6667        
[Validation] 66/66 ━━━━━━━━━━━━━━━━━━━━ 0:01:45                                                                    

Epoch 2/5
  [Training] 66/66 ━━━━━━━━━━━━━━━━━━━━ 0:01:18 0.7m/step tot_cost: $0.0223 train_loss: 31.5000 - train_accuracy:  
                                                          0.9545 - val_loss: 25.5000 - val_accuracy: 0.7727        
[Validation] 66/66 ━━━━━━━━━━━━━━━━━━━━ 0:01:25                                                                    

Epoch 3/5
  [Training] 66/66 ━━━━━━━━━━━━━━━━━━━━ 0:03:01 2.3m/step tot_cost: $0.0353 train_loss: 27.0000 - train_accuracy:  
                                                          0.8182 - val_loss: 21.0000 - val_accuracy: 0.6364        
[Validation] 66/66 ━━━━━━━━━━━━━━━━━━━━ 0:03:05                                                                    

Epoch 4/5
  [Training] 66/66 ━━━━━━━━━━━━━━━━━━━━ 0:03:12 2.4m/step tot_cost: $0.0479 train_loss: 23.5000 - train_accuracy:  
                                                          0.7121 - val_loss: 20.0000 - val_accuracy: 0.6061        
[Validation] 66/66 ━━━━━━━━━━━━━━━━━━━━ 0:03:15                                                                    

Epoch 5/5
  [Training] 66/66 ━━━━━━━━━━━━━━━━━━━━ 0:03:00 2.2m/step tot_cost: $0.0628 train_loss: 23.0000 - train_accuracy:  
                                                          0.6970 - val_loss: 22.5000 - val_accuracy: 0.6818        
[Validation] 66/66 ━━━━━━━━━━━━━━━━━━━━ 0:03:04

🏅 Loading and Testing the Optimized AI Agent¶

Tip: The best checkpoint is the one with the highest val_accuracy (accuracy on validation set) during training. You can find its filename in the automatically created checkpoints/ directory.

Load the best agent checkpoint, evaluate on the test set, and display the final results.

In [14]:

# Only run this if you want to download our reference checkpoint
checkpoint_path = 'checkpoints/checkpoint_epoch2_20250912-190039.hf'
if not os.path.exists(checkpoint_path):
  !mkdir -p checkpoints
  !wget https://github.com/Tellurio-AI/tutorials/raw/main/facility_support/checkpoints/checkpoint_epoch2_20250912-190039.hf -P checkpoints/

In [15]:

checkpoint = afnio.load("checkpoints/checkpoint_epoch2_20250912-190039.hf")  # Replace with your best checkpoint path, or use our reference checkpoint (downloaded with the previous cell)
best_agent = FacilitySupportAnalyzer()
best_agent.load_state_dict(
    checkpoint['agent_state_dict'],
    model_clients={
        "sentiment_classifier.forward_model_client": fw_model_client,
    }
)

Out[15]:

<All keys matched successfully>

In [16]:

# Test the best agent checkpoint on the test set
trainer.test(agent=best_agent, test_dataloader=test_dataloader, llm_clients=llm_clients)

Testing
[Test] 68/68 ━━━━━━━━━━━━━━━━━━━━ 0:00:04 tot_cost: $0.0697  - test_loss: 19.3333 - test_accuracy: 0.8990

Out[16]:

{'loss': 19.333333333333332, 'accuracy': 0.8989898989898991}

In [ ]:

# Compare the agent's prompt before and after training
from IPython.display import display, HTML

display(HTML(f"""
<table style="width:100%;border-collapse:collapse;">
  <tr>
    <th style="text-align:left;background-color:#e0e0e0; color:#222;font-weight:bold;">BEFORE OPTIMIZATION</th>
    <th style="text-align:left;background-color:#e0e0e0; color:#222;font-weight:bold;">AFTER OPTIMIZATION</th>
  </tr>
  <tr>
    <td style="text-align:left;vertical-align:top;word-break:break-word;">
      <pre style="margin:0;white-space:pre-wrap;word-break:break-word;">{sentiment_task}</pre>
    </td>
    <td style="text-align:left;vertical-align:top;word-break:break-word;">
      <pre style="margin:0;white-space:pre-wrap;word-break:break-word;">{best_agent.sentiment_task.data}</pre>
    </td>
  </tr>
</table>
"""))

BEFORE OPTIMIZATION

AFTER OPTIMIZATION

Read the provided message and determine the sentiment.

You are a sentiment classifier. Read the provided message and output exactly one of: positive, negative, neutral — all lowercase, no punctuation, no extra text or spaces.

Scope: Judge the author’s expressed sentiment toward the subject of the message (e.g., the company, product, service, or the issue described), not the topic content itself, roles/titles, greetings, or urgency alone.

Decision rules:
- If polarity evidence is weak, mixed, contradictory, or evenly balanced, output neutral.
- Only output positive or negative when one clearly outweighs the other by intensity or count.

Positive vs neutral boundary:
- Positive only when there is clear, unambiguous, and sufficiently strong praise directed at the provider/service outcome (e.g., explicit evaluatives such as love, thrilled, amazing, excellent, fantastic, flawless, top-notch) and there are no concurrent concerns.
- Default to neutral for inquiries, status updates, logistics, generic politeness or thanks without evaluative content, hedged or weak praise (okay, fine, pretty good, satisfied client, pleased), expressions of uncertainty, or mixed messages where positives do not clearly dominate by intensity or count.

Cue handling:
- Aggregate polarity cues across the entire message; account for intensifiers and negations.
- Treat factual status updates, informational messages, inquiries, or logistical requests as neutral unless explicit sentiment is expressed, even if they include politeness or generic praise (e.g., thanks, appreciate your support, top-notch service).

Negation and modifier guidance:
- Negative: not good, not impressed, frustrated, unacceptable, skeptical.
- Usually neutral unless accompanied by strong positive cues: not bad, mild or weak praise such as satisfied client or pleased, okay, fine, pretty good.

Mixed or multi-issue messages:
- If praise co-occurs with requests or concerns and neither side clearly dominates, choose neutral.
- Choose neutral unless multiple strong positive indicators outweigh any negatives and there are no explicit negative cues.
- If different parts convey opposing sentiments and there is no clear majority by intensity or count, choose neutral.

Operational rule:
- Aggregate cues with negation and intensifiers; label positive only if net positive clearly exceeds negative by a high margin or there is at least one strong positive indicator (superlatives, emphatic adverbs, exclamatory emphasis) directed at the subject; otherwise neutral.

Examples (message → label):
- Thanks for the quick reply; can you update the ticket by tomorrow? → neutral
- Appreciate your support. Please fix the recurring billing error. → neutral
- Top-notch service on the last order, but this one arrived damaged. → neutral
- I’m pleased with the app overall, just a few minor issues to resolve. → neutral
- Pretty good overall. → neutral
- Not good — the installer keeps crashing. → negative
- I’m not impressed with your response times. → negative
- This delay is unacceptable and very frustrating. → negative
- It’s not bad. → neutral
- Absolutely love the new update; everything works flawlessly. → positive
- Amazing job! → positive

Output format reminder:
- Emit exactly one of the following labels: positive or negative or neutral — all lowercase, no punctuation, no extra text or spaces.
- Trim any leading/trailing whitespace or newlines before finalizing the single-word output.

In [18]:

run.finish()

[afnio] Run 'compassionate_sambar_231' marked as COMPLETED.