# Runs and Experiments

```{warning}
Before running any code, ensure you are logged in to the Afnio backend (`afnio login`).
See [Logging in to Afnio Backend](login) for details.
```

Afnio promotes an explicitly experimental approach to building AI agent
architectures. The same iterative workflow used for machine learning and
deep-learning development—design, experiment, evaluate, and refine—is central
to producing robust agents. Afnio is optimized for language-centric workflows
where the training set is often orders of magnitude smaller than typical deep-learning
datasets; powerful optimizers and textual gradients let you learn from a few,
high-quality examples and rich semantic feedback rather than relying on massive
numeric datasets.

This page explains how Afnio tracks experiments on the [Tellurio Studio](https://platform.tellurio.ai/)
backend, how Runs are created and finished, and how logs, metrics, and artifacts are
associated with an active [`Run`](../../generated/afnio.tellurio.run.rst).

**Terminology note:** "Runs" and "Experiments" are used interchangeably and
refer to the same concept: a tracked execution grouping metadata, metrics,
costs, and artifacts for a single experiment.

---

## What is a Run?

A [`Run`](../../generated/afnio.tellurio.run.rst) represents a single tracked
execution of your agent or workflow. It groups inputs, outputs, evaluation metrics,
cost information (LM usage), and artifacts (checkpoints, serialized state) created
while developing or training an agent.

- **Local code, remote optimization:** Agent logic and forward passes execute
  locally, while the Afnio backend — hosted on
  [Tellurio Studio](https://platform.tellurio.ai/) — executes LM requests,
  constructs the backward/optimization graph, and runs optimizer/backpropagation.
  This separation enables secure, scalable textual gradient generation, centralized
  optimizer execution, and consolidated cost and metric tracking.
- **Active Run:** An active Run is required to optimize your agent via
  backpropagation and to associate logs, and artifacts with a specific experiment.
  Without an active Run, Afnio will not create the backward graph on the server
  and backpropagation-based operations will fail. See the
  [Active Run and Backpropagation](#active-run-and-backpropagation) subsection
  below for details.

---

## Creating a Run

[`afnio.tellurio`](../../generated/afnio.tellurio) is the client module you use
to interact with [Tellurio Studio](https://platform.tellurio.ai/): login, create
or retrieve [Projects](projects_tellurio_studio.md#creating-and-managing-projects),
create [Runs](#creating-a-run), log metrics, and upload artifacts.

Runs are grouped within a Project. See [Projects (Tellurio Studio)](projects_tellurio_studio)
for details on creating Projects, visibility levels, and membership.

Runs can be created programmatically from Python. The `te.init(...)` function
constructs or retrieves a [`Project`](../../generated/afnio.tellurio.project) and
creates a [`Run`](../../generated/afnio.tellurio.run.rst) that becomes the active
Run for the process. A typical usage pattern looks like this:

```python
import afnio.tellurio as te

run = te.init(
    namespace_slug="username_or_org",
    project_display_name="My Project",
    description="Prompt tuning for sentiment agent",
)

# run forward/backward logic here
print("Running experiment...")

# When you create a Run programmatically, call `run.finish()` to mark it
# COMPLETED on the server and to clear the active Run from the current process.
run.finish()
```

_Output:_

```output
INFO     : Project with slug 'my-project' does not exist in namespace 'username_or_org'. Creating it now with RESTRICTED visibility.
INFO     : Run 'hungry_brownie_557' created successfully at: https://platform.tellurio.ai/username_or_org/projects/my-project/runs/hungry-brownie-557/
Running experiment...
INFO     : Run 'hungry_brownie_557' marked as COMPLETED.
```

The `te.init(...)` call returns a [`Run`](../../generated/afnio.tellurio.run.rst)
object and (internally) sets it as the active Run for the current process. Many
higher-level Afnio utilities will automatically use that active Run to associate
logs, metrics, and artifacts.

You can also use the Run as a context manager so that it is automatically
finished when the block exits:

```python
with te.init("username_or_org", "my-project") as run:

  # run forward/backward logic here
  print("Running experiment...")

# when the with-block exits, the Run is marked COMPLETED
```

_Output:_

```output
INFO     : Project with slug 'my-project' already exists in namespace 'username_or_org'.
INFO     : Run 'focused_halloumi_666' created successfully at: https://platform.tellurio.ai/username_or_org/projects/my-project/runs/focused-halloumi-666/
Running experiment...
INFO     : Run 'focused_halloumi_666' marked as COMPLETED.
```

---

## Run Lifecycle and `run.finish()`

A Run progresses through a small set of lifecycle states (expressed in the API and UI). The table below combines each state description with how that state is set in practice:

| State     | Description                                                                               | How it's set (typical)                                                                             |
| --------- | ----------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| RUNNING   | The Run is active and accepting logs, metrics, and artifacts.                             | Created by `te.init(...)` or when entering a a `with te.init(...)` block.                          |
| COMPLETED | The Run finished successfully.                                                            | Call `run.finish()` or exit a `with te.init(...)` block without an exception.                      |
| CRASHED   | The Run terminated due to an unhandled exception during execution.                        | Set automatically by the context manager or the atexit/exit handler when an exception occurs.      |
| FAILED    | A safeguard/intermediate failure state used when a Run was left unfinished or superseded. | Set by the safeguard when a previous active Run is replaced or in certain unrecoverable scenarios. |

Why `run.finish()` matters:

- Marks the Run as `COMPLETED` (or another status you pass) on [Tellurio Studio](https://platform.tellurio.ai/)
  UI via a server PATCH.
- Clears the active Run from the local process so subsequent logs/operations are not associated with an ended Run.
- Unregisters the automatic safeguard that attempts to finish the Run on process exit.

Prefer the context-manager form (`with te.init(...) as run:`) to ensure Runs are cleanly finished and automatically marked `CRASHED` if an exception occurs.

---

## Active Run and Backpropagation

The backward graph—the structure that describes how textual gradients should
flow to learnable [Parameters](build_agent_workflow.md#parameters-and-buffers)
(for example, prompt pieces)—is constructed on the Afnio backend (hosted on
[Tellurio Studio](https://platform.tellurio.ai/)) only when an active Run exists.
In practice:

- Start or init a Run before you run any code that will perform optimization
  or backpropagation.
- If you call [`optimizer.clear_grad()`](optimization_loop.md#optimizer),
  [`optimizer.step()`](optimization_loop.md#optimizer),
  [`trainer.fit()`](trainer.md#train-validate-trainer-fit), or any
  [`.backward()`](automatic_differentiation.md#computing-gradients) operations
  without an active Run, those operations will fail.

---

## Logging and Tracking

Logs and metrics are the primary way to monitor training and evaluation. The
[`Run`](../../generated/afnio.tellurio.run.rst) object exposes a simple
[`.log()`](../../generated/afnio.tellurio.run.rst#afnio.tellurio.run.Run.log)
method that records scalar values, structured metrics, and step indices. Logged
values are streamed to [Tellurio Studio](https://platform.tellurio.ai/) as they are
produced, and associated with the active Run, so you can visualize and compare them
on the platform.

[Tellurio Studio](https://platform.tellurio.ai/) provides real‑time scalar plots,
per‑Run overlays for direct comparisons, step‑wise charts, and cost‑breakdown
visualizations; use the platform's compare view to overlay runs and inspect
differences interactively.

**Example: Logging metrics inside a Run (context manager)**

```python
with te.init("username_or_org", "my-project") as run:

    # Log some metrics
    run.log("train_loss", 0.23, step=3)
    run.log("val_accuracy", 0.87, step=3)
```

_Output:_

```output
INFO     : Project with slug 'my-project' already exists in namespace 'username_or_org'.
INFO     : Run 'vigilant_pho_308' created successfully at: https://platform.tellurio.ai/username_or_org/projects/my-project/runs/vigilant-pho-308/
INFO     : Logged metric 'train_loss'=0.23 for run 'vigilant_pho_308'.
INFO     : Logged metric 'val_accuracy'=0.87 for run 'vigilant_pho_308'.
INFO     : Run 'vigilant_pho_308' marked as COMPLETED.
```

What gets tracked:

- **Scalars and metrics:** loss, accuracy, custom evaluation scores.
- **Costs:** Afnio can record LM usage and cost for each call so you can monitor
  budget across runs.
  <!-- TODO: Add the following bullet when supported -->
  <!-- - **Artifacts:** Checkpoints, serialized `state_dict`s, and other files can be
    saved and associated with the Run (see `afnio.save` and checkpoint examples). -->

---

## Best Practices

- **Create a Run early:** Call `te.init(...)` before starting training or optimization
  so that the backward graph gets generated, and metrics are properly associated.
- **Use descriptive names:** Give runs meaningful `name` and `description`
  values to simplify later analysis.
- **Use the context manager:** Prefer `with te_run.init(...) as run:` to ensure
  runs are cleanly finished even if your script errors.
  <!-- TODO: Add the following bullet when supported -->
  <!-- - **Save checkpoints as artifacts:** Persist `state_dict()` snapshots and upload
    them via your normal `afnio.save` workflow to keep reproducible snapshots tied
    to the Run. -->

---

## Troubleshooting

- If backpropagation does not produce gradients on the server, confirm you have
  an active Run set (use `te_run.init(...)`) and that `requires_grad=True` is
  set for the [Parameters](build_agent_workflow.md#parameters-and-buffers)
  you expect to optimize.
- If logs do not appear in [Tellurio Studio](https://platform.tellurio.ai/),
  check network connectivity and make sure your session credentials and consent
  (API key sharing) are intact.

---

## Further reading

- [Projects (Tellurio Studio)](projects_tellurio_studio)
- [Trainer](trainer)