Skip to content

afnio.autodiff.evaluator

afnio.autodiff.evaluator.DeterministicEvaluator

Bases: Function

Evaluates predictions deterministically using a user-defined evaluation function within the afnio framework, supporting automatic differentiation.

This class inherits from Function and requires both the forward and backward methods to be defined.

The DeterministicEvaluator function computes a score and an explanation based on the prediction and target inputs using a user-defined evaluation function (eval_fn). The evaluation function's purpose is described by eval_fn_purpose. Outputs include a numerical or textual score and a textual explanation, both wrapped as Variable objects.

The prediction is a Variable. The target can be a string, a list of strings, or a Variable. Each Variable passed as an input argument can have either a scalar or a list data field, supporting both individual samples and batch processing. For batch processing, the lengths of prediction and target must match.

The success_fn parameter is a user-defined function that returns True when all predictions evaluated by eval_fn are considered successful, and False otherwise. If success_fn returns True, the backward pass will skip gradient calculations and directly return an empty gradient, optimizing computational time.

The reduction_fn parameter specifies the aggregation function to use for scores across a batch of predictions and targets. When specified, the reduction function's purpose is described using reduction_fn_purpose. If aggregation is not desired, set reduction_fn and reduction_fn_purpose to None.

Examples:

Example with scalar inputs:

>>> prediction = Variable(
...     data="green",
...     role="color prediction",
...     requires_grad=True
... )
>>> target = "red"
>>> def exact_match_fn(p: str, t: str) -> int:
...     return 1 if p == t else 0
>>> score, explanation = DeterministicEvaluator.apply(
...     prediction,
...     target,
...     exact_match_fn,
...     "exact match",
... )
>>> score.data
0
>>> explanation.data
'The evaluation function, designed for 'exact match', compared the <DATA> field of the predicted variable ('green') with the <DATA> field of the target variable ('red'), resulting in a score: 0.'
>>> explanation.backward()
>>> prediction.grad[0].data
'Reassess the criteria that led to the initial prediction of 'green'.'

Example with batched inputs:

>>> prediction = Variable(
...     data=["green", "blue"],
...     role="color prediction",
...     requires_grad=True
... )
>>> target = ["red", "blue"]
>>> def exact_match_fn(p: str, t: str) -> int:
...     return 1 if p == t else 0
>>> score, explanation = DeterministicEvaluator.apply(
...     prediction,
...     target,
...     exact_match_fn,
...     "exact match",
...     reduction_fn=sum,
...     reduction_fn_purpose="summation"
... )
>>> score.data
1
>>> explanation.data
'The evaluation function, designed for 'exact match', compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch, generating individual scores for each pair. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
>>> explanation.backward()
>>> prediction.grad[0].data
'Reassess the criteria that led to the initial prediction of 'green'.'
Source code in afnio/autodiff/evaluator.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
@evaluator
class DeterministicEvaluator(Function):
    """
    Evaluates predictions deterministically using a user-defined evaluation function
    within the `afnio` framework, supporting automatic differentiation.

    This class inherits from [`Function`][afnio.autodiff.function.Function] and
    requires both the [`forward`][afnio.autodiff.function.Function.forward] and
    [`backward`][afnio.autodiff.function.Function.backward] methods to be defined.

    The `DeterministicEvaluator` function computes a `score` and an `explanation` based
    on the `prediction` and `target` inputs using a user-defined evaluation function
    (`eval_fn`). The evaluation function's purpose is described by `eval_fn_purpose`.
    Outputs include a numerical or textual score and a textual explanation, both wrapped
    as [`Variable`][afnio.Variable] objects.

    The `prediction` is a [`Variable`][afnio.Variable]. The `target` can be a string,
    a list of strings, or a [`Variable`][afnio.Variable].
    Each [`Variable`][afnio.Variable] passed as an input argument can have either
    a scalar or a list [`data`][afnio.Variable.data] field, supporting both individual
    samples and batch processing. For batch processing, the lengths of `prediction`
    and `target` must match.

    The `success_fn` parameter is a user-defined function that returns `True` when
    all predictions evaluated by `eval_fn` are considered successful, and `False`
    otherwise. If `success_fn` returns `True`, the `backward` pass will skip gradient
    calculations and directly return an empty gradient, optimizing computational time.

    The `reduction_fn` parameter specifies the aggregation function to use for scores
    across a batch of predictions and targets. When specified, the reduction function's
    purpose is described using `reduction_fn_purpose`. If aggregation is not desired,
    set `reduction_fn` and `reduction_fn_purpose` to `None`.

    Examples:
        Example with scalar inputs:
        >>> prediction = Variable(
        ...     data="green",
        ...     role="color prediction",
        ...     requires_grad=True
        ... )
        >>> target = "red"
        >>> def exact_match_fn(p: str, t: str) -> int:
        ...     return 1 if p == t else 0
        >>> score, explanation = DeterministicEvaluator.apply(
        ...     prediction,
        ...     target,
        ...     exact_match_fn,
        ...     "exact match",
        ... )
        >>> score.data
        0
        >>> explanation.data
        'The evaluation function, designed for 'exact match', compared the <DATA> field of the predicted variable ('green') with the <DATA> field of the target variable ('red'), resulting in a score: 0.'
        >>> explanation.backward()
        >>> prediction.grad[0].data
        'Reassess the criteria that led to the initial prediction of 'green'.'

        Example with batched inputs:
        >>> prediction = Variable(
        ...     data=["green", "blue"],
        ...     role="color prediction",
        ...     requires_grad=True
        ... )
        >>> target = ["red", "blue"]
        >>> def exact_match_fn(p: str, t: str) -> int:
        ...     return 1 if p == t else 0
        >>> score, explanation = DeterministicEvaluator.apply(
        ...     prediction,
        ...     target,
        ...     exact_match_fn,
        ...     "exact match",
        ...     reduction_fn=sum,
        ...     reduction_fn_purpose="summation"
        ... )
        >>> score.data
        1
        >>> explanation.data
        'The evaluation function, designed for 'exact match', compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch, generating individual scores for each pair. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
        >>> explanation.backward()
        >>> prediction.grad[0].data
        'Reassess the criteria that led to the initial prediction of 'green'.'
    """  # noqa: E501

    @staticmethod
    def forward(
        ctx,
        prediction: Variable,
        target: Union[str, List[str], Variable],
        eval_fn: Callable[[Variable, Union[str, Variable]], List[Any]],
        eval_fn_purpose: Union[str, Variable],
        success_fn: Optional[Callable[[List[Any]], bool]],
        reduction_fn: Optional[Callable[[List[Any]], Any]],
        reduction_fn_purpose: Optional[Union[str, Variable]],
    ) -> Tuple[Variable, Variable]:
        """
        Forward pass for the deterministic evaluator function.

        Warning:
            This method is invoked by
            [`apply()`][afnio.autodiff.function.Function.apply]
            and should not be called directly.

        Args:
            ctx: Context object used to save information for [`backward`][..backward]
                computation.
            prediction: The predicted variable to evaluate, which can have scalar or
                list [`data`][afnio.Variable.data] (supporting both individual and
                batch processing).
            target: The target (ground truth) to compare against, which can be a string,
                a list of strings, or a `Variable`.
            eval_fn: A user-defined function that takes a prediction and a target
                and returns a list of scores for each sample. If `target` is a
                [`Variable`][afnio.Variable], the function should compare the
                [`data`][afnio.Variable.data] fields of `prediction` and `target`.
            eval_fn_purpose: A brief description of the purpose of `eval_fn`,
                used by the autodiff engine to generate the explanations.
            success_fn: A user-defined function that takes the list of scores returned
                by `eval_fn` and returns `True` if all predictions are considered
                successful, or `False` otherwise.
            reduction_fn: An optional function to aggregate scores across a batch of
                predictions and targets. If `None`, no aggregation is applied.
            reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
                used by the autodiff engine to generate explanations. Required if
                `reduction_fn` is provided.

        Returns:
            score: A variable containing the evaluation score(s),
                or their aggregation if `reduction_fn` is provided.
            explanation: A variable containing the explanation(s) of the evaluation,
                or their aggregation if `reduction_fn` is provided.

        Raises:
            TypeError: If the types of `prediction`, `target`, `eval_fn`,
                `eval_fn_purpose`, `success_fn`, `reduction_fn`,
                or `reduction_fn_purpose` are not as expected.
            ValueError: If the lengths of `prediction.data` and `target` (or
                `target.data`, when `target` is a `Variable`) do not match when
                both are lists, or if `eval_fn_purpose` (or `eval_fn_purpose.data`)
                is an empty string, or if `reduction_fn_purpose` (or
                `reduction_fn_purpose.data`) is an empty string,
                or if the number of scores returned by `eval_fn`
                does not match the number of samples in the batch.
        """
        raise NotImplementedError(
            "DeterministicEvaluator.forward is implemented on the server. "
            "Client-side execution is not supported."
        )

    @staticmethod
    def backward(
        ctx, score_grad_output: Variable, explanation_grad_output: Variable
    ) -> Tuple[Variable, None, None, None]:
        """
        Backward pass for the deterministic evaluator function.

        Warning:
            This method is invoked by the autodiff engine
            and should not be called directly.

        Args:
            ctx: Context object containing saved information from the
                [`forward`][..forward] pass.
            score_grad_output: The gradient of the `score` `Variable` w.r.t.
                the output of the `forward()` method.
            explanation_grad_output: The gradient of the `explanation` `Variable` w.r.t.
                the output of the `forward()` method.

        Returns:
            grad_prediction: The gradient for the `prediction` input `Variable`
                of `forward()`.
            None (None): Placeholder for the `target` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `eval_fn` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `eval_fn_purpose` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `success_fn` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `reduction_fn` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `reduction_fn_purpose` argument of
                `forward()`, which does not require a gradient.

        Raises:
            RuntimeError: If the LM response to generate the gradients cannot be parsed
                as valid JSON after the maximum number of retries.
            ValueError: If the number of gradients returned by the LM does not match the
                expected number (1 in this case).
        """
        raise NotImplementedError(
            "DeterministicEvaluator.backward is implemented on the server. "
            "Client-side execution is not supported."
        )

forward(ctx, prediction, target, eval_fn, eval_fn_purpose, success_fn, reduction_fn, reduction_fn_purpose) staticmethod

Forward pass for the deterministic evaluator function.

Warning

This method is invoked by apply() and should not be called directly.

Parameters:

Name Type Description Default
ctx

Context object used to save information for backward computation.

required
prediction Variable

The predicted variable to evaluate, which can have scalar or list data (supporting both individual and batch processing).

required
target str | list[str] | Variable

The target (ground truth) to compare against, which can be a string, a list of strings, or a Variable.

required
eval_fn Callable[[Variable, Union[str, Variable]], list[Any]]

A user-defined function that takes a prediction and a target and returns a list of scores for each sample. If target is a Variable, the function should compare the data fields of prediction and target.

required
eval_fn_purpose str | Variable

A brief description of the purpose of eval_fn, used by the autodiff engine to generate the explanations.

required
success_fn Callable[[List[Any]], bool] | None

A user-defined function that takes the list of scores returned by eval_fn and returns True if all predictions are considered successful, or False otherwise.

required
reduction_fn Callable[[List[Any]], Any] | None

An optional function to aggregate scores across a batch of predictions and targets. If None, no aggregation is applied.

required
reduction_fn_purpose str | Variable | None

A brief description of the purpose of reduction_fn, used by the autodiff engine to generate explanations. Required if reduction_fn is provided.

required

Returns:

Name Type Description
score Variable

A variable containing the evaluation score(s), or their aggregation if reduction_fn is provided.

explanation Variable

A variable containing the explanation(s) of the evaluation, or their aggregation if reduction_fn is provided.

Raises:

Type Description
TypeError

If the types of prediction, target, eval_fn, eval_fn_purpose, success_fn, reduction_fn, or reduction_fn_purpose are not as expected.

ValueError

If the lengths of prediction.data and target (or target.data, when target is a Variable) do not match when both are lists, or if eval_fn_purpose (or eval_fn_purpose.data) is an empty string, or if reduction_fn_purpose (or reduction_fn_purpose.data) is an empty string, or if the number of scores returned by eval_fn does not match the number of samples in the batch.

Source code in afnio/autodiff/evaluator.py
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
@staticmethod
def forward(
    ctx,
    prediction: Variable,
    target: Union[str, List[str], Variable],
    eval_fn: Callable[[Variable, Union[str, Variable]], List[Any]],
    eval_fn_purpose: Union[str, Variable],
    success_fn: Optional[Callable[[List[Any]], bool]],
    reduction_fn: Optional[Callable[[List[Any]], Any]],
    reduction_fn_purpose: Optional[Union[str, Variable]],
) -> Tuple[Variable, Variable]:
    """
    Forward pass for the deterministic evaluator function.

    Warning:
        This method is invoked by
        [`apply()`][afnio.autodiff.function.Function.apply]
        and should not be called directly.

    Args:
        ctx: Context object used to save information for [`backward`][..backward]
            computation.
        prediction: The predicted variable to evaluate, which can have scalar or
            list [`data`][afnio.Variable.data] (supporting both individual and
            batch processing).
        target: The target (ground truth) to compare against, which can be a string,
            a list of strings, or a `Variable`.
        eval_fn: A user-defined function that takes a prediction and a target
            and returns a list of scores for each sample. If `target` is a
            [`Variable`][afnio.Variable], the function should compare the
            [`data`][afnio.Variable.data] fields of `prediction` and `target`.
        eval_fn_purpose: A brief description of the purpose of `eval_fn`,
            used by the autodiff engine to generate the explanations.
        success_fn: A user-defined function that takes the list of scores returned
            by `eval_fn` and returns `True` if all predictions are considered
            successful, or `False` otherwise.
        reduction_fn: An optional function to aggregate scores across a batch of
            predictions and targets. If `None`, no aggregation is applied.
        reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
            used by the autodiff engine to generate explanations. Required if
            `reduction_fn` is provided.

    Returns:
        score: A variable containing the evaluation score(s),
            or their aggregation if `reduction_fn` is provided.
        explanation: A variable containing the explanation(s) of the evaluation,
            or their aggregation if `reduction_fn` is provided.

    Raises:
        TypeError: If the types of `prediction`, `target`, `eval_fn`,
            `eval_fn_purpose`, `success_fn`, `reduction_fn`,
            or `reduction_fn_purpose` are not as expected.
        ValueError: If the lengths of `prediction.data` and `target` (or
            `target.data`, when `target` is a `Variable`) do not match when
            both are lists, or if `eval_fn_purpose` (or `eval_fn_purpose.data`)
            is an empty string, or if `reduction_fn_purpose` (or
            `reduction_fn_purpose.data`) is an empty string,
            or if the number of scores returned by `eval_fn`
            does not match the number of samples in the batch.
    """
    raise NotImplementedError(
        "DeterministicEvaluator.forward is implemented on the server. "
        "Client-side execution is not supported."
    )

backward(ctx, score_grad_output, explanation_grad_output) staticmethod

Backward pass for the deterministic evaluator function.

Warning

This method is invoked by the autodiff engine and should not be called directly.

Parameters:

Name Type Description Default
ctx

Context object containing saved information from the forward pass.

required
score_grad_output Variable

The gradient of the score Variable w.r.t. the output of the forward() method.

required
explanation_grad_output Variable

The gradient of the explanation Variable w.r.t. the output of the forward() method.

required

Returns:

Name Type Description
grad_prediction Variable

The gradient for the prediction input Variable of forward().

None None

Placeholder for the target argument of forward(), which does not require a gradient.

None None

Placeholder for the eval_fn argument of forward(), which does not require a gradient.

None None

Placeholder for the eval_fn_purpose argument of forward(), which does not require a gradient.

None None

Placeholder for the success_fn argument of forward(), which does not require a gradient.

None None

Placeholder for the reduction_fn argument of forward(), which does not require a gradient.

None None

Placeholder for the reduction_fn_purpose argument of forward(), which does not require a gradient.

Raises:

Type Description
RuntimeError

If the LM response to generate the gradients cannot be parsed as valid JSON after the maximum number of retries.

ValueError

If the number of gradients returned by the LM does not match the expected number (1 in this case).

Source code in afnio/autodiff/evaluator.py
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
@staticmethod
def backward(
    ctx, score_grad_output: Variable, explanation_grad_output: Variable
) -> Tuple[Variable, None, None, None]:
    """
    Backward pass for the deterministic evaluator function.

    Warning:
        This method is invoked by the autodiff engine
        and should not be called directly.

    Args:
        ctx: Context object containing saved information from the
            [`forward`][..forward] pass.
        score_grad_output: The gradient of the `score` `Variable` w.r.t.
            the output of the `forward()` method.
        explanation_grad_output: The gradient of the `explanation` `Variable` w.r.t.
            the output of the `forward()` method.

    Returns:
        grad_prediction: The gradient for the `prediction` input `Variable`
            of `forward()`.
        None (None): Placeholder for the `target` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `eval_fn` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `eval_fn_purpose` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `success_fn` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `reduction_fn` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `reduction_fn_purpose` argument of
            `forward()`, which does not require a gradient.

    Raises:
        RuntimeError: If the LM response to generate the gradients cannot be parsed
            as valid JSON after the maximum number of retries.
        ValueError: If the number of gradients returned by the LM does not match the
            expected number (1 in this case).
    """
    raise NotImplementedError(
        "DeterministicEvaluator.backward is implemented on the server. "
        "Client-side execution is not supported."
    )

afnio.autodiff.evaluator.ExactMatchEvaluator

Bases: Function

Evaluates predictions using exact matching within the afnio framework, supporting automatic differentiation.

This class inherits from Function and requires both the forward and backward methods to be defined.

The ExactMatchEvaluator function computes a score and an explanation by comparing the data fields of a prediction and a target for an exact match. For each sample:

  • A score of 1 is assigned for an exact match.
  • A score of 0 is assigned otherwise.

The prediction is a Variable. The target can be a string, a list of strings, or a Variable. Each Variable passed as an input argument can have either a scalar or a list data field, supporting both individual samples and batch processing. For batch processing, the lengths of prediction and target must match.

If batched inputs are provided, the scores can be aggregated using an optional reduction_fn, such as sum. The purpose of the reduction is described using reduction_fn_purpose. If aggregation is not desired, set reduction_fn and reduction_fn_purpose to None.

Examples:

Example with scalar inputs:

>>> prediction = Variable(
...     data="green",
...     role="color prediction",
...     requires_grad=True
... )
>>> target = "red",
>>> score, explanation = ExactMatchEvaluator.apply(prediction, target)
>>> score.data
0
>>> explanation.data
'The evaluation function, designed for 'exact match', compared the <DATA> field of the predicted variable ('green') with the <DATA> field of the target variable ('red'), resulting in a score: 0.'
>>> explanation.backward()
>>> prediction.grad[0].data
'Reassess the criteria that led to the initial prediction of 'green'.'

Example with batched inputs:

>>> prediction = Variable(
...     data=["green", "blue"],
...     role="color prediction",
...     requires_grad=True
... )
>>> target = ["red", "blue"]
>>> score, explanation = ExactMatchEvaluator.apply(prediction, target)
>>> score.data
1
>>> explanation.data
'The evaluation function, designed for 'exact match', compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch, generating individual scores for each pair. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
>>> explanation.backward()
>>> prediction.grad[0].data
'Reassess the criteria that led to the initial prediction of 'green'.'
Source code in afnio/autodiff/evaluator.py
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
@evaluator
class ExactMatchEvaluator(Function):
    """
    Evaluates predictions using exact matching within the `afnio` framework,
    supporting automatic differentiation.

    This class inherits from [`Function`][afnio.autodiff.function.Function] and
    requires both the [`forward`][afnio.autodiff.function.Function.forward] and
    [`backward`][afnio.autodiff.function.Function.backward] methods to be defined.

    The `ExactMatchEvaluator` function computes a `score` and an `explanation` by
    comparing the [`data`][afnio.Variable.data] fields of a `prediction`
    and a `target` for an exact match. For each sample:

    - A score of `1` is assigned for an exact match.
    - A score of `0` is assigned otherwise.

    The `prediction` is a [`Variable`][afnio.Variable]. The `target` can be a string,
    a list of strings, or a [`Variable`][afnio.Variable].
    Each [`Variable`][afnio.Variable] passed as an input argument can have either
    a scalar or a list [`data`][afnio.Variable.data] field, supporting both individual
    samples and batch processing. For batch processing, the lengths of `prediction`
    and `target` must match.

    If batched inputs are provided, the scores can be aggregated using an optional
    `reduction_fn`, such as `sum`. The purpose of the reduction is described using
    `reduction_fn_purpose`. If aggregation is not desired, set `reduction_fn` and
    `reduction_fn_purpose` to `None`.

    Examples:
        Example with scalar inputs:
        >>> prediction = Variable(
        ...     data="green",
        ...     role="color prediction",
        ...     requires_grad=True
        ... )
        >>> target = "red",
        >>> score, explanation = ExactMatchEvaluator.apply(prediction, target)
        >>> score.data
        0
        >>> explanation.data
        'The evaluation function, designed for 'exact match', compared the <DATA> field of the predicted variable ('green') with the <DATA> field of the target variable ('red'), resulting in a score: 0.'
        >>> explanation.backward()
        >>> prediction.grad[0].data
        'Reassess the criteria that led to the initial prediction of 'green'.'

        Example with batched inputs:
        >>> prediction = Variable(
        ...     data=["green", "blue"],
        ...     role="color prediction",
        ...     requires_grad=True
        ... )
        >>> target = ["red", "blue"]
        >>> score, explanation = ExactMatchEvaluator.apply(prediction, target)
        >>> score.data
        1
        >>> explanation.data
        'The evaluation function, designed for 'exact match', compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch, generating individual scores for each pair. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
        >>> explanation.backward()
        >>> prediction.grad[0].data
        'Reassess the criteria that led to the initial prediction of 'green'.'
    """  # noqa: E501

    @staticmethod
    def forward(
        ctx,
        prediction: Variable,
        target: Union[str, List[str], Variable],
        reduction_fn: Optional[Callable[[List[Any]], Any]] = sum,
        reduction_fn_purpose: Optional[Union[str, Variable]] = "summation",
    ) -> Tuple[Variable, Variable]:
        """
        Forward pass for the exact match evaluator function.

        Warning:
            This method is invoked by
            [`apply()`][afnio.autodiff.function.Function.apply]
            and should not be called directly.

        Args:
            ctx: Context object used to save information for [`backward`][..backward]
                computation.
            prediction: The predicted variable to evaluate, which can have scalar or
                list [`data`][afnio.Variable.data] (supporting both individual and
                batch processing).
            target: The target (ground truth) to compare against, which can be a string,
                a list of strings, or a `Variable`.
            reduction_fn: An optional function to aggregate scores across a batch of
                predictions and targets. If `None`, no aggregation is applied.
            reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
                used by the autodiff engine to generate explanations. Required if
                `reduction_fn` is provided.

        Returns:
            score: A variable containing the evaluation score(s),
                or their aggregation if `reduction_fn` is provided.
            explanation: A variable containing the explanation(s) of the evaluation,
                or their aggregation if `reduction_fn` is provided.

        Raises:
            TypeError: If the types of `prediction`, `target`, `reduction_fn`,
                or `reduction_fn_purpose` are not as expected.
            ValueError: If the lengths of `prediction.data` and `target` (or
                `target.data`, when `target` is a `Variable`) do not match when
                both are lists, or if `reduction_fn_purpose` (or
                `reduction_fn_purpose.data`) is an empty string.
        """
        raise NotImplementedError(
            "ExactMatchEvaluator.forward is implemented on the server. "
            "Client-side execution is not supported."
        )

    @staticmethod
    def backward(
        ctx, score_grad_output: Variable, explanation_grad_output: Variable
    ) -> Tuple[Variable, None]:
        """
        Backward pass for the exact match evaluator function.

        Warning:
            This method is invoked by the autodiff engine
            and should not be called directly.

        Args:
            ctx: Context object containing saved information from the
                [`forward`][..forward] pass.
            score_grad_output: The gradient of the `score` `Variable` w.r.t.
                the output of the `forward()` method.
            explanation_grad_output: The gradient of the `explanation` `Variable` w.r.t.
                the output of the `forward()` method.

        Returns:
            grad_prediction: The gradient for the `prediction` input `Variable`
                of `forward()`.
            None (None): Placeholder for the `target` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `reduction_fn` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `reduction_fn_purpose` argument of
                `forward()`, which does not require a gradient.
        """
        raise NotImplementedError(
            "ExactMatchEvaluator.backward is implemented on the server. "
            "Client-side execution is not supported."
        )

forward(ctx, prediction, target, reduction_fn=sum, reduction_fn_purpose='summation') staticmethod

Forward pass for the exact match evaluator function.

Warning

This method is invoked by apply() and should not be called directly.

Parameters:

Name Type Description Default
ctx

Context object used to save information for backward computation.

required
prediction Variable

The predicted variable to evaluate, which can have scalar or list data (supporting both individual and batch processing).

required
target str | list[str] | Variable

The target (ground truth) to compare against, which can be a string, a list of strings, or a Variable.

required
reduction_fn Callable[[List[Any]], Any] | None

An optional function to aggregate scores across a batch of predictions and targets. If None, no aggregation is applied.

sum
reduction_fn_purpose str | Variable | None

A brief description of the purpose of reduction_fn, used by the autodiff engine to generate explanations. Required if reduction_fn is provided.

'summation'

Returns:

Name Type Description
score Variable

A variable containing the evaluation score(s), or their aggregation if reduction_fn is provided.

explanation Variable

A variable containing the explanation(s) of the evaluation, or their aggregation if reduction_fn is provided.

Raises:

Type Description
TypeError

If the types of prediction, target, reduction_fn, or reduction_fn_purpose are not as expected.

ValueError

If the lengths of prediction.data and target (or target.data, when target is a Variable) do not match when both are lists, or if reduction_fn_purpose (or reduction_fn_purpose.data) is an empty string.

Source code in afnio/autodiff/evaluator.py
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
@staticmethod
def forward(
    ctx,
    prediction: Variable,
    target: Union[str, List[str], Variable],
    reduction_fn: Optional[Callable[[List[Any]], Any]] = sum,
    reduction_fn_purpose: Optional[Union[str, Variable]] = "summation",
) -> Tuple[Variable, Variable]:
    """
    Forward pass for the exact match evaluator function.

    Warning:
        This method is invoked by
        [`apply()`][afnio.autodiff.function.Function.apply]
        and should not be called directly.

    Args:
        ctx: Context object used to save information for [`backward`][..backward]
            computation.
        prediction: The predicted variable to evaluate, which can have scalar or
            list [`data`][afnio.Variable.data] (supporting both individual and
            batch processing).
        target: The target (ground truth) to compare against, which can be a string,
            a list of strings, or a `Variable`.
        reduction_fn: An optional function to aggregate scores across a batch of
            predictions and targets. If `None`, no aggregation is applied.
        reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
            used by the autodiff engine to generate explanations. Required if
            `reduction_fn` is provided.

    Returns:
        score: A variable containing the evaluation score(s),
            or their aggregation if `reduction_fn` is provided.
        explanation: A variable containing the explanation(s) of the evaluation,
            or their aggregation if `reduction_fn` is provided.

    Raises:
        TypeError: If the types of `prediction`, `target`, `reduction_fn`,
            or `reduction_fn_purpose` are not as expected.
        ValueError: If the lengths of `prediction.data` and `target` (or
            `target.data`, when `target` is a `Variable`) do not match when
            both are lists, or if `reduction_fn_purpose` (or
            `reduction_fn_purpose.data`) is an empty string.
    """
    raise NotImplementedError(
        "ExactMatchEvaluator.forward is implemented on the server. "
        "Client-side execution is not supported."
    )

backward(ctx, score_grad_output, explanation_grad_output) staticmethod

Backward pass for the exact match evaluator function.

Warning

This method is invoked by the autodiff engine and should not be called directly.

Parameters:

Name Type Description Default
ctx

Context object containing saved information from the forward pass.

required
score_grad_output Variable

The gradient of the score Variable w.r.t. the output of the forward() method.

required
explanation_grad_output Variable

The gradient of the explanation Variable w.r.t. the output of the forward() method.

required

Returns:

Name Type Description
grad_prediction Variable

The gradient for the prediction input Variable of forward().

None None

Placeholder for the target argument of forward(), which does not require a gradient.

None None

Placeholder for the reduction_fn argument of forward(), which does not require a gradient.

None None

Placeholder for the reduction_fn_purpose argument of forward(), which does not require a gradient.

Source code in afnio/autodiff/evaluator.py
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
@staticmethod
def backward(
    ctx, score_grad_output: Variable, explanation_grad_output: Variable
) -> Tuple[Variable, None]:
    """
    Backward pass for the exact match evaluator function.

    Warning:
        This method is invoked by the autodiff engine
        and should not be called directly.

    Args:
        ctx: Context object containing saved information from the
            [`forward`][..forward] pass.
        score_grad_output: The gradient of the `score` `Variable` w.r.t.
            the output of the `forward()` method.
        explanation_grad_output: The gradient of the `explanation` `Variable` w.r.t.
            the output of the `forward()` method.

    Returns:
        grad_prediction: The gradient for the `prediction` input `Variable`
            of `forward()`.
        None (None): Placeholder for the `target` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `reduction_fn` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `reduction_fn_purpose` argument of
            `forward()`, which does not require a gradient.
    """
    raise NotImplementedError(
        "ExactMatchEvaluator.backward is implemented on the server. "
        "Client-side execution is not supported."
    )

afnio.autodiff.evaluator.LMJudgeEvaluator

Bases: Function

Implements an evaluation of a model prediction using a language model (LM) as the judge within the afnio framework, supporting automatic differentiation.

This class inherits from Function and requires both the forward and backward methods to be defined.

This function returns a score and an explanation, both as Variable objects, by comparing a prediction against a target (when present) using a composite prompt. The prompt is constructed from a list of messages and optional inputs, which can dynamically populate placeholders in the message templates. The evaluation process leverages the specified forward_model_client to perform the LM-based assessment.

The prediction is a Variable. The target can be a string, a list of strings, or a Variable. Similarly, the inputs dictionary can include strings, lists of strings, or Variables. Each Variable passed as an input argument can have either a scalar or a list data field, supporting both individual samples and batch processing. For batch processing, the lengths of prediction, target, and any batched inputs must match.

The success_fn parameter is a user-defined function that returns True when all predictions evaluated by the LM as Judge are considered successful, and False otherwise. If success_fn returns True, the backward pass will skip gradient calculations and directly return an empty gradient, optimizing computational time.

If you are processing a batch of predictions and targets, you can use the reduction_fn to aggregate individual scores (e.g., using sum to compute a total score). The reduction_fn_purpose parameter is a brief description of the aggregation's purpose (e.g., "summation"). If you don't want any aggregation, set both reduction_fn and reduction_fn_purpose to None.

The function operates in two modes controlled by eval_mode:

  • eval_mode=True (default) – Computes gradients for prediction only. Use it for direct feedback on predictions.
  • eval_mode=False – Computes gradients for messages and inputs. Use it to optimize the evaluator or align with human evaluation datasets.

Additional model parameters, such as temperature, max tokens, or seed values, can be passed through completion_args to customize the LLM's behavior.

Examples:

Example with scalar inputs:

>>> task = Variable(
...     "Evaluate if the translation is accurate.",
...     role="evaluation task",
...     requires_grad=True
... )
>>> format = Variable(
...     "Provide 'score' (true/false) and 'explanation' in JSON.",
...     role="output format"
... )
>>> user = Variable(
...     "<PREDICTION>{prediction}</PREDICTION><TARGET>{target}</TARGET>",
...     role="user query"
... )
>>> prediction = Variable(
...     "Hola Mundo",
...     role="translated text",
...     requires_grad=True
... )
>>> target = Variable("Ciao Mondo", role="expected output")
>>> messages = [
...     {"role": "system", "content": [task, format]},
...     {"role": "user", "content": [user]}
... ]
>>> score, explanation = LMJudgeEvaluator.apply(
...     model,
...     messages,
...     prediction,
...     target,
...     temperature=0.5,
... )
>>> score.data
False
>>> explanation.data
'The translated text is in Spanish, but the expected is in Italian.'
>>> explanation.backward()
>>> prediction.grad[0].data
'The translated text should be in Italian.'

Example with batched inputs:

>>> task = Variable(
...     "Evaluate if the translation is accurate.",
...     role="evaluation task",
...     requires_grad=True
... )
>>> format = Variable(
...     "Provide 'score' (true/false) and 'explanation' in JSON.",
...     role="output format"
... )
>>> user = Variable(
...     "<PREDICTION>{prediction}</PREDICTION><TARGET>{target}</TARGET>",
...     role="user query"
... )
>>> prediction = Variable(
...     data=["Hola Mundo", "Salve a tutti"],
...     role="translated text",
...     requires_grad=True,
... )
>>> target = ["Ciao Mondo", "Salve a tutti"]
>>> score, explanation = LMJudgeEvaluator.apply(
...     model,
...     messages,
...     prediction,
...     target,
...     reduction_fn=sum,
...     reduction_fn_purpose="summation",
... )
>>> score.data
1
>>> explanation.data
'The evaluation function, designed using an LM as the judge, compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
Source code in afnio/autodiff/evaluator.py
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
@evaluator
class LMJudgeEvaluator(Function):
    """
    Implements an evaluation of a model prediction using a language model (LM) as the
    judge within the `afnio` framework, supporting automatic differentiation.

    This class inherits from [`Function`][afnio.autodiff.function.Function] and
    requires both the [`forward`][afnio.autodiff.function.Function.forward] and
    [`backward`][afnio.autodiff.function.Function.backward] methods to be defined.

    This function returns a `score` and an `explanation`, both as
    [`Variable`][afnio.Variable] objects, by comparing a `prediction` against a `target`
    (when present) using a composite prompt. The prompt is constructed from a list of
    `messages` and optional `inputs`, which can dynamically populate placeholders in the
    message templates. The evaluation process leverages the specified
    `forward_model_client` to perform the LM-based assessment.

    The `prediction` is a [`Variable`][afnio.Variable]. The `target` can be a string,
    a list of strings, or a [`Variable`][afnio.Variable]. Similarly, the `inputs`
    dictionary can include strings, lists of strings, or [`Variable`][afnio.Variable]s.
    Each [`Variable`][afnio.Variable] passed as an input argument can have either
    a scalar or a list [`data`][afnio.Variable.data] field, supporting both individual
    samples and batch processing. For batch processing, the lengths of `prediction`,
    `target`, and any batched `inputs` must match.

    The `success_fn` parameter is a user-defined function that returns `True` when
    all predictions evaluated by the LM as Judge are considered successful, and `False`
    otherwise. If `success_fn` returns `True`, the `backward` pass will skip gradient
    calculations and directly return an empty gradient, optimizing computational time.

    If you are processing a batch of predictions and targets, you can use the
    `reduction_fn` to aggregate individual scores (e.g., using `sum` to compute a total
    score). The `reduction_fn_purpose` parameter is a brief description of the
    aggregation's purpose (e.g., `"summation"`). If you don't want any aggregation, set
    both `reduction_fn` and `reduction_fn_purpose` to `None`.

    The function operates in two modes controlled by `eval_mode`:

    - **eval_mode=True (default)** – Computes gradients for `prediction` only. Use it
      for direct feedback on predictions.
    - **eval_mode=False** – Computes gradients for `messages` and `inputs`. Use it to
      optimize the evaluator or align with human evaluation datasets.

    Additional model parameters, such as temperature, max tokens, or seed values, can
    be passed through `completion_args` to customize the LLM's behavior.

    Examples:
        Example with scalar inputs:
        >>> task = Variable(
        ...     "Evaluate if the translation is accurate.",
        ...     role="evaluation task",
        ...     requires_grad=True
        ... )
        >>> format = Variable(
        ...     "Provide 'score' (true/false) and 'explanation' in JSON.",
        ...     role="output format"
        ... )
        >>> user = Variable(
        ...     "<PREDICTION>{prediction}</PREDICTION><TARGET>{target}</TARGET>",
        ...     role="user query"
        ... )
        >>> prediction = Variable(
        ...     "Hola Mundo",
        ...     role="translated text",
        ...     requires_grad=True
        ... )
        >>> target = Variable("Ciao Mondo", role="expected output")
        >>> messages = [
        ...     {"role": "system", "content": [task, format]},
        ...     {"role": "user", "content": [user]}
        ... ]
        >>> score, explanation = LMJudgeEvaluator.apply(
        ...     model,
        ...     messages,
        ...     prediction,
        ...     target,
        ...     temperature=0.5,
        ... )
        >>> score.data
        False
        >>> explanation.data
        'The translated text is in Spanish, but the expected is in Italian.'
        >>> explanation.backward()
        >>> prediction.grad[0].data
        'The translated text should be in Italian.'

        Example with batched inputs:
        >>> task = Variable(
        ...     "Evaluate if the translation is accurate.",
        ...     role="evaluation task",
        ...     requires_grad=True
        ... )
        >>> format = Variable(
        ...     "Provide 'score' (true/false) and 'explanation' in JSON.",
        ...     role="output format"
        ... )
        >>> user = Variable(
        ...     "<PREDICTION>{prediction}</PREDICTION><TARGET>{target}</TARGET>",
        ...     role="user query"
        ... )
        >>> prediction = Variable(
        ...     data=["Hola Mundo", "Salve a tutti"],
        ...     role="translated text",
        ...     requires_grad=True,
        ... )
        >>> target = ["Ciao Mondo", "Salve a tutti"]
        >>> score, explanation = LMJudgeEvaluator.apply(
        ...     model,
        ...     messages,
        ...     prediction,
        ...     target,
        ...     reduction_fn=sum,
        ...     reduction_fn_purpose="summation",
        ... )
        >>> score.data
        1
        >>> explanation.data
        'The evaluation function, designed using an LM as the judge, compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
    """  # noqa: E501

    @staticmethod
    def forward(
        ctx,
        forward_model_client: Optional[ChatCompletionModel],
        messages: MultiTurnMessages,
        prediction: Variable,
        target: Optional[Union[str, List[str], Variable]] = None,
        inputs: Optional[Dict[str, Union[str, Variable]]] = None,
        success_fn: Optional[Callable[[List[Any]], bool]] = None,
        reduction_fn: Optional[Callable[[List[Any]], Any]] = sum,
        reduction_fn_purpose: Optional[Union[str, Variable]] = "summation",
        eval_mode: Union[bool, Variable] = True,
        **completion_args,
    ) -> Tuple[Variable, Variable]:
        """
        Forward pass for the LM Judge evaluator function.

        Warning:
            This method is invoked by
            [`apply()`][afnio.autodiff.function.Function.apply]
            and should not be called directly.

        Args:
            ctx: Context object used to save information for [`backward`][..backward]
                computation.
            forward_model_client: The LM model client used for the forward
                pass evaluation.
            messages: A list of messages that compose the prompt/context for the LM.
                Each message is a dictionary with a `"role"` (e.g., `"system"`,
                `"user"`, `"assistant"`) and a `"content"` field, which is a list of
                `Variable` objects. The `Variable` objects in the `"content"` can
                contain placeholders (e.g., `{prediction}`, `{target}`) that will be
                populated with the corresponding values from the `inputs` dictionary.
            prediction: The predicted variable to evaluate, which can have scalar or
                list [`data`][afnio.Variable.data] (supporting both individual and
                batch processing).
            target: The target (ground truth) to compare against, which can be a string,
                a list of strings, or a `Variable`. Optional if the evaluation does not
                require a target and only relies on the correctness of the LM Judge's
                assessment of the `prediction`.
            inputs: A dictionary mapping placeholder names to their corresponding
                values, which can be strings or `Variable` instances. These values
                will be used to populate the placeholders in the `messages` content
                before sending the prompt to the LM. For example, if a message
                `"content"` field contains the placeholder `{color}`, the `inputs`
                dictionary should have a key `"color"` with the value to substitute
                in the prompt. Optional if there are no placeholders in the messages or
                if all placeholders are directly related to `prediction` and `target`.
            success_fn: A user-defined function that takes the list of scores returned
                by the LM Judge and returns `True` if all predictions are considered
                successful, or `False` otherwise.
            reduction_fn: An optional function to aggregate scores across a batch of
                predictions and targets. If `None`, no aggregation is applied.
            reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
                used by the autodiff engine to generate explanations. Required if
                `reduction_fn` is provided.
            eval_mode: Indicates the evaluation mode. If `True`, the `backward` pass
                will compute gradients for the `prediction` variable only. If `False`,
                the `backward` pass will compute gradients for the `messages` and
                `inputs`, allowing optimization of the evaluator itself or alignment
                with human evaluation datasets.
            **completion_args: Additional keyword arguments to pass to the LM model
                client's `chat` method, such as temperature, max tokens, or seed values,
                to customize the LLM's behavior during the evaluation.

        Returns:
            score: A variable containing the evaluation score(s),
                or their aggregation if `reduction_fn` is provided.
            explanation: A variable containing the explanation(s) of the evaluation,
                or their aggregation if `reduction_fn` is provided.

        Raises:
            RuntimeError: If the LM response to generate the evaluation `score` and
                `explanation` cannot be parsed as valid JSON.
            TypeError: If the types of `forward_model_client`, `messages`, `prediction`,
                `target`, `inputs`, `success_fn`, `reduction_fn`,
                `reduction_fn_purpose`, or `eval_mode` are not as expected.
            ValueError: If the lengths of `prediction.data` and `target` (or
                `target.data`, when `target` is a `Variable`) do not match when both are
                lists, or if `reduction_fn_purpose` (or `reduction_fn_purpose.data`) is
                an empty string, or if `inputs` contains keys that conflict with
                `prediction` or `target`.
        """
        raise NotImplementedError(
            "LMJudgeEvaluator.forward is implemented on the server. "
            "Client-side execution is not supported."
        )

    @staticmethod
    def backward(
        ctx, score_grad_output: Variable, explanation_grad_output: Variable
    ) -> Tuple[Optional[Variable], ...]:
        """
        Backward pass for the LM Judge evaluator function.

        Warning:
            This method is invoked by the autodiff engine
            and should not be called directly.

        Args:
            ctx: Context object containing saved information from the
                [`forward`][..forward] pass.
            score_grad_output: The gradient of the `score` `Variable` w.r.t.
                the output of the `forward()` method.
            explanation_grad_output: The gradient of the `explanation` `Variable` w.r.t.
                the output of the `forward()` method.

        Returns:
            None (None): Placeholder for the `forward_model_client` argument of
                `forward()`, which does not require a gradient.
            grad_messages (Tuple[Optional[Variable], ...]): A tuple of gradients for the
                `messages` argument of `forward()`, where each gradient corresponds to
                the respective message variable. If `eval_mode` is `True`, this will be
                a tuple of `None` values, indicating that no gradients are computed for
                the messages in this mode.
            grad_prediction (Optional[Variable]): The gradient for the `prediction`
                argument `Variable` of `forward()`, or `None` if `eval_mode` is `False`.
            None (None): Placeholder for the `target` argument of `forward()`,
                which does not require a gradient.
            grad_inputs (Tuple[Optional[Variable], ...]): A tuple of gradients for the
                `inputs` argument of `forward()`, where each gradient corresponds to the
                respective input variable. If `eval_mode` is `True`, this will be a
                tuple of `None` values, indicating that no gradients are computed for
                the inputs in this mode.
            None (None): Placeholder for the `success_fn` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `reduction_fn` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `reduction_fn_purpose` argument of
                `forward()`, which does not require a gradient.
            None (None): Placeholder for the `eval_mode` argument of `forward()`,
                which does not require a gradient.
            None (Tuple[None, ...]): Placeholder for any additional completion arguments
                passed to `forward()`, which do not require gradients.

        Raises:
            RuntimeError: If the LM response to generate the gradients cannot be parsed
                as valid JSON after the maximum number of retries.
            ValueError: If the number of gradients returned by the LM does not match the
                expected number based on the evaluation mode.
        """
        raise NotImplementedError(
            "LMJudgeEvaluator.backward is implemented on the server. "
            "Client-side execution is not supported."
        )

forward(ctx, forward_model_client, messages, prediction, target=None, inputs=None, success_fn=None, reduction_fn=sum, reduction_fn_purpose='summation', eval_mode=True, **completion_args) staticmethod

Forward pass for the LM Judge evaluator function.

Warning

This method is invoked by apply() and should not be called directly.

Parameters:

Name Type Description Default
ctx

Context object used to save information for backward computation.

required
forward_model_client ChatCompletionModel | None

The LM model client used for the forward pass evaluation.

required
messages MultiTurnMessages

A list of messages that compose the prompt/context for the LM. Each message is a dictionary with a "role" (e.g., "system", "user", "assistant") and a "content" field, which is a list of Variable objects. The Variable objects in the "content" can contain placeholders (e.g., {prediction}, {target}) that will be populated with the corresponding values from the inputs dictionary.

required
prediction Variable

The predicted variable to evaluate, which can have scalar or list data (supporting both individual and batch processing).

required
target str | list[str] | Variable | None

The target (ground truth) to compare against, which can be a string, a list of strings, or a Variable. Optional if the evaluation does not require a target and only relies on the correctness of the LM Judge's assessment of the prediction.

None
inputs dict[str, str | Variable] | None

A dictionary mapping placeholder names to their corresponding values, which can be strings or Variable instances. These values will be used to populate the placeholders in the messages content before sending the prompt to the LM. For example, if a message "content" field contains the placeholder {color}, the inputs dictionary should have a key "color" with the value to substitute in the prompt. Optional if there are no placeholders in the messages or if all placeholders are directly related to prediction and target.

None
success_fn Callable[[List[Any]], bool] | None

A user-defined function that takes the list of scores returned by the LM Judge and returns True if all predictions are considered successful, or False otherwise.

None
reduction_fn Callable[[List[Any]], Any] | None

An optional function to aggregate scores across a batch of predictions and targets. If None, no aggregation is applied.

sum
reduction_fn_purpose str | Variable | None

A brief description of the purpose of reduction_fn, used by the autodiff engine to generate explanations. Required if reduction_fn is provided.

'summation'
eval_mode bool | Variable

Indicates the evaluation mode. If True, the backward pass will compute gradients for the prediction variable only. If False, the backward pass will compute gradients for the messages and inputs, allowing optimization of the evaluator itself or alignment with human evaluation datasets.

True
**completion_args

Additional keyword arguments to pass to the LM model client's chat method, such as temperature, max tokens, or seed values, to customize the LLM's behavior during the evaluation.

{}

Returns:

Name Type Description
score Variable

A variable containing the evaluation score(s), or their aggregation if reduction_fn is provided.

explanation Variable

A variable containing the explanation(s) of the evaluation, or their aggregation if reduction_fn is provided.

Raises:

Type Description
RuntimeError

If the LM response to generate the evaluation score and explanation cannot be parsed as valid JSON.

TypeError

If the types of forward_model_client, messages, prediction, target, inputs, success_fn, reduction_fn, reduction_fn_purpose, or eval_mode are not as expected.

ValueError

If the lengths of prediction.data and target (or target.data, when target is a Variable) do not match when both are lists, or if reduction_fn_purpose (or reduction_fn_purpose.data) is an empty string, or if inputs contains keys that conflict with prediction or target.

Source code in afnio/autodiff/evaluator.py
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
@staticmethod
def forward(
    ctx,
    forward_model_client: Optional[ChatCompletionModel],
    messages: MultiTurnMessages,
    prediction: Variable,
    target: Optional[Union[str, List[str], Variable]] = None,
    inputs: Optional[Dict[str, Union[str, Variable]]] = None,
    success_fn: Optional[Callable[[List[Any]], bool]] = None,
    reduction_fn: Optional[Callable[[List[Any]], Any]] = sum,
    reduction_fn_purpose: Optional[Union[str, Variable]] = "summation",
    eval_mode: Union[bool, Variable] = True,
    **completion_args,
) -> Tuple[Variable, Variable]:
    """
    Forward pass for the LM Judge evaluator function.

    Warning:
        This method is invoked by
        [`apply()`][afnio.autodiff.function.Function.apply]
        and should not be called directly.

    Args:
        ctx: Context object used to save information for [`backward`][..backward]
            computation.
        forward_model_client: The LM model client used for the forward
            pass evaluation.
        messages: A list of messages that compose the prompt/context for the LM.
            Each message is a dictionary with a `"role"` (e.g., `"system"`,
            `"user"`, `"assistant"`) and a `"content"` field, which is a list of
            `Variable` objects. The `Variable` objects in the `"content"` can
            contain placeholders (e.g., `{prediction}`, `{target}`) that will be
            populated with the corresponding values from the `inputs` dictionary.
        prediction: The predicted variable to evaluate, which can have scalar or
            list [`data`][afnio.Variable.data] (supporting both individual and
            batch processing).
        target: The target (ground truth) to compare against, which can be a string,
            a list of strings, or a `Variable`. Optional if the evaluation does not
            require a target and only relies on the correctness of the LM Judge's
            assessment of the `prediction`.
        inputs: A dictionary mapping placeholder names to their corresponding
            values, which can be strings or `Variable` instances. These values
            will be used to populate the placeholders in the `messages` content
            before sending the prompt to the LM. For example, if a message
            `"content"` field contains the placeholder `{color}`, the `inputs`
            dictionary should have a key `"color"` with the value to substitute
            in the prompt. Optional if there are no placeholders in the messages or
            if all placeholders are directly related to `prediction` and `target`.
        success_fn: A user-defined function that takes the list of scores returned
            by the LM Judge and returns `True` if all predictions are considered
            successful, or `False` otherwise.
        reduction_fn: An optional function to aggregate scores across a batch of
            predictions and targets. If `None`, no aggregation is applied.
        reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
            used by the autodiff engine to generate explanations. Required if
            `reduction_fn` is provided.
        eval_mode: Indicates the evaluation mode. If `True`, the `backward` pass
            will compute gradients for the `prediction` variable only. If `False`,
            the `backward` pass will compute gradients for the `messages` and
            `inputs`, allowing optimization of the evaluator itself or alignment
            with human evaluation datasets.
        **completion_args: Additional keyword arguments to pass to the LM model
            client's `chat` method, such as temperature, max tokens, or seed values,
            to customize the LLM's behavior during the evaluation.

    Returns:
        score: A variable containing the evaluation score(s),
            or their aggregation if `reduction_fn` is provided.
        explanation: A variable containing the explanation(s) of the evaluation,
            or their aggregation if `reduction_fn` is provided.

    Raises:
        RuntimeError: If the LM response to generate the evaluation `score` and
            `explanation` cannot be parsed as valid JSON.
        TypeError: If the types of `forward_model_client`, `messages`, `prediction`,
            `target`, `inputs`, `success_fn`, `reduction_fn`,
            `reduction_fn_purpose`, or `eval_mode` are not as expected.
        ValueError: If the lengths of `prediction.data` and `target` (or
            `target.data`, when `target` is a `Variable`) do not match when both are
            lists, or if `reduction_fn_purpose` (or `reduction_fn_purpose.data`) is
            an empty string, or if `inputs` contains keys that conflict with
            `prediction` or `target`.
    """
    raise NotImplementedError(
        "LMJudgeEvaluator.forward is implemented on the server. "
        "Client-side execution is not supported."
    )

backward(ctx, score_grad_output, explanation_grad_output) staticmethod

Backward pass for the LM Judge evaluator function.

Warning

This method is invoked by the autodiff engine and should not be called directly.

Parameters:

Name Type Description Default
ctx

Context object containing saved information from the forward pass.

required
score_grad_output Variable

The gradient of the score Variable w.r.t. the output of the forward() method.

required
explanation_grad_output Variable

The gradient of the explanation Variable w.r.t. the output of the forward() method.

required

Returns:

Name Type Description
None None

Placeholder for the forward_model_client argument of forward(), which does not require a gradient.

grad_messages tuple[Variable | None, ...]

A tuple of gradients for the messages argument of forward(), where each gradient corresponds to the respective message variable. If eval_mode is True, this will be a tuple of None values, indicating that no gradients are computed for the messages in this mode.

grad_prediction Variable | None

The gradient for the prediction argument Variable of forward(), or None if eval_mode is False.

None None

Placeholder for the target argument of forward(), which does not require a gradient.

grad_inputs tuple[Variable | None, ...]

A tuple of gradients for the inputs argument of forward(), where each gradient corresponds to the respective input variable. If eval_mode is True, this will be a tuple of None values, indicating that no gradients are computed for the inputs in this mode.

None None

Placeholder for the success_fn argument of forward(), which does not require a gradient.

None None

Placeholder for the reduction_fn argument of forward(), which does not require a gradient.

None None

Placeholder for the reduction_fn_purpose argument of forward(), which does not require a gradient.

None None

Placeholder for the eval_mode argument of forward(), which does not require a gradient.

None tuple[None, ...]

Placeholder for any additional completion arguments passed to forward(), which do not require gradients.

Raises:

Type Description
RuntimeError

If the LM response to generate the gradients cannot be parsed as valid JSON after the maximum number of retries.

ValueError

If the number of gradients returned by the LM does not match the expected number based on the evaluation mode.

Source code in afnio/autodiff/evaluator.py
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
@staticmethod
def backward(
    ctx, score_grad_output: Variable, explanation_grad_output: Variable
) -> Tuple[Optional[Variable], ...]:
    """
    Backward pass for the LM Judge evaluator function.

    Warning:
        This method is invoked by the autodiff engine
        and should not be called directly.

    Args:
        ctx: Context object containing saved information from the
            [`forward`][..forward] pass.
        score_grad_output: The gradient of the `score` `Variable` w.r.t.
            the output of the `forward()` method.
        explanation_grad_output: The gradient of the `explanation` `Variable` w.r.t.
            the output of the `forward()` method.

    Returns:
        None (None): Placeholder for the `forward_model_client` argument of
            `forward()`, which does not require a gradient.
        grad_messages (Tuple[Optional[Variable], ...]): A tuple of gradients for the
            `messages` argument of `forward()`, where each gradient corresponds to
            the respective message variable. If `eval_mode` is `True`, this will be
            a tuple of `None` values, indicating that no gradients are computed for
            the messages in this mode.
        grad_prediction (Optional[Variable]): The gradient for the `prediction`
            argument `Variable` of `forward()`, or `None` if `eval_mode` is `False`.
        None (None): Placeholder for the `target` argument of `forward()`,
            which does not require a gradient.
        grad_inputs (Tuple[Optional[Variable], ...]): A tuple of gradients for the
            `inputs` argument of `forward()`, where each gradient corresponds to the
            respective input variable. If `eval_mode` is `True`, this will be a
            tuple of `None` values, indicating that no gradients are computed for
            the inputs in this mode.
        None (None): Placeholder for the `success_fn` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `reduction_fn` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `reduction_fn_purpose` argument of
            `forward()`, which does not require a gradient.
        None (None): Placeholder for the `eval_mode` argument of `forward()`,
            which does not require a gradient.
        None (Tuple[None, ...]): Placeholder for any additional completion arguments
            passed to `forward()`, which do not require gradients.

    Raises:
        RuntimeError: If the LM response to generate the gradients cannot be parsed
            as valid JSON after the maximum number of retries.
        ValueError: If the number of gradients returned by the LM does not match the
            expected number based on the evaluation mode.
    """
    raise NotImplementedError(
        "LMJudgeEvaluator.backward is implemented on the server. "
        "Client-side execution is not supported."
    )