`afnio.autodiff.evaluator`

`afnio.autodiff.evaluator.DeterministicEvaluator`

Bases: Function

Evaluates predictions deterministically using a user-defined evaluation function within the afnio framework, supporting automatic differentiation.

This class inherits from Function and requires both the forward and backward methods to be defined.

The DeterministicEvaluator function computes a score and an explanation based on the prediction and target inputs using a user-defined evaluation function (eval_fn). The evaluation function's purpose is described by eval_fn_purpose. Outputs include a numerical or textual score and a textual explanation, both wrapped as Variable objects.

The prediction is a Variable. The target can be a string, a list of strings, or a Variable. Each Variable passed as an input argument can have either a scalar or a list data field, supporting both individual samples and batch processing. For batch processing, the lengths of prediction and target must match.

The success_fn parameter is a user-defined function that returns True when all predictions evaluated by eval_fn are considered successful, and False otherwise. If success_fn returns True, the backward pass will skip gradient calculations and directly return an empty gradient, optimizing computational time.

The reduction_fn parameter specifies the aggregation function to use for scores across a batch of predictions and targets. When specified, the reduction function's purpose is described using reduction_fn_purpose. If aggregation is not desired, set reduction_fn and reduction_fn_purpose to None.

Examples:

Example with scalar inputs:

>>> prediction = Variable(
...     data="green",
...     role="color prediction",
...     requires_grad=True
... )
>>> target = "red"
>>> def exact_match_fn(p: str, t: str) -> int:
...     return 1 if p == t else 0
>>> score, explanation = DeterministicEvaluator.apply(
...     prediction,
...     target,
...     exact_match_fn,
...     "exact match",
... )
>>> score.data
0
>>> explanation.data
'The evaluation function, designed for 'exact match', compared the <DATA> field of the predicted variable ('green') with the <DATA> field of the target variable ('red'), resulting in a score: 0.'
>>> explanation.backward()
>>> prediction.grad[0].data
'Reassess the criteria that led to the initial prediction of 'green'.'

Example with batched inputs:

>>> prediction = Variable(
...     data=["green", "blue"],
...     role="color prediction",
...     requires_grad=True
... )
>>> target = ["red", "blue"]
>>> def exact_match_fn(p: str, t: str) -> int:
...     return 1 if p == t else 0
>>> score, explanation = DeterministicEvaluator.apply(
...     prediction,
...     target,
...     exact_match_fn,
...     "exact match",
...     reduction_fn=sum,
...     reduction_fn_purpose="summation"
... )
>>> score.data
1
>>> explanation.data
'The evaluation function, designed for 'exact match', compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch, generating individual scores for each pair. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
>>> explanation.backward()
>>> prediction.grad[0].data
'Reassess the criteria that led to the initial prediction of 'green'.'

Source code in afnio/autodiff/evaluator.py

@evaluator
class DeterministicEvaluator(Function):
    """
    Evaluates predictions deterministically using a user-defined evaluation function
    within the `afnio` framework, supporting automatic differentiation.

    This class inherits from [`Function`][afnio.autodiff.function.Function] and
    requires both the [`forward`][afnio.autodiff.function.Function.forward] and
    [`backward`][afnio.autodiff.function.Function.backward] methods to be defined.

    The `DeterministicEvaluator` function computes a `score` and an `explanation` based
    on the `prediction` and `target` inputs using a user-defined evaluation function
    (`eval_fn`). The evaluation function's purpose is described by `eval_fn_purpose`.
    Outputs include a numerical or textual score and a textual explanation, both wrapped
    as [`Variable`][afnio.Variable] objects.

    The `prediction` is a [`Variable`][afnio.Variable]. The `target` can be a string,
    a list of strings, or a [`Variable`][afnio.Variable].
    Each [`Variable`][afnio.Variable] passed as an input argument can have either
    a scalar or a list [`data`][afnio.Variable.data] field, supporting both individual
    samples and batch processing. For batch processing, the lengths of `prediction`
    and `target` must match.

    The `success_fn` parameter is a user-defined function that returns `True` when
    all predictions evaluated by `eval_fn` are considered successful, and `False`
    otherwise. If `success_fn` returns `True`, the `backward` pass will skip gradient
    calculations and directly return an empty gradient, optimizing computational time.

    The `reduction_fn` parameter specifies the aggregation function to use for scores
    across a batch of predictions and targets. When specified, the reduction function's
    purpose is described using `reduction_fn_purpose`. If aggregation is not desired,
    set `reduction_fn` and `reduction_fn_purpose` to `None`.

    Examples:
        Example with scalar inputs:
        >>> prediction = Variable(
        ...     data="green",
        ...     role="color prediction",
        ...     requires_grad=True
        ... )
        >>> target = "red"
        >>> def exact_match_fn(p: str, t: str) -> int:
        ...     return 1 if p == t else 0
        >>> score, explanation = DeterministicEvaluator.apply(
        ...     prediction,
        ...     target,
        ...     exact_match_fn,
        ...     "exact match",
        ... )
        >>> score.data
        0
        >>> explanation.data
        'The evaluation function, designed for 'exact match', compared the <DATA> field of the predicted variable ('green') with the <DATA> field of the target variable ('red'), resulting in a score: 0.'
        >>> explanation.backward()
        >>> prediction.grad[0].data
        'Reassess the criteria that led to the initial prediction of 'green'.'

        Example with batched inputs:
        >>> prediction = Variable(
        ...     data=["green", "blue"],
        ...     role="color prediction",
        ...     requires_grad=True
        ... )
        >>> target = ["red", "blue"]
        >>> def exact_match_fn(p: str, t: str) -> int:
        ...     return 1 if p == t else 0
        >>> score, explanation = DeterministicEvaluator.apply(
        ...     prediction,
        ...     target,
        ...     exact_match_fn,
        ...     "exact match",
        ...     reduction_fn=sum,
        ...     reduction_fn_purpose="summation"
        ... )
        >>> score.data
        1
        >>> explanation.data
        'The evaluation function, designed for 'exact match', compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch, generating individual scores for each pair. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
        >>> explanation.backward()
        >>> prediction.grad[0].data
        'Reassess the criteria that led to the initial prediction of 'green'.'
    """  # noqa: E501

    @staticmethod
    def forward(
        ctx,
        prediction: Variable,
        target: Union[str, List[str], Variable],
        eval_fn: Callable[[Variable, Union[str, Variable]], List[Any]],
        eval_fn_purpose: Union[str, Variable],
        success_fn: Optional[Callable[[List[Any]], bool]],
        reduction_fn: Optional[Callable[[List[Any]], Any]],
        reduction_fn_purpose: Optional[Union[str, Variable]],
    ) -> Tuple[Variable, Variable]:
        """
        Forward pass for the deterministic evaluator function.

        Warning:
            This method is invoked by
            [`apply()`][afnio.autodiff.function.Function.apply]
            and should not be called directly.

        Args:
            ctx: Context object used to save information for [`backward`][..backward]
                computation.
            prediction: The predicted variable to evaluate, which can have scalar or
                list [`data`][afnio.Variable.data] (supporting both individual and
                batch processing).
            target: The target (ground truth) to compare against, which can be a string,
                a list of strings, or a `Variable`.
            eval_fn: A user-defined function that takes a prediction and a target
                and returns a list of scores for each sample. If `target` is a
                [`Variable`][afnio.Variable], the function should compare the
                [`data`][afnio.Variable.data] fields of `prediction` and `target`.
            eval_fn_purpose: A brief description of the purpose of `eval_fn`,
                used by the autodiff engine to generate the explanations.
            success_fn: A user-defined function that takes the list of scores returned
                by `eval_fn` and returns `True` if all predictions are considered
                successful, or `False` otherwise.
            reduction_fn: An optional function to aggregate scores across a batch of
                predictions and targets. If `None`, no aggregation is applied.
            reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
                used by the autodiff engine to generate explanations. Required if
                `reduction_fn` is provided.

        Returns:
            score: A variable containing the evaluation score(s),
                or their aggregation if `reduction_fn` is provided.
            explanation: A variable containing the explanation(s) of the evaluation,
                or their aggregation if `reduction_fn` is provided.

        Raises:
            TypeError: If the types of `prediction`, `target`, `eval_fn`,
                `eval_fn_purpose`, `success_fn`, `reduction_fn`,
                or `reduction_fn_purpose` are not as expected.
            ValueError: If the lengths of `prediction.data` and `target` (or
                `target.data`, when `target` is a `Variable`) do not match when
                both are lists, or if `eval_fn_purpose` (or `eval_fn_purpose.data`)
                is an empty string, or if `reduction_fn_purpose` (or
                `reduction_fn_purpose.data`) is an empty string,
                or if the number of scores returned by `eval_fn`
                does not match the number of samples in the batch.
        """
        raise NotImplementedError(
            "DeterministicEvaluator.forward is implemented on the server. "
            "Client-side execution is not supported."
        )

    @staticmethod
    def backward(
        ctx, score_grad_output: Variable, explanation_grad_output: Variable
    ) -> Tuple[Variable, None, None, None]:
        """
        Backward pass for the deterministic evaluator function.

        Warning:
            This method is invoked by the autodiff engine
            and should not be called directly.

        Args:
            ctx: Context object containing saved information from the
                [`forward`][..forward] pass.
            score_grad_output: The gradient of the `score` `Variable` w.r.t.
                the output of the `forward()` method.
            explanation_grad_output: The gradient of the `explanation` `Variable` w.r.t.
                the output of the `forward()` method.

        Returns:
            grad_prediction: The gradient for the `prediction` input `Variable`
                of `forward()`.
            None (None): Placeholder for the `target` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `eval_fn` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `eval_fn_purpose` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `success_fn` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `reduction_fn` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `reduction_fn_purpose` argument of
                `forward()`, which does not require a gradient.

        Raises:
            RuntimeError: If the LM response to generate the gradients cannot be parsed
                as valid JSON after the maximum number of retries.
            ValueError: If the number of gradients returned by the LM does not match the
                expected number (1 in this case).
        """
        raise NotImplementedError(
            "DeterministicEvaluator.backward is implemented on the server. "
            "Client-side execution is not supported."
        )

`forward(ctx, prediction, target, eval_fn, eval_fn_purpose, success_fn, reduction_fn, reduction_fn_purpose)` `staticmethod`

Forward pass for the deterministic evaluator function.

Warning

This method is invoked by apply() and should not be called directly.

Parameters:

Name	Type	Description	Default
`ctx`		Context object used to save information for `backward` computation.	required
`prediction`	`Variable`	The predicted variable to evaluate, which can have scalar or list `data` (supporting both individual and batch processing).	required
`target`	`str \| list[str] \| Variable`	The target (ground truth) to compare against, which can be a string, a list of strings, or a `Variable`.	required
`eval_fn`	`Callable[[Variable, Union[str, Variable]], list[Any]]`	A user-defined function that takes a prediction and a target and returns a list of scores for each sample. If `target` is a `Variable`, the function should compare the `data` fields of `prediction` and `target`.	required
`eval_fn_purpose`	`str \| Variable`	A brief description of the purpose of `eval_fn`, used by the autodiff engine to generate the explanations.	required
`success_fn`	`Callable[[List[Any]], bool] \| None`	A user-defined function that takes the list of scores returned by `eval_fn` and returns `True` if all predictions are considered successful, or `False` otherwise.	required
`reduction_fn`	`Callable[[List[Any]], Any] \| None`	An optional function to aggregate scores across a batch of predictions and targets. If `None`, no aggregation is applied.	required
`reduction_fn_purpose`	`str \| Variable \| None`	A brief description of the purpose of `reduction_fn`, used by the autodiff engine to generate explanations. Required if `reduction_fn` is provided.	required

Returns:

Name	Type	Description
`score`	`Variable`	A variable containing the evaluation score(s), or their aggregation if `reduction_fn` is provided.
`explanation`	`Variable`	A variable containing the explanation(s) of the evaluation, or their aggregation if `reduction_fn` is provided.

Raises:

Type	Description
`TypeError`	If the types of `prediction`, `target`, `eval_fn`, `eval_fn_purpose`, `success_fn`, `reduction_fn`, or `reduction_fn_purpose` are not as expected.
`ValueError`	If the lengths of `prediction.data` and `target` (or `target.data`, when `target` is a `Variable`) do not match when both are lists, or if `eval_fn_purpose` (or `eval_fn_purpose.data`) is an empty string, or if `reduction_fn_purpose` (or `reduction_fn_purpose.data`) is an empty string, or if the number of scores returned by `eval_fn` does not match the number of samples in the batch.

Source code in afnio/autodiff/evaluator.py

@staticmethod
def forward(
    ctx,
    prediction: Variable,
    target: Union[str, List[str], Variable],
    eval_fn: Callable[[Variable, Union[str, Variable]], List[Any]],
    eval_fn_purpose: Union[str, Variable],
    success_fn: Optional[Callable[[List[Any]], bool]],
    reduction_fn: Optional[Callable[[List[Any]], Any]],
    reduction_fn_purpose: Optional[Union[str, Variable]],
) -> Tuple[Variable, Variable]:
    """
    Forward pass for the deterministic evaluator function.

    Warning:
        This method is invoked by
        [`apply()`][afnio.autodiff.function.Function.apply]
        and should not be called directly.

    Args:
        ctx: Context object used to save information for [`backward`][..backward]
            computation.
        prediction: The predicted variable to evaluate, which can have scalar or
            list [`data`][afnio.Variable.data] (supporting both individual and
            batch processing).
        target: The target (ground truth) to compare against, which can be a string,
            a list of strings, or a `Variable`.
        eval_fn: A user-defined function that takes a prediction and a target
            and returns a list of scores for each sample. If `target` is a
            [`Variable`][afnio.Variable], the function should compare the
            [`data`][afnio.Variable.data] fields of `prediction` and `target`.
        eval_fn_purpose: A brief description of the purpose of `eval_fn`,
            used by the autodiff engine to generate the explanations.
        success_fn: A user-defined function that takes the list of scores returned
            by `eval_fn` and returns `True` if all predictions are considered
            successful, or `False` otherwise.
        reduction_fn: An optional function to aggregate scores across a batch of
            predictions and targets. If `None`, no aggregation is applied.
        reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
            used by the autodiff engine to generate explanations. Required if
            `reduction_fn` is provided.

    Returns:
        score: A variable containing the evaluation score(s),
            or their aggregation if `reduction_fn` is provided.
        explanation: A variable containing the explanation(s) of the evaluation,
            or their aggregation if `reduction_fn` is provided.

    Raises:
        TypeError: If the types of `prediction`, `target`, `eval_fn`,
            `eval_fn_purpose`, `success_fn`, `reduction_fn`,
            or `reduction_fn_purpose` are not as expected.
        ValueError: If the lengths of `prediction.data` and `target` (or
            `target.data`, when `target` is a `Variable`) do not match when
            both are lists, or if `eval_fn_purpose` (or `eval_fn_purpose.data`)
            is an empty string, or if `reduction_fn_purpose` (or
            `reduction_fn_purpose.data`) is an empty string,
            or if the number of scores returned by `eval_fn`
            does not match the number of samples in the batch.
    """
    raise NotImplementedError(
        "DeterministicEvaluator.forward is implemented on the server. "
        "Client-side execution is not supported."
    )

`backward(ctx, score_grad_output, explanation_grad_output)` `staticmethod`

Backward pass for the deterministic evaluator function.

Warning

This method is invoked by the autodiff engine and should not be called directly.

Parameters:

Name	Type	Description	Default
`ctx`		Context object containing saved information from the `forward` pass.	required
`score_grad_output`	`Variable`	The gradient of the `score` `Variable` w.r.t. the output of the `forward()` method.	required
`explanation_grad_output`	`Variable`	The gradient of the `explanation` `Variable` w.r.t. the output of the `forward()` method.	required

Returns:

Name	Type	Description
`grad_prediction`	`Variable`	The gradient for the `prediction` input `Variable` of `forward()`.
`None`	`None`	Placeholder for the `target` argument of `forward()`, which does not require a gradient.
`None`	`None`	Placeholder for the `eval_fn` argument of `forward()`, which does not require a gradient.
`None`	`None`	Placeholder for the `eval_fn_purpose` argument of `forward()`, which does not require a gradient.
`None`	`None`	Placeholder for the `success_fn` argument of `forward()`, which does not require a gradient.
`None`	`None`	Placeholder for the `reduction_fn` argument of `forward()`, which does not require a gradient.
`None`	`None`	Placeholder for the `reduction_fn_purpose` argument of `forward()`, which does not require a gradient.

Raises:

Type	Description
`RuntimeError`	If the LM response to generate the gradients cannot be parsed as valid JSON after the maximum number of retries.
`ValueError`	If the number of gradients returned by the LM does not match the expected number (1 in this case).

Source code in afnio/autodiff/evaluator.py

@staticmethod
def backward(
    ctx, score_grad_output: Variable, explanation_grad_output: Variable
) -> Tuple[Variable, None, None, None]:
    """
    Backward pass for the deterministic evaluator function.

    Warning:
        This method is invoked by the autodiff engine
        and should not be called directly.

    Args:
        ctx: Context object containing saved information from the
            [`forward`][..forward] pass.
        score_grad_output: The gradient of the `score` `Variable` w.r.t.
            the output of the `forward()` method.
        explanation_grad_output: The gradient of the `explanation` `Variable` w.r.t.
            the output of the `forward()` method.

    Returns:
        grad_prediction: The gradient for the `prediction` input `Variable`
            of `forward()`.
        None (None): Placeholder for the `target` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `eval_fn` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `eval_fn_purpose` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `success_fn` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `reduction_fn` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `reduction_fn_purpose` argument of
            `forward()`, which does not require a gradient.

    Raises:
        RuntimeError: If the LM response to generate the gradients cannot be parsed
            as valid JSON after the maximum number of retries.
        ValueError: If the number of gradients returned by the LM does not match the
            expected number (1 in this case).
    """
    raise NotImplementedError(
        "DeterministicEvaluator.backward is implemented on the server. "
        "Client-side execution is not supported."
    )

`afnio.autodiff.evaluator.ExactMatchEvaluator`

Bases: Function

Evaluates predictions using exact matching within the afnio framework, supporting automatic differentiation.

This class inherits from Function and requires both the forward and backward methods to be defined.

The ExactMatchEvaluator function computes a score and an explanation by comparing the data fields of a prediction and a target for an exact match. For each sample:

A score of 1 is assigned for an exact match.
A score of 0 is assigned otherwise.

The prediction is a Variable. The target can be a string, a list of strings, or a Variable. Each Variable passed as an input argument can have either a scalar or a list data field, supporting both individual samples and batch processing. For batch processing, the lengths of prediction and target must match.

If batched inputs are provided, the scores can be aggregated using an optional reduction_fn, such as sum. The purpose of the reduction is described using reduction_fn_purpose. If aggregation is not desired, set reduction_fn and reduction_fn_purpose to None.

Examples:

Example with scalar inputs:

>>> prediction = Variable(
...     data="green",
...     role="color prediction",
...     requires_grad=True
... )
>>> target = "red",
>>> score, explanation = ExactMatchEvaluator.apply(prediction, target)
>>> score.data
0
>>> explanation.data
'The evaluation function, designed for 'exact match', compared the <DATA> field of the predicted variable ('green') with the <DATA> field of the target variable ('red'), resulting in a score: 0.'
>>> explanation.backward()
>>> prediction.grad[0].data
'Reassess the criteria that led to the initial prediction of 'green'.'

Example with batched inputs:

>>> prediction = Variable(
...     data=["green", "blue"],
...     role="color prediction",
...     requires_grad=True
... )
>>> target = ["red", "blue"]
>>> score, explanation = ExactMatchEvaluator.apply(prediction, target)
>>> score.data
1
>>> explanation.data
'The evaluation function, designed for 'exact match', compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch, generating individual scores for each pair. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
>>> explanation.backward()
>>> prediction.grad[0].data
'Reassess the criteria that led to the initial prediction of 'green'.'

Source code in afnio/autodiff/evaluator.py

@evaluator
class ExactMatchEvaluator(Function):
    """
    Evaluates predictions using exact matching within the `afnio` framework,
    supporting automatic differentiation.

    This class inherits from [`Function`][afnio.autodiff.function.Function] and
    requires both the [`forward`][afnio.autodiff.function.Function.forward] and
    [`backward`][afnio.autodiff.function.Function.backward] methods to be defined.

    The `ExactMatchEvaluator` function computes a `score` and an `explanation` by
    comparing the [`data`][afnio.Variable.data] fields of a `prediction`
    and a `target` for an exact match. For each sample:

    - A score of `1` is assigned for an exact match.
    - A score of `0` is assigned otherwise.

    The `prediction` is a [`Variable`][afnio.Variable]. The `target` can be a string,
    a list of strings, or a [`Variable`][afnio.Variable].
    Each [`Variable`][afnio.Variable] passed as an input argument can have either
    a scalar or a list [`data`][afnio.Variable.data] field, supporting both individual
    samples and batch processing. For batch processing, the lengths of `prediction`
    and `target` must match.

    If batched inputs are provided, the scores can be aggregated using an optional
    `reduction_fn`, such as `sum`. The purpose of the reduction is described using
    `reduction_fn_purpose`. If aggregation is not desired, set `reduction_fn` and
    `reduction_fn_purpose` to `None`.

    Examples:
        Example with scalar inputs:
        >>> prediction = Variable(
        ...     data="green",
        ...     role="color prediction",
        ...     requires_grad=True
        ... )
        >>> target = "red",
        >>> score, explanation = ExactMatchEvaluator.apply(prediction, target)
        >>> score.data
        0
        >>> explanation.data
        'The evaluation function, designed for 'exact match', compared the <DATA> field of the predicted variable ('green') with the <DATA> field of the target variable ('red'), resulting in a score: 0.'
        >>> explanation.backward()
        >>> prediction.grad[0].data
        'Reassess the criteria that led to the initial prediction of 'green'.'

        Example with batched inputs:
        >>> prediction = Variable(
        ...     data=["green", "blue"],
        ...     role="color prediction",
        ...     requires_grad=True
        ... )
        >>> target = ["red", "blue"]
        >>> score, explanation = ExactMatchEvaluator.apply(prediction, target)
        >>> score.data
        1
        >>> explanation.data
        'The evaluation function, designed for 'exact match', compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch, generating individual scores for each pair. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
        >>> explanation.backward()
        >>> prediction.grad[0].data
        'Reassess the criteria that led to the initial prediction of 'green'.'
    """  # noqa: E501

    @staticmethod
    def forward(
        ctx,
        prediction: Variable,
        target: Union[str, List[str], Variable],
        reduction_fn: Optional[Callable[[List[Any]], Any]] = sum,
        reduction_fn_purpose: Optional[Union[str, Variable]] = "summation",
    ) -> Tuple[Variable, Variable]:
        """
        Forward pass for the exact match evaluator function.

        Warning:
            This method is invoked by
            [`apply()`][afnio.autodiff.function.Function.apply]
            and should not be called directly.

        Args:
            ctx: Context object used to save information for [`backward`][..backward]
                computation.
            prediction: The predicted variable to evaluate, which can have scalar or
                list [`data`][afnio.Variable.data] (supporting both individual and
                batch processing).
            target: The target (ground truth) to compare against, which can be a string,
                a list of strings, or a `Variable`.
            reduction_fn: An optional function to aggregate scores across a batch of
                predictions and targets. If `None`, no aggregation is applied.
            reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
                used by the autodiff engine to generate explanations. Required if
                `reduction_fn` is provided.

        Returns:
            score: A variable containing the evaluation score(s),
                or their aggregation if `reduction_fn` is provided.
            explanation: A variable containing the explanation(s) of the evaluation,
                or their aggregation if `reduction_fn` is provided.

        Raises:
            TypeError: If the types of `prediction`, `target`, `reduction_fn`,
                or `reduction_fn_purpose` are not as expected.
            ValueError: If the lengths of `prediction.data` and `target` (or
                `target.data`, when `target` is a `Variable`) do not match when
                both are lists, or if `reduction_fn_purpose` (or
                `reduction_fn_purpose.data`) is an empty string.
        """
        raise NotImplementedError(
            "ExactMatchEvaluator.forward is implemented on the server. "
            "Client-side execution is not supported."
        )

    @staticmethod
    def backward(
        ctx, score_grad_output: Variable, explanation_grad_output: Variable
    ) -> Tuple[Variable, None]:
        """
        Backward pass for the exact match evaluator function.

        Warning:
            This method is invoked by the autodiff engine
            and should not be called directly.

        Args:
            ctx: Context object containing saved information from the
                [`forward`][..forward] pass.
            score_grad_output: The gradient of the `score` `Variable` w.r.t.
                the output of the `forward()` method.
            explanation_grad_output: The gradient of the `explanation` `Variable` w.r.t.
                the output of the `forward()` method.

        Returns:
            grad_prediction: The gradient for the `prediction` input `Variable`
                of `forward()`.
            None (None): Placeholder for the `target` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `reduction_fn` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `reduction_fn_purpose` argument of
                `forward()`, which does not require a gradient.
        """
        raise NotImplementedError(
            "ExactMatchEvaluator.backward is implemented on the server. "
            "Client-side execution is not supported."
        )

`forward(ctx, prediction, target, reduction_fn=sum, reduction_fn_purpose='summation')` `staticmethod`

Forward pass for the exact match evaluator function.

Warning

This method is invoked by apply() and should not be called directly.

Parameters:

Name	Type	Description	Default
`ctx`		Context object used to save information for `backward` computation.	required
`prediction`	`Variable`	The predicted variable to evaluate, which can have scalar or list `data` (supporting both individual and batch processing).	required
`target`	`str \| list[str] \| Variable`	The target (ground truth) to compare against, which can be a string, a list of strings, or a `Variable`.	required
`reduction_fn`	`Callable[[List[Any]], Any] \| None`	An optional function to aggregate scores across a batch of predictions and targets. If `None`, no aggregation is applied.	`sum`
`reduction_fn_purpose`	`str \| Variable \| None`	A brief description of the purpose of `reduction_fn`, used by the autodiff engine to generate explanations. Required if `reduction_fn` is provided.	`'summation'`

Returns:

Name	Type	Description
`score`	`Variable`	A variable containing the evaluation score(s), or their aggregation if `reduction_fn` is provided.
`explanation`	`Variable`	A variable containing the explanation(s) of the evaluation, or their aggregation if `reduction_fn` is provided.

Raises:

Type	Description
`TypeError`	If the types of `prediction`, `target`, `reduction_fn`, or `reduction_fn_purpose` are not as expected.
`ValueError`	If the lengths of `prediction.data` and `target` (or `target.data`, when `target` is a `Variable`) do not match when both are lists, or if `reduction_fn_purpose` (or `reduction_fn_purpose.data`) is an empty string.

Source code in afnio/autodiff/evaluator.py

@staticmethod
def forward(
    ctx,
    prediction: Variable,
    target: Union[str, List[str], Variable],
    reduction_fn: Optional[Callable[[List[Any]], Any]] = sum,
    reduction_fn_purpose: Optional[Union[str, Variable]] = "summation",
) -> Tuple[Variable, Variable]:
    """
    Forward pass for the exact match evaluator function.

    Warning:
        This method is invoked by
        [`apply()`][afnio.autodiff.function.Function.apply]
        and should not be called directly.

    Args:
        ctx: Context object used to save information for [`backward`][..backward]
            computation.
        prediction: The predicted variable to evaluate, which can have scalar or
            list [`data`][afnio.Variable.data] (supporting both individual and
            batch processing).
        target: The target (ground truth) to compare against, which can be a string,
            a list of strings, or a `Variable`.
        reduction_fn: An optional function to aggregate scores across a batch of
            predictions and targets. If `None`, no aggregation is applied.
        reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
            used by the autodiff engine to generate explanations. Required if
            `reduction_fn` is provided.

    Returns:
        score: A variable containing the evaluation score(s),
            or their aggregation if `reduction_fn` is provided.
        explanation: A variable containing the explanation(s) of the evaluation,
            or their aggregation if `reduction_fn` is provided.

    Raises:
        TypeError: If the types of `prediction`, `target`, `reduction_fn`,
            or `reduction_fn_purpose` are not as expected.
        ValueError: If the lengths of `prediction.data` and `target` (or
            `target.data`, when `target` is a `Variable`) do not match when
            both are lists, or if `reduction_fn_purpose` (or
            `reduction_fn_purpose.data`) is an empty string.
    """
    raise NotImplementedError(
        "ExactMatchEvaluator.forward is implemented on the server. "
        "Client-side execution is not supported."
    )

`backward(ctx, score_grad_output, explanation_grad_output)` `staticmethod`

Backward pass for the exact match evaluator function.

Warning

This method is invoked by the autodiff engine and should not be called directly.

Parameters:

Name	Type	Description	Default
`ctx`		Context object containing saved information from the `forward` pass.	required
`score_grad_output`	`Variable`	The gradient of the `score` `Variable` w.r.t. the output of the `forward()` method.	required
`explanation_grad_output`	`Variable`	The gradient of the `explanation` `Variable` w.r.t. the output of the `forward()` method.	required

Returns:

Name	Type	Description
`grad_prediction`	`Variable`	The gradient for the `prediction` input `Variable` of `forward()`.
`None`	`None`	Placeholder for the `target` argument of `forward()`, which does not require a gradient.
`None`	`None`	Placeholder for the `reduction_fn` argument of `forward()`, which does not require a gradient.
`None`	`None`	Placeholder for the `reduction_fn_purpose` argument of `forward()`, which does not require a gradient.

Source code in afnio/autodiff/evaluator.py

@staticmethod
def backward(
    ctx, score_grad_output: Variable, explanation_grad_output: Variable
) -> Tuple[Variable, None]:
    """
    Backward pass for the exact match evaluator function.

    Warning:
        This method is invoked by the autodiff engine
        and should not be called directly.

    Args:
        ctx: Context object containing saved information from the
            [`forward`][..forward] pass.
        score_grad_output: The gradient of the `score` `Variable` w.r.t.
            the output of the `forward()` method.
        explanation_grad_output: The gradient of the `explanation` `Variable` w.r.t.
            the output of the `forward()` method.

    Returns:
        grad_prediction: The gradient for the `prediction` input `Variable`
            of `forward()`.
        None (None): Placeholder for the `target` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `reduction_fn` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `reduction_fn_purpose` argument of
            `forward()`, which does not require a gradient.
    """
    raise NotImplementedError(
        "ExactMatchEvaluator.backward is implemented on the server. "
        "Client-side execution is not supported."
    )

`afnio.autodiff.evaluator.LMJudgeEvaluator`

Bases: Function

Implements an evaluation of a model prediction using a language model (LM) as the judge within the afnio framework, supporting automatic differentiation.

This class inherits from Function and requires both the forward and backward methods to be defined.

This function returns a score and an explanation, both as Variable objects, by comparing a prediction against a target (when present) using a composite prompt. The prompt is constructed from a list of messages and optional inputs, which can dynamically populate placeholders in the message templates. The evaluation process leverages the specified forward_model_client to perform the LM-based assessment.

The prediction is a Variable. The target can be a string, a list of strings, or a Variable. Similarly, the inputs dictionary can include strings, lists of strings, or Variables. Each Variable passed as an input argument can have either a scalar or a list data field, supporting both individual samples and batch processing. For batch processing, the lengths of prediction, target, and any batched inputs must match.

The success_fn parameter is a user-defined function that returns True when all predictions evaluated by the LM as Judge are considered successful, and False otherwise. If success_fn returns True, the backward pass will skip gradient calculations and directly return an empty gradient, optimizing computational time.

If you are processing a batch of predictions and targets, you can use the reduction_fn to aggregate individual scores (e.g., using sum to compute a total score). The reduction_fn_purpose parameter is a brief description of the aggregation's purpose (e.g., "summation"). If you don't want any aggregation, set both reduction_fn and reduction_fn_purpose to None.

The function operates in two modes controlled by eval_mode:

eval_mode=True (default) – Computes gradients for prediction only. Use it for direct feedback on predictions.
eval_mode=False – Computes gradients for messages and inputs. Use it to optimize the evaluator or align with human evaluation datasets.

Additional model parameters, such as temperature, max tokens, or seed values, can be passed through completion_args to customize the LLM's behavior.

Examples:

Example with scalar inputs:

>>> task = Variable(
...     "Evaluate if the translation is accurate.",
...     role="evaluation task",
...     requires_grad=True
... )
>>> format = Variable(
...     "Provide 'score' (true/false) and 'explanation' in JSON.",
...     role="output format"
... )
>>> user = Variable(
...     "<PREDICTION>{prediction}</PREDICTION><TARGET>{target}</TARGET>",
...     role="user query"
... )
>>> prediction = Variable(
...     "Hola Mundo",
...     role="translated text",
...     requires_grad=True
... )
>>> target = Variable("Ciao Mondo", role="expected output")
>>> messages = [
...     {"role": "system", "content": [task, format]},
...     {"role": "user", "content": [user]}
... ]
>>> score, explanation = LMJudgeEvaluator.apply(
...     model,
...     messages,
...     prediction,
...     target,
...     temperature=0.5,
... )
>>> score.data
False
>>> explanation.data
'The translated text is in Spanish, but the expected is in Italian.'
>>> explanation.backward()
>>> prediction.grad[0].data
'The translated text should be in Italian.'

Example with batched inputs:

>>> task = Variable(
...     "Evaluate if the translation is accurate.",
...     role="evaluation task",
...     requires_grad=True
... )
>>> format = Variable(
...     "Provide 'score' (true/false) and 'explanation' in JSON.",
...     role="output format"
... )
>>> user = Variable(
...     "<PREDICTION>{prediction}</PREDICTION><TARGET>{target}</TARGET>",
...     role="user query"
... )
>>> prediction = Variable(
...     data=["Hola Mundo", "Salve a tutti"],
...     role="translated text",
...     requires_grad=True,
... )
>>> target = ["Ciao Mondo", "Salve a tutti"]
>>> score, explanation = LMJudgeEvaluator.apply(
...     model,
...     messages,
...     prediction,
...     target,
...     reduction_fn=sum,
...     reduction_fn_purpose="summation",
... )
>>> score.data
1
>>> explanation.data
'The evaluation function, designed using an LM as the judge, compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'

Source code in afnio/autodiff/evaluator.py

@evaluator
class LMJudgeEvaluator(Function):
    """
    Implements an evaluation of a model prediction using a language model (LM) as the
    judge within the `afnio` framework, supporting automatic differentiation.

    This class inherits from [`Function`][afnio.autodiff.function.Function] and
    requires both the [`forward`][afnio.autodiff.function.Function.forward] and
    [`backward`][afnio.autodiff.function.Function.backward] methods to be defined.

    This function returns a `score` and an `explanation`, both as
    [`Variable`][afnio.Variable] objects, by comparing a `prediction` against a `target`
    (when present) using a composite prompt. The prompt is constructed from a list of
    `messages` and optional `inputs`, which can dynamically populate placeholders in the
    message templates. The evaluation process leverages the specified
    `forward_model_client` to perform the LM-based assessment.

    The `prediction` is a [`Variable`][afnio.Variable]. The `target` can be a string,
    a list of strings, or a [`Variable`][afnio.Variable]. Similarly, the `inputs`
    dictionary can include strings, lists of strings, or [`Variable`][afnio.Variable]s.
    Each [`Variable`][afnio.Variable] passed as an input argument can have either
    a scalar or a list [`data`][afnio.Variable.data] field, supporting both individual
    samples and batch processing. For batch processing, the lengths of `prediction`,
    `target`, and any batched `inputs` must match.

    The `success_fn` parameter is a user-defined function that returns `True` when
    all predictions evaluated by the LM as Judge are considered successful, and `False`
    otherwise. If `success_fn` returns `True`, the `backward` pass will skip gradient
    calculations and directly return an empty gradient, optimizing computational time.

    If you are processing a batch of predictions and targets, you can use the
    `reduction_fn` to aggregate individual scores (e.g., using `sum` to compute a total
    score). The `reduction_fn_purpose` parameter is a brief description of the
    aggregation's purpose (e.g., `"summation"`). If you don't want any aggregation, set
    both `reduction_fn` and `reduction_fn_purpose` to `None`.

    The function operates in two modes controlled by `eval_mode`:

    - **eval_mode=True (default)** – Computes gradients for `prediction` only. Use it
      for direct feedback on predictions.
    - **eval_mode=False** – Computes gradients for `messages` and `inputs`. Use it to
      optimize the evaluator or align with human evaluation datasets.

    Additional model parameters, such as temperature, max tokens, or seed values, can
    be passed through `completion_args` to customize the LLM's behavior.

    Examples:
        Example with scalar inputs:
        >>> task = Variable(
        ...     "Evaluate if the translation is accurate.",
        ...     role="evaluation task",
        ...     requires_grad=True
        ... )
        >>> format = Variable(
        ...     "Provide 'score' (true/false) and 'explanation' in JSON.",
        ...     role="output format"
        ... )
        >>> user = Variable(
        ...     "<PREDICTION>{prediction}</PREDICTION><TARGET>{target}</TARGET>",
        ...     role="user query"
        ... )
        >>> prediction = Variable(
        ...     "Hola Mundo",
        ...     role="translated text",
        ...     requires_grad=True
        ... )
        >>> target = Variable("Ciao Mondo", role="expected output")
        >>> messages = [
        ...     {"role": "system", "content": [task, format]},
        ...     {"role": "user", "content": [user]}
        ... ]
        >>> score, explanation = LMJudgeEvaluator.apply(
        ...     model,
        ...     messages,
        ...     prediction,
        ...     target,
        ...     temperature=0.5,
        ... )
        >>> score.data
        False
        >>> explanation.data
        'The translated text is in Spanish, but the expected is in Italian.'
        >>> explanation.backward()
        >>> prediction.grad[0].data
        'The translated text should be in Italian.'

        Example with batched inputs:
        >>> task = Variable(
        ...     "Evaluate if the translation is accurate.",
        ...     role="evaluation task",
        ...     requires_grad=True
        ... )
        >>> format = Variable(
        ...     "Provide 'score' (true/false) and 'explanation' in JSON.",
        ...     role="output format"
        ... )
        >>> user = Variable(
        ...     "<PREDICTION>{prediction}</PREDICTION><TARGET>{target}</TARGET>",
        ...     role="user query"
        ... )
        >>> prediction = Variable(
        ...     data=["Hola Mundo", "Salve a tutti"],
        ...     role="translated text",
        ...     requires_grad=True,
        ... )
        >>> target = ["Ciao Mondo", "Salve a tutti"]
        >>> score, explanation = LMJudgeEvaluator.apply(
        ...     model,
        ...     messages,
        ...     prediction,
        ...     target,
        ...     reduction_fn=sum,
        ...     reduction_fn_purpose="summation",
        ... )
        >>> score.data
        1
        >>> explanation.data
        'The evaluation function, designed using an LM as the judge, compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
    """  # noqa: E501

    @staticmethod
    def forward(
        ctx,
        forward_model_client: Optional[ChatCompletionModel],
        messages: MultiTurnMessages,
        prediction: Variable,
        target: Optional[Union[str, List[str], Variable]] = None,
        inputs: Optional[Dict[str, Union[str, Variable]]] = None,
        success_fn: Optional[Callable[[List[Any]], bool]] = None,
        reduction_fn: Optional[Callable[[List[Any]], Any]] = sum,
        reduction_fn_purpose: Optional[Union[str, Variable]] = "summation",
        eval_mode: Union[bool, Variable] = True,
        **completion_args,
    ) -> Tuple[Variable, Variable]:
        """
        Forward pass for the LM Judge evaluator function.

        Warning:
            This method is invoked by
            [`apply()`][afnio.autodiff.function.Function.apply]
            and should not be called directly.

        Args:
            ctx: Context object used to save information for [`backward`][..backward]
                computation.
            forward_model_client: The LM model client used for the forward
                pass evaluation.
            messages: A list of messages that compose the prompt/context for the LM.
                Each message is a dictionary with a `"role"` (e.g., `"system"`,
                `"user"`, `"assistant"`) and a `"content"` field, which is a list of
                `Variable` objects. The `Variable` objects in the `"content"` can
                contain placeholders (e.g., `{prediction}`, `{target}`) that will be
                populated with the corresponding values from the `inputs` dictionary.
            prediction: The predicted variable to evaluate, which can have scalar or
                list [`data`][afnio.Variable.data] (supporting both individual and
                batch processing).
            target: The target (ground truth) to compare against, which can be a string,
                a list of strings, or a `Variable`. Optional if the evaluation does not
                require a target and only relies on the correctness of the LM Judge's
                assessment of the `prediction`.
            inputs: A dictionary mapping placeholder names to their corresponding
                values, which can be strings or `Variable` instances. These values
                will be used to populate the placeholders in the `messages` content
                before sending the prompt to the LM. For example, if a message
                `"content"` field contains the placeholder `{color}`, the `inputs`
                dictionary should have a key `"color"` with the value to substitute
                in the prompt. Optional if there are no placeholders in the messages or
                if all placeholders are directly related to `prediction` and `target`.
            success_fn: A user-defined function that takes the list of scores returned
                by the LM Judge and returns `True` if all predictions are considered
                successful, or `False` otherwise.
            reduction_fn: An optional function to aggregate scores across a batch of
                predictions and targets. If `None`, no aggregation is applied.
            reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
                used by the autodiff engine to generate explanations. Required if
                `reduction_fn` is provided.
            eval_mode: Indicates the evaluation mode. If `True`, the `backward` pass
                will compute gradients for the `prediction` variable only. If `False`,
                the `backward` pass will compute gradients for the `messages` and
                `inputs`, allowing optimization of the evaluator itself or alignment
                with human evaluation datasets.
            **completion_args: Additional keyword arguments to pass to the LM model
                client's `chat` method, such as temperature, max tokens, or seed values,
                to customize the LLM's behavior during the evaluation.

        Returns:
            score: A variable containing the evaluation score(s),
                or their aggregation if `reduction_fn` is provided.
            explanation: A variable containing the explanation(s) of the evaluation,
                or their aggregation if `reduction_fn` is provided.

        Raises:
            RuntimeError: If the LM response to generate the evaluation `score` and
                `explanation` cannot be parsed as valid JSON.
            TypeError: If the types of `forward_model_client`, `messages`, `prediction`,
                `target`, `inputs`, `success_fn`, `reduction_fn`,
                `reduction_fn_purpose`, or `eval_mode` are not as expected.
            ValueError: If the lengths of `prediction.data` and `target` (or
                `target.data`, when `target` is a `Variable`) do not match when both are
                lists, or if `reduction_fn_purpose` (or `reduction_fn_purpose.data`) is
                an empty string, or if `inputs` contains keys that conflict with
                `prediction` or `target`.
        """
        raise NotImplementedError(
            "LMJudgeEvaluator.forward is implemented on the server. "
            "Client-side execution is not supported."
        )

    @staticmethod
    def backward(
        ctx, score_grad_output: Variable, explanation_grad_output: Variable
    ) -> Tuple[Optional[Variable], ...]:
        """
        Backward pass for the LM Judge evaluator function.

        Warning:
            This method is invoked by the autodiff engine
            and should not be called directly.

        Args:
            ctx: Context object containing saved information from the
                [`forward`][..forward] pass.
            score_grad_output: The gradient of the `score` `Variable` w.r.t.
                the output of the `forward()` method.
            explanation_grad_output: The gradient of the `explanation` `Variable` w.r.t.
                the output of the `forward()` method.

        Returns:
            None (None): Placeholder for the `forward_model_client` argument of
                `forward()`, which does not require a gradient.
            grad_messages (Tuple[Optional[Variable], ...]): A tuple of gradients for the
                `messages` argument of `forward()`, where each gradient corresponds to
                the respective message variable. If `eval_mode` is `True`, this will be
                a tuple of `None` values, indicating that no gradients are computed for
                the messages in this mode.
            grad_prediction (Optional[Variable]): The gradient for the `prediction`
                argument `Variable` of `forward()`, or `None` if `eval_mode` is `False`.
            None (None): Placeholder for the `target` argument of `forward()`,
                which does not require a gradient.
            grad_inputs (Tuple[Optional[Variable], ...]): A tuple of gradients for the
                `inputs` argument of `forward()`, where each gradient corresponds to the
                respective input variable. If `eval_mode` is `True`, this will be a
                tuple of `None` values, indicating that no gradients are computed for
                the inputs in this mode.
            None (None): Placeholder for the `success_fn` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `reduction_fn` argument of `forward()`,
                which does not require a gradient.
            None (None): Placeholder for the `reduction_fn_purpose` argument of
                `forward()`, which does not require a gradient.
            None (None): Placeholder for the `eval_mode` argument of `forward()`,
                which does not require a gradient.
            None (Tuple[None, ...]): Placeholder for any additional completion arguments
                passed to `forward()`, which do not require gradients.

        Raises:
            RuntimeError: If the LM response to generate the gradients cannot be parsed
                as valid JSON after the maximum number of retries.
            ValueError: If the number of gradients returned by the LM does not match the
                expected number based on the evaluation mode.
        """
        raise NotImplementedError(
            "LMJudgeEvaluator.backward is implemented on the server. "
            "Client-side execution is not supported."
        )

`forward(ctx, forward_model_client, messages, prediction, target=None, inputs=None, success_fn=None, reduction_fn=sum, reduction_fn_purpose='summation', eval_mode=True, **completion_args)` `staticmethod`

Forward pass for the LM Judge evaluator function.

Warning

This method is invoked by apply() and should not be called directly.

Parameters:

Name	Type	Description	Default
`ctx`		Context object used to save information for `backward` computation.	required
`forward_model_client`	`ChatCompletionModel \| None`	The LM model client used for the forward pass evaluation.	required
`messages`	`MultiTurnMessages`	A list of messages that compose the prompt/context for the LM. Each message is a dictionary with a `"role"` (e.g., `"system"`, `"user"`, `"assistant"`) and a `"content"` field, which is a list of `Variable` objects. The `Variable` objects in the `"content"` can contain placeholders (e.g., `{prediction}`, `{target}`) that will be populated with the corresponding values from the `inputs` dictionary.	required
`prediction`	`Variable`	The predicted variable to evaluate, which can have scalar or list `data` (supporting both individual and batch processing).	required
`target`	`str \| list[str] \| Variable \| None`	The target (ground truth) to compare against, which can be a string, a list of strings, or a `Variable`. Optional if the evaluation does not require a target and only relies on the correctness of the LM Judge's assessment of the `prediction`.	`None`
`inputs`	`dict[str, str \| Variable] \| None`	A dictionary mapping placeholder names to their corresponding values, which can be strings or `Variable` instances. These values will be used to populate the placeholders in the `messages` content before sending the prompt to the LM. For example, if a message `"content"` field contains the placeholder `{color}`, the `inputs` dictionary should have a key `"color"` with the value to substitute in the prompt. Optional if there are no placeholders in the messages or if all placeholders are directly related to `prediction` and `target`.	`None`
`success_fn`	`Callable[[List[Any]], bool] \| None`	A user-defined function that takes the list of scores returned by the LM Judge and returns `True` if all predictions are considered successful, or `False` otherwise.	`None`
`reduction_fn`	`Callable[[List[Any]], Any] \| None`	An optional function to aggregate scores across a batch of predictions and targets. If `None`, no aggregation is applied.	`sum`
`reduction_fn_purpose`	`str \| Variable \| None`	A brief description of the purpose of `reduction_fn`, used by the autodiff engine to generate explanations. Required if `reduction_fn` is provided.	`'summation'`
`eval_mode`	`bool \| Variable`	Indicates the evaluation mode. If `True`, the `backward` pass will compute gradients for the `prediction` variable only. If `False`, the `backward` pass will compute gradients for the `messages` and `inputs`, allowing optimization of the evaluator itself or alignment with human evaluation datasets.	`True`
`**completion_args`		Additional keyword arguments to pass to the LM model client's `chat` method, such as temperature, max tokens, or seed values, to customize the LLM's behavior during the evaluation.	`{}`

Returns:

Name	Type	Description
`score`	`Variable`	A variable containing the evaluation score(s), or their aggregation if `reduction_fn` is provided.
`explanation`	`Variable`	A variable containing the explanation(s) of the evaluation, or their aggregation if `reduction_fn` is provided.

Raises:

Type	Description
`RuntimeError`	If the LM response to generate the evaluation `score` and `explanation` cannot be parsed as valid JSON.
`TypeError`	If the types of `forward_model_client`, `messages`, `prediction`, `target`, `inputs`, `success_fn`, `reduction_fn`, `reduction_fn_purpose`, or `eval_mode` are not as expected.
`ValueError`	If the lengths of `prediction.data` and `target` (or `target.data`, when `target` is a `Variable`) do not match when both are lists, or if `reduction_fn_purpose` (or `reduction_fn_purpose.data`) is an empty string, or if `inputs` contains keys that conflict with `prediction` or `target`.

Source code in afnio/autodiff/evaluator.py

@staticmethod
def forward(
    ctx,
    forward_model_client: Optional[ChatCompletionModel],
    messages: MultiTurnMessages,
    prediction: Variable,
    target: Optional[Union[str, List[str], Variable]] = None,
    inputs: Optional[Dict[str, Union[str, Variable]]] = None,
    success_fn: Optional[Callable[[List[Any]], bool]] = None,
    reduction_fn: Optional[Callable[[List[Any]], Any]] = sum,
    reduction_fn_purpose: Optional[Union[str, Variable]] = "summation",
    eval_mode: Union[bool, Variable] = True,
    **completion_args,
) -> Tuple[Variable, Variable]:
    """
    Forward pass for the LM Judge evaluator function.

    Warning:
        This method is invoked by
        [`apply()`][afnio.autodiff.function.Function.apply]
        and should not be called directly.

    Args:
        ctx: Context object used to save information for [`backward`][..backward]
            computation.
        forward_model_client: The LM model client used for the forward
            pass evaluation.
        messages: A list of messages that compose the prompt/context for the LM.
            Each message is a dictionary with a `"role"` (e.g., `"system"`,
            `"user"`, `"assistant"`) and a `"content"` field, which is a list of
            `Variable` objects. The `Variable` objects in the `"content"` can
            contain placeholders (e.g., `{prediction}`, `{target}`) that will be
            populated with the corresponding values from the `inputs` dictionary.
        prediction: The predicted variable to evaluate, which can have scalar or
            list [`data`][afnio.Variable.data] (supporting both individual and
            batch processing).
        target: The target (ground truth) to compare against, which can be a string,
            a list of strings, or a `Variable`. Optional if the evaluation does not
            require a target and only relies on the correctness of the LM Judge's
            assessment of the `prediction`.
        inputs: A dictionary mapping placeholder names to their corresponding
            values, which can be strings or `Variable` instances. These values
            will be used to populate the placeholders in the `messages` content
            before sending the prompt to the LM. For example, if a message
            `"content"` field contains the placeholder `{color}`, the `inputs`
            dictionary should have a key `"color"` with the value to substitute
            in the prompt. Optional if there are no placeholders in the messages or
            if all placeholders are directly related to `prediction` and `target`.
        success_fn: A user-defined function that takes the list of scores returned
            by the LM Judge and returns `True` if all predictions are considered
            successful, or `False` otherwise.
        reduction_fn: An optional function to aggregate scores across a batch of
            predictions and targets. If `None`, no aggregation is applied.
        reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
            used by the autodiff engine to generate explanations. Required if
            `reduction_fn` is provided.
        eval_mode: Indicates the evaluation mode. If `True`, the `backward` pass
            will compute gradients for the `prediction` variable only. If `False`,
            the `backward` pass will compute gradients for the `messages` and
            `inputs`, allowing optimization of the evaluator itself or alignment
            with human evaluation datasets.
        **completion_args: Additional keyword arguments to pass to the LM model
            client's `chat` method, such as temperature, max tokens, or seed values,
            to customize the LLM's behavior during the evaluation.

    Returns:
        score: A variable containing the evaluation score(s),
            or their aggregation if `reduction_fn` is provided.
        explanation: A variable containing the explanation(s) of the evaluation,
            or their aggregation if `reduction_fn` is provided.

    Raises:
        RuntimeError: If the LM response to generate the evaluation `score` and
            `explanation` cannot be parsed as valid JSON.
        TypeError: If the types of `forward_model_client`, `messages`, `prediction`,
            `target`, `inputs`, `success_fn`, `reduction_fn`,
            `reduction_fn_purpose`, or `eval_mode` are not as expected.
        ValueError: If the lengths of `prediction.data` and `target` (or
            `target.data`, when `target` is a `Variable`) do not match when both are
            lists, or if `reduction_fn_purpose` (or `reduction_fn_purpose.data`) is
            an empty string, or if `inputs` contains keys that conflict with
            `prediction` or `target`.
    """
    raise NotImplementedError(
        "LMJudgeEvaluator.forward is implemented on the server. "
        "Client-side execution is not supported."
    )

`backward(ctx, score_grad_output, explanation_grad_output)` `staticmethod`

Backward pass for the LM Judge evaluator function.

Warning

This method is invoked by the autodiff engine and should not be called directly.

Parameters:

Name	Type	Description	Default
`ctx`		Context object containing saved information from the `forward` pass.	required
`score_grad_output`	`Variable`	The gradient of the `score` `Variable` w.r.t. the output of the `forward()` method.	required
`explanation_grad_output`	`Variable`	The gradient of the `explanation` `Variable` w.r.t. the output of the `forward()` method.	required

Returns:

Name	Type	Description
`None`	`None`	Placeholder for the `forward_model_client` argument of `forward()`, which does not require a gradient.
`grad_messages`	`tuple[Variable \| None, ...]`	A tuple of gradients for the `messages` argument of `forward()`, where each gradient corresponds to the respective message variable. If `eval_mode` is `True`, this will be a tuple of `None` values, indicating that no gradients are computed for the messages in this mode.
`grad_prediction`	`Variable \| None`	The gradient for the `prediction` argument `Variable` of `forward()`, or `None` if `eval_mode` is `False`.
`None`	`None`	Placeholder for the `target` argument of `forward()`, which does not require a gradient.
`grad_inputs`	`tuple[Variable \| None, ...]`	A tuple of gradients for the `inputs` argument of `forward()`, where each gradient corresponds to the respective input variable. If `eval_mode` is `True`, this will be a tuple of `None` values, indicating that no gradients are computed for the inputs in this mode.
`None`	`None`	Placeholder for the `success_fn` argument of `forward()`, which does not require a gradient.
`None`	`None`	Placeholder for the `reduction_fn` argument of `forward()`, which does not require a gradient.
`None`	`None`	Placeholder for the `reduction_fn_purpose` argument of `forward()`, which does not require a gradient.
`None`	`None`	Placeholder for the `eval_mode` argument of `forward()`, which does not require a gradient.
`None`	`tuple[None, ...]`	Placeholder for any additional completion arguments passed to `forward()`, which do not require gradients.

Raises:

Type	Description
`RuntimeError`	If the LM response to generate the gradients cannot be parsed as valid JSON after the maximum number of retries.
`ValueError`	If the number of gradients returned by the LM does not match the expected number based on the evaluation mode.

Source code in afnio/autodiff/evaluator.py

@staticmethod
def backward(
    ctx, score_grad_output: Variable, explanation_grad_output: Variable
) -> Tuple[Optional[Variable], ...]:
    """
    Backward pass for the LM Judge evaluator function.

    Warning:
        This method is invoked by the autodiff engine
        and should not be called directly.

    Args:
        ctx: Context object containing saved information from the
            [`forward`][..forward] pass.
        score_grad_output: The gradient of the `score` `Variable` w.r.t.
            the output of the `forward()` method.
        explanation_grad_output: The gradient of the `explanation` `Variable` w.r.t.
            the output of the `forward()` method.

    Returns:
        None (None): Placeholder for the `forward_model_client` argument of
            `forward()`, which does not require a gradient.
        grad_messages (Tuple[Optional[Variable], ...]): A tuple of gradients for the
            `messages` argument of `forward()`, where each gradient corresponds to
            the respective message variable. If `eval_mode` is `True`, this will be
            a tuple of `None` values, indicating that no gradients are computed for
            the messages in this mode.
        grad_prediction (Optional[Variable]): The gradient for the `prediction`
            argument `Variable` of `forward()`, or `None` if `eval_mode` is `False`.
        None (None): Placeholder for the `target` argument of `forward()`,
            which does not require a gradient.
        grad_inputs (Tuple[Optional[Variable], ...]): A tuple of gradients for the
            `inputs` argument of `forward()`, where each gradient corresponds to the
            respective input variable. If `eval_mode` is `True`, this will be a
            tuple of `None` values, indicating that no gradients are computed for
            the inputs in this mode.
        None (None): Placeholder for the `success_fn` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `reduction_fn` argument of `forward()`,
            which does not require a gradient.
        None (None): Placeholder for the `reduction_fn_purpose` argument of
            `forward()`, which does not require a gradient.
        None (None): Placeholder for the `eval_mode` argument of `forward()`,
            which does not require a gradient.
        None (Tuple[None, ...]): Placeholder for any additional completion arguments
            passed to `forward()`, which do not require gradients.

    Raises:
        RuntimeError: If the LM response to generate the gradients cannot be parsed
            as valid JSON after the maximum number of retries.
        ValueError: If the number of gradients returned by the LM does not match the
            expected number based on the evaluation mode.
    """
    raise NotImplementedError(
        "LMJudgeEvaluator.backward is implemented on the server. "
        "Client-side execution is not supported."
    )

afnio.autodiff.evaluator

afnio.autodiff.evaluator.DeterministicEvaluator

forward(ctx, prediction, target, eval_fn, eval_fn_purpose, success_fn, reduction_fn, reduction_fn_purpose) staticmethod

backward(ctx, score_grad_output, explanation_grad_output) staticmethod

afnio.autodiff.evaluator.ExactMatchEvaluator

forward(ctx, prediction, target, reduction_fn=sum, reduction_fn_purpose='summation') staticmethod

backward(ctx, score_grad_output, explanation_grad_output) staticmethod

afnio.autodiff.evaluator.LMJudgeEvaluator

forward(ctx, forward_model_client, messages, prediction, target=None, inputs=None, success_fn=None, reduction_fn=sum, reduction_fn_purpose='summation', eval_mode=True, **completion_args) staticmethod

backward(ctx, score_grad_output, explanation_grad_output) staticmethod

`afnio.autodiff.evaluator`

`afnio.autodiff.evaluator.DeterministicEvaluator`

`forward(ctx, prediction, target, eval_fn, eval_fn_purpose, success_fn, reduction_fn, reduction_fn_purpose)` `staticmethod`

`backward(ctx, score_grad_output, explanation_grad_output)` `staticmethod`

`afnio.autodiff.evaluator.ExactMatchEvaluator`

`forward(ctx, prediction, target, reduction_fn=sum, reduction_fn_purpose='summation')` `staticmethod`

`backward(ctx, score_grad_output, explanation_grad_output)` `staticmethod`

`afnio.autodiff.evaluator.LMJudgeEvaluator`

`forward(ctx, forward_model_client, messages, prediction, target=None, inputs=None, success_fn=None, reduction_fn=sum, reduction_fn_purpose='summation', eval_mode=True, **completion_args)` `staticmethod`

`backward(ctx, score_grad_output, explanation_grad_output)` `staticmethod`