Skip to content

afnio.cognitive.modules.lm_judge_evaluator

afnio.cognitive.modules.lm_judge_evaluator.LMJudgeEvaluator

Bases: Module

Evaluates predictions using a language model (LM) as the judge.

This module leverages the LMJudgeEvaluator operation from afnio.autodiff.evaluator to perform model-based evaluations. The forward method accepts a list of messages that construct the evaluation prompt, with optional inputs to dynamically fill placeholders within message templates. A prediction is compared against a target (optional) to generate a score and an explanation.

When processing a batch of predictions and targets, reduction_fn function aggregates individual scores (e.g., using sum to compute a total score). The reduction_fn_purpose parameter is a brief description of the aggregation's purpose (e.g., "summation"). If aggregation is not desired, set reduction_fn and reduction_fn_purpose to None. The success_fn checks if all evaluations are successful, allowing the backward pass to skip unnecessary gradient computations.

This module supports both evaluation (eval_mode=True) and optimization (eval_mode=False) modes.

The forward_model_client specifies the LM responsible for evaluation, while completion_args allows customization of generation parameters like temperature, max tokens, and seed.

Examples:

>>> from afnio import cognitive as cog
>>> from afnio.models.openai import OpenAI
>>> from afnio import set_backward_model_client
>>> fwd_model_client = OpenAI()
>>> fwd_model_args = {"model": "gpt-4o", "temperature": 0.5}
>>> set_backward_model_client("openai/gpt-4o")
>>> class Evaluator(cog.Module):
...     def __init__(self):
...         super().__init__()
...         self.judge = cog.LMJudgeEvaluator()
...     def forward(self, fwd_model, messages, prediction, target, inputs, **completion_args):
...         return self.judge(fwd_model, messages, prediction, target, inputs, **completion_args)
>>> task = afnio.Variable(
...     "Evaluate if the translation is {metric}.",
...     role="evaluation task",
...     requires_grad=True
... )
>>> format = afnio.Variable(
...     "Provide 'score' (true/false) and 'explanation' in JSON.",
...     role="output format"
... )
>>> metric = afnio.Variable(["accurate", "accurate"], role="metric")
>>> user = afnio.Variable(
...     "<PREDICTION>{prediction}</PREDICTION><TARGET>{target}</TARGET>",
..      role="user query"
... )
>>> prediction = afnio.Variable(
...     ["Hola Mundo", "Salve a tutti"],
...     role="translated text",
...     requires_grad=True
... )
>>> target = ["Ciao Mondo", "Salve a tutti"]
>>> messages = [
...     {"role": "system", "content": [task, format]},
...     {"role": "user", "content": [user]},
... ]
>>> eval = Evaluator()
>>> score, explanation = eval(
...     fwd_model_client,
...     messages,
...     prediction,
...     target,
...     inputs={"metric": metric},
...     reduction_fn=sum,
...     reduction_fn_purpose="summation",
...     **fwd_model_args
... )
>>> print(score.data)
1
>>> print(explanation.data)
'The evaluation function, designed using an LM as the judge, compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
>>> explanation.backward()
>>> system.grad[0].data
'The translated text should be in Italian.'

Raises:

Type Description
RuntimeError

If the LM response to generate the evaluation score and explanation cannot be parsed as valid JSON.

TypeError

If the types of forward_model_client, messages, prediction, target, inputs, success_fn, reduction_fn, reduction_fn_purpose, or eval_mode are not as expected.

ValueError

If the lengths of prediction.data and target (or target.data, when target is a Variable) do not match when both are lists, or if reduction_fn_purpose (or reduction_fn_purpose.data) is an empty string, or if inputs contains keys that conflict with prediction or target.

See Also

afnio.autodiff.evaluator.LMJudgeEvaluator for the underlying operation.

Source code in afnio/cognitive/modules/lm_judge_evaluator.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
class LMJudgeEvaluator(Module):
    """
    Evaluates predictions using a language model (LM) as the judge.

    This module leverages the [`LMJudgeEvaluator`][afnio.autodiff.evaluator.LMJudgeEvaluator]
    operation from `afnio.autodiff.evaluator` to perform model-based evaluations.
    The `forward` method accepts a list of `messages` that construct the evaluation
    prompt, with optional `inputs` to dynamically fill placeholders within message
    templates. A `prediction` is compared against a `target` (optional) to generate
    a `score` and an `explanation`.

    When processing a batch of predictions and targets, `reduction_fn` function
    aggregates individual scores (e.g., using [`sum`][sum] to compute a total score). The
    `reduction_fn_purpose` parameter is a brief description of the aggregation's purpose
    (e.g., `"summation"`). If aggregation is not desired, set `reduction_fn` and
    `reduction_fn_purpose` to `None`. The `success_fn` checks if all evaluations are
    successful, allowing the `backward` pass to skip unnecessary gradient computations.

    This module supports both evaluation (`eval_mode=True`) and optimization
    (`eval_mode=False`) modes.

    The `forward_model_client` specifies the LM responsible for evaluation, while
    `completion_args` allows customization of generation parameters like temperature,
    max tokens, and seed.

    Examples:
        >>> from afnio import cognitive as cog
        >>> from afnio.models.openai import OpenAI
        >>> from afnio import set_backward_model_client
        >>> fwd_model_client = OpenAI()
        >>> fwd_model_args = {"model": "gpt-4o", "temperature": 0.5}
        >>> set_backward_model_client("openai/gpt-4o")
        >>> class Evaluator(cog.Module):
        ...     def __init__(self):
        ...         super().__init__()
        ...         self.judge = cog.LMJudgeEvaluator()
        ...     def forward(self, fwd_model, messages, prediction, target, inputs, **completion_args):
        ...         return self.judge(fwd_model, messages, prediction, target, inputs, **completion_args)
        >>> task = afnio.Variable(
        ...     "Evaluate if the translation is {metric}.",
        ...     role="evaluation task",
        ...     requires_grad=True
        ... )
        >>> format = afnio.Variable(
        ...     "Provide 'score' (true/false) and 'explanation' in JSON.",
        ...     role="output format"
        ... )
        >>> metric = afnio.Variable(["accurate", "accurate"], role="metric")
        >>> user = afnio.Variable(
        ...     "<PREDICTION>{prediction}</PREDICTION><TARGET>{target}</TARGET>",
        ..      role="user query"
        ... )
        >>> prediction = afnio.Variable(
        ...     ["Hola Mundo", "Salve a tutti"],
        ...     role="translated text",
        ...     requires_grad=True
        ... )
        >>> target = ["Ciao Mondo", "Salve a tutti"]
        >>> messages = [
        ...     {"role": "system", "content": [task, format]},
        ...     {"role": "user", "content": [user]},
        ... ]
        >>> eval = Evaluator()
        >>> score, explanation = eval(
        ...     fwd_model_client,
        ...     messages,
        ...     prediction,
        ...     target,
        ...     inputs={"metric": metric},
        ...     reduction_fn=sum,
        ...     reduction_fn_purpose="summation",
        ...     **fwd_model_args
        ... )
        >>> print(score.data)
        1
        >>> print(explanation.data)
        'The evaluation function, designed using an LM as the judge, compared the <DATA> fields of the predicted variable and the target variable across all samples in the batch. These scores were then aggregated using the reduction function 'summation', resulting in a final aggregated score: 1.'
        >>> explanation.backward()
        >>> system.grad[0].data
        'The translated text should be in Italian.'

    Raises:
        RuntimeError: If the LM response to generate the evaluation `score` and
            `explanation` cannot be parsed as valid JSON.
        TypeError: If the types of `forward_model_client`, `messages`, `prediction`,
            `target`, `inputs`, `success_fn`, `reduction_fn`, `reduction_fn_purpose`,
            or `eval_mode` are not as expected.
        ValueError: If the lengths of `prediction.data` and `target` (or `target.data`,
            when `target` is a `Variable`) do not match when both are lists, or if
            `reduction_fn_purpose` (or `reduction_fn_purpose.data`) is an empty string,
            or if `inputs` contains keys that conflict with `prediction` or `target`.

    See Also:
        [`afnio.autodiff.evaluator.LMJudgeEvaluator`][afnio.autodiff.evaluator.LMJudgeEvaluator]
        for the underlying operation.
    """  # noqa: E501

    forward_model_client: Optional[ChatCompletionModel]
    messages: MultiTurnMessages
    success_fn: Optional[Callable[[List[Any]], bool]]
    reduction_fn: Optional[Callable[[List[Any]], Any]]
    reduction_fn_purpose: Optional[Union[str, Variable]]
    eval_mode: Union[bool, Variable]
    completion_args: Dict[str, Any]

    def __init__(self):
        super().__init__()

        self.register_model("forward_model_client", None)
        self.register_chat("messages", None)
        self.register_function("success_fn", None)
        self.register_function("reduction_fn", None)
        self.register_buffer("reduction_fn_purpose", None)
        self.register_buffer("eval_mode", None)
        self.register_completion_config("completion_args", None)

    def forward(
        self,
        forward_model_client: Optional[ChatCompletionModel],
        messages: MultiTurnMessages,
        prediction: Variable,
        target: Optional[Union[str, List[str], Variable]] = None,
        inputs: Optional[Dict[str, Union[str, Variable]]] = None,
        success_fn: Optional[Callable[[List[Any]], bool]] = None,
        reduction_fn: Optional[Callable[[List[Any]], Any]] = sum,
        reduction_fn_purpose: Optional[Union[str, Variable]] = "summation",
        eval_mode: Union[bool, Variable] = True,
        **completion_args,
    ) -> Tuple[Variable, Variable]:
        """
        Forward pass for the LM Judge evaluator function.

        Warning:
            Users should not call this method directly. Instead, they should call the
            module instance itself, which will internally invoke this `forward` method.

        Args:
            forward_model_client: The LM model client used for the forward
                pass evaluation.
            messages: A list of messages that compose the prompt/context for the LM.
                Each message is a dictionary with a `"role"` (e.g., `"system"`,
                `"user"`, `"assistant"`) and a `"content"` field, which is a list of
                `Variable` objects. The `Variable` objects in the `"content"` can
                contain placeholders (e.g., `{prediction}`, `{target}`) that will be
                populated with the corresponding values from the `inputs` dictionary.
            prediction: The predicted variable to evaluate, which can have scalar or
                list [`data`][afnio.Variable.data] (supporting both individual and
                batch processing).
            target: The target (ground truth) to compare against, which can be a string,
                a list of strings, or a `Variable`. Optional if the evaluation does not
                require a target and only relies on the correctness of the LM Judge's
                assessment of the `prediction`.
            inputs: A dictionary mapping placeholder names to their corresponding
                values, which can be strings or `Variable` instances. These values
                will be used to populate the placeholders in the `messages` content
                before sending the prompt to the LM. For example, if a message
                `"content"` field contains the placeholder `{color}`, the `inputs`
                dictionary should have a key `"color"` with the value to substitute
                in the prompt. Optional if there are no placeholders in the messages or
                if all placeholders are directly related to `prediction` and `target`.
            success_fn: A user-defined function that takes the list of scores returned
                by the LM Judge and returns `True` if all predictions are considered
                successful, or `False` otherwise.
            reduction_fn: An optional function to aggregate scores across a batch of
                predictions and targets. If `None`, no aggregation is applied.
            reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
                used by the autodiff engine to generate explanations. Required if
                `reduction_fn` is provided.
            eval_mode: Indicates the evaluation mode. If `True`, the `backward` pass
                will compute gradients for the `prediction` variable only. If `False`,
                the `backward` pass will compute gradients for the `messages` and
                `inputs`, allowing optimization of the evaluator itself or alignment
                with human evaluation datasets.
            **completion_args: Additional keyword arguments to pass to the LM model
                client's `chat` method, such as temperature, max tokens, or seed values,
                to customize the LLM's behavior during the evaluation.

        Returns:
            score: A variable containing the evaluation score(s),
                or their aggregation if `reduction_fn` is provided.
            explanation: A variable containing the explanation(s) of the evaluation,
                or their aggregation if `reduction_fn` is provided.

        Raises:
            RuntimeError: If the LM response to generate the evaluation `score` and
                `explanation` cannot be parsed as valid JSON.
            TypeError: If the types of `forward_model_client`, `messages`, `prediction`,
                `target`, `inputs`, `success_fn`, `reduction_fn`,
                `reduction_fn_purpose`, or `eval_mode` are not as expected.
            ValueError: If the lengths of `prediction.data` and `target` (or
                `target.data`, when `target` is a `Variable`) do not match when both are
                lists, or if `reduction_fn_purpose` (or `reduction_fn_purpose.data`) is
                an empty string, or if `inputs` contains keys that conflict with
                `prediction` or `target`.
        """
        self.forward_model_client = forward_model_client
        self.messages = messages
        self.success_fn = success_fn
        self.reduction_fn = reduction_fn
        self.reduction_fn_purpose = (
            None
            if reduction_fn_purpose is None
            else (
                reduction_fn_purpose
                if isinstance(reduction_fn_purpose, Variable)
                else Variable(reduction_fn_purpose)
            )
        )
        self.eval_mode = (
            eval_mode if isinstance(eval_mode, Variable) else Variable(eval_mode)
        )
        self.completion_args = completion_args
        return LMJudgeEvaluatorOp.apply(
            self.forward_model_client,
            self.messages,
            prediction,
            target,
            inputs,
            self.success_fn,
            self.reduction_fn,
            self.reduction_fn_purpose,
            self.eval_mode,
            **self.completion_args,
        )

forward(forward_model_client, messages, prediction, target=None, inputs=None, success_fn=None, reduction_fn=sum, reduction_fn_purpose='summation', eval_mode=True, **completion_args)

Forward pass for the LM Judge evaluator function.

Warning

Users should not call this method directly. Instead, they should call the module instance itself, which will internally invoke this forward method.

Parameters:

Name Type Description Default
forward_model_client ChatCompletionModel | None

The LM model client used for the forward pass evaluation.

required
messages MultiTurnMessages

A list of messages that compose the prompt/context for the LM. Each message is a dictionary with a "role" (e.g., "system", "user", "assistant") and a "content" field, which is a list of Variable objects. The Variable objects in the "content" can contain placeholders (e.g., {prediction}, {target}) that will be populated with the corresponding values from the inputs dictionary.

required
prediction Variable

The predicted variable to evaluate, which can have scalar or list data (supporting both individual and batch processing).

required
target str | list[str] | Variable | None

The target (ground truth) to compare against, which can be a string, a list of strings, or a Variable. Optional if the evaluation does not require a target and only relies on the correctness of the LM Judge's assessment of the prediction.

None
inputs dict[str, str | Variable] | None

A dictionary mapping placeholder names to their corresponding values, which can be strings or Variable instances. These values will be used to populate the placeholders in the messages content before sending the prompt to the LM. For example, if a message "content" field contains the placeholder {color}, the inputs dictionary should have a key "color" with the value to substitute in the prompt. Optional if there are no placeholders in the messages or if all placeholders are directly related to prediction and target.

None
success_fn Callable[[List[Any]], bool] | None

A user-defined function that takes the list of scores returned by the LM Judge and returns True if all predictions are considered successful, or False otherwise.

None
reduction_fn Callable[[List[Any]], Any] | None

An optional function to aggregate scores across a batch of predictions and targets. If None, no aggregation is applied.

sum
reduction_fn_purpose str | Variable | None

A brief description of the purpose of reduction_fn, used by the autodiff engine to generate explanations. Required if reduction_fn is provided.

'summation'
eval_mode bool | Variable

Indicates the evaluation mode. If True, the backward pass will compute gradients for the prediction variable only. If False, the backward pass will compute gradients for the messages and inputs, allowing optimization of the evaluator itself or alignment with human evaluation datasets.

True
**completion_args

Additional keyword arguments to pass to the LM model client's chat method, such as temperature, max tokens, or seed values, to customize the LLM's behavior during the evaluation.

{}

Returns:

Name Type Description
score Variable

A variable containing the evaluation score(s), or their aggregation if reduction_fn is provided.

explanation Variable

A variable containing the explanation(s) of the evaluation, or their aggregation if reduction_fn is provided.

Raises:

Type Description
RuntimeError

If the LM response to generate the evaluation score and explanation cannot be parsed as valid JSON.

TypeError

If the types of forward_model_client, messages, prediction, target, inputs, success_fn, reduction_fn, reduction_fn_purpose, or eval_mode are not as expected.

ValueError

If the lengths of prediction.data and target (or target.data, when target is a Variable) do not match when both are lists, or if reduction_fn_purpose (or reduction_fn_purpose.data) is an empty string, or if inputs contains keys that conflict with prediction or target.

Source code in afnio/cognitive/modules/lm_judge_evaluator.py
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
def forward(
    self,
    forward_model_client: Optional[ChatCompletionModel],
    messages: MultiTurnMessages,
    prediction: Variable,
    target: Optional[Union[str, List[str], Variable]] = None,
    inputs: Optional[Dict[str, Union[str, Variable]]] = None,
    success_fn: Optional[Callable[[List[Any]], bool]] = None,
    reduction_fn: Optional[Callable[[List[Any]], Any]] = sum,
    reduction_fn_purpose: Optional[Union[str, Variable]] = "summation",
    eval_mode: Union[bool, Variable] = True,
    **completion_args,
) -> Tuple[Variable, Variable]:
    """
    Forward pass for the LM Judge evaluator function.

    Warning:
        Users should not call this method directly. Instead, they should call the
        module instance itself, which will internally invoke this `forward` method.

    Args:
        forward_model_client: The LM model client used for the forward
            pass evaluation.
        messages: A list of messages that compose the prompt/context for the LM.
            Each message is a dictionary with a `"role"` (e.g., `"system"`,
            `"user"`, `"assistant"`) and a `"content"` field, which is a list of
            `Variable` objects. The `Variable` objects in the `"content"` can
            contain placeholders (e.g., `{prediction}`, `{target}`) that will be
            populated with the corresponding values from the `inputs` dictionary.
        prediction: The predicted variable to evaluate, which can have scalar or
            list [`data`][afnio.Variable.data] (supporting both individual and
            batch processing).
        target: The target (ground truth) to compare against, which can be a string,
            a list of strings, or a `Variable`. Optional if the evaluation does not
            require a target and only relies on the correctness of the LM Judge's
            assessment of the `prediction`.
        inputs: A dictionary mapping placeholder names to their corresponding
            values, which can be strings or `Variable` instances. These values
            will be used to populate the placeholders in the `messages` content
            before sending the prompt to the LM. For example, if a message
            `"content"` field contains the placeholder `{color}`, the `inputs`
            dictionary should have a key `"color"` with the value to substitute
            in the prompt. Optional if there are no placeholders in the messages or
            if all placeholders are directly related to `prediction` and `target`.
        success_fn: A user-defined function that takes the list of scores returned
            by the LM Judge and returns `True` if all predictions are considered
            successful, or `False` otherwise.
        reduction_fn: An optional function to aggregate scores across a batch of
            predictions and targets. If `None`, no aggregation is applied.
        reduction_fn_purpose: A brief description of the purpose of `reduction_fn`,
            used by the autodiff engine to generate explanations. Required if
            `reduction_fn` is provided.
        eval_mode: Indicates the evaluation mode. If `True`, the `backward` pass
            will compute gradients for the `prediction` variable only. If `False`,
            the `backward` pass will compute gradients for the `messages` and
            `inputs`, allowing optimization of the evaluator itself or alignment
            with human evaluation datasets.
        **completion_args: Additional keyword arguments to pass to the LM model
            client's `chat` method, such as temperature, max tokens, or seed values,
            to customize the LLM's behavior during the evaluation.

    Returns:
        score: A variable containing the evaluation score(s),
            or their aggregation if `reduction_fn` is provided.
        explanation: A variable containing the explanation(s) of the evaluation,
            or their aggregation if `reduction_fn` is provided.

    Raises:
        RuntimeError: If the LM response to generate the evaluation `score` and
            `explanation` cannot be parsed as valid JSON.
        TypeError: If the types of `forward_model_client`, `messages`, `prediction`,
            `target`, `inputs`, `success_fn`, `reduction_fn`,
            `reduction_fn_purpose`, or `eval_mode` are not as expected.
        ValueError: If the lengths of `prediction.data` and `target` (or
            `target.data`, when `target` is a `Variable`) do not match when both are
            lists, or if `reduction_fn_purpose` (or `reduction_fn_purpose.data`) is
            an empty string, or if `inputs` contains keys that conflict with
            `prediction` or `target`.
    """
    self.forward_model_client = forward_model_client
    self.messages = messages
    self.success_fn = success_fn
    self.reduction_fn = reduction_fn
    self.reduction_fn_purpose = (
        None
        if reduction_fn_purpose is None
        else (
            reduction_fn_purpose
            if isinstance(reduction_fn_purpose, Variable)
            else Variable(reduction_fn_purpose)
        )
    )
    self.eval_mode = (
        eval_mode if isinstance(eval_mode, Variable) else Variable(eval_mode)
    )
    self.completion_args = completion_args
    return LMJudgeEvaluatorOp.apply(
        self.forward_model_client,
        self.messages,
        prediction,
        target,
        inputs,
        self.success_fn,
        self.reduction_fn,
        self.reduction_fn_purpose,
        self.eval_mode,
        **self.completion_args,
    )