跳转到内容

pydantic_evals.evaluators

Contains dataclass

基类:Evaluator[object, object, object]

检查输出是否包含预期输出。

对于字符串,检查 expected_output 是否为 output 的子字符串。对于列表/元组,检查 expected_output 是否在 output 中。对于字典,检查 expected_output 中的所有键值对是否存在于 output 中。

注意:case_sensitive 仅在值和输出都是字符串时适用。

源代码位于 pydantic_evals/pydantic_evals/evaluators/common.py
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
@dataclass(repr=False)
class Contains(Evaluator[object, object, object]):
    """Check if the output contains the expected output.

    For strings, checks if expected_output is a substring of output.
    For lists/tuples, checks if expected_output is in output.
    For dicts, checks if all key-value pairs in expected_output are in output.

    Note: case_sensitive only applies when both the value and output are strings.
    """

    value: Any
    case_sensitive: bool = True
    as_strings: bool = False
    evaluation_name: str | None = field(default=None)

    def evaluate(
        self,
        ctx: EvaluatorContext[object, object, object],
    ) -> EvaluationReason:
        # Convert objects to strings if requested
        failure_reason: str | None = None
        as_strings = self.as_strings or (isinstance(self.value, str) and isinstance(ctx.output, str))
        if as_strings:
            output_str = str(ctx.output)
            expected_str = str(self.value)

            if not self.case_sensitive:
                output_str = output_str.lower()
                expected_str = expected_str.lower()

            failure_reason: str | None = None
            if expected_str not in output_str:
                output_trunc = _truncated_repr(output_str, max_length=100)
                expected_trunc = _truncated_repr(expected_str, max_length=100)
                failure_reason = f'Output string {output_trunc} does not contain expected string {expected_trunc}'
            return EvaluationReason(value=failure_reason is None, reason=failure_reason)

        try:
            # Handle different collection types
            if isinstance(ctx.output, dict):
                if isinstance(self.value, dict):
                    # Cast to Any to avoid type checking issues
                    output_dict = cast(dict[Any, Any], ctx.output)  # pyright: ignore[reportUnknownMemberType]
                    expected_dict = cast(dict[Any, Any], self.value)  # pyright: ignore[reportUnknownMemberType]
                    for k in expected_dict:
                        if k not in output_dict:
                            k_trunc = _truncated_repr(k, max_length=30)
                            failure_reason = f'Output dictionary does not contain expected key {k_trunc}'
                            break
                        elif output_dict[k] != expected_dict[k]:
                            k_trunc = _truncated_repr(k, max_length=30)
                            output_v_trunc = _truncated_repr(output_dict[k], max_length=100)
                            expected_v_trunc = _truncated_repr(expected_dict[k], max_length=100)
                            failure_reason = f'Output dictionary has different value for key {k_trunc}: {output_v_trunc} != {expected_v_trunc}'
                            break
                else:
                    if self.value not in ctx.output:  # pyright: ignore[reportUnknownMemberType]
                        output_trunc = _truncated_repr(ctx.output, max_length=200)  # pyright: ignore[reportUnknownMemberType]
                        failure_reason = f'Output {output_trunc} does not contain provided value as a key'
            elif self.value not in ctx.output:  # pyright: ignore[reportOperatorIssue]  # will be handled by except block
                output_trunc = _truncated_repr(ctx.output, max_length=200)
                failure_reason = f'Output {output_trunc} does not contain provided value'
        except (TypeError, ValueError) as e:
            failure_reason = f'Containment check failed: {e}'

        return EvaluationReason(value=failure_reason is None, reason=failure_reason)

Equals dataclass

基类:Evaluator[object, object, object]

检查输出是否与提供的值完全相等。

源代码位于 pydantic_evals/pydantic_evals/evaluators/common.py
29
30
31
32
33
34
35
36
37
@dataclass(repr=False)
class Equals(Evaluator[object, object, object]):
    """Check if the output exactly equals the provided value."""

    value: Any
    evaluation_name: str | None = field(default=None)

    def evaluate(self, ctx: EvaluatorContext[object, object, object]) -> bool:
        return ctx.output == self.value

EqualsExpected dataclass

基类:Evaluator[object, object, object]

检查输出是否与预期输出完全相等。

源代码位于 pydantic_evals/pydantic_evals/evaluators/common.py
40
41
42
43
44
45
46
47
48
49
@dataclass(repr=False)
class EqualsExpected(Evaluator[object, object, object]):
    """Check if the output exactly equals the expected output."""

    evaluation_name: str | None = field(default=None)

    def evaluate(self, ctx: EvaluatorContext[object, object, object]) -> bool | dict[str, bool]:
        if ctx.expected_output is None:
            return {}  # Only compare if expected output is provided
        return ctx.output == ctx.expected_output

HasMatchingSpan dataclass

基类:Evaluator[object, object, object]

检查跨度树是否包含与指定查询匹配的跨度。

源代码位于 pydantic_evals/pydantic_evals/evaluators/common.py
257
258
259
260
261
262
263
264
265
266
267
268
@dataclass(repr=False)
class HasMatchingSpan(Evaluator[object, object, object]):
    """Check if the span tree contains a span that matches the specified query."""

    query: SpanQuery
    evaluation_name: str | None = field(default=None)

    def evaluate(
        self,
        ctx: EvaluatorContext[object, object, object],
    ) -> bool:
        return ctx.span_tree.any(self.query)

IsInstance dataclass

基类:Evaluator[object, object, object]

检查输出是否为具有给定名称的类型的实例。

源代码位于 pydantic_evals/pydantic_evals/evaluators/common.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
@dataclass(repr=False)
class IsInstance(Evaluator[object, object, object]):
    """Check if the output is an instance of a type with the given name."""

    type_name: str
    evaluation_name: str | None = field(default=None)

    def evaluate(self, ctx: EvaluatorContext[object, object, object]) -> EvaluationReason:
        output = ctx.output
        for cls in type(output).__mro__:
            if cls.__name__ == self.type_name or cls.__qualname__ == self.type_name:
                return EvaluationReason(value=True)

        reason = f'output is of type {type(output).__name__}'
        if type(output).__qualname__ != type(output).__name__:
            reason += f' (qualname: {type(output).__qualname__})'
        return EvaluationReason(value=False, reason=reason)

LLMJudge dataclass

基类:Evaluator[object, object, object]

判断语言模型的输出是否满足所提供评分标准的要求。

如果您未指定模型,则会使用默认模型进行评判。默认模型初始为 'openai:gpt-4o',但可以通过调用 set_default_judge_model 来覆盖。

源代码位于 pydantic_evals/pydantic_evals/evaluators/common.py
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
@dataclass(repr=False)
class LLMJudge(Evaluator[object, object, object]):
    """Judge whether the output of a language model meets the criteria of a provided rubric.

    If you do not specify a model, it uses the default model for judging. This starts as 'openai:gpt-4o', but can be
    overridden by calling [`set_default_judge_model`][pydantic_evals.evaluators.llm_as_a_judge.set_default_judge_model].
    """

    rubric: str
    model: models.Model | models.KnownModelName | None = None
    include_input: bool = False
    include_expected_output: bool = False
    model_settings: ModelSettings | None = None
    score: OutputConfig | Literal[False] = False
    assertion: OutputConfig | Literal[False] = field(default_factory=lambda: OutputConfig(include_reason=True))

    async def evaluate(
        self,
        ctx: EvaluatorContext[object, object, object],
    ) -> EvaluatorOutput:
        if self.include_input:
            if self.include_expected_output:
                from .llm_as_a_judge import judge_input_output_expected

                grading_output = await judge_input_output_expected(
                    ctx.inputs, ctx.output, ctx.expected_output, self.rubric, self.model, self.model_settings
                )
            else:
                from .llm_as_a_judge import judge_input_output

                grading_output = await judge_input_output(
                    ctx.inputs, ctx.output, self.rubric, self.model, self.model_settings
                )
        else:
            if self.include_expected_output:
                from .llm_as_a_judge import judge_output_expected

                grading_output = await judge_output_expected(
                    ctx.output, ctx.expected_output, self.rubric, self.model, self.model_settings
                )
            else:
                from .llm_as_a_judge import judge_output

                grading_output = await judge_output(ctx.output, self.rubric, self.model, self.model_settings)

        output: dict[str, EvaluationScalar | EvaluationReason] = {}
        include_both = self.score is not False and self.assertion is not False
        evaluation_name = self.get_default_evaluation_name()

        if self.score is not False:
            default_name = f'{evaluation_name}_score' if include_both else evaluation_name
            _update_combined_output(output, grading_output.score, grading_output.reason, self.score, default_name)

        if self.assertion is not False:
            default_name = f'{evaluation_name}_pass' if include_both else evaluation_name
            _update_combined_output(output, grading_output.pass_, grading_output.reason, self.assertion, default_name)

        return output

    def build_serialization_arguments(self):
        result = super().build_serialization_arguments()
        # always serialize the model as a string when present; use its name if it's a KnownModelName
        if (model := result.get('model')) and isinstance(model, models.Model):  # pragma: no branch
            result['model'] = f'{model.system}:{model.model_name}'

        # Note: this may lead to confusion if you try to serialize-then-deserialize with a custom model.
        # I expect that is rare enough to be worth not solving yet, but common enough that we probably will want to
        # solve it eventually. I'm imagining some kind of model registry, but don't want to work out the details yet.
        return result

MaxDuration dataclass

基类:Evaluator[object, object, object]

检查执行时间是否低于指定的最大值。

源代码位于 pydantic_evals/pydantic_evals/evaluators/common.py
151
152
153
154
155
156
157
158
159
160
161
162
@dataclass(repr=False)
class MaxDuration(Evaluator[object, object, object]):
    """Check if the execution time is under the specified maximum."""

    seconds: float | timedelta

    def evaluate(self, ctx: EvaluatorContext[object, object, object]) -> bool:
        duration = timedelta(seconds=ctx.duration)
        seconds = self.seconds
        if not isinstance(seconds, timedelta):
            seconds = timedelta(seconds=seconds)
        return duration <= seconds

输出配置

基类:TypedDict

LLMJudge 评估器的分数和断言输出的配置。

源代码位于 pydantic_evals/pydantic_evals/evaluators/common.py
165
166
167
168
169
class OutputConfig(TypedDict, total=False):
    """Configuration for the score and assertion outputs of the LLMJudge evaluator."""

    evaluation_name: str
    include_reason: bool

Python dataclass

基类:Evaluator[object, object, object]

此评估器的输出是评估所提供的 Python 表达式的结果。

警告:此评估器会运行任意 Python 代码,因此您绝不应该将其用于不受信任的输入。

源代码位于 pydantic_evals/pydantic_evals/evaluators/common.py
272
273
274
275
276
277
278
279
280
281
282
283
284
@dataclass(repr=False)
class Python(Evaluator[object, object, object]):
    """The output of this evaluator is the result of evaluating the provided Python expression.

    ***WARNING***: this evaluator runs arbitrary Python code, so you should ***NEVER*** use it with untrusted inputs.
    """

    expression: str
    evaluation_name: str | None = field(default=None)

    def evaluate(self, ctx: EvaluatorContext[object, object, object]) -> EvaluatorOutput:
        # Evaluate the condition, exposing access to the evaluator context as `ctx`.
        return eval(self.expression, {'ctx': ctx})

EvaluatorContext dataclass

基类:Generic[InputsT, OutputT, MetadataT]

用于评估任务执行的上下文。

该类的实例是所有评估器的唯一输入。它包含评估任务执行所需的所有信息,包括输入、输出、元数据和遥测数据。

评估器使用此上下文来访问任务输入、实际输出、预期输出以及其他信息,以评估任务执行的结果。

示例

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ExactMatch(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        # Use the context to access task inputs, outputs, and expected outputs
        return ctx.output == ctx.expected_output

源代码位于 pydantic_evals/pydantic_evals/evaluators/context.py
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
@dataclass(kw_only=True)
class EvaluatorContext(Generic[InputsT, OutputT, MetadataT]):
    """Context for evaluating a task execution.

    An instance of this class is the sole input to all Evaluators. It contains all the information
    needed to evaluate the task execution, including inputs, outputs, metadata, and telemetry data.

    Evaluators use this context to access the task inputs, actual output, expected output, and other
    information when evaluating the result of the task execution.

    Example:
    ```python
    from dataclasses import dataclass

    from pydantic_evals.evaluators import Evaluator, EvaluatorContext


    @dataclass
    class ExactMatch(Evaluator):
        def evaluate(self, ctx: EvaluatorContext) -> bool:
            # Use the context to access task inputs, outputs, and expected outputs
            return ctx.output == ctx.expected_output
    ```
    """

    name: str | None
    """The name of the case."""
    inputs: InputsT
    """The inputs provided to the task for this case."""
    metadata: MetadataT | None
    """Metadata associated with the case, if provided. May be None if no metadata was specified."""
    expected_output: OutputT | None
    """The expected output for the case, if provided. May be None if no expected output was specified."""

    output: OutputT
    """The actual output produced by the task for this case."""
    duration: float
    """The duration of the task run for this case."""
    _span_tree: SpanTree | SpanTreeRecordingError = field(repr=False)
    """The span tree for the task run for this case.

    This will be `None` if `logfire.configure` has not been called.
    """

    attributes: dict[str, Any]
    """Attributes associated with the task run for this case.

    These can be set by calling `pydantic_evals.dataset.set_eval_attribute` in any code executed
    during the evaluation task."""
    metrics: dict[str, int | float]
    """Metrics associated with the task run for this case.

    These can be set by calling `pydantic_evals.dataset.increment_eval_metric` in any code executed
    during the evaluation task."""

    @property
    def span_tree(self) -> SpanTree:
        """Get the `SpanTree` for this task execution.

        The span tree is a graph where each node corresponds to an OpenTelemetry span recorded during the task
        execution, including timing information and any custom spans created during execution.

        Returns:
            The span tree for the task execution.

        Raises:
            SpanTreeRecordingError: If spans were not captured during execution of the task, e.g. due to not having
                the necessary dependencies installed.
        """
        if isinstance(self._span_tree, SpanTreeRecordingError):
            # In this case, there was a reason we couldn't record the SpanTree. We raise that now
            raise self._span_tree
        return self._span_tree

name instance-attribute

name: str | None

用例的名称。

inputs instance-attribute

inputs: InputsT

为此用例提供给任务的输入。

metadata 实例属性

metadata: MetadataT | None

与用例关联的元数据(如果已提供)。如果未指定元数据,则可能为 None。

expected_output instance-attribute

expected_output: OutputT | None

用例的预期输出(如果已提供)。如果未指定预期输出,则可能为 None。

output instance-attribute

output: OutputT

任务为此用例生成的实际输出。

duration instance-attribute

duration: float

此用例的任务运行持续时间。

attributes instance-attribute

attributes: dict[str, Any]

与此用例的任务运行关联的属性。

这些属性可以通过在评估任务期间执行的任何代码中调用 pydantic_evals.dataset.set_eval_attribute 来设置。

metrics instance-attribute

metrics: dict[str, int | float]

与此用例的任务运行关联的指标。

这些指标可以通过在评估任务期间执行的任何代码中调用 pydantic_evals.dataset.increment_eval_metric 来设置。

span_tree property

span_tree: SpanTree

获取此任务执行的 SpanTree

跨度树是一个图,其中每个节点对应于任务执行期间记录的一个 OpenTelemetry 跨度,包括计时信息和执行期间创建的任何自定义跨度。

返回

类型 描述
SpanTree

任务执行的跨度树。

引发

类型 描述
SpanTreeRecordingError

如果在任务执行期间未捕获到跨度,例如由于未安装必要的依赖项。

EvaluationReason dataclass

运行评估器的结果,附带可选解释。

包含一个标量值和对该值的可选“理由”解释。

参数

名称 类型 描述 默认值
EvaluationScalar

评估的标量结果(布尔值、整数、浮点数或字符串)。

必需
理由 str | None

对评估结果的可选解释。

None
源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
40
41
42
43
44
45
46
47
48
49
50
51
52
@dataclass
class EvaluationReason:
    """The result of running an evaluator with an optional explanation.

    Contains a scalar value and an optional "reason" explaining the value.

    Args:
        value: The scalar result of the evaluation (boolean, integer, float, or string).
        reason: An optional explanation of the evaluation result.
    """

    value: EvaluationScalar
    reason: str | None = None

EvaluationResult dataclass

基类:Generic[EvaluationScalarT]

单个评估结果的详细信息。

包含单个评估的名称、值、理由和源评估器。

参数

名称 类型 描述 默认值
name str

评估的名称。

必需
EvaluationScalarT

评估的标量结果。

必需
理由 str | None

对评估结果的可选解释。

必需
来源 评估器规范

生成此结果的评估器的规范。

必需
源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
@dataclass
class EvaluationResult(Generic[EvaluationScalarT]):
    """The details of an individual evaluation result.

    Contains the name, value, reason, and source evaluator for a single evaluation.

    Args:
        name: The name of the evaluation.
        value: The scalar result of the evaluation.
        reason: An optional explanation of the evaluation result.
        source: The spec of the evaluator that produced this result.
    """

    name: str
    value: EvaluationScalarT
    reason: str | None
    source: EvaluatorSpec

    def downcast(self, *value_types: type[T]) -> EvaluationResult[T] | None:
        """Attempt to downcast this result to a more specific type.

        Args:
            *value_types: The types to check the value against.

        Returns:
            A downcast version of this result if the value is an instance of one of the given types,
            otherwise None.
        """
        # Check if value matches any of the target types, handling bool as a special case
        for value_type in value_types:
            if isinstance(self.value, value_type):
                # Only match bool with explicit bool type
                if isinstance(self.value, bool) and value_type is not bool:
                    continue
                return cast(EvaluationResult[T], self)
        return None

向下转型

downcast(
    *value_types: type[T],
) -> EvaluationResult[T] | None

尝试将此结果向下转型为更具体的类型。

参数

名称 类型 描述 默认值
*value_types type[T]

用于检查值的类型。

()

返回

类型 描述
EvaluationResult[T] | None

如果值是给定类型之一的实例,则返回此结果的向下转型版本,

EvaluationResult[T] | None

否则返回 None。

源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def downcast(self, *value_types: type[T]) -> EvaluationResult[T] | None:
    """Attempt to downcast this result to a more specific type.

    Args:
        *value_types: The types to check the value against.

    Returns:
        A downcast version of this result if the value is an instance of one of the given types,
        otherwise None.
    """
    # Check if value matches any of the target types, handling bool as a special case
    for value_type in value_types:
        if isinstance(self.value, value_type):
            # Only match bool with explicit bool type
            if isinstance(self.value, bool) and value_type is not bool:
                continue
            return cast(EvaluationResult[T], self)
    return None

Evaluator dataclass

基类:Generic[InputsT, OutputT, MetadataT]

所有评估器的基类。

评估器可以根据 EvaluatorContext 以多种方式评估任务的性能。

子类必须实现 evaluate 方法。请注意,它可以使用 defasync def 定义。

示例

from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext


@dataclass
class ExactMatch(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> bool:
        return ctx.output == ctx.expected_output

源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
@dataclass(repr=False)
class Evaluator(Generic[InputsT, OutputT, MetadataT], metaclass=_StrictABCMeta):
    """Base class for all evaluators.

    Evaluators can assess the performance of a task in a variety of ways, as a function of the EvaluatorContext.

    Subclasses must implement the `evaluate` method. Note it can be defined with either `def` or `async def`.

    Example:
    ```python
    from dataclasses import dataclass

    from pydantic_evals.evaluators import Evaluator, EvaluatorContext


    @dataclass
    class ExactMatch(Evaluator):
        def evaluate(self, ctx: EvaluatorContext) -> bool:
            return ctx.output == ctx.expected_output
    ```
    """

    __pydantic_config__ = ConfigDict(arbitrary_types_allowed=True)

    @classmethod
    def get_serialization_name(cls) -> str:
        """Return the 'name' of this Evaluator to use during serialization.

        Returns:
            The name of the Evaluator, which is typically the class name.
        """
        return cls.__name__

    @classmethod
    @deprecated('`name` has been renamed, use `get_serialization_name` instead.')
    def name(cls) -> str:
        """`name` has been renamed, use `get_serialization_name` instead."""
        return cls.get_serialization_name()

    def get_default_evaluation_name(self) -> str:
        """Return the default name to use in reports for the output of this evaluator.

        By default, if the evaluator has an attribute called `evaluation_name` of type string, that will be used.
        Otherwise, the serialization name of the evaluator (which is usually the class name) will be used.

        This can be overridden to get a more descriptive name in evaluation reports, e.g. using instance information.

        Note that evaluators that return a mapping of results will always use the keys of that mapping as the names
        of the associated evaluation results.
        """
        evaluation_name = getattr(self, 'evaluation_name', None)
        if isinstance(evaluation_name, str):
            # If the evaluator has an attribute `name` of type string, use that
            return evaluation_name

        return self.get_serialization_name()

    @abstractmethod
    def evaluate(
        self, ctx: EvaluatorContext[InputsT, OutputT, MetadataT]
    ) -> EvaluatorOutput | Awaitable[EvaluatorOutput]:  # pragma: no cover
        """Evaluate the task output in the given context.

        This is the main evaluation method that subclasses must implement. It can be either synchronous
        or asynchronous, returning either an EvaluatorOutput directly or an Awaitable[EvaluatorOutput].

        Args:
            ctx: The context containing the inputs, outputs, and metadata for evaluation.

        Returns:
            The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping
            of evaluation names to either of those. Can be returned either synchronously or as an
            awaitable for asynchronous evaluation.
        """
        raise NotImplementedError('You must implement `evaluate`.')

    def evaluate_sync(self, ctx: EvaluatorContext[InputsT, OutputT, MetadataT]) -> EvaluatorOutput:
        """Run the evaluator synchronously, handling both sync and async implementations.

        This method ensures synchronous execution by running any async evaluate implementation
        to completion using run_until_complete.

        Args:
            ctx: The context containing the inputs, outputs, and metadata for evaluation.

        Returns:
            The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping
            of evaluation names to either of those.
        """
        output = self.evaluate(ctx)
        if inspect.iscoroutine(output):  # pragma: no cover
            return get_event_loop().run_until_complete(output)
        else:
            return cast(EvaluatorOutput, output)

    async def evaluate_async(self, ctx: EvaluatorContext[InputsT, OutputT, MetadataT]) -> EvaluatorOutput:
        """Run the evaluator asynchronously, handling both sync and async implementations.

        This method ensures asynchronous execution by properly awaiting any async evaluate
        implementation. For synchronous implementations, it returns the result directly.

        Args:
            ctx: The context containing the inputs, outputs, and metadata for evaluation.

        Returns:
            The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping
            of evaluation names to either of those.
        """
        # Note: If self.evaluate is synchronous, but you need to prevent this from blocking, override this method with:
        # return await anyio.to_thread.run_sync(self.evaluate, ctx)
        output = self.evaluate(ctx)
        if inspect.iscoroutine(output):
            return await output
        else:
            return cast(EvaluatorOutput, output)

    @model_serializer(mode='plain')
    def serialize(self, info: SerializationInfo) -> Any:
        """Serialize this Evaluator to a JSON-serializable form.

        Returns:
            A JSON-serializable representation of this evaluator as an EvaluatorSpec.
        """
        return to_jsonable_python(
            self.as_spec(),
            context=info.context,
            serialize_unknown=True,
        )

    def as_spec(self) -> EvaluatorSpec:
        raw_arguments = self.build_serialization_arguments()

        arguments: None | tuple[Any,] | dict[str, Any]
        if len(raw_arguments) == 0:
            arguments = None
        elif len(raw_arguments) == 1:
            arguments = (next(iter(raw_arguments.values())),)
        else:
            arguments = raw_arguments

        return EvaluatorSpec(name=self.get_serialization_name(), arguments=arguments)

    def build_serialization_arguments(self) -> dict[str, Any]:
        """Build the arguments for serialization.

        Evaluators are serialized for inclusion as the "source" in an `EvaluationResult`.
        If you want to modify how the evaluator is serialized for that or other purposes, you can override this method.

        Returns:
            A dictionary of arguments to be used during serialization.
        """
        raw_arguments: dict[str, Any] = {}
        for field in fields(self):
            value = getattr(self, field.name)
            # always exclude defaults:
            if field.default is not MISSING:
                if value == field.default:
                    continue
            if field.default_factory is not MISSING:
                if value == field.default_factory():  # pragma: no branch
                    continue
            raw_arguments[field.name] = value
        return raw_arguments

    __repr__ = _utils.dataclasses_no_defaults_repr

get_serialization_name classmethod

get_serialization_name() -> str

返回此评估器在序列化期间使用的“名称”。

返回

类型 描述
str

评估器的名称,通常是类名。

源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
162
163
164
165
166
167
168
169
@classmethod
def get_serialization_name(cls) -> str:
    """Return the 'name' of this Evaluator to use during serialization.

    Returns:
        The name of the Evaluator, which is typically the class name.
    """
    return cls.__name__

name classmethod deprecated

name() -> str
已弃用

name 已重命名,请改用 get_serialization_name

name 已重命名,请改用 get_serialization_name

源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
171
172
173
174
175
@classmethod
@deprecated('`name` has been renamed, use `get_serialization_name` instead.')
def name(cls) -> str:
    """`name` has been renamed, use `get_serialization_name` instead."""
    return cls.get_serialization_name()

获取默认评估名称

get_default_evaluation_name() -> str

返回此评估器输出在报告中使用的默认名称。

默认情况下,如果评估器有一个名为 evaluation_name 的字符串类型属性,则会使用该属性。否则,将使用评估器的序列化名称(通常是类名)。

可以重写此方法以在评估报告中获得更具描述性的名称,例如使用实例信息。

请注意,返回结果映射的评估器将始终使用该映射的键作为相关评估结果的名称。

源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
def get_default_evaluation_name(self) -> str:
    """Return the default name to use in reports for the output of this evaluator.

    By default, if the evaluator has an attribute called `evaluation_name` of type string, that will be used.
    Otherwise, the serialization name of the evaluator (which is usually the class name) will be used.

    This can be overridden to get a more descriptive name in evaluation reports, e.g. using instance information.

    Note that evaluators that return a mapping of results will always use the keys of that mapping as the names
    of the associated evaluation results.
    """
    evaluation_name = getattr(self, 'evaluation_name', None)
    if isinstance(evaluation_name, str):
        # If the evaluator has an attribute `name` of type string, use that
        return evaluation_name

    return self.get_serialization_name()

evaluate abstractmethod

evaluate(
    ctx: EvaluatorContext[InputsT, OutputT, MetadataT],
) -> EvaluatorOutput | Awaitable[EvaluatorOutput]

在给定上下文中评估任务输出。

这是子类必须实现的主要评估方法。它可以是同步的或异步的,直接返回一个 EvaluatorOutput 或一个 Awaitable[EvaluatorOutput]。

参数

名称 类型 描述 默认值
ctx EvaluatorContext[InputsT, OutputT, MetadataT]

包含用于评估的输入、输出和元数据的上下文。

必需

返回

类型 描述
EvaluatorOutput | Awaitable[EvaluatorOutput]

评估结果,可以是标量值、EvaluationReason 或从评估名称到这两者之一的映射。

EvaluatorOutput | Awaitable[EvaluatorOutput]

可以同步返回,也可以作为异步评估的

EvaluatorOutput | Awaitable[EvaluatorOutput]

可等待对象返回。

源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
@abstractmethod
def evaluate(
    self, ctx: EvaluatorContext[InputsT, OutputT, MetadataT]
) -> EvaluatorOutput | Awaitable[EvaluatorOutput]:  # pragma: no cover
    """Evaluate the task output in the given context.

    This is the main evaluation method that subclasses must implement. It can be either synchronous
    or asynchronous, returning either an EvaluatorOutput directly or an Awaitable[EvaluatorOutput].

    Args:
        ctx: The context containing the inputs, outputs, and metadata for evaluation.

    Returns:
        The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping
        of evaluation names to either of those. Can be returned either synchronously or as an
        awaitable for asynchronous evaluation.
    """
    raise NotImplementedError('You must implement `evaluate`.')

同步评估

evaluate_sync(
    ctx: EvaluatorContext[InputsT, OutputT, MetadataT],
) -> EvaluatorOutput

同步运行评估器,处理同步和异步实现。

此方法通过使用 run_until_complete 运行任何异步 evaluate 实现直到完成来确保同步执行。

参数

名称 类型 描述 默认值
ctx EvaluatorContext[InputsT, OutputT, MetadataT]

包含用于评估的输入、输出和元数据的上下文。

必需

返回

类型 描述
评估器输出

评估结果,可以是标量值、EvaluationReason 或从评估名称到这两者之一的映射。

评估器输出

从评估名称到这两者之一的映射。

源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
def evaluate_sync(self, ctx: EvaluatorContext[InputsT, OutputT, MetadataT]) -> EvaluatorOutput:
    """Run the evaluator synchronously, handling both sync and async implementations.

    This method ensures synchronous execution by running any async evaluate implementation
    to completion using run_until_complete.

    Args:
        ctx: The context containing the inputs, outputs, and metadata for evaluation.

    Returns:
        The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping
        of evaluation names to either of those.
    """
    output = self.evaluate(ctx)
    if inspect.iscoroutine(output):  # pragma: no cover
        return get_event_loop().run_until_complete(output)
    else:
        return cast(EvaluatorOutput, output)

evaluate_async async

evaluate_async(
    ctx: EvaluatorContext[InputsT, OutputT, MetadataT],
) -> EvaluatorOutput

异步运行评估器,处理同步和异步实现。

此方法通过正确等待任何异步 evaluate 实现来确保异步执行。对于同步实现,它直接返回结果。

参数

名称 类型 描述 默认值
ctx EvaluatorContext[InputsT, OutputT, MetadataT]

包含用于评估的输入、输出和元数据的上下文。

必需

返回

类型 描述
评估器输出

评估结果,可以是标量值、EvaluationReason 或从评估名称到这两者之一的映射。

评估器输出

从评估名称到这两者之一的映射。

源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
async def evaluate_async(self, ctx: EvaluatorContext[InputsT, OutputT, MetadataT]) -> EvaluatorOutput:
    """Run the evaluator asynchronously, handling both sync and async implementations.

    This method ensures asynchronous execution by properly awaiting any async evaluate
    implementation. For synchronous implementations, it returns the result directly.

    Args:
        ctx: The context containing the inputs, outputs, and metadata for evaluation.

    Returns:
        The evaluation result, which can be a scalar value, an EvaluationReason, or a mapping
        of evaluation names to either of those.
    """
    # Note: If self.evaluate is synchronous, but you need to prevent this from blocking, override this method with:
    # return await anyio.to_thread.run_sync(self.evaluate, ctx)
    output = self.evaluate(ctx)
    if inspect.iscoroutine(output):
        return await output
    else:
        return cast(EvaluatorOutput, output)

序列化

serialize(info: SerializationInfo) -> Any

将此评估器序列化为 JSON 可序列化的形式。

返回

类型 描述
Any

此评估器作为 EvaluatorSpec 的 JSON 可序列化表示。

源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
254
255
256
257
258
259
260
261
262
263
264
265
@model_serializer(mode='plain')
def serialize(self, info: SerializationInfo) -> Any:
    """Serialize this Evaluator to a JSON-serializable form.

    Returns:
        A JSON-serializable representation of this evaluator as an EvaluatorSpec.
    """
    return to_jsonable_python(
        self.as_spec(),
        context=info.context,
        serialize_unknown=True,
    )

构建序列化参数

build_serialization_arguments() -> dict[str, Any]

构建用于序列化的参数。

评估器被序列化以作为“来源”包含在 EvaluationResult 中。如果您想为此或其他目的修改评估器的序列化方式,可以重写此方法。

返回

类型 描述
dict[str, Any]

一个用于序列化期间的参数字典。

源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
def build_serialization_arguments(self) -> dict[str, Any]:
    """Build the arguments for serialization.

    Evaluators are serialized for inclusion as the "source" in an `EvaluationResult`.
    If you want to modify how the evaluator is serialized for that or other purposes, you can override this method.

    Returns:
        A dictionary of arguments to be used during serialization.
    """
    raw_arguments: dict[str, Any] = {}
    for field in fields(self):
        value = getattr(self, field.name)
        # always exclude defaults:
        if field.default is not MISSING:
            if value == field.default:
                continue
        if field.default_factory is not MISSING:
            if value == field.default_factory():  # pragma: no branch
                continue
        raw_arguments[field.name] = value
    return raw_arguments

EvaluatorFailure dataclass

表示在评估器执行期间引发的失败。

源代码位于 pydantic_evals/pydantic_evals/evaluators/evaluator.py
104
105
106
107
108
109
110
111
@dataclass
class EvaluatorFailure:
    """Represents a failure raised during the execution of an evaluator."""

    name: str
    error_message: str
    error_stacktrace: str
    source: EvaluatorSpec

EvaluatorOutput module-attribute

EvaluatorOutput = (
    EvaluationScalar
    | EvaluationReason
    | Mapping[str, EvaluationScalar | EvaluationReason]
)

评估器输出的类型,可以是标量、EvaluationReason 或从名称到这两者之一的映射。

评估器规范

基类:BaseModel

待运行评估器的规范。

该类用于以可序列化格式表示评估器,在 YAML 或 JSON 数据集文件中定义评估器时,为方便起见支持各种简写形式。

特别是,支持以下每种形式来指定名为 MyEvaluator 的评估器: * 'MyEvaluator' - 如果其 __init__ 不带参数,则仅使用评估器子类的(字符串)名称 * {'MyEvaluator': first_arg} - 将单个参数作为第一个位置参数传递给 MyEvaluator.__init__ * {'MyEvaluator': {k1: v1, k2: v2}} - 将多个关键字参数传递给 MyEvaluator.__init__

源代码位于 pydantic_evals/pydantic_evals/evaluators/spec.py
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
class EvaluatorSpec(BaseModel):
    """The specification of an evaluator to be run.

    This class is used to represent evaluators in a serializable format, supporting various
    short forms for convenience when defining evaluators in YAML or JSON dataset files.

    In particular, each of the following forms is supported for specifying an evaluator with name `MyEvaluator`:
    * `'MyEvaluator'` - Just the (string) name of the Evaluator subclass is used if its `__init__` takes no arguments
    * `{'MyEvaluator': first_arg}` - A single argument is passed as the first positional argument to `MyEvaluator.__init__`
    * `{'MyEvaluator': {k1: v1, k2: v2}}` - Multiple kwargs are passed to `MyEvaluator.__init__`
    """

    name: str
    """The name of the evaluator class; should be the value returned by `EvaluatorClass.get_serialization_name()`"""

    arguments: None | tuple[Any] | dict[str, Any]
    """The arguments to pass to the evaluator's constructor.

    Can be None (no arguments), a tuple (a single positional argument), or a dict (keyword arguments).
    """

    @property
    def args(self) -> tuple[Any, ...]:
        """Get the positional arguments for the evaluator.

        Returns:
            A tuple of positional arguments if arguments is a tuple, otherwise an empty tuple.
        """
        if isinstance(self.arguments, tuple):
            return self.arguments
        return ()

    @property
    def kwargs(self) -> dict[str, Any]:
        """Get the keyword arguments for the evaluator.

        Returns:
            A dictionary of keyword arguments if arguments is a dict, otherwise an empty dict.
        """
        if isinstance(self.arguments, dict):
            return self.arguments
        return {}

    @model_validator(mode='wrap')
    @classmethod
    def deserialize(cls, value: Any, handler: ModelWrapValidatorHandler[EvaluatorSpec]) -> EvaluatorSpec:
        """Deserialize an EvaluatorSpec from various formats.

        This validator handles the various short forms of evaluator specifications,
        converting them to a consistent EvaluatorSpec instance.

        Args:
            value: The value to deserialize.
            handler: The validator handler.

        Returns:
            The deserialized EvaluatorSpec.

        Raises:
            ValidationError: If the value cannot be deserialized.
        """
        try:
            result = handler(value)
            return result
        except ValidationError as exc:
            try:
                deserialized = _SerializedEvaluatorSpec.model_validate(value)
            except ValidationError:
                raise exc  # raise the original error
            return deserialized.to_evaluator_spec()

    @model_serializer(mode='wrap')
    def serialize(self, handler: SerializerFunctionWrapHandler, info: SerializationInfo) -> Any:
        """Serialize using the appropriate short-form if possible.

        Returns:
            The serialized evaluator specification, using the shortest form possible:
            - Just the name if there are no arguments
            - {name: first_arg} if there's a single positional argument
            - {name: {kwargs}} if there are multiple (keyword) arguments
        """
        if isinstance(info.context, dict) and info.context.get('use_short_form'):  # pyright: ignore[reportUnknownMemberType]
            if self.arguments is None:
                return self.name
            elif isinstance(self.arguments, tuple):
                return {self.name: self.arguments[0]}
            else:
                return {self.name: self.arguments}
        else:
            return handler(self)

name instance-attribute

name: str

评估器类的名称;应为 EvaluatorClass.get_serialization_name() 返回的值

arguments instance-attribute

arguments: None | tuple[Any] | dict[str, Any]

传递给评估器构造函数的参数。

可以是 None(无参数)、元组(单个位置参数)或字典(关键字参数)。

args property

args: tuple[Any, ...]

获取评估器的位置参数。

返回

类型 描述
tuple[Any, ...]

如果 arguments 是元组,则为位置参数的元组;否则为空元组。

kwargs property

kwargs: dict[str, Any]

获取评估器的关键字参数。

返回

类型 描述
dict[str, Any]

如果 arguments 是字典,则为关键字参数的字典;否则为空字典。

deserialize classmethod

deserialize(
    value: Any,
    handler: ModelWrapValidatorHandler[EvaluatorSpec],
) -> EvaluatorSpec

从各种格式反序列化 EvaluatorSpec。

此验证器处理评估器规范的各种简写形式,将它们转换为一致的 EvaluatorSpec 实例。

参数

名称 类型 描述 默认值
Any

要反序列化的值。

必需
处理器 ModelWrapValidatorHandler[EvaluatorSpec]

验证器处理器。

必需

返回

类型 描述
评估器规范

反序列化后的 EvaluatorSpec。

引发

类型 描述
ValidationError

如果值无法反序列化。

源代码位于 pydantic_evals/pydantic_evals/evaluators/spec.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
@model_validator(mode='wrap')
@classmethod
def deserialize(cls, value: Any, handler: ModelWrapValidatorHandler[EvaluatorSpec]) -> EvaluatorSpec:
    """Deserialize an EvaluatorSpec from various formats.

    This validator handles the various short forms of evaluator specifications,
    converting them to a consistent EvaluatorSpec instance.

    Args:
        value: The value to deserialize.
        handler: The validator handler.

    Returns:
        The deserialized EvaluatorSpec.

    Raises:
        ValidationError: If the value cannot be deserialized.
    """
    try:
        result = handler(value)
        return result
    except ValidationError as exc:
        try:
            deserialized = _SerializedEvaluatorSpec.model_validate(value)
        except ValidationError:
            raise exc  # raise the original error
        return deserialized.to_evaluator_spec()

序列化

serialize(
    handler: SerializerFunctionWrapHandler,
    info: SerializationInfo,
) -> Any

如果可能,使用适当的简写形式进行序列化。

返回

类型 描述
Any

序列化后的评估器规范,使用尽可能短的形式

Any
  • 如果没有参数,则仅为名称
Any
  • 如果只有一个位置参数,则为 {name: first_arg}
Any
  • 如果有多个(关键字)参数,则为 {name: {kwargs}}
源代码位于 pydantic_evals/pydantic_evals/evaluators/spec.py
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
@model_serializer(mode='wrap')
def serialize(self, handler: SerializerFunctionWrapHandler, info: SerializationInfo) -> Any:
    """Serialize using the appropriate short-form if possible.

    Returns:
        The serialized evaluator specification, using the shortest form possible:
        - Just the name if there are no arguments
        - {name: first_arg} if there's a single positional argument
        - {name: {kwargs}} if there are multiple (keyword) arguments
    """
    if isinstance(info.context, dict) and info.context.get('use_short_form'):  # pyright: ignore[reportUnknownMemberType]
        if self.arguments is None:
            return self.name
        elif isinstance(self.arguments, tuple):
            return {self.name: self.arguments[0]}
        else:
            return {self.name: self.arguments}
    else:
        return handler(self)

评分输出

基类:BaseModel

评分操作的输出。

源代码位于 pydantic_evals/pydantic_evals/evaluators/llm_as_a_judge.py
27
28
29
30
31
32
class GradingOutput(BaseModel, populate_by_name=True):
    """The output of a grading operation."""

    reason: str
    pass_: bool = Field(validation_alias='pass', serialization_alias='pass')
    score: float

judge_output async

judge_output(
    output: Any,
    rubric: str,
    model: Model | KnownModelName | None = None,
    model_settings: ModelSettings | None = None,
) -> GradingOutput

根据评分标准对模型输出进行评判。

如果未指定模型,则使用默认模型。默认模型初始为 'openai:gpt-4o',但可以使用 set_default_judge_model 函数进行更改。

源代码位于 pydantic_evals/pydantic_evals/evaluators/llm_as_a_judge.py
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
async def judge_output(
    output: Any,
    rubric: str,
    model: models.Model | models.KnownModelName | None = None,
    model_settings: ModelSettings | None = None,
) -> GradingOutput:
    """Judge the output of a model based on a rubric.

    If the model is not specified, a default model is used. The default model starts as 'openai:gpt-4o',
    but this can be changed using the `set_default_judge_model` function.
    """
    user_prompt = _build_prompt(output=output, rubric=rubric)
    return (
        await _judge_output_agent.run(user_prompt, model=model or _default_model, model_settings=model_settings)
    ).output

judge_input_output async

judge_input_output(
    inputs: Any,
    output: Any,
    rubric: str,
    model: Model | KnownModelName | None = None,
    model_settings: ModelSettings | None = None,
) -> GradingOutput

根据输入和评分标准对模型输出进行评判。

如果未指定模型,则使用默认模型。默认模型初始为 'openai:gpt-4o',但可以使用 set_default_judge_model 函数进行更改。

源代码位于 pydantic_evals/pydantic_evals/evaluators/llm_as_a_judge.py
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
async def judge_input_output(
    inputs: Any,
    output: Any,
    rubric: str,
    model: models.Model | models.KnownModelName | None = None,
    model_settings: ModelSettings | None = None,
) -> GradingOutput:
    """Judge the output of a model based on the inputs and a rubric.

    If the model is not specified, a default model is used. The default model starts as 'openai:gpt-4o',
    but this can be changed using the `set_default_judge_model` function.
    """
    user_prompt = _build_prompt(inputs=inputs, output=output, rubric=rubric)

    return (
        await _judge_input_output_agent.run(user_prompt, model=model or _default_model, model_settings=model_settings)
    ).output

judge_input_output_expected async

judge_input_output_expected(
    inputs: Any,
    output: Any,
    expected_output: Any,
    rubric: str,
    model: Model | KnownModelName | None = None,
    model_settings: ModelSettings | None = None,
) -> GradingOutput

根据输入和评分标准对模型输出进行评判。

如果未指定模型,则使用默认模型。默认模型初始为 'openai:gpt-4o',但可以使用 set_default_judge_model 函数进行更改。

源代码位于 pydantic_evals/pydantic_evals/evaluators/llm_as_a_judge.py
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
async def judge_input_output_expected(
    inputs: Any,
    output: Any,
    expected_output: Any,
    rubric: str,
    model: models.Model | models.KnownModelName | None = None,
    model_settings: ModelSettings | None = None,
) -> GradingOutput:
    """Judge the output of a model based on the inputs and a rubric.

    If the model is not specified, a default model is used. The default model starts as 'openai:gpt-4o',
    but this can be changed using the `set_default_judge_model` function.
    """
    user_prompt = _build_prompt(inputs=inputs, output=output, rubric=rubric, expected_output=expected_output)

    return (
        await _judge_input_output_expected_agent.run(
            user_prompt, model=model or _default_model, model_settings=model_settings
        )
    ).output

judge_output_expected async

judge_output_expected(
    output: Any,
    expected_output: Any,
    rubric: str,
    model: Model | KnownModelName | None = None,
    model_settings: ModelSettings | None = None,
) -> GradingOutput

根据预期输出、实际输出和评分标准对模型输出进行评判。

如果未指定模型,则使用默认模型。默认模型初始为 'openai:gpt-4o',但可以使用 set_default_judge_model 函数进行更改。

源代码位于 pydantic_evals/pydantic_evals/evaluators/llm_as_a_judge.py
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
async def judge_output_expected(
    output: Any,
    expected_output: Any,
    rubric: str,
    model: models.Model | models.KnownModelName | None = None,
    model_settings: ModelSettings | None = None,
) -> GradingOutput:
    """Judge the output of a model based on the expected output, output, and a rubric.

    If the model is not specified, a default model is used. The default model starts as 'openai:gpt-4o',
    but this can be changed using the `set_default_judge_model` function.
    """
    user_prompt = _build_prompt(output=output, rubric=rubric, expected_output=expected_output)
    return (
        await _judge_output_expected_agent.run(
            user_prompt, model=model or _default_model, model_settings=model_settings
        )
    ).output

设置默认裁判模型

set_default_judge_model(
    model: Model | KnownModelName,
) -> None

设置用于评判的默认模型。

如果在 judge_outputjudge_input_outputmodel 参数中传递 None,则使用此模型。

源代码位于 pydantic_evals/pydantic_evals/evaluators/llm_as_a_judge.py
205
206
207
208
209
210
211
def set_default_judge_model(model: models.Model | models.KnownModelName) -> None:  # pragma: no cover
    """Set the default model used for judging.

    This model is used if `None` is passed to the `model` argument of `judge_output` and `judge_input_output`.
    """
    global _default_model
    _default_model = model