Skip to main content
When running evaluations, you may want to process results programmatically in your script rather than viewing them in the LangSmith UI. This is useful for scenarios like:
  • CI/CD pipelines: Implement quality gates that fail builds if evaluation scores drop below a threshold.
  • Local debugging: Inspect and analyze results without API calls.
  • Custom aggregations: Calculate metrics and statistics using your own logic.
  • Integration testing: Use evaluation results to gate merges or deployments.
This guide shows how to read and process experiment results directly from the Client.evaluate() response.
This page focuses on processing results programmatically while still uploading them to LangSmith.If you want to run evaluations locally without recording anything to LangSmith (for quick testing or validation), refer to Run an evaluation locally which uses upload_results=False.

Iterate over evaluation results

The evaluate() function returns an iterator when called with blocking=False. This allows you to process results as they’re produced:
from langsmith import Client
import random

client = Client()

def target(inputs):
    """Your application or LLM chain"""
    return {"output": "MY OUTPUT"}

def evaluator(run, example):
    """Your evaluator function"""
    return {"key": "randomness", "score": random.randint(0, 1)}

# Run evaluation with blocking=False to get an iterator
streamed_results = client.evaluate(
    target,
    data="MY_DATASET_NAME",
    evaluators=[evaluator],
    blocking=False
)

# Collect results as they stream in
aggregated_results = []
for result in streamed_results:
    aggregated_results.append(result)

# Separate loop to avoid logging at the same time as logs from evaluate()
for result in aggregated_results:
    print("Input:", result["run"].inputs)
    print("Output:", result["run"].outputs)
    print("Evaluation Results:", result["evaluation_results"]["results"])
    print("--------------------------------")
This produces output like:
Input: {'input': 'MY INPUT'}
Output: {'output': 'MY OUTPUT'}
Evaluation Results: [EvaluationResult(key='randomness', score=1, value=None, comment=None, correction=None, evaluator_info={}, feedback_config=None, source_run_id=UUID('7ebb4900-91c0-40b0-bb10-f2f6a451fd3c'), target_run_id=None, extra=None)]
--------------------------------

Understand the result structure

Each result in the iterator contains:
  • result["run"]: The execution of your target function.
    • result["run"].inputs: The inputs from your dataset example.
    • result["run"].outputs: The outputs produced by your target function.
    • result["run"].id: The unique ID for this run.
  • result["evaluation_results"]["results"]: A list of EvaluationResult objects, one per evaluator.
    • key: The metric name (from your evaluator’s return value).
    • score: The numeric score (typically 0-1 or boolean).
    • comment: Optional explanatory text.
    • source_run_id: The ID of the evaluator run.
  • result["example"]: The dataset example that was evaluated.
    • result["example"].inputs: The input values.
    • result["example"].outputs: The reference outputs (if any).

Example: Implement a quality gate

This example shows how to use evaluation results to pass or fail a CI/CD build automatically based on quality thresholds. The script iterates through results, calculates an average accuracy score, and exits with a non-zero status code if the accuracy falls below 85%. This ensures that you can deploy code changes that meet quality standards.
from langsmith import Client
import sys

client = Client()

def my_application(inputs):
    # Your application logic
    return {"response": "..."}

def accuracy_evaluator(run, example):
    # Your evaluation logic
    is_correct = run.outputs["response"] == example.outputs["expected"]
    return {"key": "accuracy", "score": 1 if is_correct else 0}

# Run evaluation
results = client.evaluate(
    my_application,
    data="my_test_dataset",
    evaluators=[accuracy_evaluator],
    blocking=False
)

# Calculate aggregate metrics
total_score = 0
count = 0

for result in results:
    eval_result = result["evaluation_results"]["results"][0]
    total_score += eval_result.score
    count += 1

average_accuracy = total_score / count

print(f"Average accuracy: {average_accuracy:.2%}")

# Fail the build if accuracy is too low
if average_accuracy < 0.85:
    print("❌ Evaluation failed! Accuracy below 85% threshold.")
    sys.exit(1)

print("✅ Evaluation passed!")

Example: Collect results for analysis

Sometimes you may want to collect all results first before processing them. This is useful when you need to perform operations that require the full dataset (like calculating percentiles, sorting by score, or generating summary reports). Collecting results separately also prevents your output from being mixed with the logging from evaluate().
# Collect all results first
all_results = []
for result in client.evaluate(target, data=dataset, evaluators=[evaluator], blocking=False):
    all_results.append(result)

# Then process them separately
# (This avoids mixing your print statements with evaluation logs)
for result in all_results:
    print("Input:", result["run"].inputs)
    print("Output:", result["run"].outputs)

    # Access individual evaluation results
    for eval_result in result["evaluation_results"]["results"]:
        print(f"  {eval_result.key}: {eval_result.score}")
For more information on running evaluations without uploading results, refer to Run an evaluation locally.
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.