Hot Posts

6/recent/ticker-posts

Measuring what matters: How offline evaluation of GitHub MCP Server works

{}

MCP (Model Context Protocol) is a simple, common way for AI models (LLMs) to talk to APIs and data. Think of it like a universal plug: if both sides support MCP, they can connect and work together. An MCP server is any service or app that “speaks MCP” and offers tools the model can use, publishing a list of tools, what each tool does, and what inputs (parameters) each tool needs. 

The GitHub MCP Server is the foundation for many GitHub Copilot workflows, both inside and outside of GitHub. As an engineering team working on GitHub MCP, we’re always looking to deliver new features and functionality, while avoiding regressions and improving quality with every iteration. And how we name a tool, explain what it does, and spell out its parameters directly affects whether the model picks the right tool, in the right order, with the right arguments. 

When it comes to our work, small edits matter: tightening a description, adding or removing a tool, or combining a few similar tools can shift results a lot. When descriptions are off, agents choose the wrong tool, skip a step, send arguments in the wrong format, or drop them entirely. The outcome is weak. We need a safe way to change MCP and know if things actually got better, not worse. That’s where offline evaluation comes in.

Offline evaluation catches regressions before users see them and keeps the feedback loop short, so we can ship changes that genuinely improve performance.

This article walks through our evaluation pipeline and explains the metrics and algorithms that help us achieve these goals.

How automated offline evaluation works

Our offline evaluation pipeline checks how well our tool prompts work across different models. The tool instructions are kept simple and precise so the model can choose the right tool and fill in the correct parameters. Because LLMs vary in how they use tools, we systematically test each model–MCP pairing to measure compatibility, quality, and gaps.

We have curated datasets that we use as benchmarks. Every benchmark contains the following parameters: 

  1. Input: This is a user request formulated in natural language. 
  2. Expected tools: Tools we expect to be called.
  3. Expected arguments: Arguments we expect to be passed to each tool.

Here are a few examples:

Asking how many issues were created in a given time period

Input:  How many issues were created in the github/github-mcp-server repository during April 2025? 
Expected tools: list_issues with arguments:

owner: github 
repo: github-mcp-server 
since: 2025-04-01T00:00:00Z

Merging pull requests

Input: Merge PR 123 in github/docs using squash merge with title “Update installation guide”
Expected tools: merge_pull_request with arguments:

owner: github
repo: docs 
pullNumber: 123 
merge_method: squash 
commit_title: Update installation guide

Requesting code reviews

Input: Request reviews from alice456 and bob123 for PR 67 in team/project-alpha
Expected tools: update_pull_request with arguments: 

owner: team 
repo: project-alpha 
pullNumber: 67
reviewers: ["alice456", "bob123"]

Summarizing discussion comments

Input: Summarize the comments in discussion 33801, in the facebook/react repository 
Expected tools: get_discussion_comments with arguments:

owner: facebook
repo: react
discussionNumber: 33801

The evaluation pipeline has three stages: fulfillment, evaluation, and summarization.

  • Fulfillment: We run each benchmark across multiple models, providing the list of available MCP tools with every request. For each run, we record which tools the model invoked and the arguments it supplied.
  • Evaluation: We process the raw outputs and compute metrics and scores.
  • Summarization: We aggregate dataset-level statistics and produce the final evaluation report.

Evaluation metrics and algorithms

Our evaluation targets two aspects: whether the model selects the correct tools and whether it supplies correct arguments.

Tool selection

When benchmarks involve a single tool call, tool selection reduces to a multi-class classification problem. Each benchmark is labeled with the tool it expects, and each tool is a “class.”

Models tasked with this classification are evaluated using accuracy, precision, recall, and F1-score.

  1. Accuracy is the simplest measure that shows the percentage of correct classifications. In our case it means the percentage of inputs that resulted in an expected tool call. This is calculated on the whole dataset.
  2. Precision shows the proportion of the cases for which the tool was called correctly out of all cases where the tool was called. Low precision means the model picks the tool even for the cases where the tool is not expected to be called.
  3. Recall shows the proportion of correctly called tools out of all cases where the given tool call was expected. Low recall may indicate that the model doesn’t understand that the tool needs to be called and fails to call the tool or calls another tool instead.
  4. F1-score is a harmonic mean showing how well the model is doing in terms of both precision and recall. 

If the model confuses two tools, it can result in low precision or recall for these tools.

We have two similar tools that used to be confused often, which are list_issues and search_issues. Let’s say we have 10 benchmarks for list_issues  and 10 benchmarks for search_issues. Imagine list_issues is called correctly in all of 10 cases and on top in 30% of cases where search_issues should be called.

This means we’re going to have lower recall for search_issues and lower precision for list_issues:

Precision (list_issues) = 10 (cases where tool is called correctly) / (10 + 3 (cases where tool is called instead of search_issues)) = 0.77

Recall (search_issues) =  7 (tool was called correctly) / 10 (cases where tool is expected to be called) = 0.7

In order to have visibility into what tools are confused with each other, we build a confusion matrix. Confusion matrix for the search_issues and list_issues tools from the example above would look the following:

Expected tool / Called toolsearch_issueslist_issues
search_issues73
list_issues010

The confusion matrix allows us to see the reason behind low precision and recall for certain tools and tweak their descriptions to minimize confusion.

Argument correctness

Selecting the right tool isn’t enough. The model must also supply correct arguments. We’ve defined a set of argument-correctness metrics that pinpoint specific issues, making regressions easy to diagnose and fix.

We track four argument-quality metrics:

  • Argument hallucination: How often the model supplies argument names that aren’t defined for the tool.
  • All expected arguments provided: Whether every expected argument is present.
  • All required arguments provided: Whether all required arguments are included.
  • Exact value match: Whether provided argument values match the expected values exactly.

These metrics are computed for tools that were correctly selected. The final report summarizes each tool’s performance across all four metrics.

Looking forward and filling the gaps

The current evaluation framework gives us a solid read on tool performance against curated datasets, but there’s still room to improve.

More is better

Benchmark volume is the weak point of offline evaluation. With so many classes (tools), we need more robust per-tool coverage. Evaluations based on just a couple of examples aren’t dependable alone. Adding more benchmarks is always useful to increase the reliability of classification evaluation and other metrics.

Evaluation of multi-tool flows

Our current pipeline handles only single tool calls. In practice, tools are often invoked sequentially, with later calls consuming the outputs of earlier ones. To evaluate these flows, we must go beyond fetching the MCP tool list and actually execute tool calls (or mock their responses) during evaluation.

We’ll also update summarization. Today we treat tool selection as multi-class classification, which assumes one tool per input. For flows where a single input can trigger multiple tools, multi-label classification is the better fit.

Take this with you

Offline evaluation gives us a fast, safe way to iterate on MCP, so models pick the right GitHub tools with the right arguments. By combining curated benchmarks with clear metrics—classification scores for tool selection and targeted checks for argument quality—we turn vague “it seems better” into measurable progress and actionable fixes.

We’re not stopping here. We’re expanding benchmark coverage, refining tool descriptions to reduce confusion, and extending the pipeline to handle real multi-tool flows with execution or faithful mocks. These investments mean fewer regressions, clearer insights, and more reliable agents that help developers move faster.

Most importantly, this work raises the bar for product quality without slowing delivery. As we grow the suite and deepen the evaluation, you can expect steadier improvements to GitHub MCP Server—and a better, more predictable experience for anyone building with it.

The post Measuring what matters: How offline evaluation of GitHub MCP Server works appeared first on The GitHub Blog.

Take a look inside our automated pipeline for rapid, rigorous evaluation for the GitHub MCP Server.

The post Measuring what matters: How offline evaluation of GitHub MCP Server works appeared first on The GitHub Blog.