Transfer Pricing Basics

Transfer Pricing Benchmarking Analysis: From Comparable Search to Arm's Length Range

Why most benchmarking studies fail on audit, how the comparable search process works, and what separates a defensible analysis from a mechanical one.

The benchmarking analysis is the part of a transfer pricing study that most people think of as “the study.” Finding comparable companies, computing their profitability, checking whether the tested party’s intercompany pricing falls in the same range.

It is also where the most money is wasted. Not because the math is hard, but because the industry has a structural problem: comparable searches are expensive to run, easy to do poorly, and difficult to evaluate from the outside. A search that produces 15 comparables and a clean interquartile range looks professional on paper. Whether those 15 companies actually resemble the tested party is a question that only gets answered on audit. By then, it is expensive to fix.

This article is built around a simple claim: the quality of your comparables matters more than anything else in a benchmarking study. More than the profit level indicator, more than the statistical method, more than the number of companies in the set. Most of what follows is about how to get that right and what happens when you do not.

What the Output Looks Like

The output of a benchmarking analysis is a set of comparable companies and a range of their profitability. The tested party’s results are compared to that range.

In most TNMM/CPM studies, the profitability indicator is one of three: net cost plus mark-up (operating profit divided by total costs), operating margin (operating profit divided by revenue), or the Berry ratio (gross profit divided by operating expenses). Which one to use depends on the tested party’s functional profile. Net cost plus mark-up is the natural choice for service providers, contract manufacturers, and most routine intercompany functions where the entity is compensated for its costs plus a return. Operating margin works well for distributors. The Berry ratio is less common but relevant where cost of goods sold does not cleanly separate from operating expenses.

The range is the interquartile range: the 25th to 75th percentile, typically computed using a weighted multi-year average. Practices differ by jurisdiction. The United States follows a weighted average approach under §1.482. Germany sometimes accepts simple averages or single-year observations.

To put this in concrete terms: a benchmarking search for a US entity providing software engineering services to its foreign parent might produce an interquartile range of roughly 3% to 11% net cost plus mark-up. If the tested party charges a 7% mark-up over its total costs, it falls within the range, and the pricing is arm’s length. If it charges 1%, an adjustment to the median would be expected. The math at this stage is mechanical. The judgment was spent earlier, in deciding which companies belong in the set.

The Comparable Search: Where the Judgment Lives

A benchmarking search follows a standard workflow, but executing it well requires judgment at every step. For a detailed walkthrough of the screening process itself, see our guide on how to select comparable companies.

Define the search strategy. Before opening any database, the search criteria need to follow from the functional analysis. Which industry classification system: SIC, NAICS, or NACE? What geographic scope? What size thresholds? What independence criteria filter out related-party transactions? These decisions determine what the database returns. Get them wrong, and no amount of manual review will fix it.

Run the quantitative screens. Apply the criteria in a commercial database (Orbis, Capital IQ, Bloomberg) or, for US-focused searches, SEC EDGAR filings. This typically reduces a universe of hundreds of thousands of companies to a few hundred candidates. CompPress builds its pre-built benchmarking studies from SEC EDGAR filings and financial data sourced through Financial Modeling Prep, covering ten intercompany service profiles (including contract manufacturing, software engineering, contract R&D, distribution, sales and marketing, and administrative support) with FY 2022-2024 financials and net cost plus mark-up as the primary PLI.

Manual review. Each remaining candidate is reviewed individually: business description, financial profile, product mix, customer base, risk profile. This is the step that separates a credible study from a mechanical one. A significant share of the companies that pass quantitative screens turn out, on closer inspection, to operate in different segments, carry different risk profiles, or derive revenue from activities that are not comparable to the tested party. Rejecting them, with specific documented reasons, is what makes the final set defensible.

Financial analysis. For the accepted comparables, compute the profit level indicator for each year in the testing period (typically three to five years), determine the interquartile range, and compare. One decision that matters here: how to handle loss-making years. Some practitioners exclude persistent loss makers on the basis that they are not comparable to a going-concern tested party. Others include them to avoid biasing the range upward. The approach should be documented and applied consistently.

The entire process takes two to six weeks and requires database subscriptions that cost thousands per year. For routine profiles that have been benchmarked many times before, pre-built studies can replace the custom search and cut the timeline to hours.

Why US Comparables Dominate (and When to Be Careful)

Most benchmarking studies rely on US public company data, and the reason is practical: SEC EDGAR provides the deepest pool of publicly available, standardized financial data anywhere. The number of listed companies, the granularity of segment reporting, and the consistency of disclosures are unmatched.

The OECD Guidelines support cross-border use of comparables. They require comparability of functions, assets, and risks, not geography. A US public company that provides contract R&D services or distributes electronics with a limited-risk profile can be a valid comparable for a tested party in Canada, Mexico, or many European jurisdictions, provided the functional match is documented.

Some jurisdictions push back. Germany has historically preferred European comparables. India often expects Indian data. But even in these cases, US data frequently serves as a supplementary source when local comparables are insufficient.

There is a legitimate concern that practitioners should address rather than ignore: US public companies tend to report higher operating margins than private companies performing similar functions. This is partly a scale effect and partly survivorship bias in public company data. A benchmarking study built exclusively on US public comparables may produce a range that sits higher than what a private tested party would achieve in normal conditions. This does not disqualify the data, but it should be acknowledged in the study. A tax authority that spots the issue before the taxpayer does will treat it as an oversight, not a nuance.

Comparability Adjustments: The Step That Actually Moves the Range

Even a well-selected comparable set includes companies with different working capital profiles. One comparable collects receivables in 30 days, another in 90. One carries heavy inventory, another almost none. These differences directly affect operating margins.

Working capital adjustments correct for this. The mechanics are standard: compute the working capital intensity (accounts receivable plus inventory minus accounts payable, as a percentage of revenue or costs) for each comparable and the tested party, then adjust operating margins using a reference interest rate to neutralize the difference.

The impact is real. Across the studies in the CompPress library, working capital adjustments routinely shift individual comparables’ margins by one to three percentage points, and can move the interquartile range boundaries by a full point. For a tested party sitting near the edge of the range, that shift determines the outcome.

Most jurisdictions consider working capital adjustments standard practice. The US regulations explicitly provide for them. The OECD Guidelines encourage comparability adjustments where they improve reliability. Omitting them when the data supports them gives a tax authority a straightforward basis to challenge the study.

Where Studies Break Down

Most benchmarking studies that fail on audit fail for the same reason: the comparables are not comparable. The calculation is fine. The set is wrong.

The functional profile does not match. The tested party provides contract R&D services, but the accepted comparables include companies that own the resulting IP and bear commercialization risk. Or the tested party is a limited-risk distributor, but the comparables own significant inventory and set their own pricing. Tax authorities catch this quickly. They have seen the same companies show up in comparable sets across dozens of audits, sometimes correctly characterized, sometimes not.

The screening is too loose. Broad activity codes and minimal filters produce a large set that looks thorough. But a set of 25 loosely comparable companies produces a wider, less defensible range than a tight set of 12.

Equally common but less discussed: the manual review is skipped or rushed. Quantitative screens are not enough. They filter by industry code and size, not by what a company actually does. Auditors will read the business descriptions of your accepted comparables. If the person who prepared the study did not, the gap will be visible.

Stale data is a problem that compounds over time. A 2026 tested year benchmarked against 2022 comparables invites challenge, especially in industries where margins have shifted meaningfully.

And then there is cherry-picking: rejecting comparables without documented reasons, or applying ad hoc filters that happen to narrow the range favorably. Tax authorities recognize this pattern. It undermines everything else in the study.

What a Defensible Study Looks Like

The search strategy follows from the functional analysis. Screening criteria are documented and applied consistently. Rejection reasons are specific (“company generates over 60% of revenue from proprietary software products rather than engineering services, inconsistent with tested party’s contract service function”). The comparable set is tight, the financial data is current, working capital adjustments are applied, and the tested party’s segmented financials are used for the comparison.

For a step-by-step walkthrough, our free benchmarking guide covers the methodology in detail.

A Note on Methods Beyond TNMM/CPM

This article has focused on TNMM/CPM benchmarking because it covers the majority of studies. Two other methods involve benchmarking of a different kind.

The CUP method benchmarks prices directly: what did independent parties charge for the same or similar product? When reliable CUP data exists, it is the most direct test. But reliable CUPs are rare outside commodity markets and standardized financial transactions.

The profit split method divides combined profits based on each entity’s relative contributions. Benchmarking here means identifying how independent parties in similar arrangements split value, a more complex exercise. For a comparison of these methods, see our article on TNMM vs CPM.

A Closing Note

Benchmarking is the most expensive part of a transfer pricing study. It is also the part where spending more money does not automatically produce a better result. A larger comparable set is not a better one. A more expensive database does not guarantee better comparables. What matters is whether the companies in the final set actually resemble the tested party. Everything else is mechanics.

Frequently Asked Questions

What databases are used for transfer pricing benchmarking?

The most widely used databases are Bureau van Dijk's Orbis and TP Catalyst, S&P Capital IQ, and Bloomberg. For US-focused searches, SEC EDGAR filings provide publicly available financial data for listed companies. Regional databases exist for specific markets. Database subscriptions are expensive, which is one reason benchmarking is often the most costly component of a transfer pricing study.

How many comparables are needed for a benchmarking study?

There is no fixed minimum, but most practitioners aim for at least 10 to 15 accepted comparables to produce a statistically meaningful interquartile range. Smaller sets are acceptable if the comparables are genuinely close matches, but they invite more scrutiny from tax authorities. A carefully selected set of 10 companies is stronger than a loosely screened set of 30.

What is the interquartile range in transfer pricing?

The interquartile range spans the 25th to 75th percentile of the profit level indicators of the comparable companies. If the tested party's results fall within this range, the intercompany pricing is generally considered arm's length. Results outside the range typically require an adjustment to the median.

What are working capital adjustments in benchmarking?

Working capital adjustments account for differences in accounts receivable, accounts payable, and inventory levels between the tested party and the comparables. These differences affect profitability and can shift the arm's length range by one to three percentage points. Most jurisdictions consider them standard practice in TNMM/CPM studies.

Can US comparables be used for tested parties in other countries?

In many cases, yes. The OECD Guidelines do not restrict comparables by geography, and US public companies offer the deepest pool of standardized financial data available. Cross-border use of US comparables is common, particularly when local comparables are insufficient. However, some jurisdictions prefer domestic comparables. Germany, for example, has historically favored European searches. The key is to document the rationale for the geographic scope chosen.