The Infrastructure of Evidence

§01 · Lede

Rooftop Digital ran email at scale — roughly 20 mailer accounts across three ESPs, sending across 24 verticals. At that volume, creative teams were producing constantly. What the operation lacked was shared infrastructure for understanding how creatives were performing.

Each mailer account worked largely in isolation. There was no systematic way to know which layouts, elements, or placements were helping CTOR — or suppressing it. Click tracking, where it existed, was mostly limited to one or two elements: the main CTA button, occasionally the logo. Analytics could see performance data at the mailer level, but there was no consistent system for viewing it globally — no way to identify patterns across campaigns, ESPs, or verticals. A/B testing happened occasionally but without structure — varying send times, different audience cohorts, no shared standards around statistical significance thresholds.

The remit: extend existing visibility into shared infrastructure. Three parallel workstreams, all aimed at the same outcome: making email performance measurable, testable, and actionable at scale.

§02 · Hypothesis

Three parallel investments.

Three hypotheses shaped the build — each targeting a different layer of the measurement gap.

First, that a library of standardized templates, tracked consistently across mailer accounts, ESPs, and verticals, could build an evidence base that ad-hoc production had no mechanism to produce. Not which single creative won a given send, but which structural choices — layouts, element types, placements — were reliably helping or hurting CTOR across the program.

Second, that click tracking needed to expand beyond the CTA button. Additional clickable elements were being introduced — secondary buttons, clickable micro-copy — but without visibility into how users engaged with them, there was no way to understand which approaches to messaging and placement were actually working, or where in the creative the audience’s attention was landing.

Third, that A/B testing at this scale required a formal process to be worth running. Without consistent methodology — defined thresholds per mailer account and audience, scheduled cadences, a shared roadmap — tests couldn’t be compared, results couldn’t compound, and the creative team had no reliable way to translate findings into the next round of work.

Together, the three were designed to do something none of them could do alone: give the program a feedback loop with enough resolution to make optimization deliberate rather than incidental.

§03 · The Work

Templates, tracking, and a testing process — built in parallel.

Templates

The starting point was the creative library itself. Rather than designing templates from scratch, the initial set was built from previously proven creatives — structures that had already demonstrated performance — then standardized into a consistent structure. The first round produced 12 layouts, each in a light and dark variant, for 24 wireframes total. Before rollout, five of those templates were A/B tested against the source creatives they were built on — a pre-launch sanity check to confirm the standardized versions held up before asking mailer accounts to adopt them.

Getting the system into production required more than design work. Taxonomy was developed in close collaboration with analytics — naming conventions that could be tracked consistently across ESPs and surfaced cleanly in Tableau. That naming structure was then integrated into the existing creative request and development process so the templates entered the workflow without creating a parallel system. Rollout was coordinated across analytics, creative compliance, writing, development, and the mailing teams: the purpose, the approach, and the tracking logic all communicated before anything went live.

The naming system also addressed an existing gap in the broader creative taxonomy. Previously, creatives not tied to a specific project used a generic default code — a catch-all that represented the majority of sends but conveyed nothing about the creative itself. The template naming convention occupied that same space, giving any creative an identifying mark at a glance. Once the initial templates proved out, a testing-phase convention was introduced using the same logic: new concepts — from the template program or any other initiative — were flagged under a distinct provisional marker rather than defaulting to the generic code. Templates that didn’t perform were retired; those that did graduated into the standard set. By the time the program matured, it carried 24 standard templates and 15 in the testing phase.

Click Tracking

The first step was a technical feasibility conversation — understanding what tracking was possible across different ESPs and mailer accounts, and how that data could be routed into Tableau in a usable form. Deliverability impact was tested early: the tracking code was run through initial checks before any broader rollout. Once the approach was confirmed, analytics joined to shape the taxonomy: data group naming that would support filtering by ESP, vertical, campaign, and creative.

The initial build identified approximately 40 data groups — every meaningful clickable element and placement type in active use. A visual reference guide was produced alongside the taxonomy: annotated wireframes with elements labeled by data group, organized into categories, and distributed across teams as a working reference.

Implementing the tracking required some additional work in the creative development and QA process, so a separate lightweight system was set up to manage that overhead and keep the lift on individual contributors minimal. On the primary ESP, the tracking system could be configured globally — one setup, applied across all mailer accounts. On a second ESP, it had to be implemented per mailer account; the rollout was prioritized by impact, reaching the highest-volume accounts first. The system reached approximately 80% of mailer accounts overall.

The click tracking and template systems were designed to stay in sync: when a new template introduced a new element or placement, a corresponding data group was added to the tracking taxonomy. Cross-team coordination became an ongoing practice, with documentation and element tracking kept current as the systems evolved.

A/B Testing

The A/B testing process was driven by a clear gap: testing was infrequent and inconsistent — when it happened, there was no shared methodology, no defined thresholds, and no way to build on what had run before. The goal was a shared system — one where test parameters were defined in advance, statistical significance thresholds were set per mailer account and audience size, and results could be compared across campaigns.

The build was a close collaboration with a counterpart on the analytics team. The process was formalized around a testing roadmap: a prioritized queue of concepts — new elements, messaging approaches, layout variations — with scheduled testing deadlines built around a quarterly cadence. Analytics owned the logistics of getting tests run — coordinating with mailing teams on timing, audience cohorts, and send volumes. The creative side owned the test concepts: defining the variables, identifying top performers to test against, and filling the queue. Beyond the creative team’s testing, analytics used the system to run their own tests when specific performance questions arose, and marketing operations submitted occasional requests when something was worth putting to a test.

One test structure that proved particularly useful was pairing the click tracking system with A/B variants: measuring not just which version performed better overall, but where in each creative users were engaging differently.

Fig. 01Three-system performance infrastructure · Rooftop Digital, 2020–2025Scroll to reveal

L1Foundation

Standardized creative library

L2Signal

Element-level engagement data

L3Validation

Structured test process

L4Output

Performance-informed creative program

L1 · Foundation

24 standard templates

15 testing-phase templates

Vertical performance matrix

Creative request integration

L2 · Signal

~50 click tracking data groups

ESP-wide configuration

Visual reference guide

Tableau-ready taxonomy

L3 · Validation

Quarterly testing roadmap

Stat sig thresholds per mailer

Cross-team test requests

Click + A/B integration

L4 · Output

Evidence-based creative selection

Underperformer retirement

Validated findings

Compounding evidence base

§04 · The Lesson

What the infrastructure produced.

The template system returned results relatively quickly. Dark variants underperformed across the program — consistently enough that they were removed from the standard set. Over time, four additional templates were retired as the evidence accumulated. The finding on dark variants came with a caveat worth stating: an isolated dark creative could still outperform its alternatives in specific conditions, so the elimination wasn’t absolute. The template system wasn’t built to find outliers; it was built to find consistency at volume, and that’s what it found. Patterns emerged across verticals, and once the evidence was confident, a performance matrix was built mapping which templates performed above average, which were neutral, and which underperformed — broken down by vertical. That matrix was integrated into the creative request process: a defined percentage of requested creatives per vertical had to come from above-average templates, with neutral templates available as an option, and underperforming templates excluded. The matrix was reviewed on a regular cadence and updated as the data warranted. Across the program, templates outperformed non-template creatives by an average of 18.75% on CTOR.

The template system wasn’t built to find outliers; it was built to find consistency at volume, and that’s what it found.

The click tracking system surfaced findings that wouldn’t have been visible otherwise. Some confirmed expectations — the main CTA button drew the majority of clicks in most cases. Others didn’t. Button placement, button count, and specific approaches to micro-copy links produced results that challenged default assumptions, including cases where a micro-copy link challenged the main CTA’s click volume — and in some cases, outperformed it. The system also provided a layer of insight beyond raw CTOR: where users were engaging within the creative began to indicate something about their motivations, not just their behavior. Each new concept added to the tracking taxonomy extended that visibility indefinitely.

The A/B testing process gave the program a reliable mechanism for validating findings before rolling them out at scale. Wins from the click tracking system — element placements, micro-copy link approaches — moved into A/B tests and were confirmed or qualified before becoming standard practice. The micro-copy link finding was tested across multiple verticals: it held in the majority, but not all — including a small number where it had a negative impact, which the process caught before those approaches were adopted broadly.

Together, the three systems formed a feedback loop where none had existed. Performance data informed template selection, click tracking identified what to test, and the A/B process validated what was worth keeping. The program reduced the amount of send volume going to creatives without evidence behind them — and increased the share going to approaches with a demonstrated track record. No single metric captures that cleanly across a program of this scale and complexity, but the direction was consistent: more signal, less noise, and a compounding body of evidence that each successive test could build on.