Why A/B test screenshots specifically?

Screenshots are the highest-leverage asset on a store listing. Apple's own data on Product Page Optimization shows that the first three screenshots drive most of the conversion lift on a listing — more than the icon, the subtitle, or even the app preview video. Google's Store Listing Experiments documentation reports the same: graphics test wins typically beat copy test wins by a wide margin.

The reason teams skip screenshot tests is friction. Producing four challenger sets (control + 3 variants) × three iPhone sizes × two iPad sizes × the locales you ship means dozens of files per test. With a render API, the friction collapses — one template duplicate, one parameter change, one API call.

What should you actually test?

Test one variable per round. The high-leverage variables, in roughly descending order of impact, are:

The first screenshot's headline — visible without scrolling in search results
Hero feature — which capability you lead with
Background style — flat color vs gradient vs photographic
Device frame — framed vs frameless, mockup angle
Caption position — above, below, or overlay
Color of the CTA element — primary brand color vs contrasting accent

Resist the urge to test five things at once. Multivariate tests need traffic that most apps do not have, and you end up with a winner you cannot explain.

How do you generate variants programmatically?

Duplicate the control template in the editor, change one element, save. Then render every variant via the same API call shape — only the templateId differs:

generate-variants.sh

bash

# Render the control + 3 challengers
for VARIANT in control red_cta photo_bg short_headline; do
  curl -X POST https://api.screenshots.live/v1/renders \
    -H "Authorization: Bearer $SCREENSHOTSLIVE_API_TOKEN" \
    -H "Content-Type: application/json" \
    -d "{
      \"templateId\": \"tpl_home_v3_${VARIANT}\",
      \"locales\": [\"en-US\"],
      \"devices\": [\"iphone-6.7\", \"iphone-6.1\"],
      \"outputDir\": \"./variants/${VARIANT}\"
    }"
done

For most tests, only render the locales the test actually targets. App Store Connect Product Page Optimization runs per-locale, so a US-only test only needs en-US assets.

How do you set up the test on App Store Connect (PPO)?

Apple's Product Page Optimization lets you test up to three challenger pages against your default page in a single locale. Apple randomizes traffic, reports impressions and conversion rate per variant, and surfaces confidence intervals when the test reaches significance.

In App Store Connect, open your app → Product Page Optimization → Create New Test.
Add up to three product page variants. Each variant gets its own screenshot set, app preview, and icon.
Allocate traffic. The default 25% per variant (control + three) works for most apps.
Pick a localization. PPO runs per locale — the same test cannot be applied to multiple locales simultaneously.
Submit for review. Each variant is reviewed by App Review like a normal binary submission.
Start the test once all variants are approved.

See the official Apple Product Page Optimization page for current limits and review timing.

How do you set up Google Play Store Listing Experiments?

Google's flow is similar but more flexible. From Play Console, navigate to Store Presence → Store Listing Experiments → Create Experiment. Pick:

Type: Default Listing (global) or Localized Listing (per-locale)
Asset: Phone screenshots, 7-inch tablet, 10-inch tablet, feature graphic, icon
Variants: up to three challengers vs control
Audience split: 25%/25%/25%/25% by default, or custom

Play Console handles confidence intervals automatically and flags the winner when uplift is statistically significant. Read the latest Google docs at Run experiments on your store listing.

How big a sample do you need?

Use a standard two-proportion z-test calculator with these inputs:

Baseline conversion rate: your current product page CTR (typically 25–35% on App Store, 5–10% on Play Store)
Minimum detectable lift: 5% relative lift is a reasonable bar; below 3% is rarely worth the time
Significance level: 95% (α = 0.05)
Power: 80% (β = 0.20)

For a 30% baseline and 5% relative lift, you need roughly 100k–150k impressions per variant. Most apps with under 50k weekly impressions should run sequential tests (one variant for two weeks, then the next, comparing against a stable historical baseline) instead of concurrent.

How long should you run a test?

At least one full week. Conversion patterns swing significantly by day-of-week and by traffic source (Search Ads vs organic). Seven days captures one full weekly cycle. Two weeks is better when traffic is low. Stop early only if (a) the difference is overwhelming (>30% lift) and (b) the test has run at least seven full days. Stopping early on noise is the most common way to ship a worse listing than you started with.

How do you read results without fooling yourself?

Use the platform's own confidence interval. Both Apple PPO and Google Play Experiments report 90% / 95% confidence intervals on the variant's lift over control. If the interval crosses zero, the test is inconclusive — do not ship the challenger just because the point estimate is positive.

Do not try to derive results from third-party MMP data. The variant a user sees is decided server-side by Apple or Google — your MMP cannot see which variant the user came from. The store's own dashboard is the only ground truth.

What do you do with the winner?

Promote the winning variant to default. The losing templates should not be deleted — keep them as reference points and write a one-line note on what changed and what you learned. The next test compounds on the last: if a short headline beat a long one, your next test might explore icon styles within that short-headline framework.

Pair this guide with the CI/CD automation guide so variants are generated automatically on every release, and the localization guide so your winner gets translated into 30+ languages without re-testing each one.

How to A/B Test App Store Screenshots for Conversion

TL;DR