Systematic quantitative research on SPY zero-day-to-expiration options strategies. Rigorous backtesting, layered filter validation, and explicit rules for promoting findings from data point to candidate-tier. Approximately 450+ configurations tested across ~30,000 simulated trades as of April 2026.
This research program develops and validates systematic 0DTE options strategies on SPY, with findings cross-validated against QQQ, IWM, and SPX. All research uses Option Alpha's backtest engine with $0.05/leg slippage (representing disciplined limit-order execution with 5–10 second patience), two-year windows, and mandatory multi-regime validation before any strategy promotes to candidate tier.
The framework exists to prevent premature conclusions, enforce rigorous testing, and build genuine mechanistic understanding of market edge — not pattern-matching or wishful thinking.
The zero live trades metric is intentional. The research is complete enough to deploy, but every validated strategy is still a backtest. Live validation is the next phase — each real trade yields more information than dozens of additional backtests. Honest disclosure of this gap matters more than pretending the research is "done."
Seven principles govern all research decisions. They were established iteratively as research revealed specific failure modes; each exists because an actual research error triggered its creation.
A single backtest result — positive or negative — does not close a research question. It narrows the search space for that specific configuration under those specific conditions.
Wrong framing: "This produced 2.09 PF, it's the best we can get." "This structure is dead." "Let's lock this in as the rule."
Correct framing: "This configuration fails at these conditions; filters or alternative parameters may reveal different behavior." "This is our current best baseline; refinement could push it higher."
Strategies don't reveal their full edge in a single test. They reveal it across a sequence: baseline → filter → regime gate → parameter optimization.
Example: the afternoon Iron Condor candidate (B30-C) wasn't discovered as PF 2.82. It emerged through:
A baseline PF of 1.2 can become PF 2.0+ with the right filters applied. Starting points matter less than the refinement process.
Before every batch, expected ranking and PF range are committed explicitly. This serves as a calibration check that prevents hindsight bias and triggers investigation when results surprise.
Tracking prediction accuracy across batches reveals where intuition fails. For example, the B32 Change% sweep produced a near-perfect inverted ranking vs prediction: predicted A>B>E>C>D, actual D>B>E>C>A. The bullish-continuation mechanism was refuted. Calibration failures like this prompted the principle that predictions should be made sparingly.
All research uses $0.05/leg slippage minimum. This represents disciplined limit-order execution with 5–10 second patience. Higher slippage variants ($0.07–$0.10/leg) are used for stress testing.
Mid-price backtests lie. A strategy that works only at theoretical mid-fills is not a strategy — it's a mirage that won't replicate in live trading. Several early research findings were invalidated when realistic slippage was applied.
Two strategies with PF 1.5 that lose money on the same days = one strategy levered up. Two strategies with PF 1.3 that lose on different days = real diversification.
Correlation analysis is mandatory before declaring two strategies stackable. Combined drawdown matters more than combined expected value. This is an open research question for B49-A (morning strong-up) and B28-D (afternoon strong-up) — both fire on the same Change% > +0.5% filter and could be highly correlated rather than additive.
Monthly equity-curve validation is required before candidate promotion. Specifically:
These tests catch strategies whose aggregate PF hides bimodal regime dependence. B54-D (slight-down LP, earlier batch) showed aggregate PF 1.03 — but monthly breakdown revealed a May 2024-Aug 2025 losing period offset by Sep 2025-Feb 2026 recovery. Two sub-regimes were masking each other. Applying the same check to B53-E (slight-up LP) confirmed its edge was NOT bimodal — real monthly stability.
The methodology explicitly treats self-gating (strategy skips days via filter) as protective, not a limitation. B49-A fires approximately 38 times per year (77 trades over 2 years); the remaining days are skipped by design. B28-D fires ~65 times per year (131 trades over 2 years) with the same filter.
When a sweep varies risk per trade (e.g., wing width on SPS), raw PF systematically favors wider wings — but max risk scales linearly while P/L scales sub-linearly. Return-on-risk is the correct tiebreaker.
Empirical example from the B49 wing width comparison:
| Config | Wing | PF | Avg P/L | Max DD | Retail yield |
|---|---|---|---|---|---|
| B48-A | $5 | 2.46 | $12 | −$181 | $24 (2-contract) |
| B49-A | $10 | 2.92 | $19 | −$202 | $19 (1-contract) |
Both: IC P25Δ/C15Δ at 9:40 AM, hold to 11:00 AM, filter Change% > +0.5%. Same underlying data, only wing width differs. N=77 trades each.
The $5-wing variant (B48-A, PF 2.46) yields $24/trade at 2-contract size vs $19/trade at 1-contract $10-wing (B49-A, PF 2.92) on retail account sizes. B49-A wins the profit-factor contest; B48-A wins the dollar-output contest at constrained margin. Both are valid candidates; choice depends on account size.
Research progresses through five phases. Each has explicit closure criteria to prevent premature promotion or endless refinement.
Goal: Identify all configurations showing positive expectancy or marginal positive signal.
Method: Test broad combinations of structure × entry time × exit time × delta × wings. Each batch isolates ONE variable and holds others constant (Option Alpha's 5-test comparison limit). Realistic slippage applied.
Closure: Phase 1 closes when major dimensions have been mapped sufficiently to identify Phase 2 candidates — NOT when "the answer" is found.
Goal: Concentrate edge via individual filters.
Filter dimensions tested:
Output: Filter heatmap per candidate. Filters producing >20% PF improvement are flagged for Phase 3.
Goal: Combine validated filters to maximize edge without overfitting.
Sample size rules:
B49-A (Change% > +0.5% alone) is a 1-filter specification with N=77 trades — clears all sample thresholds. Multi-filter combinations (e.g., Change% + VIX bands, Change% + day-of-week) have been tested but current candidates all use single-filter specifications.
Methods:
Method: Smallest viable size (1 contract). Track every trade against backtest expectation. Compare actual slippage to modeled. Document system-rule deviations.
Current status: no candidate has entered Phase 5. The research-to-execution gap is the highest-priority item on the open research list.
Configurations are classified into four tiers by how many qualification criteria they satisfy:
| Tier | Criteria met | Treatment |
|---|---|---|
| Candidate | 7 of 7 | Live-deployable, size carefully |
| Research-tier | 4–6 of 7 | Research-worthy, additional validation needed |
| Data point | Below 4 | Data point only, not actionable |
| Closed | — | Falsified or validated-and-subsumed |
| Metric | Threshold | Rationale |
|---|---|---|
| Profit factor | > 1.25 | Below this, slippage variance flips negative |
| Win rate | > 70% | Psychological sustainability; small drawdown clusters |
| Max drawdown | Account-appropriate | Account survival under worst case |
| Sample size | > 50 trades | Statistical reliability after filtering |
| Avg P/L per trade | > $8 net slippage | Worth cognitive load and capital tie-up |
| Monthly validation | Passes all three regime tests | Catches bimodal / regime-masked findings |
| Correlation (if stacking) | < 0.5 | Real diversification, not amplification |
| Priority | Metric | Why |
|---|---|---|
| Primary | Profit Factor | Risk-adjusted return |
| Secondary | Total P&L ($) | Absolute dollars |
| Tertiary | Win Rate | Consistency / psychology |
| Check | Max Drawdown | Tail risk / account survival |
| Check | Avg Win / Avg Loss | Outcome asymmetry |
A configuration with PF 2.8, $37 avg P&L, 78% WR beats one with PF 1.6, $45 avg P&L, 85% WR. The second has better headline numbers but worse risk-adjusted return.
Strategies that have cleared the qualification criteria. Each includes batch-level progression showing how it emerged from broader research.
Candidate Monthly validated
Iron Condor with inverse-asymmetric deltas, 9:40 AM entry, 11:00 AM early close, filtered to strong-up overnight gap setups. Currently the most robust morning-session finding in the database.
| Structure | Iron Condor, asymmetric deltas |
| Put side | Short 25 delta / Long 25δ − $10 (wing width $10) |
| Call side | Short 15 delta / Long 15δ + $10 (wing width $10) |
| Entry | 9:40 AM ET |
| Exit | 11:00 AM ET (1h 20m hold — early close, NOT held to expiration) |
| Filter | Change% > +0.5% (SPY opens at least 0.5% above prior close) |
| Slippage | $0.05/leg |
Strong-up opens continue trending for approximately 90 minutes rather than mean-reverting. Exit at 11:00 AM captures the continuation theta without holding through midday reversal risk. The inverse asymmetry (puts tighter at 25Δ, calls further OTM at 15Δ) reflects a specific market reality: on strong-up days, overnight fear-premium on the put side is maximally inflated at open and deflates sharply — capturing that deflation is where the edge is. Call-side premium is minimal (market is already at intraday highs) and carrying wider call wings sacrifices little.
MOST ROBUST MORNING FINDING IN DATABASE. 18 of 23 active months positive (78%), worst month −$101, longest losing streak only 2 months. No regime-masking pattern. Top 3 months removed still leaves +$671 over 20 months. Filter self-gates: months with zero strong-up days produced zero trades and zero risk (Feb 2025, Sep 2025). Strategy cannot fire when its edge conditions are absent.
At retail account scale: 1×$10-wide (B49-A as specified) uses ~$900 margin vs ~$450 for a 2×$5-wide variant (B48-A). On $1,798 account: B48-A gives $24/trade on strong-up days vs B49-A's $19/trade. B49-A has the better profit factor; B48-A has the better dollar output at retail sizing. Both are documented candidates.
Candidate Monthly validated
Long Put (single leg, buy premium, not a spread), 9:40 AM entry, 11:00 AM early close, filtered to tepid overnight gap-up setups. Fills the slight-up morning regime gap.
| Structure | Long Put — single put option purchased outright |
| Delta | 40 delta (near-the-money, higher delta for more sensitivity to move) |
| Entry | 9:40 AM ET |
| Exit | 11:00 AM ET (1h 20m hold) |
| Filter | Change% in (0%, +0.5%] (SPY opens slightly above prior close) |
| Slippage | $0.05/leg |
Slight-up overnight gaps tend to mean-revert through midday more reliably than they continue. A long put at 40Δ captures this reversion. The structurally low win rate (40%) is expected for long-premium strategies — the edge comes from R:R asymmetry (2.0x — Avg Win $95, Avg Loss $36), not win frequency. The delta sweep produced a coherent monotonic pattern (PF 1.27 to 1.34 across 15-40δ variants), validating real edge vs. noise.
16 of 25 months positive (64%), longest losing streak 3 months, no regime-masking pattern. Edge is real and regime-stable. Caveat: concentration risk — top 3 months (Feb 2026, Jan 2026, Jul 2024) contribute 73% of total P/L. Top-3-removed = +$767 over 22 months (still positive but thin). Deploy with sizing discipline.
Tier: one notch below the afternoon champions (B28-D at PF 4.16, B30-C at PF 2.24). Candidate-tier under the long-premium-adjusted criteria (Win$/Loss$ ratio ≥ 1.5 replaces the standard WR > 70% threshold, since long-premium is structurally low-WR).
Candidate Monthly validated
HIGHEST STABLE PROFIT FACTOR IN THE ENTIRE DATABASE. Jade Lizard with symmetric deltas, 2:30 PM entry, hold to expiration, filtered to strong-up days.
| Structure | Jade Lizard — short put + short call + long call (no long put, simulated-naked put side) |
| Put side | Short 15 delta, Long put $100 below (simulated-naked via extremely wide wing) |
| Call side | Short 15 delta, Long call $5 above (defined-risk upside) |
| Entry | 2:30 PM ET |
| Exit | Hold to expiration (4:00 PM close) |
| Filter | Change% > +0.5% (same filter as B49-A morning) |
| Slippage | $0.05/leg |
Strong-up days benefit disproportionately from the Jade Lizard's asymmetric structure. The wide put wing (simulated-naked) absorbs the rare reversal fully. The tight $5 call wing captures full theta on continued upside. 120W/11L over 131 trades. Return on Drawdown of 1,014% is exceptional.
21 of 25 months profitable. The 4 losing months were all small: Aug 2024 −$180, Dec 2024 −$96, Jul 2025 −$64, Nov 2025 −$10. Top 3 months removed still leaves +$1,079 positive — edge is not concentrated in a few outlier months.
Critical finding: the filter did NOT cherry-pick bad months. December 2024 had 4 trades within filter (vs 20 unfiltered) and performed near-normal (−$96 vs −$805 for unfiltered). February 2025 had only 3 trades and was actually positive (+$59). The filter identifies bad DAYS via prior-day behavior, not bad months. Real structural edge, not regime luck.
3-year validation on 2023 data still pending per methodology. Natural pairing with B49-A morning: both fire on Change% > +0.5%, so a strong-up day produces a two-trade sequence with capital recycling between 11:00 AM and 2:30 PM.
Candidate
Iron Condor with asymmetric deltas ($10 wings), 1:30 PM entry, hold to expiration, filtered to down-day afternoon setups. Fills the down-regime gap in the three-regime afternoon portfolio.
| Structure | Iron Condor, asymmetric deltas |
| Put side | Short 15 delta / Long 15δ − $10 (wing width $10) |
| Call side | Short 25 delta / Long 25δ + $10 (wing width $10) |
| Entry | 1:30 PM ET |
| Exit | Hold to expiration (4:00 PM close) |
| Filter | Change% < −0.1% (down day) |
| Slippage | $0.05/leg |
Down-day afternoon entries catch theta decay through close after the initial down move has played out. The asymmetric wings reflect the regime: on down-days, the put side has already been tested (stress absorbed), while the call side is the safer premium collection opportunity. Return on Drawdown 627%.
Beats symmetric IC $5-wide down-day (PF 2.08) AND beats Jade Lizard down-day (PF 1.68). Specific combination of $10 wings and inverse asymmetry (vs B29-D's P15/C25 $5w) is the winning configuration.
Monthly validation pending — not yet run.
Candidate
Iron Butterfly ATM, $5 wings, 1:30 PM entry, hold to expiration, filtered to flat days. Fills the flat-regime gap in the three-regime afternoon portfolio.
| Structure | Iron Butterfly — both shorts at-the-money |
| Strikes | Short put ATM / Short call ATM / Long put ATM − $5 / Long call ATM + $5 |
| Entry | 1:30 PM ET |
| Exit | Hold to expiration (4:00 PM close) |
| Filter | Change% in [−0.5%, +0.5%] (flat day) |
| Slippage | $0.05/leg |
Iron butterfly at ATM is structurally purpose-built for narrow-range days. The 1:30 PM entry catches the afternoon theta burn without morning gap risk. Previously overlooked because IB had been tested at 10:00 AM (PF 0.99, losing). 1:30 PM is the correct entry time — entry timing dominates structure at 0DTE.
B31-A filter (Change% −0.5% to +0.5%) overlaps with B30-C filter (Change% < −0.1%) on days where Change% falls between −0.5% and −0.1%. On those days, both strategies fire. This requires an overlap diagnostic before a clean three-regime portfolio claim can be made. Currently: clears 5/7 candidate criteria; pending out-of-sample and correlation testing.
Research-tier
Short Put Spread 10Δ, $10 wings, 9:40 AM entry, 11:00 AM early close, filter Change% < −0.5%. PF 1.35, WR 66.1%, N=62, Avg P/L $3, MaxDD −$265.
First positive morning heavy-down finding in the database. Mechanism: on fear regimes (heavy-down), tighter put delta (further OTM at 10Δ) wins because cushion matters more than premium capture. On bullish regimes, closer-to-money put delta wins because premium capture matters more.
Fails 4 of 7 candidate criteria (WR and Avg P/L thresholds). Research-tier only — not deployment-ready. Currently the best morning heavy-down result; no candidate-tier morning heavy-down strategy exists.
Questions that have been tested with sufficient rigor to close pending new evidence. These are not to be retested without a specific mechanistic reason — retesting without new data is the "confirmation bias trap."
Closed
Every profit-target and stop-loss management variation tested (across dozens of configurations) reduced or eliminated edge relative to hold-to-expiration. Hold-to-expiry is the validated approach. This is counterintuitive — managing losers feels prudent — but data is unambiguous.
Closed
Batches 2, 4, 6b established that morning IC entries with early exits (before afternoon) consistently produce negative edge once realistic slippage is applied. Early research findings that suggested otherwise were invalidated once $0.05/leg slippage was enforced.
Closed
Batch 3a vs 3b established the earlier entry wins on every metric for the unfiltered case. However, this finding is subsumed by the broader result that 1:30 PM entry beats every tested morning entry time (Batch 5). Afternoon IC is the correct lane for symmetric structures.
Closed
Across IC, IB, BWIC, BW IB, and SPS structures (Batches 4, 6b), exiting at 1:00 PM consistently destroys edge. Hold-to-expiration or hold-to-late-afternoon minimum.
Findings from external sources that have been integrated as reference data or additional structural candidates. External research is treated distinctly from proprietary backtests — it's either reference data (descriptive, not tested) or external strategy (tested framework from another practitioner, integrated but not validated in our engine).
External strategy · EXT-02
Jack Sokin's Peggy Bank uses percentage-OTM strike selection with a reward/risk cascade (0.5% / 0.4% / 0.3% OTM) rather than delta or fixed-dollar offsets. This is a structurally different parameterization from our research family.
Integration status: %-OTM structures added to the Decision Engine's structure library alongside dollar-offset variants. Not yet independently backtested as a standalone candidate. Initial engine output suggests Peggy-style structures converge on similar answers to dollar-offset variants when analog days are well-matched (~0.75% OTM SPS ≈ $5 OTM SPS at SPY $710).
Why %-OTM matters: more robust to SPY price drift over time. A $3 OTM strike at SPY $400 has very different meaning from $3 OTM at SPY $700. Percentage-based parameterization normalizes across price regimes.
Reference data
PEG study provides containment probabilities for SPX over different lookback windows (30D / 90D / 180D) and entry times. Key confirmed patterns:
Status: reference data, not integrated as a filter. Worth testing as a future research direction.
Reference data
External composite liquidity ranking for 220 tickers validates the SPX → SPY → QQQ → IWM → XSP hierarchy. Quantitatively backs the transfer-gap hypothesis: findings on the top 3–5 tickers transfer cleanly across the family; findings on lower-liquidity tickers are less reliable.
Reference data
Empirical overnight range probabilities for SPX 1DTE by entry time. Reference data, not used as a filter. Potentially useful for bridging between 0DTE research and 1DTE applications.
Beyond the validated candidate strategies, research has produced a daily decision-support framework that extends the candidate-filter approach with historical analog matching. This complements (does not replace) the validated filters.
Traditional backtest analysis asks "what is the expected P/L of trade X today?" This requires modeling options prices, IV surfaces, and slippage accurately — a tall order with free-tier data.
The containment framework reframes the question: "On the historical days most similar to today, what fraction of them closed inside a given structure's profit zone?" This is a binary outcome question (contained / max loss / partial) that can be answered purely from underlying price data without options pricing models.
Don't predict P/L. Predict containment. The trade edge expresses itself across many decisions conditional on containment, not via accurate per-trade P/L forecasts.
| Tier | Trigger | Action |
|---|---|---|
| GREEN | Validated filter fires AND top structure ≥ 80% containment on analogs | Standard size |
| YELLOW | One signal strong, the other mixed | Reduced size (½ contract) |
| ORANGE | No validated fire, top structure 60–80% | Small experimental or skip |
| RED | No validated fire AND top structure < 60% | Stand down |
Untested or insufficiently tested questions, priority ranked.
Research errors this methodology is explicitly designed to prevent. Each exists because we encountered it.
Treating any single backtest as conclusive. Negative results don't kill ideas; they narrow the conditions where ideas might work.
Believing the highest PF in current data is the ceiling. The afternoon IC went from PF 1.4 to PF 2.82 through layered refinement. Today's "best" is tomorrow's baseline.
Comparing strategies tested under different conditions (slippage, time periods, regime filters). Always normalize comparison conditions before drawing conclusions.
Wanting larger sample sizes that don't exist. 0DTE SPY has only ~3.5 years of real data. Cross-ticker and cross-period validation substitute for raw N where appropriate.
Running backtests without slippage and treating results as achievable. Realistic execution cost destroys naive edges. Always use $0.05/leg minimum.
Adding multiple positive-PF strategies without checking if they lose on the same days. Combined drawdown can be worse than expected if correlations are high.
Running additional tests to confirm a desired result rather than challenge a finding. Best practice: actively look for tests that could falsify your current hypothesis.
Adding strategies because they're profitable, even if execution complexity exceeds capacity. A single PF 2.0 strategy executed flawlessly beats three PF 1.4 strategies executed sloppily.
Added this session. Building execution infrastructure before live-validating the strategy it's meant to support. The risk: polishing the delivery mechanism for edge that hasn't been proven to survive real markets.
Option Alpha: real options pricing, 5 simultaneous backtests for comparison, 2-year window typical, $0.01–$0.25/leg slippage configurable.
Wealthsimple (CAD margin): 1.5% FX fee per USD conversion, wider spreads than US-direct brokers. Suitable for initial live testing; transferable to other brokers for scale.
| Source | Data | Notes |
|---|---|---|
| Polygon (free tier) | SPY, QQQ, IWM, TLT, UUP daily & intraday | 2-year history cap on free tier |
| FRED (St. Louis Fed) | VIX daily close | Free, 3-year+ history available |
| Option Alpha | Historical options chains (indirect via backtester) | Actual options pricing, not modeled |
| Economic calendar | FOMC / CPI / NFP / GDP dates | Manual population from ForexFactory |
Methodology is versioned explicitly. Each revision documents what changed and why.
| Version | Date | Changes |
|---|---|---|
| v1 | Apr 16, 2026 | Initial methodology document. Established after morning bridge research session demonstrated need for explicit framework to prevent premature conclusions. Seven principles, five phases, seven qualification criteria, nine anti-patterns. |
| v2 | ~Apr 18, 2026 | Added monthly validation formalization (Principle 6). Added Reference Data vs External Research distinction (Principle 8 in some revisions). Added handoff-document discipline for session continuity. |
| v3 | ~Apr 19, 2026 | Refined qualification criteria to include out-of-sample validation and correlation. Expanded anti-patterns. Integrated morning research findings (B49, B53, B55 families) into candidate tier framework. |
| v3.1 | Apr 21, 2026 | Current. Added Principle 7 (capital-efficiency over raw PF) from B49 wing-width comparison. Integrated inverse asymmetry finding (B48-A, P25/C15 beats P15/C25 for strong-up mornings). Added Decision Engine / containment framework as research-tier deliverable. Added "premature web app" anti-pattern. |
A "closed" question can be reopened if:
Example: morning Iron Butterfly was tentatively closed in early batches (10:00 AM entry produced PF 0.99) but reopened when B31-D tested IB at 1:30 PM and found PF 1.67 unfiltered. B31-A (flat-filter variant at PF 1.97) is the candidate-tier result.