"""
MAML Explainer - Understanding Meta-Learning
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Paper: https://arxiv.org/abs/1703.03400
"""

print("""
╔══════════════════════════════════════════════════════════════════╗
║                    MAML: HOW IT WORKS                            ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  Traditional Learning:                                           ║
║    Initialize θ randomly                                         ║
║    Train on Task A for 1000 steps → θ_A                          ║
║    Train on Task B for 1000 steps → θ_B                          ║
║    New Task C? Start from scratch again...                       ║
║                                                                  ║
║  Meta-Learning (MAML):                                           ║
║    Find initialization θ* that's "close" to all tasks           ║
║    New Task C? Start from θ*, adapt in 5 steps → 99% accuracy   ║
║                                                                  ║
║  How?                                                            ║
║    1. Sample batch of tasks {T1, T2, ..., TN}                   ║
║    2. For each task Ti:                                          ║
║       - Inner loop: Adapt θ to Ti → θ'i (5 steps)               ║
║       - Outer loop: Update θ based on performance of θ'i        ║
║    3. Repeat until θ* is good for fast adaptation               ║
║                                                                  ║
╠══════════════════════════════════════════════════════════════════╣
║  KEY INSIGHT:                                                    ║
║  MAML optimizes for "learnability" not "performance"            ║
║  It finds weights where gradient descent works really well      ║
╚══════════════════════════════════════════════════════════════════╝

MAML ALGORITHM:

Require: Distribution over tasks p(T)
Require: α (inner learning rate), β (outer learning rate)

Initialize: θ randomly

while not done:
    Sample batch of tasks Ti ~ p(T)
    
    for each task Ti:
        # INNER LOOP (task-specific adaptation)
        Sample K examples from Ti for training
        Compute adapted parameters:
            θ'i = θ - α * ∇_θ L_Ti(θ)  [5 gradient steps]
        
        # Evaluate on test examples from Ti
        Compute loss: L_Ti(θ'i)
    
    # OUTER LOOP (meta-update)
    Update θ using gradients from all tasks:
        θ = θ - β * ∇_θ Σ L_Ti(θ'i)
        
Result: θ* that enables fast adaptation

╔══════════════════════════════════════════════════════════════════╗
║  THE HARD PART:                                                  ║
║  Computing ∇_θ L_Ti(θ'i) requires second-order derivatives      ║
║  (gradients through the inner loop gradient steps)              ║
║  This is computationally expensive but what makes MAML work     ║
╚══════════════════════════════════════════════════════════════════╝
""")
