In-context learning (ICL) is the ability of foundation models to adapt their behavior at inference time by conditioning on examples, instructions, and other task-specific information placed directly in the prompt. This post builds a practical, cross-modal view of ICL: how it shows up in large language models, how it changes when images enter the context, and what it means for robotics, where “outputs” are actions rather than tokens. We will develop an intuition for what ICL is (and is not), connect it to transformer mechanisms such as induction heads and long-context effects, and then walk through concrete prompting patterns, failure modes, and evaluation strategies. The goal is to leave you with a mental model and a playbook you can use to design, test, and deploy ICL-based systems across language, computer vision, and robotics.

Who it’s for

Key takeaways

1) Motivation: why ICL matters across modalities

The problem: task adaptation is expensive

Traditional adaptation usually means:

That pipeline is slow, compute-heavy, and operationally risky. Even “lightweight” updates (LoRA/adapters) still require data management, evaluation cycles, and model versioning.

The promise: adapt at inference time using demonstrations

ICL offers a different workflow: you keep the model fixed and change only the prompt. As popularized by large-scale language models, models can often solve new tasks from a handful of labeled examples in the prompt without any gradient updates (e.g., GPT-3 style few-shot prompting).

Why multimodal + robotics makes it more interesting

Once you leave pure text, the challenges get sharper:

ICL becomes a unifying interface: the same idea (provide demonstrations in the prompt) can drive behavior across text outputs, visual reasoning outputs, and action outputs.

2) What “In-Context Learning” actually is (and isn’t)

Definition: conditioning on context vs updating parameters

A clean way to say it:

Fine-tuning changes parameters:

ICL changes the input to the model, not theta:

So the “learning” is in how the model uses the prompt context to condition its outputs.

Few-shot, zero-shot, instruction following: how they relate

“Context as a temporary dataset” mental model

A useful mental model:

This framing helps you design prompts: choose examples that span edge cases, match deployment distribution, and reduce ambiguity.

A tiny running example (text + image + action)

We will reuse a single task across modalities: “Sort-and-route”.

Goal: Determine the correct bin for an item: `RECYCLE`, `COMPOST`, or `TRASH`.

We’ll keep the output schema fixed:

{"bin": "RECYCLE|COMPOST|TRASH", "confidence": 0.0}

3) Mechanistic intuition (keep it accessible)

You do not need mechanistic interpretability to use ICL well, but a little intuition explains why formatting and example placement can make or break performance.

Transformer attention as retrieval over demonstrations

Transformers can attend to earlier tokens (and, in multimodal models, earlier visual tokens) that look relevant to the current query. This enables a kind of “retrieve-and-apply” behavior:

Mechanistic work has identified induction heads as one plausible circuit that supports pattern copying and continuation, which aligns with few-shot behavior in sequence models.

Pattern completion vs implicit task inference (“latent task” viewpoint)

Two helpful lenses:

Many theoretical and empirical analyses suggest that ICL is often closer to identifying the task than “learning new knowledge.”

Why ordering, formatting, and representativeness matter

ICL is fragile to:

A consistent template reduces the search space of “what task is being asked.”

Context length constraints + “lost-in-the-middle”

Even long-context models may not use all context uniformly. In long prompts, performance can drop when critical information is buried in the middle rather than near the beginning or end. This matters for demonstration-heavy prompts:

4) ICL for Language (LLMs): what it can do well

Classic capabilities-In language tasks, ICL can support:

A landmark example is the observation that sufficiently large language models can perform many tasks in a few-shot setting without gradient updates.

Tool-use / function calling as structured ICL

Tool-use can be seen as ICL with a strict output contract:

This often outperforms “free-form” prompting because the model’s job is narrowed: produce valid structured output.

Prompt “programs”: chain-of-thought vs structured reasoning (safe + practical)

You can ask models to “reason,” but for production systems prefer:

A safe pattern is to request a brief explanation and keep the main decision machine-checkable.

Mini demo: few-shot schema extraction

Task: Extract a normalized record from a short message.

Prompt template

You are a data extraction system.
Return ONLY valid JSON matching this schema:
{"name": string, "email": string|null, "company": string|null, "bin": "RECYCLE"|"COMPOST"|"TRASH", "confidence": number}

Examples:
Input: "Hi, I'm Ana from GreenLoop. This is a glass bottle. ana@greenloop.io"
Output: {"name":"Ana","email":"ana@greenloop.io","company":"GreenLoop","bin":"RECYCLE","confidence":0.92}

Input: "Compostable paper cup, I'm Ben at CafeKraft"
Output: {"name":"Ben","email":null,"company":"CafeKraft","bin":"COMPOST","confidence":0.70}

Input: "Plastic wrapper. Contact: li@noodlebar.de"
Output: {"name":null,"email":"li@noodlebar.de","company":"Noodlebar","bin":"TRASH","confidence":0.66}

Now extract:
Input: "I’m Sara at UniSiegen. Banana peel. sara@uni-siegen.de"
Output:

Evaluation ideas

5) ICL for Computer Vision: from captions to visual tasks

Multimodal ICL often means: **the context contains images and text** interleaved as demonstrations, and the query includes a new image.

5.1 Vision-language ICL

Common VLM tasks:

Some VLM architectures are designed to ingest interleaved image-text sequences and perform few-shot prompting with examples.

Example (conceptual)

[Image A]
Q: Which bin? A: {"bin":"RECYCLE","confidence":0.90}

[Image B]
Q: Which bin? A: {"bin":"TRASH","confidence":0.80}

[Image C]
Q: Which bin? A: ?

Tip: Keep the exact output format identical across examples.

5.2 In-context visual classification

This is the “support set” idea: provide exemplar images with labels, then classify a new image.

How this differs from training a linear probe:

In practice, this can be more flexible for fast iteration, but less stable than learned adapters.

5.3 In-context dense/structured prediction (harder)

Dense tasks like detection/segmentation require structured outputs:

Some VLMs can approximate these via text-based formats (e.g., “box=(x1,y1,x2,y2)”), but results are often limited:

When you need reliability, tools help:

Mini demo idea A: few-shot product categorization (4 labeled image examples)

Output schema

{"category": "BOTTLE|CAN|PAPER|FOOD_WASTE|PLASTIC_WRAP|OTHER", "bin":"RECYCLE|COMPOST|TRASH", "confidence": 0.0}

Prompt skeleton

You are a visual classification system.
Given an image of an item, output ONLY JSON in this schema:
{...}

Examples:
[Image 1: aluminum can]
Output: {"category":"CAN","bin":"RECYCLE","confidence":0.93}

[Image 2: banana peel]
Output: {"category":"FOOD_WASTE","bin":"COMPOST","confidence":0.91}

[Image 3: plastic wrapper]
Output: {"category":"PLASTIC_WRAP","bin":"TRASH","confidence":0.84}

[Image 4: glass bottle]
Output: {"category":"BOTTLE","bin":"RECYCLE","confidence":0.90}

Now classify:
[Image 5: query]
Output:

Evaluation

Mini demo idea B: visual analogies / ARC-style grid reasoning

ARC-style tasks can be framed as ICL with small input-output grids as demonstrations.

This is a good “pure reasoning” test for multimodal ICL because:

6) ICL for Robotics: from language to action

Robotics adds two twists:

  1. outputs are actions (continuous or discrete),
  2. you must care about safety, latency, and real-world failures.

6.1 Language-conditioned policies

Here, the prompt can include:

What counts as “context” in robotics

Trajectories

Example (text demonstration)

Task: place item into correct bin

Demo 1:
Item: "banana peel"
Plan: ["approach item","grasp","move to COMPOST bin","release"]

Demo 2:
Item: "glass bottle"
Plan: ["approach item","grasp","move to RECYCLE bin","release"]

Query:
Item: "plastic wrapper"
Plan:

6.2 Vision-language-action (VLA) models

VLA models unify:

The appeal:

6.3 Robot agents with tools

A practical systems view:

ICL shows up as:

Mini demo ideas

A) Few-shot pick-and-place variants

B) Failure recovery via one corrective demonstration

Safety note: why constraints/checks matter more in robotics than text

In robotics, you should assume the model may be wrong. Mitigations:

Treat ICL as a powerful interface, not a guarantee.

7) When ICL fails (and how to make it work)

Common failure modes

Retrieval strategies (RAG for demonstrations)

Instead of writing prompts by hand, build a demonstration library and retrieve:

Retrieve 3–8 high-quality demonstrations rather than 30 mediocre ones.

Robust prompting checklist

8) Measuring ICL: how to evaluate properly

Language metrics

Vision metrics

Robotics metrics

Practical evaluation harness

Run controlled ablations:

Pseudo-protocol

for seed in seeds:
demos = retrieve(k, seed)
for order in permutations(demos):
prompt = build_prompt(order, query)
y = model(prompt)
score(y)
report mean, std, and worst-case

9) Practical playbook: building an ICL system end-to-end

Step 1: define the task + output contract

Step 2: curate a demonstration library (gold examples)

  1. Store demonstrations as structured records:

2. Maintain versioning and audit trail.

Step 3: retrieve demos (embedding + metadata filters)

Step 4: inference orchestration (tooling, verifiers, fallback)

Step 5: monitoring + continuous improvement

“ICL vs fine-tuning vs adapters” decision table

ICL v/s Fine-Tuning v/s Adapters

10) Future directions (what’s next)

11) Conclusion

ICL is best viewed as context-driven task adaptation: the prompt becomes a temporary dataset and interface that steers a frozen model. In text, this enables rapid few-shot generalization; in vision-language, it enables interleaved demonstration prompting but can be less robust; and in robotics, it becomes a pathway from language and perception to actions, where safety and verification are non-negotiable.

Cheat sheet

Appendix

A) Prompt templates (language, vision-language, robotics)

A.1 Language: few-shot classification with JSON schema

System: You are a classification system. Output ONLY valid JSON.

Schema:
{"bin":"RECYCLE|COMPOST|TRASH","confidence":0.0}

Examples:
Input: "glass bottle"
Output: {"bin":"RECYCLE","confidence":0.92}

Input: "banana peel"
Output: {"bin":"COMPOST","confidence":0.90}

Input: "plastic wrapper"
Output: {"bin":"TRASH","confidence":0.85}

Query:
Input: "{USER_TEXT}"
Output:

A.2 Vision-language: interleaved image demonstrations

System: You are a visual classifier. Output ONLY valid JSON.

Schema:
{"bin":"RECYCLE|COMPOST|TRASH","confidence":0.0}

Examples:
[Image 1]
Output: {"bin":"RECYCLE","confidence":0.92}

[Image 2]
Output: {"bin":"COMPOST","confidence":0.90}

[Image 3]
Output: {"bin":"TRASH","confidence":0.85}

Query:
[Image Q]
Output:

A.3 Robotics: plan + tool calls (safer than raw actions)

System: You are a robot planner. Use tools; do not output raw motor commands.

Tools (JSON):
- perceive_item() -> {"item":"...", "bbox":[...], "confidence":...}
- plan_motion(target) -> {"trajectory_id":"..."}
- execute(trajectory_id) -> {"status":"ok|fail"}
- open_gripper(), close_gripper()

Return a JSON plan:
{"steps":[{"tool":"...", "args":{...}}], "safety_checks":[...], "confidence":0.0}

Demos:
Demo 1: ...
Demo 2: ...

Query: sort the item on the table into the correct bin.
Output:

B) Example formatting conventions (few-shot tables)

Use a single canonical format. For example:

Few Shot Table

Repeat exactly for each demonstration, then the query row.

C) Glossary

D) References (grouped)

D.1 ICL foundations and analysis

D.2 Multimodal in-context learning (vision-language)

D.3 Robotics and VLA models