In-context learning (ICL) is the ability of foundation models to adapt their behavior at inference time by conditioning on examples, instructions, and other task-specific information placed directly in the prompt. This post builds a practical, cross-modal view of ICL: how it shows up in large language models, how it changes when images enter the context, and what it means for robotics, where “outputs” are actions rather than tokens. We will develop an intuition for what ICL is (and is not), connect it to transformer mechanisms such as induction heads and long-context effects, and then walk through concrete prompting patterns, failure modes, and evaluation strategies. The goal is to leave you with a mental model and a playbook you can use to design, test, and deploy ICL-based systems across language, computer vision, and robotics.

Who it’s for

ML engineers building LLM/VLM applications and want reliable few-shot behavior.
Researchers exploring why ICL works and how to measure it.
Robotics folks working with language-conditioned policies or vision-language-action (VLA) models.

Key takeaways

ICL is behavioral adaptation by conditioning, not parameter updates. Think: “temporary task memory in the prompt.”
ICL quality depends heavily on example selection, format consistency, and where demonstrations appear in long contexts.
Multimodal ICL is harder than text-only ICL: many VLMs struggle to fully use few-shot demonstrations, and specialized training/curricula can help.
In robotics, “context” can mean trajectories, state-action pairs, videos, and tool traces; safety constraints and verification matter more.
You can evaluate ICL systematically with ablations (shots, order, position, modality removal) and stability checks.

1) Motivation: why ICL matters across modalities

The problem: task adaptation is expensive

Traditional adaptation usually means:

collecting labeled data,
training or fine-tuning,
validating and deploying a new artifact.

That pipeline is slow, compute-heavy, and operationally risky. Even “lightweight” updates (LoRA/adapters) still require data management, evaluation cycles, and model versioning.

The promise: adapt at inference time using demonstrations

ICL offers a different workflow: you keep the model fixed and change only the prompt. As popularized by large-scale language models, models can often solve new tasks from a handful of labeled examples in the prompt without any gradient updates (e.g., GPT-3 style few-shot prompting).

Why multimodal + robotics makes it more interesting

Once you leave pure text, the challenges get sharper:

Distribution shift: different cameras, lighting, sensor noise, domain gaps.
Grounding: language must align with pixels, objects, and physical affordances.
Safety and cost: errors in robotics can break hardware or create unsafe behaviors.
Temporal structure: robotics often needs memory across time, not just one image.

ICL becomes a unifying interface: the same idea (provide demonstrations in the prompt) can drive behavior across text outputs, visual reasoning outputs, and action outputs.

2) What “In-Context Learning” actually is (and isn’t)

Definition: conditioning on context vs updating parameters

A clean way to say it:

Fine-tuning changes parameters:

ICL changes the input to the model, not theta:

So the “learning” is in how the model uses the prompt context to condition its outputs.

Few-shot, zero-shot, instruction following: how they relate

Zero-shot: instruction only (no labeled demonstrations).
Few-shot: instruction + a small set of demonstrations (input-output pairs).
Instruction following: a broader umbrella; can be zero-shot or few-shot. In practice, formatting and explicit schemas matter as much as natural language.

“Context as a temporary dataset” mental model

A useful mental model:

the prompt contains a mini dataset,
the model behaves like it is fitting a lightweight task hypothesis to that dataset,
but nothing is stored beyond the context window.

This framing helps you design prompts: choose examples that span edge cases, match deployment distribution, and reduce ambiguity.

A tiny running example (text + image + action)

We will reuse a single task across modalities: “Sort-and-route”.

Goal: Determine the correct bin for an item: `RECYCLE`, `COMPOST`, or `TRASH`.

Language version: given a short description, output the bin.
Vision version: given an image of an item, output the bin.
Robotics version: given camera image + instruction, output an action plan (or tool call) to place item in the correct bin.

We’ll keep the output schema fixed:

{"bin": "RECYCLE|COMPOST|TRASH", "confidence": 0.0}

3) Mechanistic intuition (keep it accessible)

You do not need mechanistic interpretability to use ICL well, but a little intuition explains why formatting and example placement can make or break performance.

Transformer attention as retrieval over demonstrations

Transformers can attend to earlier tokens (and, in multimodal models, earlier visual tokens) that look relevant to the current query. This enables a kind of “retrieve-and-apply” behavior:

find the most similar demonstration,
copy its structure,
adjust the output.

Mechanistic work has identified induction heads as one plausible circuit that supports pattern copying and continuation, which aligns with few-shot behavior in sequence models.

Pattern completion vs implicit task inference (“latent task” viewpoint)

Two helpful lenses:

Pattern completion: the model continues the most likely sequence given the prompt.
Latent task inference: the model infers “what task is being defined” by the demonstrations and then applies it.

Many theoretical and empirical analyses suggest that ICL is often closer to identifying the task than “learning new knowledge.”

Why ordering, formatting, and representativeness matter

ICL is fragile to:

inconsistent labels (`Recycle` vs `RECYCLE`),
changing schemas mid-prompt,
demonstrations that conflict or are ambiguous,
rare edge cases not represented.

A consistent template reduces the search space of “what task is being asked.”

Context length constraints + “lost-in-the-middle”

Even long-context models may not use all context uniformly. In long prompts, performance can drop when critical information is buried in the middle rather than near the beginning or end. This matters for demonstration-heavy prompts:

do not blindly stuff 50 examples,
retrieve a small, high-quality set,
place them where the model is most likely to use them.

4) ICL for Language (LLMs): what it can do well

Classic capabilities-In language tasks, ICL can support:

classification (topic, sentiment, intent),
extraction (entities, schema filling),
transformation (tone, style, normalization),
reasoning templates (step-by-step planning, structured decomposition),
code and tool calling with schemas.

A landmark example is the observation that sufficiently large language models can perform many tasks in a few-shot setting without gradient updates.

Tool-use / function calling as structured ICL

Tool-use can be seen as ICL with a strict output contract:

provide examples of a function call structure,
enforce JSON schema,
verify and retry on invalid outputs.

This often outperforms “free-form” prompting because the model’s job is narrowed: produce valid structured output.

Prompt “programs”: chain-of-thought vs structured reasoning (safe + practical)

You can ask models to “reason,” but for production systems prefer:

structured outputs (JSON fields like `rationale_summary`),
verifiable steps (e.g., “list constraints, then output final JSON”),
self-checks (“validate against schema”), instead of relying on long free-form internal reasoning.

A safe pattern is to request a brief explanation and keep the main decision machine-checkable.

Mini demo: few-shot schema extraction

Task: Extract a normalized record from a short message.

Prompt template

You are a data extraction system.
Return ONLY valid JSON matching this schema:
{"name": string, "email": string|null, "company": string|null, "bin": "RECYCLE"|"COMPOST"|"TRASH", "confidence": number}

Examples:
Input: "Hi, I'm Ana from GreenLoop. This is a glass bottle. ana@greenloop.io"
Output: {"name":"Ana","email":"ana@greenloop.io","company":"GreenLoop","bin":"RECYCLE","confidence":0.92}

Input: "Compostable paper cup, I'm Ben at CafeKraft"
Output: {"name":"Ben","email":null,"company":"CafeKraft","bin":"COMPOST","confidence":0.70}

Input: "Plastic wrapper. Contact: li@noodlebar.de"
Output: {"name":null,"email":"li@noodlebar.de","company":"Noodlebar","bin":"TRASH","confidence":0.66}

Now extract:
Input: "I’m Sara at UniSiegen. Banana peel. sara@uni-siegen.de"
Output:

Evaluation ideas

Exact match for schema validity and allowed bins.
Consistency checks: re-run with shuffled examples; compare output stability.
Calibration proxy: does confidence decrease for ambiguous items?

5) ICL for Computer Vision: from captions to visual tasks

Multimodal ICL often means: **the context contains images and text** interleaved as demonstrations, and the query includes a new image.

5.1 Vision-language ICL

Common VLM tasks:

image captioning,
VQA (visual question answering),
referring expressions (grounding: “the red mug on the left”).

Some VLM architectures are designed to ingest interleaved image-text sequences and perform few-shot prompting with examples.

Example (conceptual)

[Image A]
Q: Which bin?   A: {"bin":"RECYCLE","confidence":0.90}

[Image B]
Q: Which bin?   A: {"bin":"TRASH","confidence":0.80}

[Image C]
Q: Which bin?   A: ?

Tip: Keep the exact output format identical across examples.

5.2 In-context visual classification

This is the “support set” idea: provide exemplar images with labels, then classify a new image.

How this differs from training a linear probe:

a probe learns persistent weights for a task;
ICL “reads” the support set every time and conditions on it.

In practice, this can be more flexible for fast iteration, but less stable than learned adapters.

5.3 In-context dense/structured prediction (harder)

Dense tasks like detection/segmentation require structured outputs:

bounding boxes,
masks,
keypoints.

Some VLMs can approximate these via text-based formats (e.g., “box=(x1,y1,x2,y2)”), but results are often limited:

coordinate precision,
consistency,
sensitivity to prompts.

When you need reliability, tools help:

call an external detector/segmenter,
use the LLM/VLM for reasoning and orchestration,
verify the tool output.

Mini demo idea A: few-shot product categorization (4 labeled image examples)

Output schema

{"category": "BOTTLE|CAN|PAPER|FOOD_WASTE|PLASTIC_WRAP|OTHER", "bin":"RECYCLE|COMPOST|TRASH", "confidence": 0.0}

Prompt skeleton

You are a visual classification system.
Given an image of an item, output ONLY JSON in this schema:
{...}

Examples:
[Image 1: aluminum can]
Output: {"category":"CAN","bin":"RECYCLE","confidence":0.93}

[Image 2: banana peel]
Output: {"category":"FOOD_WASTE","bin":"COMPOST","confidence":0.91}

[Image 3: plastic wrapper]
Output: {"category":"PLASTIC_WRAP","bin":"TRASH","confidence":0.84}

[Image 4: glass bottle]
Output: {"category":"BOTTLE","bin":"RECYCLE","confidence":0.90}

Now classify:
[Image 5: query]
Output:

Evaluation

Per-class accuracy, confusion matrix.
Robustness: rotate/crop/lighting changes.
Demonstration ablations: remove an example and measure drop.

Mini demo idea B: visual analogies / ARC-style grid reasoning

ARC-style tasks can be framed as ICL with small input-output grids as demonstrations.

This is a good “pure reasoning” test for multimodal ICL because:

supervision is tiny,
rules are hidden,
generalization requires pattern inference.

6) ICL for Robotics: from language to action

Robotics adds two twists:

outputs are actions (continuous or discrete),
you must care about safety, latency, and real-world failures.

6.1 Language-conditioned policies

Here, the prompt can include:

instruction text,
demonstrations described in text,
or demonstrations as trajectories (state-action sequences).

What counts as “context” in robotics

trajectories:

state summaries: object positions, gripper state
high-level plans: “approach -> grasp -> lift -> place”
videos of successful episodes (for video-conditioned policies)

Example (text demonstration)

Task: place item into correct bin

Demo 1:
Item: "banana peel"
Plan: ["approach item","grasp","move to COMPOST bin","release"]

Demo 2:
Item: "glass bottle"
Plan: ["approach item","grasp","move to RECYCLE bin","release"]

Query:
Item: "plastic wrapper"
Plan:

6.2 Vision-language-action (VLA) models

VLA models unify:

vision observations,
language goals,
and action outputs.

The appeal:

inherit semantic knowledge from large-scale vision-language pretraining,
map directly from observations to actions,
handle many tasks with one policy.

6.3 Robot agents with tools

A practical systems view:

a high-level model decides what to do,
tools execute parts reliably: perception, grasp planning, navigation, motion planning,
the agent stitches them together.

ICL shows up as:

few-shot tool-use demonstrations (“when you see X, call detector Y”),
structured action schemas,
corrective examples after failure.

Mini demo ideas

A) Few-shot pick-and-place variants

Provide 2–3 demonstrations with different objects and bins.
Query with a new object and ask for a plan or tool sequence.

B) Failure recovery via one corrective demonstration

Demo: model fails to grasp a reflective object.
Add a corrective demo: “use side grasp, reduce speed, retry.”
Query: similar reflective object.

Safety note: why constraints/checks matter more in robotics than text

In robotics, you should assume the model may be wrong. Mitigations:

action bounds (velocity/force limits),
collision checking in motion planner,
workspace constraints,
“stop if uncertain” rule,
verification via sensors (did grasp succeed?).

Treat ICL as a powerful interface, not a guarantee.

7) When ICL fails (and how to make it work)

Common failure modes

Ambiguity and spurious patterns: model latches onto superficial cues.
Distribution shift: different camera, different domain, unseen objects.
Format/order sensitivity: changes in demonstration order flip the output.
Context window limits: too many examples, irrelevant examples drown signal.
Multimodal misalignment: image-text pairs not well grounded.

Retrieval strategies (RAG for demonstrations)

Instead of writing prompts by hand, build a demonstration library and retrieve:

by embedding similarity,
by metadata (language, sensor, scene type),
by difficulty and edge cases.

Retrieve 3–8 high-quality demonstrations rather than 30 mediocre ones.

Robust prompting checklist

Use a single canonical template.
Keep label space and schema consistent.
Choose diverse demonstrations that cover boundary cases.
Add anti-examples when confusion is common (e.g., “paper cup with plastic lining”).
Add a self-check: “validate JSON schema; if unsure, set confidence low.”
Prefer structured reasoning (short rationale summary) over long free-form text.

8) Measuring ICL: how to evaluate properly

Language metrics

accuracy / F1 for classification,
exact match for JSON schemas,
robustness to paraphrase,
stability under example order permutations,
calibration (confidence vs correctness proxies).

Vision metrics

few-shot accuracy under shifts (lighting, viewpoint),
compositional generalization (new combinations of attributes),
out-of-distribution tests,
hallucination rate (text describing objects not present).

Robotics metrics

success rate over trials,
safety violations (collisions, out-of-bounds),
time-to-completion,
recovery rate after failure,
generalization to new objects and new scenes.

Practical evaluation harness

Run controlled ablations:

fixed prompt template,
randomized example order,
vary number of shots (k=0,1,2,4,8),
vary demonstration placement (top vs middle vs bottom),
remove modalities (image only vs text only vs both).

Pseudo-protocol

for seed in seeds:
  demos = retrieve(k, seed)
  for order in permutations(demos):
    prompt = build_prompt(order, query)
    y = model(prompt)
    score(y)
report mean, std, and worst-case

9) Practical playbook: building an ICL system end-to-end

Step 1: define the task + output contract

Choose a JSON schema with enums.
Define what “unknown” means (e.g., `confidence < 0.5` triggers escalation).
Decide how you will verify outputs (schema validation, tool checks).

Step 2: curate a demonstration library (gold examples)

Store demonstrations as structured records:

input (text/image/video),
output,
metadata (domain, difficulty, sensor, lighting),
notes (why it’s useful).

2. Maintain versioning and audit trail.

Step 3: retrieve demos (embedding + metadata filters)

Retrieve by similarity to query.
Diversify results (avoid near-duplicates).
Filter by allowed domains (robot type, camera type).

Step 4: inference orchestration (tooling, verifiers, fallback)

Use schema validators.
Add a retry policy: if invalid JSON, regenerate with stricter constraints.
Add tool verification when possible (detector, planner, rules).

Step 5: monitoring + continuous improvement

Log prompts and outcomes (with privacy constraints).
Track failure clusters.
Add new demonstrations for recurring failure patterns.
Regression test old prompts against new model versions.

“ICL vs fine-tuning vs adapters” decision table

10) Future directions (what’s next)

Longer context + better retrieval: more reliable “demo memory” and better selection.
Multimodal reasoning + planning: tighter coupling of perception and symbolic reasoning.
On-device robotics constraints: efficient models, privacy, latency, energy.
Hybrid personalization: small updates plus ICL for quick adaptation.
Open problems: causality, calibration, safety, real-world grounding, and formal verification.

11) Conclusion

ICL is best viewed as context-driven task adaptation: the prompt becomes a temporary dataset and interface that steers a frozen model. In text, this enables rapid few-shot generalization; in vision-language, it enables interleaved demonstration prompting but can be less robust; and in robotics, it becomes a pathway from language and perception to actions, where safety and verification are non-negotiable.

Cheat sheet

Use consistent templates and strict schemas.
Retrieve a few high-quality, diverse demonstrations.
Randomize and ablate during evaluation (shots, order, position).
Prefer structured outputs + verifiers over free-form text.
In robotics, always add safety constraints and tool checks.

Appendix

A) Prompt templates (language, vision-language, robotics)

A.1 Language: few-shot classification with JSON schema

System: You are a classification system. Output ONLY valid JSON.

Schema:
{"bin":"RECYCLE|COMPOST|TRASH","confidence":0.0}

Examples:
Input: "glass bottle"
Output: {"bin":"RECYCLE","confidence":0.92}

Input: "banana peel"
Output: {"bin":"COMPOST","confidence":0.90}

Input: "plastic wrapper"
Output: {"bin":"TRASH","confidence":0.85}

Query:
Input: "{USER_TEXT}"
Output:

A.2 Vision-language: interleaved image demonstrations

System: You are a visual classifier. Output ONLY valid JSON.

Schema:
{"bin":"RECYCLE|COMPOST|TRASH","confidence":0.0}

Examples:
[Image 1]
Output: {"bin":"RECYCLE","confidence":0.92}

[Image 2]
Output: {"bin":"COMPOST","confidence":0.90}

[Image 3]
Output: {"bin":"TRASH","confidence":0.85}

Query:
[Image Q]
Output:

A.3 Robotics: plan + tool calls (safer than raw actions)

System: You are a robot planner. Use tools; do not output raw motor commands.

Tools (JSON):
- perceive_item() -> {"item":"...", "bbox":[...], "confidence":...}
- plan_motion(target) -> {"trajectory_id":"..."}
- execute(trajectory_id) -> {"status":"ok|fail"}
- open_gripper(), close_gripper()

Return a JSON plan:
{"steps":[{"tool":"...", "args":{...}}], "safety_checks":[...], "confidence":0.0}

Demos:
Demo 1: ...
Demo 2: ...

Query: sort the item on the table into the correct bin.
Output:

B) Example formatting conventions (few-shot tables)

Use a single canonical format. For example:

Repeat exactly for each demonstration, then the query row.

C) Glossary

ICL (In-Context Learning): behavioral adaptation by conditioning on demonstrations in the prompt.
Few-shot: using a small number of examples in the prompt.
Zero-shot: instruction-only prompting without labeled examples.
VLM (Vision-Language Model): model that processes images and text.
VLA (Vision-Language-Action): model that maps vision + language (and sometimes proprioception) to actions.
Grounding: aligning language with percepts and entities in the world.
Demonstration retrieval: selecting few-shot examples from a library (often via embeddings).

D) References (grouped)

D.1 ICL foundations and analysis

Brown et al. (2020). Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165
Transformer Circuits (2022). In-context Learning and Induction Heads. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
Wies et al. (2023). The Learnability of In-Context Learning. (PDF) https://www.cl.uni-heidelberg.de/courses/ss25/the_mystery_of_in-context_learning_of_llms/papers/WiesNeurIPS2023.pdf
Liu et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. https://arxiv.org/abs/2307.03172
Zhou et al. (2024). The Mystery of In-Context Learning. (PDF) https://aclanthology.org/2024.emnlp-main.795.pdf

D.2 Multimodal in-context learning (vision-language)

Alayrac et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. https://arxiv.org/abs/2204.14198
Li et al. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. https://arxiv.org/abs/2301.12597
Zhao et al. (2023). MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning. https://arxiv.org/abs/2309.07915
Doveh et al. (2024). Towards Multimodal In-Context Learning for Vision & Language Models. https://arxiv.org/abs/2403.12736

D.3 Robotics and VLA models

Brohan et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. https://arxiv.org/abs/2307.15818
Open X-Embodiment / Octo Team (2024). Octo: An Open-Source Generalist Robot Policy. https://arxiv.org/abs/2405.12213
Kim et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. (PDF) https://groups.csail.mit.edu/robotics-center/public_papers/Kim24.pdf
(Optional reading) Musat et al. (2025). On the Emergence of Induction Heads for In-Context Learning. https://arxiv.org/abs/2511.01033