Glossary

What is prompt engineering?

Prompt engineering is the practice of designing, refining, and testing the instructions given to a generative AI system so its responses are more accurate, useful, and on-task. The "prompt" is everything the model sees before it answers — the user's question, plus any system instructions, examples, retrieved context, and structural guidance the developer adds around it.

The term covers a spectrum, from a marketer tweaking the wording of a ChatGPT request, to an AI team writing the multi-thousand-token system prompt that governs a production AI agent. OpenAI, Anthropic, Google Cloud, IBM, and AWS all publish prompt engineering guidance for developers building on their models.

This page covers what prompt engineering is, how it works, the standard techniques, what good prompts can and cannot do on their own, how prompt engineering compares to fine-tuning and retrieval-augmented generation, how it is evolving into context engineering, and what prompt engineering looks like inside a production customer service AI.

Prompt engineering in one sentence

Prompt engineering is the discipline of writing the instructions that tell a generative AI model what to do, what not to do, and how to do it well.

Where the term comes from

The phrase entered general use in 2022, as developers building on GPT-3 discovered that small changes in wording produced large changes in output quality. The early framing was almost mystical — "prompt whisperer," "prompt magic" — and most of the writing on the topic treated it as an art.

That framing has aged. Modern prompt engineering is a measurable, iterative engineering discipline with established techniques, documented patterns, and built-in evaluation. The art-vs-science framing is less useful today than it once was. A well-written prompt is a hypothesis the team tests, measures, and revises.

The word "prompt" itself is older than generative AI — it has been used for decades to mean the cue that elicits a response, whether from a person, a command-line interface, or an earlier generation of language model. Generative AI inherited the term and gave it weight.

How prompt engineering works

Every modern prompt-engineering workflow, regardless of model or use case, follows the same four-step pattern.

1. Write the prompt

The prompt is constructed from one or more of these layers, depending on the system:

  • System prompt. The persistent instructions that frame the model's role, scope, voice, and rules. In a production system, this is often hundreds to thousands of tokens long and stays the same across conversations.

  • User prompt. The specific request or message in the moment. In a chatbot, this is what the user types.

  • Examples. Sample input-output pairs shown to the model so it can learn the pattern from context. Sometimes called few-shot examples.

  • Retrieved context. Facts pulled from a connected source — a knowledge base, a customer record, a product catalog — and inserted into the prompt at runtime. This is the retrieval-augmented generation (RAG) layer.

  • Structural guidance. Format requirements (return JSON, use markdown, keep responses under 200 words), constraints (never recommend a competitor, never quote a price), and reasoning scaffolds (think step by step, list trade-offs first).

2. Choose a technique

Different tasks suit different prompting patterns. The standard library:

Technique

What it is

When to use

Zero-shot prompting

Give the model the instruction with no examples.

Simple, well-known tasks where the model has clearly seen similar work in training.

Few-shot prompting

Include 2–5 input-output examples in the prompt before the real request.

Tasks where the desired format or style is hard to describe in words but easy to show.

Chain-of-thought prompting

Ask the model to reason step by step before producing the final answer.

Multi-step reasoning, math, complex analysis, debugging.

Role prompting

Tell the model who it is — "you are a senior CX agent at a luxury retailer" — to anchor voice and judgment.

Tone-sensitive responses, persona-driven assistants.

Structured output prompting

Demand a specific output shape — JSON schema, table, bullet list.

Anything downstream code will parse.

Self-consistency prompting

Run the same prompt several times and pick the most common answer.

High-stakes outputs where a single sample is too noisy.

Tool-use prompting

Tell the model when to call an external function instead of answering directly.

Workflows where the model needs to look up data, run code, or take action.

A production prompt usually combines several of these — a role assignment, a chain-of-thought reasoning instruction, structured output requirements, and a few examples — stacked into a single system prompt.

3. Test against real cases

A prompt that works on one example often breaks on another. Modern prompt engineering treats every prompt as a hypothesis tested against a labeled set of inputs — sometimes a handful, sometimes thousands. The team runs the prompt, scores the output (manually, with another model as a judge, or against ground-truth labels), and tracks which prompt versions perform best on which slices of input.

This is the step that separates production prompt engineering from casual prompt writing. Without evaluation, a "better" prompt is just a guess.

4. Iterate

The team revises the prompt — clearer instruction, better examples, tighter constraints — and reruns the eval. Most teams version their prompts the way they version code, with a record of which version produced which results.

This is why "prompt engineering" is the right name. The shape of the work — write, test, measure, revise — is engineering, not magic.

Examples of prompt techniques in practice

The same task — write a friendly reply to a customer asking about a delayed order — written four ways:

Technique

Prompt example

Zero-shot

"Write a friendly customer service reply to a customer asking why their order is late."

Few-shot

"Here are three example replies our brand has sent in the past: [example 1] [example 2] [example 3]. Now write a reply for this new message: [customer message]."

Chain-of-thought

"Before drafting the reply, list (a) what the customer is feeling, (b) what they need to know, and (c) what they need us to do. Then write the reply."

Role-prompted

"You are a senior customer service representative at a high-end retailer. Your voice is warm, direct, and never apologizes more than once. Write a reply to this customer message: [message]."

Each variation will produce different output from the same model. Prompt engineering is the work of figuring out which variation produces the best results for your specific task — and proving it with data, not intuition.

What good prompt engineering can do

Used well, prompt engineering changes what a generative AI model is capable of producing in production:

  • Increase accuracy. A well-structured prompt with examples and constraints reduces wrong answers on the same model, no fine-tuning required.

  • Anchor voice and tone. A role-and-rules system prompt is how brands keep AI on-voice without retraining a model.

  • Improve reliability. Structured output requirements let downstream systems parse model output without guessing.

  • Reduce off-task responses. Explicit scope rules — "only answer questions about orders" — cut hallucination on adjacent topics.

  • Speed up reasoning. Chain-of-thought prompts measurably improve performance on multi-step problems.

  • Lower model cost. A good prompt can let a smaller, cheaper model perform tasks that would otherwise require a larger one.

Prompt engineering is often one of the fastest and most cost-effective ways to improve AI performance. It does not require new training data, new model weights, or new infrastructure. It changes outputs by changing inputs.

What prompt engineering cannot do (without help)

The structural limits, less often covered in vendor explainers:

  • It cannot give a model new facts. A prompt cannot teach a model knowledge it was never trained on, except by including those facts in the prompt itself at runtime. That is where retrieval-augmented generation comes in.

  • It cannot make an inaccurate model accurate. If the base model is wrong about a domain, a better prompt will produce the wrong answer more confidently, not the right one.

  • It cannot guarantee behavior. A prompt is an instruction; the model is a probability machine. The same prompt can produce different outputs, and edge cases will leak through any rule set.

  • It cannot replace evaluation. A prompt that "looks good" is a draft, not a system. Without measurement, you don't know if it works.

  • It cannot remove the need for guardrails. Even a perfect prompt can be subverted by prompt injection — adversarial input designed to override the system instructions. Production systems need a defense layer, not just better wording.

The takeaway: prompt engineering is necessary but not sufficient. A production AI system is prompts plus grounding plus tools plus guardrails plus evaluation, not the prompt alone.

Prompt engineering vs fine-tuning vs RAG

These three are often confused in customer service marketing. The clean separation:

Approach

What it changes

When to use

Prompt engineering

Changes the input to the model. Model weights stay the same.

When you need to shape behavior with the model and data you already have. The first lever to pull.

Retrieval-augmented generation (RAG)

Adds facts to the prompt from an external source at runtime. Model weights stay the same.

When the model needs to know things — current policies, customer data, product specs — that aren't in its training data.

Fine-tuning

Updates the model's weights using a curated training dataset. Produces a new model.

When prompt engineering and RAG can't get you there — usually for highly specialized domains or tasks the base model is bad at.

Most production AI stacks combine all three. Prompt engineering shapes the behavior. RAG supplies the facts. Fine-tuning specializes the model for tasks where the first two aren't enough. The choice is rarely either-or.

Prompt engineering vs context engineering

A newer term, increasingly used in 2025 and 2026, is context engineering: the broader discipline of deciding everything the model sees at runtime — not just the wording of the system prompt, but which knowledge documents are retrieved, how prior conversation turns are summarized, which tools are made available, and how all of that is assembled into the model's context window.

The simple way to read the relationship: prompt engineering is one part of context engineering. Prompt engineering is the wording. Context engineering is the wording plus everything else the model receives. As production systems have gotten more sophisticated — longer context windows, retrieval pipelines, multi-step agents — the work has expanded beyond the prompt itself.

The phrase "prompt engineering is dead" has been a recurring claim since 2024. The accurate version is that pure prompt-writing in isolation matters less than it used to, because models follow instructions better and because context engineering is where the leverage now lives. The craft hasn't gone away; it's gotten bigger.

Where prompt engineering is used

The applications most commonly built on prompt engineering today:

  • Customer service. System prompts that govern AI agents handling refunds, account changes, order lookups, and conversational support.

  • Software development. Prompts inside coding assistants and code review tools.

  • Marketing and content. Brand-voice prompts for blog drafts, ad copy variants, and email campaigns.

  • Sales. Prompts powering outbound personalization, call summary generation, and meeting prep tools.

  • Knowledge work. Prompts in research assistants, internal search tools, and document analysis workflows.

  • Operations. Prompts in data-querying tools that let non-technical staff ask questions of business data in plain English.

  • Safety and security. Prompts that govern content moderation, prompt injection defense, and adversarial input filtering.

The pattern across these: prompt engineering is the layer that turns a general-purpose model into something useful for a specific job.

How prompts are evaluated

A prompt isn't done when it sounds right; it's done when it performs against a measurable bar. The common evaluation axes:

  • Accuracy. Does the model produce the correct output for the input? Measured against labeled data or compared to a reference answer.

  • Format compliance. Does the output match the requested shape — valid JSON, correct field names, length within bounds?

  • On-scope rate. How often does the model stay within the topic the prompt restricted it to?

  • Latency. How long does the response take? Longer prompts and more reasoning steps cost time.

  • Cost. Tokens consumed per request. Verbose prompts and large few-shot example sets cost real money at scale.

  • Robustness. Does the prompt hold up against unusual or adversarial inputs?

  • Safety. Does the prompt resist prompt injection and avoid producing harmful or off-brand outputs?

Most teams use a mix of human review, model-as-judge scoring, and golden-set comparison. The right evaluation method depends on the task — a code-generation prompt is evaluated differently from a customer-service-tone prompt.

Prompt engineering in customer service

This is the section where Gladly's position is on the record.

In a consumer-facing chatbot, the customer types the prompt. In a production customer service AI, the customer does not write the prompt that matters. The prompt that matters is the system prompt — written by the AI team, sometimes thousands of tokens long, that defines how the AI handles every conversation. It governs voice, scope, escalation rules, what to do when the order lookup fails, how to handle a frustrated customer, and what never to say.

That system prompt is where most of the work happens. The customer's message is the input the prompt is built to handle. The brand's voice, policies, and judgment all flow through it.

Prompt engineering alone, though, will not produce a customer service AI that resolves real conversations. A well-written prompt on an ungrounded model will be fluent and frequently wrong — confidently citing policies that don't exist, quoting prices that have changed, promising refunds the company doesn't honor. The same prompt connected to the customer's full conversation history, order data, brand voice guidelines, and current policies is a different system. It resolves the conversation.

Across brands using Gladly AI, the difference between prompt-engineered-but-ungrounded AI and prompt-engineered-and-grounded AI shows up in the outcomes:

  • KÜHL runs a 59% AI resolution rate with a 120% lift in revenue per conversation.

  • Breeze Airways has AI enhancing 71% of conversations while maintaining high CSAT.

  • Smith Optics hits a 67% AI resolution rate on product-help conversations — the kind that fail when a prompt is doing all the heavy lifting without grounding underneath it.

The prompt sets the rules. The grounding gives the model the facts to follow them.

Frequently asked questions

Going deeper?

See how Gladly customers put this into practice in their day-to-day customer service work.