Glossary

What is a large language model (LLM)?

LLM stands for large language model. A large language model is a type of artificial intelligence trained on enormous amounts of text that learns the statistical patterns of language well enough to read, write, summarize, translate, and answer questions in natural language. The best-known LLMs power products like ChatGPT, Claude, Gemini, Llama, and Copilot — but the model itself is the engine, not the application.

The best-known LLM families include GPT, Claude, Gemini, Llama, Mistral, and DeepSeek, which power products such as ChatGPT, Claude, Gemini, and Copilot.

The word "large" refers to two things: the size of the model (often hundreds of billions of parameters) and the size of the training dataset (typically trillions of words pulled from books, articles, websites, code repositories, and other text). Stanford's Human-Centered AI Institute and Wikipedia both define LLMs in roughly these terms.

This page covers what an LLM is, how it works, what the major LLMs are today, what LLMs can and cannot do, how they differ from related terms like AI agents, agentic AI, and small language models, and how grounding turns a general-purpose LLM into a system that resolves real customer conversations.

LLM in one sentence

A large language model is software that learns the patterns of human language from huge volumes of text and uses those patterns to read, write, and respond in natural language.

What "LLM" stands for and where the name comes from

LLM is the acronym for large language model. The name is descriptive rather than technical:

  • Large — the model has many parameters (the numbers it tunes during training) and was trained on a very large dataset.

  • Language — the model works with human language, primarily text. Multimodal LLMs also handle images, audio, and video, but the language layer is the core.

  • Model — a model in machine learning is a mathematical function trained to map inputs to outputs.

A "small language model" (SLM) is the same kind of system at a smaller scale — fewer parameters, narrower training data, usually purpose-built for a single task. More on that comparison below.

How LLMs work

Every modern LLM, regardless of vendor, follows roughly the same five-step pattern.

1. The transformer architecture

Almost all current LLMs are built on the transformer architecture, introduced by Google researchers in 2017. The transformer processes text in chunks called tokens (roughly, pieces of words) and uses a mechanism called attention to weigh how each token relates to every other token in the input. Attention is what lets an LLM keep track of context across long passages.

2. Training on a huge corpus

The model is trained on trillions of tokens of text. During training, the model is repeatedly given a sequence of tokens with the next one hidden, and asked to predict the missing token. Each prediction it gets wrong updates the model's parameters. After billions of these updates, the model has learned an extraordinarily rich statistical map of how language works — grammar, facts, reasoning patterns, style, and the relationships between concepts.

3. Token prediction at inference

When a person enters a prompt, the LLM does not retrieve a stored answer. It generates the response one token at a time, each time predicting the most likely next token given everything in the prompt and everything it has generated so far. This is why two responses to the same prompt can differ, and why the model sometimes produces fluent text that is factually wrong.

4. Alignment and fine-tuning

A base LLM is broadly capable but not always safe, helpful, or on-brand. To make it usable, model builders apply techniques like reinforcement learning from human feedback (RLHF) and supervised fine-tuning on curated examples. Alignment shapes the model's behavior — when to refuse, how to stay polite, how to follow instructions. Fine-tuning specializes the model for a particular domain, voice, or task.

5. Grounding and retrieval

A model only knows what was in its training data, which is frozen at a point in time. To answer current or company-specific questions accurately, an LLM needs to be grounded — connected to a source of truth like a knowledge base, a customer record, or a product catalog. Retrieval-augmented generation (RAG) is the most common grounding pattern: retrieve relevant facts from a connected source, include them in the prompt, then generate the answer. Grounding is what separates a research toy from a production system.

Examples of LLMs

The LLMs most people encounter today belong to a handful of model families:

Model family

Maker

First public release

GPT (powers ChatGPT)

OpenAI

2018; ChatGPT launched November 2022

Claude

Anthropic

2023

Gemini (formerly Bard)

Google

2023

Llama

Meta

2023, open-weights

Mistral

Mistral AI

2023, open-weights

Command

Cohere

2022

Grok

xAI

2023

DeepSeek

DeepSeek

2024, open-weights

ChatGPT is the household name, but ChatGPT is an application built on a GPT-family LLM. The same distinction applies across the table: the LLM is the engine; the chatbot, assistant, or copilot is the product built on top of it.

What LLMs can do

The capabilities most often used in production:

  • Read and summarize. Compress a long document, transcript, or email thread into the key points.

  • Write. Produce drafts of emails, marketing copy, reports, code, and conversational replies in a specified voice.

  • Translate. Convert text between languages, often with quality close to dedicated translation models.

  • Answer questions. Respond to natural-language questions, with accuracy improving sharply when the model is grounded in a source.

  • Classify and tag. Identify the topic, sentiment, intent, or category of a piece of text.

  • Reason in steps. Work through multi-step problems, especially when prompted to show the steps.

  • Generate structured output. Produce JSON, tables, or other formatted data that downstream systems can read.

One model can do all of the above. That generality is the reason LLMs became the default substrate for new AI products in 2023 and after.

What LLMs cannot do (without help)

Equally important — and less often covered in vendor explainers — are the structural limits of an LLM on its own:

  • No real-time knowledge. A base LLM only knows what was in its training data. It doesn't know today's news, today's prices, or today's order status without a retrieval layer.

  • No persistent memory across sessions. A model doesn't remember a person between conversations unless a memory layer is added on top.

  • No native action-taking. An LLM generates text. Refunding an order, updating an account, or sending a confirmation email requires tool use and integrations layered on top — the move from LLM to AI agent.

  • No guaranteed accuracy. Token prediction can produce fluent, confident-sounding output that is factually wrong. This is the hallucination problem, and grounding is the most reliable mitigation.

  • No native math or logic engine. Modern models are better at arithmetic and logic than earlier generations, but reliable math still benefits from tool use (a calculator, a code interpreter) rather than pure generation.

The takeaway: LLMs are powerful but partial. A production system that resolves real customer issues is an LLM plus grounding plus tools plus guardrails — not the model alone.

LLM vs SLM vs foundation model

These three terms overlap in coverage. The clean distinction:

Concept

What it is

Typical scale

Where it's used

Foundation model

A model trained on broad data that can be adapted to many tasks. The umbrella category.

Varies

Any modality — text, image, audio, code, multimodal.

Large language model (LLM)

A foundation model specialized in text, with billions to hundreds of billions of parameters.

7B–1T+ parameters

General-purpose assistants, copilots, AI agents.

Small language model (SLM)

A purpose-built or distilled language model, usually under 10 billion parameters.

Often <10B parameters

On-device assistants, specific tasks, cost-sensitive deployments.

SLMs are getting more attention because they run faster and cheaper than LLMs, and for narrow tasks they can match or exceed a general-purpose LLM. Most production AI stacks now mix LLMs and SLMs — the LLM handles open-ended reasoning, an SLM handles classification or routing, and a router picks which to use per request.

LLM vs AI agent vs agentic AI

This is the comparison most often muddled in customer service marketing. The clean version:

Term

What it is

Example

LLM

The language engine. Predicts text.

The model that drafts a reply.

AI agent

An application that uses one or more LLMs plus tools, grounding, and memory to complete tasks.

A customer service AI that reads the inquiry, looks up the order, and drafts a response.

Agentic AI

A broader architecture in which AI systems pursue multi-step goals, take action across tools, and coordinate without step-by-step human direction.

An AI that not only drafts the reply but processes the refund, updates the order, and emails the confirmation.

The LLM is the engine. The AI agent is the car. Agentic AI is the road network the car can drive on. A glossary entry on AI agents and agentic AI covers each in depth.

Where LLMs are used in the real world

The applications most commonly built on LLMs today:

  • Customer service. Drafting agent replies, summarizing long conversations, translating in real time, generating help-center content, and powering self-service answers.

  • Marketing and content. First drafts of blog posts, ad copy, product descriptions, email campaigns, and social posts — often with brand-voice fine-tuning.

  • Software development. Code completion, code review suggestions, test generation, and documentation.

  • Sales. Personalized outbound, call summaries, proposal drafts, prospect research.

  • Knowledge management. Search across internal docs, contracts, and policies in natural language.

  • Operations. Querying data in plain English, generating executive summaries, translating analyst output for non-technical audiences.

  • Design and creative. Concept generation, copy variations, multimodal brainstorming with image and video models.

The common pattern: an LLM is the substrate. The product is the LLM plus the grounding, the tools, the guardrails, and the workflow.

How LLMs are evaluated

LLMs are not measured the way traditional software is. The common axes:

  • Capability benchmarks. Standardized tests across reasoning (MMLU), coding (HumanEval), math (GSM8K), and instruction-following. Used by model builders to publish leaderboards.

  • Cost per million tokens. The pricing unit. Models priced from cents to dollars per million tokens, with sharp variation by speed and capability tier.

  • Latency. How fast the first token returns and how fast subsequent tokens stream. Critical for live conversation.

  • Safety and alignment. Refusal behavior, bias measurements, hallucination rates on grounded and ungrounded tasks.

  • Production accuracy. The only metric that matters for a deployed system: did the model produce the right answer for the user's actual question. Benchmark performance and production accuracy are correlated but not the same.

The right model for a customer service deployment is rarely the highest-ranked model on a public leaderboard. It's the model whose price, speed, and grounded accuracy fit the use case.

LLMs in customer service: the grounded vs. ungrounded distinction

This is the section where Gladly's position is on the record.

A general-purpose LLM is impressive in a demo. In a production customer service deployment, an ungrounded LLM is a liability: it doesn't know the customer's order history, it doesn't know the brand's voice, and it doesn't know whether the policy quoted in its training data still applies today. The answer it gives may be fluent and wrong.

An LLM grounded in a customer's full conversation history, order data, brand voice, and current policies is a different system. It resolves the conversation. It speaks in the brand's voice. It refunds the right order, applies the right loyalty points, and escalates only when it should.

Across brands using Gladly AI, the difference between general-purpose LLM use and grounded LLM use is the difference between novelty and outcomes:

  • KÜHL runs a 59% AI resolution rate with a 120% lift in revenue per conversation.

  • Breeze Airways has AI enhancing 71% of conversations while maintaining high CSAT.

  • Smith Optics hits a 67% AI resolution rate on product-help conversations — the kind most generic LLM deployments struggle with because they require both knowledge and judgment.

The model is the engine. The grounding is why the engine takes the customer somewhere worth going.

Frequently asked questions

Going deeper?

See how Gladly customers put this into practice in their day-to-day customer service work.