During inference the model applies its already-learned knowledge without changing its internal parameters. It is the step that runs every time you interact with ChatGPT or Claude. Inference speed (latency) and the tokens consumed determine cost and user experience. Optimizations such as quantization and caching make inference faster and cheaper.
Inference
Inference is the process by which a trained AI model produces an output from an input, for example generating an answer to a prompt. It differs from training, where the model first learns.
