Glossary

A

Activation Function: A non-linear function applied to neuron outputs (e.g., ReLU, sigmoid, tanh).

Attention Mechanism: A technique allowing models to focus on relevant parts of input when producing output.

Autoregressive Model: A model that generates sequences by predicting one element at a time, conditioned on previous elements.

B

Backpropagation: Algorithm for computing gradients in neural networks by propagating errors backward through layers.

BERT: Bidirectional Encoder Representations from Transformers; a pre-trained language model using masked language modeling.

Bias (statistical): Systematic error from overly simplistic assumptions in a model.

Boolean Algebra: Mathematical system for logical operations using AND, OR, NOT.

C

Church-Turing Thesis: The claim that any effectively calculable function can be computed by a Turing machine.

Convolutional Neural Network (CNN): Neural network using convolution operations, especially effective for image processing.

D

Deep Learning: Machine learning using neural networks with many layers.

Difference Engine: Babbage’s mechanical calculator for polynomial evaluation.

E

Embedding: Dense vector representation of discrete objects (words, tokens, etc.).

Entropy (information): Measure of uncertainty or information content.

F

Fine-tuning: Adapting a pre-trained model to a specific task with additional training.

Forward Pass: Computing output from input through a neural network.

G

Gradient Descent: Optimization algorithm that iteratively adjusts parameters in the direction that reduces loss.

GPT: Generative Pre-trained Transformer; family of autoregressive language models.

H

Hallucination: When an AI model generates plausible-sounding but false or unsupported information.

Hidden Layer: Layers in a neural network between input and output.

I

ImageNet: Large-scale image dataset that catalyzed deep learning progress.

K

Kernel (SVM): Function that computes similarity in a transformed feature space.

L

LSTM: Long Short-Term Memory; recurrent architecture with gates for learning long-term dependencies.

Loss Function: Measure of how well a model’s predictions match targets.

M

Masked Language Modeling: Pre-training objective where model predicts masked tokens.

Multi-head Attention: Parallel attention mechanisms with different learned projections.

N

Neural Network: Computational model inspired by biological neurons, composed of connected layers.

O

Overfitting: When a model learns training data too well, failing to generalize.

P

Perceptron: Single-layer neural network for binary classification.

Positional Encoding: Method for injecting sequence position information into transformers.

Pre-training: Initial training on large unlabeled data before task-specific fine-tuning.

R

Recurrent Neural Network (RNN): Neural network with connections forming cycles, processing sequences.

Reinforcement Learning: Learning through trial and error with reward signals.

RLHF: Reinforcement Learning from Human Feedback.

S

Self-Attention: Attention mechanism where queries, keys, and values all come from the same sequence.

Softmax: Function converting logits to probability distribution.

Supervised Learning: Learning from labeled examples.

T

Transformer: Architecture based on self-attention, without recurrence.

Turing Machine: Abstract model of computation with tape, head, and state transitions.

Turing Test: Proposed test where a machine exhibits intelligent behavior indistinguishable from human.

U

Universal Approximation: Property that neural networks can approximate any continuous function.

V

Vanishing Gradient: Problem where gradients become too small during backpropagation through many layers.

Variance (statistical): Model sensitivity to fluctuations in training data.

W

Word Embedding: Vector representation of words capturing semantic relationships (e.g., Word2Vec).