Glossary
A
Activation Function: A non-linear function applied to neuron outputs (e.g., ReLU, sigmoid, tanh).
Attention Mechanism: A technique allowing models to focus on relevant parts of input when producing output.
Autoregressive Model: A model that generates sequences by predicting one element at a time, conditioned on previous elements.
B
Backpropagation: Algorithm for computing gradients in neural networks by propagating errors backward through layers.
BERT: Bidirectional Encoder Representations from Transformers; a pre-trained language model using masked language modeling.
Bias (statistical): Systematic error from overly simplistic assumptions in a model.
Boolean Algebra: Mathematical system for logical operations using AND, OR, NOT.
C
Church-Turing Thesis: The claim that any effectively calculable function can be computed by a Turing machine.
Convolutional Neural Network (CNN): Neural network using convolution operations, especially effective for image processing.
D
Deep Learning: Machine learning using neural networks with many layers.
Difference Engine: Babbage’s mechanical calculator for polynomial evaluation.
E
Embedding: Dense vector representation of discrete objects (words, tokens, etc.).
Entropy (information): Measure of uncertainty or information content.
F
Fine-tuning: Adapting a pre-trained model to a specific task with additional training.
Forward Pass: Computing output from input through a neural network.
G
Gradient Descent: Optimization algorithm that iteratively adjusts parameters in the direction that reduces loss.
GPT: Generative Pre-trained Transformer; family of autoregressive language models.
H
Hallucination: When an AI model generates plausible-sounding but false or unsupported information.
Hidden Layer: Layers in a neural network between input and output.
I
ImageNet: Large-scale image dataset that catalyzed deep learning progress.
K
Kernel (SVM): Function that computes similarity in a transformed feature space.
L
LSTM: Long Short-Term Memory; recurrent architecture with gates for learning long-term dependencies.
Loss Function: Measure of how well a model’s predictions match targets.
M
Masked Language Modeling: Pre-training objective where model predicts masked tokens.
Multi-head Attention: Parallel attention mechanisms with different learned projections.
N
Neural Network: Computational model inspired by biological neurons, composed of connected layers.
O
Overfitting: When a model learns training data too well, failing to generalize.
P
Perceptron: Single-layer neural network for binary classification.
Positional Encoding: Method for injecting sequence position information into transformers.
Pre-training: Initial training on large unlabeled data before task-specific fine-tuning.
R
Recurrent Neural Network (RNN): Neural network with connections forming cycles, processing sequences.
Reinforcement Learning: Learning through trial and error with reward signals.
RLHF: Reinforcement Learning from Human Feedback.
S
Self-Attention: Attention mechanism where queries, keys, and values all come from the same sequence.
Softmax: Function converting logits to probability distribution.
Supervised Learning: Learning from labeled examples.
T
Transformer: Architecture based on self-attention, without recurrence.
Turing Machine: Abstract model of computation with tape, head, and state transitions.
Turing Test: Proposed test where a machine exhibits intelligent behavior indistinguishable from human.
U
Universal Approximation: Property that neural networks can approximate any continuous function.
V
Vanishing Gradient: Problem where gradients become too small during backpropagation through many layers.
Variance (statistical): Model sensitivity to fluctuations in training data.
W
Word Embedding: Vector representation of words capturing semantic relationships (e.g., Word2Vec).