Machine Learning: Deep Learning

Deep learning is a branch of machine learning built around neural networks with many layers. Its practical impact comes from representation learning: instead of hand-crafting features, deep models learn hierarchies of features directly from data. Over the last decade, progress in neural network architectures, optimization methods, regularization techniques, and attention mechanisms has made deep learning the dominant approach for computer vision, speech, and modern natural language processing, including large-scale models.

This article surveys the core ideas that matter in practice: the main neural architectures (CNNs, RNNs, and Transformers), how they are trained, and how regularization keeps them from failing in predictable ways.

What makes deep learning “deep”

A neural network is a composition of functions. Each layer transforms its input into a new representation, and stacking layers allows the network to model complex patterns. In a simplified view, a feed-forward network applies repeated affine transformations and nonlinearities:

Linear part: $z = W x + b$
Nonlinearity: $h = σ (z)$ (ReLU and variants are common)

Depth matters because it enables the model to build progressively abstract features. In vision, early layers detect edges; deeper layers detect textures, parts, and objects. In language, deeper layers can encode syntactic cues, semantic relationships, and discourse-level dependencies.

Deep learning’s success is not only about depth. It is also about scale (data and parameters), architectural bias (how a model is structured), and the ability to optimize large networks reliably.

Core architectures: CNNs, RNNs, Transformers

Convolutional Neural Networks (CNNs)

CNNs are designed for grid-like data, most famously images. They exploit two ideas:

Local connectivity: features depend on nearby pixels.
Weight sharing: the same filter is applied across the image.

A convolution layer applies learnable kernels across spatial positions, producing feature maps that respond to patterns such as edges and corners. Pooling or strided convolutions reduce spatial resolution, expanding the receptive field and improving computational efficiency.

Why CNNs work well:

Translation equivariance: shifting an input shifts the feature map in predictable ways.
Parameter efficiency: far fewer parameters than a fully connected network of similar capacity.
Strong inductive bias for images: the architecture matches the structure of the data.

Common practical patterns include stacking convolution blocks, using normalization (often batch normalization), and adding skip connections in deeper CNNs to stabilize training and preserve information.

Recurrent Neural Networks (RNNs)

RNNs are built for sequential data such as text, time series, or audio. They process inputs step-by-step, maintaining a hidden state intended to summarize the past. In principle, this allows modeling dependencies across time.

In practice, standard RNNs can struggle with vanishing or exploding gradients when sequences are long. Gated architectures address this:

LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) introduce gates that control what to store, forget, and output.
These gates make it easier to learn longer-range dependencies compared with plain RNNs.

RNNs have been widely used for tasks like speech recognition and sequence labeling. However, in modern language modeling and many sequence-to-sequence problems, they have largely been superseded by Transformers due to better parallelism and stronger performance at scale.

Transformers and attention mechanisms

Transformers introduced a shift in how sequences are modeled: instead of processing tokens strictly left-to-right with a recurrent state, they rely on attention. Attention lets each token “look at” other tokens and compute a context-aware representation.

At a high level, self-attention builds a weighted combination of token representations. Weights are derived from similarity between learned projections (queries and keys), applied to values. The result is a powerful mechanism for capturing relationships, including long-range dependencies, without the sequential bottleneck of recurrence.

Key advantages:

Parallel training: sequence positions can be processed simultaneously, dramatically improving throughput.
Global context: any token can attend to any other token, enabling rich dependency modeling.
Scalability: Transformers tend to improve predictably with more data, parameters, and compute, which underpins modern large models.

Transformers are now common beyond text, including vision (Vision Transformers), audio, and multimodal systems. In many of these settings, attention complements or replaces convolution and recurrence by providing flexible, content-dependent interactions.

Optimization: how deep networks are trained

Deep learning is typically framed as minimizing a loss function over model parameters. Training uses gradient-based optimization computed via backpropagation. The central loop is conceptually simple:

Forward pass: compute predictions
Loss: measure error
Backward pass: compute gradients
Update parameters

Stochastic gradient descent and its variants

Stochastic Gradient Descent (SGD) and minibatch SGD are widely used because they scale to large datasets. Momentum accelerates learning in consistent directions and reduces oscillations. Adaptive optimizers such as Adam adjust learning rates per parameter based on gradient statistics and are often effective “out of the box,” especially for Transformers.

In real deployments, the optimizer is only part of the story. Training stability often depends on:

Learning rate schedules: warmup followed by decay is common in Transformer training.
Batch size: interacts with generalization and optimization dynamics.
Gradient clipping: prevents exploding gradients, especially in RNNs and large-scale training.
Mixed precision training: improves speed and memory efficiency but requires care with numerical stability.

Loss functions and task alignment

Choosing a loss is not a formality. For classification, cross-entropy is typical. For regression, mean squared error or variants are common. For language modeling, next-token prediction with cross-entropy has become a standard pretraining objective because it scales well and yields general representations.

Loss design should reflect what matters in evaluation. A model can optimize a loss yet fail at the real-world objective if the loss is misaligned with the task or data distribution.

Regularization: preventing overfitting and improving robustness

Deep networks can overfit, especially when data is limited or labels are noisy. Regularization addresses this by constraining effective capacity or injecting beneficial noise during training.

Weight decay and norm-based regularization

Weight decay (often implemented as $L 2$ regularization) discourages overly large weights and can improve generalization. It is commonly used with SGD and also used in large-scale Transformer training, with careful handling to avoid decaying certain parameters (such as some normalization parameters) depending on the setup.

Dropout and stochastic regularizers

Dropout randomly zeroes activations during training, preventing co-adaptation and encouraging redundancy. It is widely used in fully connected layers and Transformers. Variants include dropping attention weights or entire paths in very deep architectures.

Data augmentation

In computer vision, augmentation is a primary regularization tool: random crops, flips, color jitter, and more advanced policies can dramatically reduce overfitting. In effect, augmentation encodes invariances that the model should learn.

For sequences, augmentation is more delicate, but techniques like noise injection in audio, masking strategies, and careful perturbations can help.

Early stopping and validation discipline

Even with modern regularizers, monitoring validation performance and stopping at the right time remains a practical safeguard. Many failures attributed to “model choice” are actually due to insufficient validation discipline, data leakage, or unstable training settings.

Large models: why scale matters, and what it changes

Large deep learning models, especially Transformer-based systems, have shown that performance often improves smoothly with scale. More parameters and more data can yield better generalization, provided optimization is stable and regularization is appropriate.

Scaling changes practical considerations:

Training becomes an engineering problem: distributed training, memory management, and efficient data pipelines matter.
Small implementation details can dominate outcomes: learning rate schedules, initialization, and normalization choices have outsized impact.
Evaluation must broaden: beyond accuracy, teams track bias, robustness, and behavior under distribution shift.

Large models are not automatically better for every task. When data is scarce, domain constraints are tight, or latency is critical, smaller architectures or hybrids can be more appropriate.

Choosing the right architecture in practice

A useful rule of thumb is to match architectural bias to data structure:

Images and spatial signals: CNNs remain strong and efficient; attention can complement them when global context is important.
Sequential data with moderate lengths: RNNs and gated variants can be competitive, especially in streaming scenarios.
Long-range dependencies and large-scale language tasks: Transformers are typically the first choice due to attention and parallelism.

In real systems, the “best” model is the one that meets constraints: accuracy, compute budget, latency, interpretability requirements, and the ability to maintain performance as data changes.

Conclusion

Modern deep learning rests on a set of interconnected advances: architectures that encode useful structure (CNNs, RNNs, Transformers), attention mechanisms that model global relationships, optimization techniques that make training stable at scale, and regularization strategies that preserve generalization. Understanding how these pieces interact is what separates a model that trains from a model that works reliably in production.

Machine Learning: Deep Learning

Machine Learning: Deep Learning

What makes deep learning “deep”

Core architectures: CNNs, RNNs, Transformers

Convolutional Neural Networks (CNNs)

Recurrent Neural Networks (RNNs)

Transformers and attention mechanisms

Optimization: how deep networks are trained

Stochastic gradient descent and its variants

Loss functions and task alignment

Regularization: preventing overfitting and improving robustness

Weight decay and norm-based regularization

Dropout and stochastic regularizers

Data augmentation

Early stopping and validation discipline

Large models: why scale matters, and what it changes

Choosing the right architecture in practice

Conclusion

Write better notes with AI