Maria Aguilera

Page 1 · Deep models need representations

Page 1 of 4 — Deep models need representations. Nine cards: (1) deep learning needs lots of annotated data, performance vs labeled examples plot where deep models overtake classical models only once labels are abundant, without enough labels deep models overfit; (2) RNN/CNN classifiers as larger models, early deep approaches used RNNs or CNNs over word/char embeddings with an encoder followed by softmax class, better than bag-of-words but limited compared to Transformers; (3) language modelling as the new representation, instead of hand-crafted features use a model trained to understand language itself, traditional sparse TF-IDF/BoW replaced by learned dense low-dimensional semantic embeddings, these representations 'speak the language' and transfer to many tasks; (4) 'predict the next word' as the learning task, pretrain by predicting the next token in a large corpus, the model learns grammar, facts, semantics, and style from one simple objective; (5) language understanding through context, word meaning depends on surrounding words, contextual representations capture 'I saw her book on the table' (NOUN) vs 'Please book a flight' (VERB); (6) long-term relationships in text, important information can be far apart, example 'The president of France visited Berlin … 12 words later … He gave a speech.' where model should connect 'He' → 'president', RNNs help but Transformers handle long-range dependencies more effectively; (7) Markov / n-gram limitations, n-grams use only the last (n-1) words and cannot capture long-range discourse dependencies, example 'He didn't say it because he was tired' where 2-gram sees only local history, local context ≠ true understanding; (8) RNNs as sequence models, process text token by token with a hidden state that summarizes past information, x_1 → h_1 → h_2 → … → h_T chain with outputs y_1..y_T at each step, vanishing gradients make long-term memory difficult; (9) Transformers as language models, self-attention connects all words with all other words, captures both short and long relations, pretrained on massive text becomes a general-purpose language model, contextual representations feed into an LM head for next-token prediction, these models became the new foundation for text classification and many other tasks. — Page 1 — the representation problem. Deep models need lots of data, RNN/CNN classifiers were a step up but hit their own ceiling; the breakthrough was training a language model as the representation instead of the classifier.

Page 2 · Language modelling + pretraining

Page 2 of 4 — Language modelling + pretraining. Seven cards (10–16): (10) self-supervised learning, raw text provides the supervision signal, the model sees text, creates a task it can solve, and learns, examples: mask tokens ([MASK]) or predict next word/sentence, raw text → self-generated supervision with no labels needed; (11) pretrained models, models are trained once on huge corpora learning general language knowledge, capture grammar, facts, commonsense, and reasoning patterns, reusable starting point for classification, QA, NLI, summarization, one expensive pretraining run → many downstream uses, knowledge is reused not relearned; (12) repositories of pretrained models by language / domain, model hubs (Hugging Face Hub, TensorFlow Hub, PyTorch Hub) provide many ready-to-use checkpoints for English (BERT, RoBERTa, DeBERTa, DistilBERT, Llama, Mistral), domain-specific (BioMedLM, SciBERT, ClinicalBERT), multilingual (mBERT, XLM-R, LaBSE, XLM-T), pick a model that matches your language and domain; (13) transfer learning, take a pretrained model and adapt it to your target task/domain, pipeline: pretrained model (general LM) → task adaptation (fine-tune or feature extract) → task classifier, start from learned knowledge, update only a small part (or all) with task data, leverage broad knowledge → learn the specifics, dramatically reduces labeled-data needs; (14) fine-tuning, continue training the model on your target task, updates the model weights using task-specific labeled data, before fine-tuning general pretrained model → fine-tune on task data (labels) → after fine-tuning task-specific model, WARNING: update weights not just the head, small data big gains; (15) tiny labelled dataset after pretraining, fine-tuning works with much less data than training from scratch, typical accuracy vs labeled data plot shows pretrained + fine-tuned reaching 90%+ with 10^2 examples while from-scratch requires 10^5+, pretraining + fine-tuning needs far less data, huge reduction in labeled data; (16) fine-tuned representation + simple classifier, use the fine-tuned model to produce rich representations then a small classifier head for the task, input text → pretrained/fine-tuned encoder (Transformer) → representation (sentence embedding) → simple classifier head (Linear) → label (Positive), why this works: encoder learns general language, task head learns the decision, few parameters to train, fast/efficient/effective. Takeaway: self-supervised learning turns raw text into supervision at scale, pretrained models are reusable checkpoints of knowledge, transfer learning + fine-tuning solve most of the annotation bottleneck. — Page 2 — the pretraining moment. Self-supervised on raw text creates knowledge cheaply; a small fine-tune on labelled data specialises it. This is the trick that broke the annotated-data ceiling.

Page 3 · Fine-tuning + representation reuse

Page 3 of 4 — Fine-tuning + representation reuse. Seven cards (17–23): (17) why tagging / lemmatization / parsing become less central, pretrained models learn rich representations directly from data reducing the need for manual linguistic pipelines, why it matters: captures semantics/context/long-range dependencies automatically, adapts to new domains faster, less engineering more generalization, manual pipeline (old) tokenize → tag → lemma → parse → feature engineering (TF-IDF, patterns, rules) → classifier (SVM/LR/NB) is heavy manual work vs pretrained representation pipeline (new) raw text → pretrained LM (embeddings) → small classifier (fine-tuned) → label, end-to-end learned representations robust and transferable; (18) domain adaptation through fine-tuning, fine-tune a general model on domain-specific data to specialize without training from scratch, preserves general language knowledge, learns domain vocabulary/style/patterns, requires far less data and compute, one general pretrained model → legal model, biomed model, finance model — one general model becomes many domain-specialized models; (19) learning rate caution during fine-tuning, learning rate controls how much the model updates its weights during training, WARNING: too high can overwrite pretrained knowledge and destroy performance, too high → forgetting/divergence, just right → effective adaptation, too low → slow/underfitting, effect of learning rate curve (log scale) shows performance peak in 10^-4 to 10^-3 range, start small, tune, and monitor validation performance; (20) freezing / changing only top layers, freeze lower layers that encode general language knowledge and train only task-specific layers, top layer L trainable (updates) at top, bottom layers 1..L-2 frozen (no updates), saves compute and prevents overfitting; (21) lower layers as basic language representation, lower layers learn general patterns that transfer across tasks and domains, why it matters: lower layers capture syntax/morphology/word order and local semantics, more reusable/less task-specific, can be reused across many tasks, example token 'bank' — top layers see task/domain concepts (finance vs river bank), mid layers see phrase meaning and context-aware relations, lower layers see words/morphology/syntax/subword patterns, bottom sees characters/shapes/subwords; (22) classical sparse features vs learned representations, modern models replace sparse manual features with dense contextual embeddings, classical sparse features (TF-IDF/BoW): high-dimensional and sparse, no context ('bank' same always), manual feature engineering, weak at generalization, matrix of doc × term counts; learned representations (contextual embeddings): dense vectors (e.g., 768 dims) per doc, dense and compact, context-aware ('bank' changes), learned end-to-end, stronger generalization; (23) fine-tuning workflow recap, a compact end-to-end workflow for adapting pretrained models to your task: (1) raw text/data (your labeled training data) → (2) pretrained model (provides rich representations) → (3) fine-tune (adapt to task, update some parameters) → (4) task classifier (head, small classifier on top) → (5) prediction (label/score/action). Key points to remember: pretrained models reduce manual NLP feature engineering dramatically, lower layers encode general language knowledge reusable across tasks, fine-tuning adapts the model to your domain with minimal data and compute, careful fine-tuning (LR, freezing, regularization) prevents forgetting and improves generalization, use representations not rules — let the model learn and transfer what matters. — Page 3 — how fine-tuning actually works. What each layer knows, why you freeze the bottom and change the top, learning-rate caution, and the compact five-stage workflow.

Page 4 · Zero-shot, few-shot + prompt era

Page 4 of 4 — Zero-shot, few-shot + prompt era. Thirteen cards (24–36): (24) removing the annotated dataset, LLMs + prompting can reduce or remove the need for task-specific labeled datasets, manual labeling (expensive) vs prompting (less/no labels), why it matters: lower annotation cost, faster iteration, broader domain coverage, example: classify intent → prompt model with instructions instead of training on thousands of labeled utterances; (25) zero-shot learning, use only an instruction/prompt with no examples provided, model relies on prior knowledge from pretraining, example prompt: 'Classify the sentiment of the following review as Positive or Negative. Input: The movie was fantastic and truly inspiring!' → Output: Positive, why it matters: fastest to set up, works surprisingly well for clear common tasks; (26) one-shot learning, provide one input-output example plus a new input, model infers pattern from a single demonstration, example: 'Great acting and story.' → Positive, then new input: 'The plot was boring and predictable.' → Output: Negative, why it matters: helpful when tasks are nuanced or under-specified; (27) few-shot learning, provide a few input-output examples plus a new input (k examples), more examples = better accuracy and robustness, example: (1) 'Loved it! Amazing.' Positive (2) 'Terrible experience.' Negative (3) 'Pretty good overall.' Positive, new input: 'It was okay, nothing special.' → Output: Neutral, why it matters: stronger performance for harder or domain-specific tasks; (28) GPT-3 reframing NLP as text generation, cast classification as text generation — prompt (task + context) in → generated label/text out, prompt example: 'Classify the sentiment (Positive/Negative/Neutral). Review: The service was slow but the staff were very friendly.' → LLM → 'Neutral' or 'Because the positives and negatives balance', works for labels/spans/lists/reasoning/extraction, why it matters: one general interface for many tasks unlocks new capabilities; (29) model size / scaling, larger models + more data + more compute → higher capabilities, but cost and complexity rise quickly, capability vs model-size/data/compute plot shows capability curve rising and cost curve rising sharper, why it matters: scaling brings gains but with steep trade-offs; (30) prompt engineering, design prompts that guide the model clearly and consistently, good prompts improve accuracy/reliability/controllability, prompt template: Task (what to do) + Context (background) + Constraints (what not to do) + Examples (optional) + Output format (JSON, list, etc.), clear prompt → clearer model response; (31) limits: model complexity, large models are harder to understand/control/debug, behavior can be opaque and inconsistent, why it matters: risk of hallucinations/bias/misuse, explanations are not guaranteed, symptoms: harder to debug, black-box behavior, bias and safety risks, unreliable in edge cases; (32) limits: money to train, training and running large models costs significant money and energy, why it matters: budgets/carbon footprint/accessibility are real limits, four cost categories: high financial cost, huge compute requirements, energy and carbon impact, access inequality (gatekeeping); (33) zero-shot vs one-shot vs few-shot recap table — approach: zero-shot examples 0 (instruction only) data needed None accuracy potential medium cost/latency lowest best when simple common tasks; one-shot examples 1 data needed very little accuracy medium-high cost low best when nuanced tasks need a hint; few-shot examples k (few) data needed small accuracy high cost/latency medium best when hard or domain-specific tasks; (34) when to use pretrained models, use zero/few-shot when labeled data is scarce or fast iteration is needed, fine-tune when you need higher accuracy/consistency/domain adaptation, hybrid approach works well: prompt now, collect data slowly, then fine-tune, start here decision tree: try zero-shot → try few-shot → gather small labels → fine-tune if needed → deploy and evaluate; (35) classical ML vs Deep/Prompt-era summary table — features: classical manual hand-crafted vs deep learned automatically, representation: classical sparse shallow vs deep dense contextual rich, data need: classical less (with good features) vs deep more (pretraining helps), adaptation: classical engineer new features vs deep fine-tune or prompt, workflow: classical task-specific pipeline vs deep reusable LM + prompts, cost: classical lower compute vs deep higher compute and dollars, interpretability: classical higher (inspectable) vs deep lower (black-box); (36) key practical takeaway / summary rules: learned representations replace manual features, fine-tuning adapts general models to your domain, prompts can replace labels in some cases, bigger models are powerful but expensive. Quick rules of thumb: start with prompting (zero/few-shot), escalate to fine-tuning when accuracy gaps remain, invest in prompt engineering — small tweaks matter, balance accuracy/cost/latency/safety, measure/monitor/mitigate risks. Trade-offs to remember: accuracy ↑ with model size and data but diminishing returns, latency ↑ with model size and long prompts, control ↓ as models grow (harder to constrain), cost ↑ exponentially with scale. — Page 4 — the prompt era. Zero-shot, one-shot, few-shot; GPT-3's reframing of NLP as text-to-text; scaling laws and their costs; the decision tree for when to prompt vs when to fine-tune.

Final exam traps

Deep learning needs lots of labels — that is the catch, not a footnote. Everything about the pretrain-then-fine-tune workflow exists to work around this constraint.
Do not use CNNs for text classification. Same rule as Part 6. They capture local structure but not sequential information.
The prediction layer of a pretrained language model is what you throw away. The hidden states are the representation you actually wanted.
Self-attention is the sentence attending to itself. RNN passes information sequentially; transformer relates any two positions directly. That single change is the transformer's whole edge.
Self-supervised is not "unsupervised" — labels come from the text itself. Next-word (GPT-style) or fill-missing-word (BERT-style). The signal is automatic, not absent.
Catastrophic forgetting means the model loses its general language knowledge if you fine-tune too hard. The fix is gentle fine-tuning — small LR, few epochs, watch for overfitting, freeze the bottom layers.
The learning-rate finder is not optional. LR is the single most important hyperparameter when fine-tuning. Sweep, plot loss vs LR, pick the steepest descent.
Discriminative learning rates: smaller LR for deeper layers, because you trust them more. They were learned on much more data.
fit_one_cycle — one epoch is often enough. Super-convergence is a real practical win, not marketing.
Zero-shot only works at frontier scale. GPT-3 175B ≈ 70% (near fine-tuned baseline). 13B ≈ noise. 1.3B ≈ useless. The single most important feature is model size.
Prompt engineering is the new NLP expertise for frozen models. Whether you find that thrilling or depressing depends on the day.
Frontier models cannot be self-hosted. GPT-3, GPT-4, Claude, Gemini — access via API and per-call cost. Infra cost gone, inference cost forever.
Zero-shot ≠ no knowledge of the task. The model has seen the task pattern in pretraining. "Zero-shot" means zero labelled task-specific examples at inference.