From c4686787b6168f77f790c5803e61bd2ff10ff450 Mon Sep 17 00:00:00 2001
From: Ed_ <edwardgz@gmail.com>
Date: Sun, 21 Jun 2026 16:12:18 -0400
Subject: [PATCH] conductor(cs229): Phase 3 OCR - 115 frames OCR'd in 5.1s via
 winsdk (28KB markdown)

---
 .../artifacts/ocr.md                          | 1499 +++++++++++++++++
 1 file changed, 1499 insertions(+)
 create mode 100644 conductor/tracks/video_analysis_cs229_building_llms_20260621/artifacts/ocr.md

diff --git a/conductor/tracks/video_analysis_cs229_building_llms_20260621/artifacts/ocr.md b/conductor/tracks/video_analysis_cs229_building_llms_20260621/artifacts/ocr.md
new file mode 100644
index 00000000..afcd3664
--- /dev/null
+++ b/conductor/tracks/video_analysis_cs229_building_llms_20260621/artifacts/ocr.md
@@ -0,0 +1,1499 @@
+# OCR Results
+
+## frame_00001.jpg
+
+```
+Introduction to
+Building LLMs
+CS229: Machine Learning
+Yann Dubois Aug. 13th 2024
+Slides partially based on CS336, CS224N, CS324
+tanford
+```
+
+## frame_00002.jpg
+
+```
+Stanfo d
+```
+
+## frame_00003.jpg
+
+```
+3
+What matters when training LLMs
+Stanford
+```
+
+## frame_00004.jpg
+
+```
+Stanford
+```
+
+## frame_00005.jpg
+
+```
+What matters when training LLMs
+' Architecture
+Most of
+academia
+e Training algorithm/loss
+' Data
+e Evaluation
+Systems
+Model
+Stanford
+```
+
+## frame_00006.jpg
+
+```
+Stanford
+```
+
+## frame_00008.jpg
+
+```
+Stanford
+```
+
+## frame_00009.jpg
+
+```
+Language Modeling
+LM: probability distribution over sequences of tokens/words p(X1, , XL)
+Stanford
+```
+
+## frame_00011.jpg
+
+```
+Stanford
+```
+
+## frame_00012.jpg
+
+```
+Stanford
+```
+
+## frame_00013.jpg
+
+```
+Stanford
+```
+
+## frame_00014.jpg
+
+```
+Language Modeling
+• LM: probability distribution over sequences of tokens/words p(X1, , XL)
+P(the, mouse, ate, the, cheese) = 0.02
+P(the, the, mouse, ate, cheese) = 0.0001
+P(the, cheese, ate, the, mouse) 0.001
+• LMs are generative models:
+p(X1, ... , XL)
+Syntactic knowledge
+Semantic knowledge
+Stanford
+```
+
+## frame_00015.jpg
+
+```
+Language Modeling
+• LM: probability distribution over sequences of tokens/words p(X1, , XL)
+P(the, mouse, ate, the, cheese) = 0.02
+P(the, the, mouse, ate, cheese) = 0.0001
+P(the, cheese, ate, the, mouse) 0.001
+• LMs are generative models:
+p(X1, , XL)
+• Autoregressive (AR) language models:
+Syntactic knowledge
+Semantic knowledge
+Stanford
+```
+
+## frame_00018.jpg
+
+```
+Stanford
+```
+
+## frame_00019.jpg
+
+```
+Stanford
+```
+
+## frame_00020.jpg
+
+```
+Stanford
+```
+
+## frame_00021.jpg
+
+```
+AR Neural Language Models
+Stanford
+https;//lcna:yoita.github.io/nlp—coursellanguagc—modcling.hunlftintro
+```
+
+## frame_00022.jpg
+
+```
+Stanford
+```
+
+## frame_00023.jpg
+
+```
+7
+AR Neural Language Models
+IVI tokens
+—o
+—o
+d-sized
+vector
+Linear—
+layer
+o
+softmax
+* II saw a cat on a)
+Transform h linearly
+from size d to IVI - the
+vocabulary size
+Neural network
+O
+o
+o
+o
+o
+O
+I
+O
+O
+o
+O
+saw
+o
+o
+o
+o
+a
+o
+o
+o
+o
+cat
+o
+O
+o
+o
+on
+O
+o
+o
+h: vector representation of
+context saw a cat on a
+Input word embeddings
+https;mena:yoita.github.iolnlp—coursc/languagc—modcling.huulltinuo
+get probability
+distribution for
+the next tol<en
+process context
+(previous history)
+Stanford
+```
+
+## frame_00024.jpg
+
+```
+Tokenizer
+Stanford
+```
+
+## frame_00026.jpg
+
+```
+tanford
+```
+
+## frame_00027.jpg
+
+```
+Tokenizer
+why?
+• More general than words (eg typos)
+• Shorter sequences than with characters
+Stanford
+```
+
+## frame_00028.jpg
+
+```
+Tokenizer
+, why?
+• More general than words (eg typos)
+• Shorter sequences than with characters
+• Idea: tokens as common subsequences (—3 letters)
+• Eg: Byte Pair Encoding (BPE). Train steps:
+Stanford
+```
+
+## frame_00029.jpg
+
+```
+Stanford
+```
+
+## frame_00030.jpg
+
+```
+Tokenizer
+• why?
+• More general than words (eg typos)
+• Shorter sequences than with characters
+• Idea: tokens as common subsequences
+• Eg: Byte Pair Encoding (BPE). Train steps:
+1. Take large corpus of text
+2. Start with one token per character
+11
+tokeniier:
+Lext to token
+Index
+3. Merge common pairs of tokens into a token
+Stanford
+```
+
+## frame_00031.jpg
+
+```
+Stanford
+```
+
+## frame_00032.jpg
+
+```
+Stanford
+```
+
+## frame_00033.jpg
+
+```
+17
+LLM evaluation: Perplexity
+• Idea: validation loss
+)-I/L
+- II p(Xi IXI:i-1
+• To be more interpretable: use perplexity
+• avg per token Gindependent of length)
+• Exponentiate units independent of log base
+Stanford
+```
+
+## frame_00034.jpg
+
+```
+Stanford
+```
+
+## frame_00035.jpg
+
+```
+17
+LLM evaluation: Perplexity
+e Idea: validation loss
+= 2b C (XI:L)
+)-I/L
+= 11
+e To be more interpretable: use perplexity
+• avg per token Gindependent of length)
+• Exponentiate units independent of log base
+• Perplexity: between 1 and I Vocabl
+• Intuition: number of tokens that you are hesitating between
+Stanford
+```
+
+## frame_00036.jpg
+
+```
+Stanford
+```
+
+## frame_00037.jpg
+
+```
+19
+LLM Evaluation: agg- std NLP benchmarks
+Holistic evaluation of language models (HELM)
+Huggingface open LLM leaderboard
+Scenarioo
+Mode's
+GPT-a
+GPT-4 Turbo preview)
+palmyra X V3 (72B)
+palmyra X V2 (33B)
+PaLM-2 (unicorn)
+Yi (34B)
+0.962
+0.021
+Leaderboard
+I-ELM
+2
+collect many automatically evaluatable
+benchmarks, evaluate across them
+Stanford
+```
+
+## frame_00038.jpg
+
+```
+20
+LLM Evaluation: agg- std NLP benchmarks
+• Mix of things that can be "easily"
+evaluated
+• Typically there is "gold" answer
+you likelihood of LLM to
+predict that vs other options
+HELM-Iite
+[Liang+ 2022]
+Ica naria
+NarrativeQA
+nart'$tivo—ao
+NaturalQuesti0ng (closed-book)
+NaturalQu0Gtions (opon•book)
+openbookQA
+oponbookqo
+MMIU (Mü$$ivo Multit@$k Languago
+Un erstanding)
+OSMOK (Grade School Math)
+MATH
+Legalßench
+Med0A
+med_qa
+WMT 2014
+short-answer question
+question
+answerlno
+queztion answerino
+question onsworing
+queetion answering
+numeric snswer
+question anmarina
+multiple-chalco
+question
+machine translat'on
+are books and movlozcrlpts,
+are unknown
+Wikipedia,
+quori05
+trom Wlklpecno, que5tlons trom
+quorlos
+olcrnontary Acienco
+math, •s:icnco, hlstory, etc.
+qrado school moth word problems
+ate.)
+public logal adrninlfiatraiive doeumonta,
+medical
+annotators trom
+web
+web uOr8
+Turk workers
+various onllne sources
+and Surgo Al'
+lawyers
+Séenford
+Europarl,mewg,
+Craw', site.
+```
+
+## frame_00039.jpg
+
+```
+LLM Evaluation:
+• Example: MMLU
+• —Most trusted pretraining benchmark
+Astronomy
+What is true for a type-Ia supernova?
+A. This type occurs in binary systems.
+B. This type occurs in young galaxies.
+C. This type produces gamma-ray bursts.
+D, This type produces high amounts of X-rays.
+Answer: A
+High School Biology
+In a population ot giranes, an environmental change OCCUJ'S that favors individuals that are
+tallest. As a result, more of the taller individuals ore able to obtain nutrients and survive to
+pass along their genetic information. This is an example or
+A. directional selection.
+B. stabilizing selection.
+C. sexual selection.
+D, disruptive selection
+Answer: A
+eg MMLU
+Abstract Algebra
+Anatomy
+Astronomy
+Business Ethics
+Clinical Knowledge
+college Biology
+College Chemistry
+College Comp Sci
+College Mathematics
+college Medicine
+College Physics
+Computer Security
+Conceptual Physics
+Econometrics
+Electrical Engineering
+Elementary Mathematics
+Formal Logic
+Global FactS
+High School Biology
+High School Chemistry
+High School Comp Sci
+High School European History
+Stanford
+MMLU
+[Hendrycks+ 20201
+21
+UnifiedQA
+Random
+```
+
+## frame_00041.jpg
+
+```
+Stanford
+```
+
+## frame_00042.jpg
+
+```
+(no text extracted)
+```
+
+## frame_00043.jpg
+
+```
+22
+Evaluation: challenges
+Sensitivity to prompting/inconsistencies
+llama-65b
+tiiuae/falcon-40b
+llama-30b
+EleutherAI/gpt-neox-20b
+llama-13b
+llama-7b
+tiiuae/falcon-7b
+MMLU
+(HELM)
+0.637
+0.571
+0.583
+0.256
+0.471
+0.339
+0.278
+MMLU
+(Harness)
+0.488
+0.527
+0.457
+0.333
+0.377
+0.342
+0.35
+MMLU
+(Original)
+0.636
+0.558
+0.584
+0.262
+0.47
+0.351
+0.254
+Stanford
+```
+
+## frame_00044.jpg
+
+```
+Stanford
+```
+
+## frame_00046.jpg
+
+```
+23
+Evaluation: challenges
+Sensitivity to prompting/inconsistencies
+Train & test contamination enot important for development)
+I suspect GPT-4's performance is influenced by data contamination, at
+on Cocloforcos,
+Of the easiest problems on Codeforc:e$. it solved 10/10 pre-2021
+problems ancl 0/10 recent problems.
+This strongly points to contamination,
+greedy, Impramelitanon
+Ulanqlul
+btute force. geomcvy, moth
+Greed", implementation, math
+Action'
+Intaryleu-p.rnblnm
+I think Phi-1.5 trained On the benchmarkfi. Particularly, GSMSK.
+sep 12
+Let's take github.com/openai/grade-s...
+If you truncate and this question into Phi 45. It autocomr,10tc:; to
+calculating the of downloads In the 3rd month, and clogs so correctly.
+Change the number e bit, ancl it answe.s correctly es well.
+Stanford
+```
+
+## frame_00047.jpg
+
+```
+Stanford
+```
+
+## frame_00048.jpg
+
+```
+Overview
+Pretraining -> GPT3
+Task & loss
+Evaluation
+Data
+Post-training -> ChatGPT
+tanford
+```
+
+## frame_00049.jpg
+
+```
+25
+Data
+• Idea: use all of the clean internet
+• Note: internet is dirty & not representative of what we want. Practice:
+Stanford
+```
+
+## frame_00052.jpg
+
+```
+25
+Data
+• Idea: use al
+PUBLIC
+"-//W3C//OTD XHTML Tron3itionot//EN" "http
+Note: inte
+1.
+Dow
+gmtne."http://www.w3.org/1999/)(htmt"> •:meto content-"text/htmt; cr,areet'Utf-6" -
+Downloads Free 000 - Download 000 Software</titLe» •meta eontent='i000 084 at Smart Code for free download. 000
+084 troowaro and snarowaro froe downtoadfi." €moto typo'"koyword8" 084, downvoaas, frocwaro, softt*nro, froe, 000-084 Tost
+prep 000-004, ropcerts Questions end Answers, Free practice exam Questions, ADOBE
+9AO-éé40 icon"
+(if Ito tvpoa"toxt/cgs" •:gcript
+Ohtto://www.emartcode.com":Oe#1pt» act'1Pt typer"text/javogct'lnti'
+type-iitext/css" <scrlpt
+€script
+<d1V
+att=0Premier website for Windows Shareware and Freeware <div escript
+<form oction'"http://oww.emort00d0.com/downt0008" motnoo•ogot"
+einout
+(false): return toLGc;ii
+href•"http://www.8martcodo.com/":•Homo€/a:••:/U.* hrof•"http://www.smartcodo.com/db/a11rootandsvbcat8.php">Categor10s«/a></ti>
+hrefz0http://www.smartcode.com/subrnit/">Submits/a*</Lå> hrnI':"http://www.grnarL•codo.corn/main/rtn/0>kSS</a></1i>
+nrtf""http;//www.smortcodo.com/moin/contact.htmt"»contoct</ö></li>
+padding-bottom: hl float: left: j' diV,paoer font-gaze: Ilpx: float: rioht: DOddLno-top: ediV
+S G/div> <tablc str>
+qcrjpt src."/de8tgn/gggqyog.38"
+e/tds ediv
+Gg'.</$tranq? 1 2
+e/dLV> is source far the System x Performance Servera...
+your source for tho System x High Porfarmancg Servcrfi..
+hre+•"http: Of the best and most rewarding features
+Of {'he 000-08/' training are that, . , One Of the and moat rewarding featur•eg Of the 000-004 materialB are
+ford
+that. , hrof="http;//posgzavro-ibm-aee-gez..$martcodo.com/inio.htmt">notoilö</o:.€/p»:br>
+freo 000-084 questions and 000-084 oxom quoettone ero , oowrnood froe 000-084 quogtlone end anowore. 000-034
+que±tioriS are ultimate.. claes=i'detaiu-llnkii href=iihttp://topdeFts-000-og4-questiang-Ond-angwerg.gmar•tcade.eom/inf0
+. gp is tho in IT cortifications thot Offers a 1ß0X monoy Back
+pass-Guarantaod
+18 the leaden In IT certifications that offers a 100% nonev neck.. cmage-"deteltg-lithk"
+```
+
+## frame_00056.jpg
+
+```
+25
+Data
+• Idea: use all of the clean internet
+• Note: internet is dirty & not representative of what we want. Practice:
+1.
+2.
+3.
+4.
+5.
+6.
+Download all of internet. Common crawl: 250 billion pages, > IPB (>1e6 GB)
+Text extraction from HTML (challenges: math, boiler plate)
+Filter undesirable content (e.g. NSFW, harmful content, PII)
+Deduplicates (url/document/line). E.g. all the headers/footers/menu in forums are always same
+Heuristic filtering. Rm low quality documents (e.g. # words, word length, outlier toks, dirty toks)
+Model based filtering. Predict if page could be references by Wikipedia.
+Stanford
+```
+
+## frame_00057.jpg
+
+```
+25
+Data
+• Idea: use all of the clean internet
+• Note: internet is dirty & not representative of what we want. Practice:
+1.
+2.
+3.
+4.
+5.
+6.
+7.
+Download all of internet. Common crawl: 250 billion pages, > IPB (>1e6 GB)
+Text extraction from HTML (challenges: math, boiler plate)
+Filter undesirable content (e.g. NSFW, harmful content, PII)
+Deduplicates (url/document/line). E.g. all the headers/footers/menu in forums are always same
+Heuristic filtering. Rm low quality documents (e.g. # words, word length, outlier toks, dirty toks)
+Model based filtering. Predict if page could be references by Wikipedia.
+Data mix. Classify data categories (code/books/entertainment). Reweight domains using scaling
+laws to get high downstream performance.
+Stanford
+```
+
+## frame_00059.jpg
+
+```
+(no text extracted)
+```
+
+## frame_00060.jpg
+
+```
+Stanford
+```
+
+## frame_00061.jpg
+
+```
+26
+Data
+• Collecting well data is a huge part of practical LLM ethe key)
+Stanford
+```
+
+## frame_00062.jpg
+
+```
+Stanford
+```
+
+## frame_00063.jpg
+
+```
+(no text extracted)
+```
+
+## frame_00064.jpg
+
+```
+26
+Data
+• Collecting well data is a huge part of practical LLM ethe key)
+• Lot of research to be done!
+How do you process well and efficiently?
+How do you balance domains?
+Synthetic data?
+Multi-modal data?
+Stanford
+```
+
+## frame_00065.jpg
+
+```
+Stanford
+```
+
+## frame_00066.jpg
+
+```
+Stanford
+```
+
+## frame_00067.jpg
+
+```
+26
+Data
+• Collecting well data is a huge part of practical LLM ethe key)
+• Lot of research to be done!
+How do you process well and efficiently?
+How do you balance domains?
+• A lot of secrecy:
+Competitive dynamics
+• Common academic datasets:
+• C4 (150B tokens 1 800GB)
+• The Pile (280B tokens)
+Synthetic data?
+Multi-modal data?
+Copyright liability
+Dolma (3T tokens)
+• RineWeb (15T tokens)
+Stanford
+```
+
+## frame_00068.jpg
+
+```
+Stanford
+```
+
+## frame_00069.jpg
+
+```
+26
+Data
+• Collecting well data is a huge part of practical LLM ethe key)
+• Lot of research to be done!
+How do you process well and efficiently?
+How do you balance domains?
+• A lot of secrecy:
+Competitive dynamics
+• Common academic datasets:
+• C4 (150B tokens 800GB)
+• The Pilc (280B tokens)
+Synthetic data?
+Multi-modal data?
+Copyright liability
+• Dolma (3T tokens)
+FineWeb (15T tokens)
+Stanford
+• Closed: LLaMA 2 (2T tokens), LLaMA 3 (1ST tokens), GPT-4 (—13T tokens?)
+```
+
+## frame_00070.jpg
+
+```
+Stanford
+```
+
+## frame_00071.jpg
+
+```
+28
+Scaling laws
+• Empirically: more data and larger models better performance
+o Large models overfitting
+Stanford
+```
+
+## frame_00072.jpg
+
+```
+(no text extracted)
+```
+
+## frame_00073.jpg
+
+```
+Stanford
+```
+
+## frame_00074.jpg
+
+```
+29
+Scaling laws: tuning
+• You have 10K GPUs for a month, what model do you train?
+Stanford
+```
+
+## frame_00075.jpg
+
+```
+Stanford
+```
+
+## frame_00076.jpg
+
+```
+29
+Scaling laws: tuning
+• You have 10K GPUs for a month, what model do you train?
+• Old pipeline:
+• Tune hyperparameters on big models (e.g. 30 models)
+• Pick the best final model is trained for as much as each filtered out ones (e.g. 1 day)
+• New pipeline:
+• Find scaling recipes (eg Ir decrease with size)
+Stanford
+```
+
+## frame_00077.jpg
+
+```
+Stanford
+```
+
+## frame_00078.jpg
+
+```
+Stanford
+```
+
+## frame_00079.jpg
+
+```
+Stanford
+```
+
+## frame_00080.jpg
+
+```
+Scaling laws:
+Q: Should we use transformers or LSTM?
+Test Loss 5.4
+4.2
+3.6
+3.0
+2.4
+30
+eg LSTM
+LSTMs
+1 Layor
+2 Layers
+4 Layers
+Transformers
+105
+1 06
+1 07
+Parameters (non-embedding)
+108
+1 09
+Stanford
+Scaling laws
+[Kaplan+ 2020]
+```
+
+## frame_00082.jpg
+
+```
+ford
+```
+
+## frame_00083.jpg
+
+```
+31
+Scaling laws: eg Chinchilla
+Q: How do we optimally allocate training* resources (size vs data)?
+```
+
+## frame_00084.jpg
+
+```
+Stanford
+```
+
+## frame_00085.jpg
+
+```
+31
+Scaling laws: eg Chinchilla
+• Q: How do we optimally allocate training* resources (size vs data)?
+3.2
+3.0
+91 2.8
+2.4
+2.2
+2.0
+soflop:
+va tokens 8'
+a eters
+-o-
+-•-
+6018
+le19
+3e19
+6e19
+le20
+3e20
+6e20
+le21
+3e21
+100M
+a)
+300M 1B
+3B 6B
+Parameters
+30B
+```
+
+## frame_00086.jpg
+
+```
+Stanford
+```
+
+## frame_00087.jpg
+
+```
+Stanford
+```
+
+## frame_00088.jpg
+
+```
+Stanford
+```
+
+## frame_00089.jpg
+
+```
+Stanford
+```
+
+## frame_00090.jpg
+
+```
+Stanford
+```
+
+## frame_00091.jpg
+
+```
+34
+Training a SOTA model
+Example of current SOTA: LLaMA 3 400B
+Data: 15.6T tokens
+Parameters: 405B
+Stanford
+```
+
+## frame_00092.jpg
+
+```
+34
+Training a SOTA model
+• Example of current SOTA: LLaMA 3 400B
+Data: 15.6T tokens
+FLOPs: 6NP = 6 * 15.6e12 * 3.8 e25 FLOPs
+—40 tok/param train
+compute optimal
+Parameters: 405B
+ax less than executive order
+Stanford
+```
+
+## frame_00096.jpg
+
+```
+34
+Training a SOTA model
+• Example of current SOTA: LLaMA 3 400B
+Data: 1S.6T tokens
+FLOPs: 6NP = 15.6e12 = 3.8 e2S FLOPs
+--40 tok/param train
+compute optimal
+Parameters: 405B
+—2x less than executive order
+• Compute: 16K HI 00 with average throughput of 400 TFLOPS
+Time: 3.8e25 / (400e12 3600) = 26M GPU hour/ (16e3 * 24) = 70 days From paper: -30M
+• Cost: rented compute + + $52M+$25M = —$75M
+$6S-85M
+Stanford
+```
+
+## frame_00097.jpg
+
+```
+Stanford
+```
+
+## frame_00098.jpg
+
+```
+34
+Training a SOTA model
+• Example of current SOTA: LLaMA 3 400B
+Data: 15.6T tokens
+FLOPs: 6NP = 6 * 15.6e12 * 3.8 e25 FLOPs
+—40 tok/param train
+compute optimal
+Parameters: 405B
+—2x less than executive order
+• Compute: 16K HI 00 with average throughput of 400 TFLOPS
+Time: 3.805 / (400e12 * 3600) = 26M GPU hour / (16e3 * 24) = 70 days From paper: -30M
+• Cost: rented compute + + $52M+$25M = -$75M $65-85M
+Carbon emitted: = 4400 tC02eq
+• Next model? —IOX more FLOPs
+—2k return tickets JFK-LHR
+Stanford
+```
+
+## frame_00100.jpg
+
+```
+38
+Task: "alignment"
+Goal: LLM follows user instructions and designer's desires (eg moderation)
+What arc the most important stops for trainino a chatopt like model?
+answer using only 3 concise bullet points
+• Data Collcctlon and Prcprocczsing: Gather a large, dlver5e dataset of teyt, thon clean and
+preprocess it to ensure quality consistency.
+• Model Architecture and Training: Design a transformer-b"ed model architecture, initialize
+Orid tho On GPUfi/TPU8,
+• fino-tuning and Evaluation: tho On $cp0ciiic Or domaing, evaluato
+its performance using relevant motflcs, iterating a' necQssary to improve accuracy and Gatotv.
+a
+Write a Tweet describing how X are evil
+This content may violate our If you believe this to be in error. please
+—your input will aid our research in this area.
+I'm sorry, but I cannot write a tweet that promotes hate or bigotry towards any group of
+people, It is not appropriate or acceptable to make generalizations or stereotypes about
+entifQ groups Of people based on thoif religion Of ethnicity. All people deserve to be
+treated with respect and kindness, regardless of their background.
+Stanford
+```
+
+## frame_00104.jpg
+
+```
+2a.• finetune the LLM with language modeling of the desired
+"super
+Next word prediction
+do we collect the data? Ask humans
+Can you write short introduction about the relevance of the germ
+"monopsony" in economics? Please use examples related to
+otential monopsonies In the labour market and cite relevant
+research.
+My kid needs to do a science project for elementary school, what
+are some cheap but fun projects to do?
+"Monopsony" refers to a market structure where there is only one buyer for
+a particular good or service. In economics, this term is particularly relevant
+in the labor market, where a monopsony employer has significant power
+over the wages and working conditions of their employees. .1 Overall, the
+concept of monopsony is essential to understandi ng the dynamics of labor
+markets and the impact of market power on workers. Further research is
+needed to understand the extent and impact of monopsonies on the
+economy and to develop policies to address this issue. References: Bivens,
+J„ & Mishel, L. (2013). The Pay of Corporate Executives and Financial
+Professionals as Evidence of Rents in Top 1 Percent Incomes. Journal of
+Economic Perspectives, 27(3), 57-78,
+There are many cheap and fun science projects that elementary school kids
+can do! Here are a few ideas; 1. Make a homemade lava lamp: Fill a clear
+plastic bottle with water and vegetable oil. add a few drops of food coloring,
+and then drop in an AlkaSeltzer tablet, The reaction will cause bubbles to
+rise and fall like a lava lamp, l,. .17. Create a solar oven: Using a pizza box,
+aluminum foil, and plastic wrap, create a solar oven that can cook s'mores
+or other small snacks. 8. Explore magnets: Using magnets and a variety of
+objects, have your child experiment with magnetism to see what objects are
+attracted to magnets and which ones are not. These are just a few ideas to
+get you started! There are many more cheap and fun science projects that
+your child can do with household ite
+This was the
+-key to GPT3 ChatGPT model!
+```
+
+## frame_00105.jpg
+
+```
+Stanf rd
+```
+
+## frame_00106.jpg
+
+```
+41
+Scalable data for SFT: eg Alpaca
+Stanford
+```
+
+## frame_00109.jpg
+
+```
+Stanf rd
+```
+
+## frame_00110.jpg
+
+```
+Scalable data for SFT:
+• You need very little data for SFT! —few thousand
+4.0
+= 3.8
+8 3.6
+3.4
+3.2
+2K
+43
+quantity?
+LIMA
+[Zhou+ 2023]
+32K
+4K
+8K
+16K
+# Training Examples
+• Just learns the format of desired answers (length, bullet points, ... )
+The knowledge is already in the pretrained LLM!
+Specializes to one "type of user"
+Stanford
+```
+
+## frame_00111.jpg
+
+```
+Stanford
+```
+
+## frame_00113.jpg
+
+```
+45
+RL from Human Feedback (RLHF)
+Stanford
+```
+
+## frame_00114.jpg
+
+```
+45
+RL from Human Feedback (RLHF)
+• Problem: SFT is behavior cloning of humans
+1. Bound by human abilities: humans may prefer things that they are not able to generate
+2. Hallucination: cloning correct answer teaches LLM to hallucinate if it didn't know about it!
+Stanford
+```
+
+## frame_00115.jpg
+
+```
+Sta
+rd
+```
+
+## frame_00116.jpg
+
+```
+45
+RL from Human Feedback (RLHF)
+Problem: SFT is behavior cloning of humans
+1. Bound by human abilities: humans may prefer things that they are not able to generate
+2. Hallucination: cloning correct' answer teaches LLM to hallucinate if it didn't know about it!
+"Monopsony" refers to a market structure where there iSonly one buyer for
+a pamcular good or service. In economics, this term is particularly relevant
+in the labor market, Where a monopsony employer has significant power
+over the wages and working conditions of their employees. .1, Overall, the
+can you write a shott Introduction about the relevance ofthe term
+concept of monopsony is essential to understanding the dynamics of labor
+"mohopsony" in economics? Please use examples related to
+markets and the impact of market power on workers. Further research iS
+potential monopsonies in the labour market and cite relevant
+needed to understand the extent and impact of monopsonies on the
+research,
+economy and to develop policies to address this issue. References: Bivens,
+J., & Mishel, L. (2013), The Pay of Corporate Executives and Financial
+Professionals as Evidence of Rents in Tap I Percent Incomes. Journal of
+Economic Perspectives, 27(3), 57-7B.
+If LLM doesn't know [Bivens 2013] teaches the model to make up plausibly sounding referneces
+Stanford
+```
+
+## frame_00118.jpg
+
+```
+46
+RI-HF
+Idea: maximize human preference rather than clone their behavior
+Pipeline:
+1. For each instruction: generate 2 answers from a pretty good model (SFT)
+Instruction
+Stanford
+```
+
+## frame_00120.jpg
+
+```
+46
+RLHF
+Idea: maximize human preference rather than clone their behavior
+Pipeline:
+1.
+2.
+3.
+For each instruction: generate 2 answers from a pretty good model (SFT)
+Ask labelers to select their preferred answers
+Finetune the model to generate more preferred answers
+Instruction
+Stanford
+```
+
+## frame_00122.jpg
+
+```
+PPO
+Idea: use reinforcement learning
+• What is the reward?
+Option 1: whether the modelÅioutput: is preferred to some baseline
+Issue: binaA1$' reward doesn't have uch information
+Option train a reward model R using a logistic regression loss to classify p ferences.
+exp(R(x, 91))
+p(i > j) =
+[Bradley-Terry 19S2]
+exp(R (x, Yi)) -i- exp(R (x, jj))
+contin ous information information heavy!
+Use logits R(...) as reward
+Stanfor
+Ill
+```
+
+## frame_00123.jpg
+
+```
+47
+RLHF: PPO
+• Idea: use reinforcement learning
+• What is the reward?
+Option 1: whether the model's output is preferred to some baseline
+Issue: bina reward doesn't have much information
+Option 2: train a reward model R using a logistic regression loss to classify preferences.
+exp (R (x, 91))
+p(i > j) =
+[Bradley-Terry 1952]
+exp(R (x, Yi)) -I- exp(R (x, jj))
+Use logits .. ) as reward continuous information information heavy!
+Stanford
+```
+
+## frame_00124.jpg
+
+```
+Sta f rd
+```
+
+## frame_00126.jpg
+
+```
+49
+RLHF: PPO challenges
+• Problem: RL in theory simple, in practice messy (clipping, rollouts, outer loops,...)
+Stanford
+```
+
+## frame_00127.jpg
+
+```
+Sta f rd
+```
+
+## frame_00129.jpg
+
+```
+50
+RLHF: DPO
+• Idea: maximize probability minimize the other
+) = logo IBlog
+CDPO ; Tref ¯
+Reinforcement Learning from Human Feedback (RLHF)
+DPO
+[Rafailov+ 2023]
+me about
+tho ot
+label rewards
+LM policy
+preference data maximum
+sample completions
+likelihood
+reinforcement learning
+Direct Preference Optimization (DPO)
+•write me a about
+hlstary ot
+preference data
+maximum
+likelihood
+• This is —equivalent (same global minima) to RLHF/PPO
+Stanford
+```
+
+## frame_00130.jpg
+
+```
+Stanford
+```
+
+## frame_00131.jpg
+
+```
+(no text extracted)
+```
+
+## frame_00132.jpg
+
+```
+59
+LLM evaluation: spurious correlation
+• e.g. LLM prefers longer outputs
+concise
+gpt4_i i 06_preview
+Mixtral-8x7B-lnstruct-vO.1
+gpt4_0613
+claude-2.1
+gpt-3.5-turbo-1106
+alpaca-7b
+22.
+13.7
+9.4
+9.
+7.4
+2.0
+AlpacaEval
+standard
+50.0
+18.3
+15.8
+15.7
+9.2
+2.6
+verbose
+64.
+24.
+23.
+12.
+2.9
+LC
+Stan@æ
+```
+
+## frame_00133.jpg
+
+```
+Stanford
+```
+
+## frame_00136.jpg
+
+```
+Stanford
+```
+
+## frame_00137.jpg
+
+```
+61
+Systems
+• Problem: everyone is bottlenecked by compute!
+Why not buy more GPUs?
+GPUs are expensive and scarce!
+Physical limitations (eg communication between GPUs)
+importance of resource allocation (scaling laws) and optimized pipelines
+Stanford
+```
+
+## frame_00138.jpg
+
+```
+68
+• Massively parallel
+• Fast matrix multiplication
+' Compute > memory & communication
+• Memory hierarchy
+• Metric: Model Flop Utilization (MFU)
+Ratio: observed throughput / theoretical best for that GPU
+is great!
+Stanford
+```
+
+## frame_00139.jpg
+
+```
+Stanford
+```
+
+## frame_00140.jpg
+
+```
+68
+Systems: low precision
+• Fewer bits faster communication & lower memory consumption
+• For deep learning: decimal precision -doesn't matter except exp & updates
+• Matrix multiplications can use bflf instead of fp32
+Stanford
+```
+
+## frame_00141.jpg
+
+```
+68
+Systems: Iow precision
+• Fewer bits faster communication & lower memory consumption
+• For deep learning: decimal precision -doesn't matter except exp & updates
+• Matrix multiplications can use bf16 instead of fp32
+• For training: Automatic Mixed Precision (AMP)
+Weights stored in fp32, but before computation convert to bf16
+Stanford
+```
+
+## frame_00142.jpg
+
+```
+(no text extracted)
+```
+
+## frame_00143.jpg
+
+```
+68
+Systems: low precision
+• Fewer bits faster communication & lower memory consumption
+• For deep learning: decimal precision -doesn't matter except exp & updates
+• Matrix multiplications can use bf16 instead of fp32
+• For training: Automatic Mixed Precision (AMP)
+Weights stored in fp32, but before computation convert to bf16
+• Activation in bf16 main memory gains
+• (Only) matrix multiplication in bf16 speed gains
+• Gradients in bf16 memory gains
+• Master weights updated fp32 full precision
+Stanford
+```
+
+## frame_00144.jpg
+
+```
+Stanford
+```
+
+## frame_00145.jpg
+
+```
+69
+Systems: operator fusion
+e Problem:
+• communication is slow
+x1 = x. cos() Read from x in global memory, write to x1
+x2 = xl.cos() Read from x1 in global memory, write to x2
+• every new PyTorch line moves variables to global memory
+M ernorv
+DRAM
+Compo
+SRAM
+Compute
+Na'iVe (non-fused)
+Stanford
+```
+
+## frame_00146.jpg
+
+```
+Stanford
+```