From c4686787b6168f77f790c5803e61bd2ff10ff450 Mon Sep 17 00:00:00 2001 From: Ed_ Date: Sun, 21 Jun 2026 16:12:18 -0400 Subject: [PATCH] conductor(cs229): Phase 3 OCR - 115 frames OCR'd in 5.1s via winsdk (28KB markdown) --- .../artifacts/ocr.md | 1499 +++++++++++++++++ 1 file changed, 1499 insertions(+) create mode 100644 conductor/tracks/video_analysis_cs229_building_llms_20260621/artifacts/ocr.md diff --git a/conductor/tracks/video_analysis_cs229_building_llms_20260621/artifacts/ocr.md b/conductor/tracks/video_analysis_cs229_building_llms_20260621/artifacts/ocr.md new file mode 100644 index 00000000..afcd3664 --- /dev/null +++ b/conductor/tracks/video_analysis_cs229_building_llms_20260621/artifacts/ocr.md @@ -0,0 +1,1499 @@ +# OCR Results + +## frame_00001.jpg + +``` +Introduction to +Building LLMs +CS229: Machine Learning +Yann Dubois Aug. 13th 2024 +Slides partially based on CS336, CS224N, CS324 +tanford +``` + +## frame_00002.jpg + +``` +Stanfo d +``` + +## frame_00003.jpg + +``` +3 +What matters when training LLMs +Stanford +``` + +## frame_00004.jpg + +``` +Stanford +``` + +## frame_00005.jpg + +``` +What matters when training LLMs +' Architecture +Most of +academia +e Training algorithm/loss +' Data +e Evaluation +Systems +Model +Stanford +``` + +## frame_00006.jpg + +``` +Stanford +``` + +## frame_00008.jpg + +``` +Stanford +``` + +## frame_00009.jpg + +``` +Language Modeling +LM: probability distribution over sequences of tokens/words p(X1, , XL) +Stanford +``` + +## frame_00011.jpg + +``` +Stanford +``` + +## frame_00012.jpg + +``` +Stanford +``` + +## frame_00013.jpg + +``` +Stanford +``` + +## frame_00014.jpg + +``` +Language Modeling +• LM: probability distribution over sequences of tokens/words p(X1, , XL) +P(the, mouse, ate, the, cheese) = 0.02 +P(the, the, mouse, ate, cheese) = 0.0001 +P(the, cheese, ate, the, mouse) 0.001 +• LMs are generative models: +p(X1, ... , XL) +Syntactic knowledge +Semantic knowledge +Stanford +``` + +## frame_00015.jpg + +``` +Language Modeling +• LM: probability distribution over sequences of tokens/words p(X1, , XL) +P(the, mouse, ate, the, cheese) = 0.02 +P(the, the, mouse, ate, cheese) = 0.0001 +P(the, cheese, ate, the, mouse) 0.001 +• LMs are generative models: +p(X1, , XL) +• Autoregressive (AR) language models: +Syntactic knowledge +Semantic knowledge +Stanford +``` + +## frame_00018.jpg + +``` +Stanford +``` + +## frame_00019.jpg + +``` +Stanford +``` + +## frame_00020.jpg + +``` +Stanford +``` + +## frame_00021.jpg + +``` +AR Neural Language Models +Stanford +https;//lcna:yoita.github.io/nlp—coursellanguagc—modcling.hunlftintro +``` + +## frame_00022.jpg + +``` +Stanford +``` + +## frame_00023.jpg + +``` +7 +AR Neural Language Models +IVI tokens +—o +—o +d-sized +vector +Linear— +layer +o +softmax +* II saw a cat on a) +Transform h linearly +from size d to IVI - the +vocabulary size +Neural network +O +o +o +o +o +O +I +O +O +o +O +saw +o +o +o +o +a +o +o +o +o +cat +o +O +o +o +on +O +o +o +h: vector representation of +context saw a cat on a +Input word embeddings +https;mena:yoita.github.iolnlp—coursc/languagc—modcling.huulltinuo +get probability +distribution for +the next tol GPT3 +Task & loss +Evaluation +Data +Post-training -> ChatGPT +tanford +``` + +## frame_00049.jpg + +``` +25 +Data +• Idea: use all of the clean internet +• Note: internet is dirty & not representative of what we want. Practice: +Stanford +``` + +## frame_00052.jpg + +``` +25 +Data +• Idea: use al +PUBLIC +"-//W3C//OTD XHTML Tron3itionot//EN" "http +Note: inte +1. +Dow +gmtne."http://www.w3.org/1999/)(htmt"> •:meto content-"text/htmt; cr,areet'Utf-6" - +Downloads Free 000 - Download 000 SoftwareCategor10s«/a> +hrefz0http://www.smartcode.com/subrnit/">Submits/a* hrnI':"http://www.grnarL•codo.corn/main/rtn/0>kSS +nrtf""http;//www.smortcodo.com/moin/contact.htmt"»contoct +padding-bottom: hl float: left: j' diV,paoer font-gaze: Ilpx: float: rioht: DOddLno-top: ediV +S G/div> +qcrjpt src."/de8tgn/gggqyog.38" +e/tds ediv +Gg'. is source far the System x Performance Servera... +your source for tho System x High Porfarmancg Servcrfi.. +hre+•"http: Of the best and most rewarding features +Of {'he 000-08/' training are that, . , One Of the and moat rewarding featur•eg Of the 000-004 materialB are +ford +that. , hrof="http;//posgzavro-ibm-aee-gez..$martcodo.com/inio.htmt">notoilö +freo 000-084 questions and 000-084 oxom quoettone ero , oowrnood froe 000-084 quogtlone end anowore. 000-034 +que±tioriS are ultimate.. claes=i'detaiu-llnkii href=iihttp://topdeFts-000-og4-questiang-Ond-angwerg.gmar•tcade.eom/inf0 +. gp is tho in IT cortifications thot Offers a 1ß0X monoy Back +pass-Guarantaod +18 the leaden In IT certifications that offers a 100% nonev neck.. cmage-"deteltg-lithk" +``` + +## frame_00056.jpg + +``` +25 +Data +• Idea: use all of the clean internet +• Note: internet is dirty & not representative of what we want. Practice: +1. +2. +3. +4. +5. +6. +Download all of internet. Common crawl: 250 billion pages, > IPB (>1e6 GB) +Text extraction from HTML (challenges: math, boiler plate) +Filter undesirable content (e.g. NSFW, harmful content, PII) +Deduplicates (url/document/line). E.g. all the headers/footers/menu in forums are always same +Heuristic filtering. Rm low quality documents (e.g. # words, word length, outlier toks, dirty toks) +Model based filtering. Predict if page could be references by Wikipedia. +Stanford +``` + +## frame_00057.jpg + +``` +25 +Data +• Idea: use all of the clean internet +• Note: internet is dirty & not representative of what we want. Practice: +1. +2. +3. +4. +5. +6. +7. +Download all of internet. Common crawl: 250 billion pages, > IPB (>1e6 GB) +Text extraction from HTML (challenges: math, boiler plate) +Filter undesirable content (e.g. NSFW, harmful content, PII) +Deduplicates (url/document/line). E.g. all the headers/footers/menu in forums are always same +Heuristic filtering. Rm low quality documents (e.g. # words, word length, outlier toks, dirty toks) +Model based filtering. Predict if page could be references by Wikipedia. +Data mix. Classify data categories (code/books/entertainment). Reweight domains using scaling +laws to get high downstream performance. +Stanford +``` + +## frame_00059.jpg + +``` +(no text extracted) +``` + +## frame_00060.jpg + +``` +Stanford +``` + +## frame_00061.jpg + +``` +26 +Data +• Collecting well data is a huge part of practical LLM ethe key) +Stanford +``` + +## frame_00062.jpg + +``` +Stanford +``` + +## frame_00063.jpg + +``` +(no text extracted) +``` + +## frame_00064.jpg + +``` +26 +Data +• Collecting well data is a huge part of practical LLM ethe key) +• Lot of research to be done! +How do you process well and efficiently? +How do you balance domains? +Synthetic data? +Multi-modal data? +Stanford +``` + +## frame_00065.jpg + +``` +Stanford +``` + +## frame_00066.jpg + +``` +Stanford +``` + +## frame_00067.jpg + +``` +26 +Data +• Collecting well data is a huge part of practical LLM ethe key) +• Lot of research to be done! +How do you process well and efficiently? +How do you balance domains? +• A lot of secrecy: +Competitive dynamics +• Common academic datasets: +• C4 (150B tokens 1 800GB) +• The Pile (280B tokens) +Synthetic data? +Multi-modal data? +Copyright liability +Dolma (3T tokens) +• RineWeb (15T tokens) +Stanford +``` + +## frame_00068.jpg + +``` +Stanford +``` + +## frame_00069.jpg + +``` +26 +Data +• Collecting well data is a huge part of practical LLM ethe key) +• Lot of research to be done! +How do you process well and efficiently? +How do you balance domains? +• A lot of secrecy: +Competitive dynamics +• Common academic datasets: +• C4 (150B tokens 800GB) +• The Pilc (280B tokens) +Synthetic data? +Multi-modal data? +Copyright liability +• Dolma (3T tokens) +FineWeb (15T tokens) +Stanford +• Closed: LLaMA 2 (2T tokens), LLaMA 3 (1ST tokens), GPT-4 (—13T tokens?) +``` + +## frame_00070.jpg + +``` +Stanford +``` + +## frame_00071.jpg + +``` +28 +Scaling laws +• Empirically: more data and larger models better performance +o Large models overfitting +Stanford +``` + +## frame_00072.jpg + +``` +(no text extracted) +``` + +## frame_00073.jpg + +``` +Stanford +``` + +## frame_00074.jpg + +``` +29 +Scaling laws: tuning +• You have 10K GPUs for a month, what model do you train? +Stanford +``` + +## frame_00075.jpg + +``` +Stanford +``` + +## frame_00076.jpg + +``` +29 +Scaling laws: tuning +• You have 10K GPUs for a month, what model do you train? +• Old pipeline: +• Tune hyperparameters on big models (e.g. 30 models) +• Pick the best final model is trained for as much as each filtered out ones (e.g. 1 day) +• New pipeline: +• Find scaling recipes (eg Ir decrease with size) +Stanford +``` + +## frame_00077.jpg + +``` +Stanford +``` + +## frame_00078.jpg + +``` +Stanford +``` + +## frame_00079.jpg + +``` +Stanford +``` + +## frame_00080.jpg + +``` +Scaling laws: +Q: Should we use transformers or LSTM? +Test Loss 5.4 +4.2 +3.6 +3.0 +2.4 +30 +eg LSTM +LSTMs +1 Layor +2 Layers +4 Layers +Transformers +105 +1 06 +1 07 +Parameters (non-embedding) +108 +1 09 +Stanford +Scaling laws +[Kaplan+ 2020] +``` + +## frame_00082.jpg + +``` +ford +``` + +## frame_00083.jpg + +``` +31 +Scaling laws: eg Chinchilla +Q: How do we optimally allocate training* resources (size vs data)? +``` + +## frame_00084.jpg + +``` +Stanford +``` + +## frame_00085.jpg + +``` +31 +Scaling laws: eg Chinchilla +• Q: How do we optimally allocate training* resources (size vs data)? +3.2 +3.0 +91 2.8 +2.4 +2.2 +2.0 +soflop: +va tokens 8' +a eters +-o- +-•- +6018 +le19 +3e19 +6e19 +le20 +3e20 +6e20 +le21 +3e21 +100M +a) +300M 1B +3B 6B +Parameters +30B +``` + +## frame_00086.jpg + +``` +Stanford +``` + +## frame_00087.jpg + +``` +Stanford +``` + +## frame_00088.jpg + +``` +Stanford +``` + +## frame_00089.jpg + +``` +Stanford +``` + +## frame_00090.jpg + +``` +Stanford +``` + +## frame_00091.jpg + +``` +34 +Training a SOTA model +Example of current SOTA: LLaMA 3 400B +Data: 15.6T tokens +Parameters: 405B +Stanford +``` + +## frame_00092.jpg + +``` +34 +Training a SOTA model +• Example of current SOTA: LLaMA 3 400B +Data: 15.6T tokens +FLOPs: 6NP = 6 * 15.6e12 * 3.8 e25 FLOPs +—40 tok/param train +compute optimal +Parameters: 405B +ax less than executive order +Stanford +``` + +## frame_00096.jpg + +``` +34 +Training a SOTA model +• Example of current SOTA: LLaMA 3 400B +Data: 1S.6T tokens +FLOPs: 6NP = 15.6e12 = 3.8 e2S FLOPs +--40 tok/param train +compute optimal +Parameters: 405B +—2x less than executive order +• Compute: 16K HI 00 with average throughput of 400 TFLOPS +Time: 3.8e25 / (400e12 3600) = 26M GPU hour/ (16e3 * 24) = 70 days From paper: -30M +• Cost: rented compute + + $52M+$25M = —$75M +$6S-85M +Stanford +``` + +## frame_00097.jpg + +``` +Stanford +``` + +## frame_00098.jpg + +``` +34 +Training a SOTA model +• Example of current SOTA: LLaMA 3 400B +Data: 15.6T tokens +FLOPs: 6NP = 6 * 15.6e12 * 3.8 e25 FLOPs +—40 tok/param train +compute optimal +Parameters: 405B +—2x less than executive order +• Compute: 16K HI 00 with average throughput of 400 TFLOPS +Time: 3.805 / (400e12 * 3600) = 26M GPU hour / (16e3 * 24) = 70 days From paper: -30M +• Cost: rented compute + + $52M+$25M = -$75M $65-85M +Carbon emitted: = 4400 tC02eq +• Next model? —IOX more FLOPs +—2k return tickets JFK-LHR +Stanford +``` + +## frame_00100.jpg + +``` +38 +Task: "alignment" +Goal: LLM follows user instructions and designer's desires (eg moderation) +What arc the most important stops for trainino a chatopt like model? +answer using only 3 concise bullet points +• Data Collcctlon and Prcprocczsing: Gather a large, dlver5e dataset of teyt, thon clean and +preprocess it to ensure quality consistency. +• Model Architecture and Training: Design a transformer-b"ed model architecture, initialize +Orid tho On GPUfi/TPU8, +• fino-tuning and Evaluation: tho On $cp0ciiic Or domaing, evaluato +its performance using relevant motflcs, iterating a' necQssary to improve accuracy and Gatotv. +a +Write a Tweet describing how X are evil +This content may violate our If you believe this to be in error. please +—your input will aid our research in this area. +I'm sorry, but I cannot write a tweet that promotes hate or bigotry towards any group of +people, It is not appropriate or acceptable to make generalizations or stereotypes about +entifQ groups Of people based on thoif religion Of ethnicity. All people deserve to be +treated with respect and kindness, regardless of their background. +Stanford +``` + +## frame_00104.jpg + +``` +2a.• finetune the LLM with language modeling of the desired +"super +Next word prediction +do we collect the data? Ask humans +Can you write short introduction about the relevance of the germ +"monopsony" in economics? Please use examples related to +otential monopsonies In the labour market and cite relevant +research. +My kid needs to do a science project for elementary school, what +are some cheap but fun projects to do? +"Monopsony" refers to a market structure where there is only one buyer for +a particular good or service. In economics, this term is particularly relevant +in the labor market, where a monopsony employer has significant power +over the wages and working conditions of their employees. .1 Overall, the +concept of monopsony is essential to understandi ng the dynamics of labor +markets and the impact of market power on workers. Further research is +needed to understand the extent and impact of monopsonies on the +economy and to develop policies to address this issue. References: Bivens, +J„ & Mishel, L. (2013). The Pay of Corporate Executives and Financial +Professionals as Evidence of Rents in Top 1 Percent Incomes. Journal of +Economic Perspectives, 27(3), 57-78, +There are many cheap and fun science projects that elementary school kids +can do! Here are a few ideas; 1. Make a homemade lava lamp: Fill a clear +plastic bottle with water and vegetable oil. add a few drops of food coloring, +and then drop in an AlkaSeltzer tablet, The reaction will cause bubbles to +rise and fall like a lava lamp, l,. .17. Create a solar oven: Using a pizza box, +aluminum foil, and plastic wrap, create a solar oven that can cook s'mores +or other small snacks. 8. Explore magnets: Using magnets and a variety of +objects, have your child experiment with magnetism to see what objects are +attracted to magnets and which ones are not. These are just a few ideas to +get you started! There are many more cheap and fun science projects that +your child can do with household ite +This was the +-key to GPT3 ChatGPT model! +``` + +## frame_00105.jpg + +``` +Stanf rd +``` + +## frame_00106.jpg + +``` +41 +Scalable data for SFT: eg Alpaca +Stanford +``` + +## frame_00109.jpg + +``` +Stanf rd +``` + +## frame_00110.jpg + +``` +Scalable data for SFT: +• You need very little data for SFT! —few thousand +4.0 += 3.8 +8 3.6 +3.4 +3.2 +2K +43 +quantity? +LIMA +[Zhou+ 2023] +32K +4K +8K +16K +# Training Examples +• Just learns the format of desired answers (length, bullet points, ... ) +The knowledge is already in the pretrained LLM! +Specializes to one "type of user" +Stanford +``` + +## frame_00111.jpg + +``` +Stanford +``` + +## frame_00113.jpg + +``` +45 +RL from Human Feedback (RLHF) +Stanford +``` + +## frame_00114.jpg + +``` +45 +RL from Human Feedback (RLHF) +• Problem: SFT is behavior cloning of humans +1. Bound by human abilities: humans may prefer things that they are not able to generate +2. Hallucination: cloning correct answer teaches LLM to hallucinate if it didn't know about it! +Stanford +``` + +## frame_00115.jpg + +``` +Sta +rd +``` + +## frame_00116.jpg + +``` +45 +RL from Human Feedback (RLHF) +Problem: SFT is behavior cloning of humans +1. Bound by human abilities: humans may prefer things that they are not able to generate +2. Hallucination: cloning correct' answer teaches LLM to hallucinate if it didn't know about it! +"Monopsony" refers to a market structure where there iSonly one buyer for +a pamcular good or service. In economics, this term is particularly relevant +in the labor market, Where a monopsony employer has significant power +over the wages and working conditions of their employees. .1, Overall, the +can you write a shott Introduction about the relevance ofthe term +concept of monopsony is essential to understanding the dynamics of labor +"mohopsony" in economics? Please use examples related to +markets and the impact of market power on workers. Further research iS +potential monopsonies in the labour market and cite relevant +needed to understand the extent and impact of monopsonies on the +research, +economy and to develop policies to address this issue. References: Bivens, +J., & Mishel, L. (2013), The Pay of Corporate Executives and Financial +Professionals as Evidence of Rents in Tap I Percent Incomes. Journal of +Economic Perspectives, 27(3), 57-7B. +If LLM doesn't know [Bivens 2013] teaches the model to make up plausibly sounding referneces +Stanford +``` + +## frame_00118.jpg + +``` +46 +RI-HF +Idea: maximize human preference rather than clone their behavior +Pipeline: +1. For each instruction: generate 2 answers from a pretty good model (SFT) +Instruction +Stanford +``` + +## frame_00120.jpg + +``` +46 +RLHF +Idea: maximize human preference rather than clone their behavior +Pipeline: +1. +2. +3. +For each instruction: generate 2 answers from a pretty good model (SFT) +Ask labelers to select their preferred answers +Finetune the model to generate more preferred answers +Instruction +Stanford +``` + +## frame_00122.jpg + +``` +PPO +Idea: use reinforcement learning +• What is the reward? +Option 1: whether the modelÅioutput: is preferred to some baseline +Issue: binaA1$' reward doesn't have uch information +Option train a reward model R using a logistic regression loss to classify p ferences. +exp(R(x, 91)) +p(i > j) = +[Bradley-Terry 19S2] +exp(R (x, Yi)) -i- exp(R (x, jj)) +contin ous information information heavy! +Use logits R(...) as reward +Stanfor +Ill +``` + +## frame_00123.jpg + +``` +47 +RLHF: PPO +• Idea: use reinforcement learning +• What is the reward? +Option 1: whether the model's output is preferred to some baseline +Issue: bina reward doesn't have much information +Option 2: train a reward model R using a logistic regression loss to classify preferences. +exp (R (x, 91)) +p(i > j) = +[Bradley-Terry 1952] +exp(R (x, Yi)) -I- exp(R (x, jj)) +Use logits .. ) as reward continuous information information heavy! +Stanford +``` + +## frame_00124.jpg + +``` +Sta f rd +``` + +## frame_00126.jpg + +``` +49 +RLHF: PPO challenges +• Problem: RL in theory simple, in practice messy (clipping, rollouts, outer loops,...) +Stanford +``` + +## frame_00127.jpg + +``` +Sta f rd +``` + +## frame_00129.jpg + +``` +50 +RLHF: DPO +• Idea: maximize probability minimize the other +) = logo IBlog +CDPO ; Tref ¯ +Reinforcement Learning from Human Feedback (RLHF) +DPO +[Rafailov+ 2023] +me about +tho ot +label rewards +LM policy +preference data maximum +sample completions +likelihood +reinforcement learning +Direct Preference Optimization (DPO) +•write me a about +hlstary ot +preference data +maximum +likelihood +• This is —equivalent (same global minima) to RLHF/PPO +Stanford +``` + +## frame_00130.jpg + +``` +Stanford +``` + +## frame_00131.jpg + +``` +(no text extracted) +``` + +## frame_00132.jpg + +``` +59 +LLM evaluation: spurious correlation +• e.g. LLM prefers longer outputs +concise +gpt4_i i 06_preview +Mixtral-8x7B-lnstruct-vO.1 +gpt4_0613 +claude-2.1 +gpt-3.5-turbo-1106 +alpaca-7b +22. +13.7 +9.4 +9. +7.4 +2.0 +AlpacaEval +standard +50.0 +18.3 +15.8 +15.7 +9.2 +2.6 +verbose +64. +24. +23. +12. +2.9 +LC +Stan@æ +``` + +## frame_00133.jpg + +``` +Stanford +``` + +## frame_00136.jpg + +``` +Stanford +``` + +## frame_00137.jpg + +``` +61 +Systems +• Problem: everyone is bottlenecked by compute! +Why not buy more GPUs? +GPUs are expensive and scarce! +Physical limitations (eg communication between GPUs) +importance of resource allocation (scaling laws) and optimized pipelines +Stanford +``` + +## frame_00138.jpg + +``` +68 +• Massively parallel +• Fast matrix multiplication +' Compute > memory & communication +• Memory hierarchy +• Metric: Model Flop Utilization (MFU) +Ratio: observed throughput / theoretical best for that GPU +is great! +Stanford +``` + +## frame_00139.jpg + +``` +Stanford +``` + +## frame_00140.jpg + +``` +68 +Systems: low precision +• Fewer bits faster communication & lower memory consumption +• For deep learning: decimal precision -doesn't matter except exp & updates +• Matrix multiplications can use bflf instead of fp32 +Stanford +``` + +## frame_00141.jpg + +``` +68 +Systems: Iow precision +• Fewer bits faster communication & lower memory consumption +• For deep learning: decimal precision -doesn't matter except exp & updates +• Matrix multiplications can use bf16 instead of fp32 +• For training: Automatic Mixed Precision (AMP) +Weights stored in fp32, but before computation convert to bf16 +Stanford +``` + +## frame_00142.jpg + +``` +(no text extracted) +``` + +## frame_00143.jpg + +``` +68 +Systems: low precision +• Fewer bits faster communication & lower memory consumption +• For deep learning: decimal precision -doesn't matter except exp & updates +• Matrix multiplications can use bf16 instead of fp32 +• For training: Automatic Mixed Precision (AMP) +Weights stored in fp32, but before computation convert to bf16 +• Activation in bf16 main memory gains +• (Only) matrix multiplication in bf16 speed gains +• Gradients in bf16 memory gains +• Master weights updated fp32 full precision +Stanford +``` + +## frame_00144.jpg + +``` +Stanford +``` + +## frame_00145.jpg + +``` +69 +Systems: operator fusion +e Problem: +• communication is slow +x1 = x. cos() Read from x in global memory, write to x1 +x2 = xl.cos() Read from x1 in global memory, write to x2 +• every new PyTorch line moves variables to global memory +M ernorv +DRAM +Compo +SRAM +Compute +Na'iVe (non-fused) +Stanford +``` + +## frame_00146.jpg + +``` +Stanford +```