URI: 
       [HN Gopher] A GPT in 60 Lines of NumPy
       ___________________________________________________________________
        
       A GPT in 60 Lines of NumPy
        
       Author : squidhunter
       Score  : 721 points
       Date   : 2023-02-09 16:08 UTC (6 hours ago)
        
  HTML web link (jaykmody.com)
  TEXT w3m dump (jaykmody.com)
        
       | simonw wrote:
       | This article is an absolutely fantastic introduction to GPT
       | models - I think the clearest I've seen anywhere, at least for
       | the first section that talks about generating text and sampling.
       | 
       | Then it got to the training section, which starts "We train a GPT
       | like any other neural network, using gradient descent with
       | respect to some loss function".
       | 
       | It's still good from that point on, but it's not as valuable as a
       | beginner's introduction.
        
         | tysam_and wrote:
         | That concept is not the easiest to describe succinctly inside a
         | file like this (or -- while we are completely at it, in a
         | Hacker News post like this!), I think (especially as there are
         | various levels of 'beginner' to take into account here). This
         | is considered a very entry level concept (not as an insult --
         | simply from an information categorization/tagging perspective
         | here :D :)), and I think there might be others who would
         | consider it to be noise if logged in the code or described in
         | the comments/blogpost.
         | 
         | After all, there was a disclaimer that you might have missed up
         | front in the blogpost! "This post assumes familiarity with
         | Python, NumPy, and some basic experience training neural
         | networks." So it is in there! But in all of the firehose of
         | info we get maybe it is not that hard to miss.
         | 
         | However, I'm here to help! Thankfully the concept is not too
         | terribly difficult, I believe.
         | 
         | Effectively, the loss function compresses the task we've
         | described with our labels from our training dataset into our
         | neural network. This includes (ideally, at least), 'all' the
         | information the neural network needs to perform that task well,
         | according to the data we have, at least. If you'd like to know
         | more about the specifics of this, I'd refer you to the original
         | Shannon-Weaver paper on information theory -- Weaver's
         | introduction to the topic is in plain English and accessible to
         | (I believe) nearly anyone off of the street with enough time
         | and energy to think through and parse some of the concepts.
         | Very good stuff! An initial read-through should take no more
         | than half an hour to an hour or so, and should change the way
         | you think about the world if you've not been introduced to the
         | topic before. You can read a scan of the book at a university
         | hosted link here: https://raley.english.ucsb.edu/wp-
         | content/Engl800/Shannon-We...
         | 
         | Using some of the concepts of Shannon's theory, we can see that
         | anything that minimizes an information-theoretic loss function
         | should indeed learn as well those prerequisites to the task at
         | hand (features that identify xyz, features that move
         | information about xyz from place A to B in the neural network,
         | etc). In this case, even though it appears we do not have
         | labels -- we certainly do! We are training on predicting the
         | _next words_ in a sequence, and so thus by consequence humans
         | have already created a very, _very_ richly labeled dataset for
         | free! In this way, getting the data is much easier and the bar
         | to entry for high performance for a neural network is very low
         | -- especially if we want to pivot and 'fine-tune' to other
         | tasks. This is because...to learn the task of predicting the
         | next word, we have to learn tons of other sub-tasks inside of
         | the neural network which overlap with the tasks that we want to
         | perform. And because of the nature of spoken/written language
         | -- to truly perform incredibly well, sometimes we have to learn
         | all of these alternative tasks well enough that little-to-no-
         | finetuning on human-labeled data for this 'secondary' task (for
         | example, question answering) is required! Very cool stuff.
         | 
         | This is a very rough introduction, I have not condensed it as
         | much as it could be and certainly, some of the words are more
         | than they should be. But it's an internet comment so this is
         | probably the most I should put into it for now. I hope this
         | helps set you forward a bit on your journey of neural network
         | explanation! :D :D <3 <3 :)))))))))) :fireworks:
         | 
         | For reference, I'm interested very much in what I refer to as
         | Kolmogorov-minimal explanations (Wikipedia 'Kolmogorov
         | complexity' once you chew through some of that paper if you're
         | interested! I am still very much a student of it, but it is a
         | fun explanation). In fact (though this repo performs several
         | functions), I made https://github.com/tysam-code/hlb-CIFAR10 as
         | beginner-friendly as possible. One does have to make some
         | decisions to keep verbosity down, and I assume a very basic
         | understand of what's happening in neural networks here too.
         | 
         | I have yet to find a good go-to explanation of neural networks
         | as a conceptual intro (I started with Hinton -- love the man
         | but extremely mathematically technical for foundation! D:).
         | Karpathy might have a really good one, I think I saw a zero-to-
         | hero course from him a little while back that seemed really
         | good.
         | 
         | Andrej (practically) got me into deep learning via some of his
         | earlier work, and I really love basically everything that I've
         | seen the man put out. I skimmed the first video of his from
         | this series and it seems pretty darn good, I trust his content.
         | You should take a look! (Github and first video:
         | https://github.com/karpathy/nn-zero-to-hero,
         | https://youtu.be/VMj-3S1tku0)
         | 
         | For reference, he is the person that's made a lot of cool
         | things recently, including his own minimal GPT
         | (https://github.com/karpathy/minGPT), and the much smaller
         | version of it (https://github.com/karpathy/nanoGPT). But of
         | course, since we are in this blog post I would refer you to
         | this 60 line numpy GPT first (A. to keep us on track, B.
         | because I skimmed it and it seemed very helpful! I'd recommend
         | taking a look at outside sources if you're feeling particularly
         | voracious in expanding your knowledge here.)
         | 
         | I hope this helps give you a solid introduction to the basics
         | of this concept, and/or for anyone else reading this, feel free
         | to let me know if you have any technically (or-otherwise)
         | appropriate questions here, many thanks and much love! <3 <3 <3
         | <3 :DDDDDDDD :)))))))) :)))) :))))
        
         | aulin wrote:
         | there is so much material on deep learning basics these days
         | that I think we can finally skip reintroducing gradient descent
         | in every tutorial, can't we?
        
           | CamelCaseName wrote:
           | Any favorites you can share?
        
           | Joker_vD wrote:
           | The idea of "find in which direction function decreases most
           | quickly and go that direction" is really deep, and its
           | implementation via this cutting-edge mathematical concept of
           | "gradient" also deserves a whole section as well.
        
             | sillysaurusx wrote:
             | Look no further:
             | https://jax.readthedocs.io/en/latest/autodidax.html
             | 
             | "Autodidax: JAX core from scratch" walks you through it in
             | detail.
        
             | blagie wrote:
             | It's both really shallow and really deep.
             | 
             | On one hand, you can explain it to a 5-year-old: Go in the
             | direction which improves things.
             | 
             | On the other hand, we have more than a half-century of
             | research on sophisticated mathematical methods for doing it
             | well.
             | 
             | The latter isn't really helpful for beginners, and the
             | former is easy to explain. You can't use sophisticated
             | algorithms in either case, for beginners, so you can go
             | with something as dumb as tweak in all directions, and go
             | where it improves most. It will work fine for dummy
             | examples.
        
         | ianbutler wrote:
         | I think FastAI lesson 3 in "Practical Deep Learning for
         | Coders", has one of the most intuitive buildups of gradient
         | descent and loss that I've seen. * Lecture [1] Book Chapter [2]
         | 
         | It doesn't go into the math but I don't think that's a bad
         | thing for beginners.
         | 
         | If you want mathematical, 3blue1brown has a great series of
         | videos [3] on the topic.
         | 
         | [1] https://www.youtube.com/watch?v=hBBOjCiFcuo&t=1932s
         | 
         | [2]
         | https://github.com/fastai/fastbook/blob/master/04_mnist_basi...
         | 
         | [3] https://www.youtube.com/watch?v=aircAruvnKk
         | 
         | * I've been messing around with this stuff since 2016 and have
         | done a few different courses like the original Andrew Ng course
         | and more.
        
           | matsemann wrote:
           | I just did the 4th chapter of the book today
           | (04_mnist_basic). Very educational.
        
         | ly3xqhl8g9 wrote:
         | For those curious about writing a "gradient descent with
         | respect to some loss function" starting from an empty .py file
         | (and a numpy import, sure), can't recommend enough Harrison
         | "sentdex" Kinsley's videos/book _Neural Networks from Scratch
         | in Python_ [1].
         | 
         | [1]
         | https://youtu.be/Wo5dMEP_BbI?list=PLQVvvaa0QuDcjD5BAw2DxE6OF...
         | https://nnfs.io
        
         | RickHull wrote:
         | Here is an introduction to gradient descent with back
         | propagation, for Ruby, based on Andrej Karpathy's micrograd:
         | https://github.com/rickhull/backprop
        
       | sva_ wrote:
       | Impressive, but only forward pass.
        
         | thwayunion wrote:
         | It's an excellent learning tool :) Doing the backward pass in
         | the same style would be a great tool for teaching.
        
         | anigbrowl wrote:
         | I think the completeness and self-contained-ness more than
         | offsets the limited scope. One of the problems in the ML field
         | is rapidly multiplying logistical complexity, and I appreciate
         | an example that is (somewhat) functional but simple enough to
         | fit on a postcard and using very basic components.
        
         | time_to_smile wrote:
         | just replace the numpy code with jax.numpy as you should have a
         | fully differentiable model ready for training!
        
           | pumanoir wrote:
           | For someone not familiar with jax, if I do the suggested
           | replacement. What'd be the little extra code to make it do
           | the backward pass? Or is it all automatic and we literally
           | would not need extra lines of code?
        
             | time_to_smile wrote:
             | Backprop is just an implementation detail when doing
             | automatic differentiation, basically setting up how you
             | would apply the chain rule to your problem.
             | 
             | JAX is able to differentiate arbitrary python code (so long
             | as it uses JAX for the numeric stuff) automatically so the
             | backprop is abstracted away.
             | 
             | If you have the forward model written, to train it all you
             | have to do with wrap it in whatever loss function you want,
             | and the use JAX's `grad` with respect to the model
             | parameters and you can use that to find the optimum using
             | your favorite gradient optimization algorithm.
             | 
             | This is why JAX is so awesome. Differentiable programming
             | means you only have to think about problems in terms of the
             | forward pass and then you can trivially get the derivative
             | of that function without having to worry about the
             | implementation details.
        
               | matsemann wrote:
               | I haven't heard about JAX before, but been tinkering in
               | pytorch. Would I also be able to switch the use of np
               | arrays here to torch, and then do .backwards() and get
               | kinda the same benefits of JAX, or how does it differ in
               | this regard?
        
       | insane_dreamer wrote:
       | nice and clear. a worthy contribution to the subject.
        
       | ultrasounder wrote:
       | I also learnt a ton from NLPDemystified-
       | https://www.nlpdemystified.org. In fact I used this resource
       | first before attempting Andrej Karpathy's
       | https://karpathy.ai/zero-to-hero.html. I find Nitin's voice
       | soothing and am able to focus more. I also found the pacing good
       | and the course introduces a lots of concepts a beginner level and
       | also points to appropriate resources along the way(spacy for
       | instance). Overall an exciting time to be a total beginner
       | looking to grok NLP concepts.
        
       | master_yoda_1 wrote:
       | this article DOES NOT describe GPT3 and this article title is
       | misleading, and the author is lying.
        
         | moyix wrote:
         | The title doesn't say GPT3, it says GPT (unless it's been
         | edited since you posted this?).
        
       | voz_ wrote:
       | Wonderfully written, I love the amount of detail put into the
       | diagrams. Would love breakdowns like this for more stuff :)
        
       | jaykmody wrote:
       | Hey ya'll author here!
       | 
       | Thank you for all the nice and constructive comments!
       | 
       | For clarity, this is ONLY the forward pass of the model. There's
       | no training code, batching, kv cache for efficiency, GPU support,
       | etc ...
       | 
       | The goal here was to provide a simple yet complete technical
       | introduction to the GPT as an educational tool. Tried to make the
       | first two sections something any programmer can understand, but
       | yeah, beyond that you're gonna need to know some deep learning.
       | 
       | Btw, I tried to make the implementation as hackable as possible.
       | For example, if you change the import from `import numpy as np`
       | to `import jax.numpy as np`, the code becomes end-to-end
       | differentiable:                   def lm_loss(params, inputs,
       | n_head) -> float:             x, y = inputs[:-1], inputs[1:]
       | output = gpt(x, **params, n_head=n_head)             loss =
       | np.mean(-np.log(output[y]))             return loss
       | grads = jax.grad(lm_loss)(params, inputs, n_head)
       | 
       | You can even support batching with `jax.vmap` (https://jax.readth
       | edocs.io/en/latest/_autosummary/jax.vmap.h...):
       | gpt2_batched = jax.vmap(gpt2, in_axes=0)
       | gpt2_batched(batched_inputs) # [batch, seq_len] -> [batch,
       | seq_len, vocab]
       | 
       | Of course, with JAX comes in-built GPU and even TPU support!
       | 
       | As far as training code and KV Cache for inference efficiency, I
       | leave that as an exercise for the reader lol
        
         | moconnor wrote:
         | This is beautiful. Having worked with everything from nanoGPT
         | to Megatron, sitting down and reading through picoGPT.py was
         | clear and refreshing with just the essential details. Nothing
         | left to add, nothing left to take away: perfection.
        
         | eslaught wrote:
         | > GPU support
         | 
         | If you haven't tried cuNumeric [1], you really ought to. It's a
         | drop-in NumPy wrapper for distributed GPU acceleration. Would
         | be interesting to see if it works for this.
         | 
         | [1]: https://github.com/nv-legate/cunumeric
        
           | VHRanger wrote:
           | The problem with drop-in replacements between CPU and GPU
           | code is that performance GPU code requires rethinking the
           | dataflow often -- so even if the code itself is a drop-in,
           | the "make it good" part still requires some rewriting.
           | 
           | I'd be curious how that library compares to other numeric
           | python GPU libraries
        
         | tysam_and wrote:
         | "hackable" and "simple yet complete technical introduction"
         | 
         | Music to my ears, well done and don't worry too much about the
         | negative comments! They'll come out for anything you do I
         | think.
         | 
         | I saw a tweet from someone the other day talking about how they
         | massively increased their training speed by changing part of
         | their architecture to have dimensions that were a factor of 64
         | rather than a prime-like kind of number.
         | 
         | One of the comments below it? ~"Seems very architecture
         | specific."
         | 
         | lol.
         | 
         | So don't sweat it! <3 Great work and thanks for putting
         | yourself out there, super job! :D :D :D :D :)))))) <3 :D :D
         | :fireworks:
        
       | master_yoda_1 wrote:
       | [flagged]
        
         | master_yoda_1 wrote:
         | also for curious mind here is a more authentic tutorial on
         | building gpt (precursor to gpt3) by andrej karpathy
         | https://www.youtube.com/watch?v=kCc8FmEb1nY My point is if you
         | want to spend time, spend on authentic material not some bogus
         | material.
        
         | enlyth wrote:
         | That's a bit overly harsh and missing the point of the article
        
         | LukeB42 wrote:
         | Terrible joke and COMPLETELY missing the point.
        
         | kurisufag wrote:
         | the LoC count has nothing to do with the amount of processing
         | required for training -- if anything they're probably inversely
         | related.
         | 
         | the title (correctly) describes a small LoC count, which makes
         | no assertions about the other things you mentioned.
        
           | master_yoda_1 wrote:
           | the article DOES NOT describe GPT3 and this article title is
           | misleading, and the author is lying.
        
             | kurisufag wrote:
             | gpt3 =/= gpt
             | 
             | https://cdn.openai.com/research-covers/language-
             | unsupervised...
        
             | simonw wrote:
             | What's the lie? The article title is "GPT in 60 lines of
             | NumPy".
             | 
             | I'd agree with you if it said "GPT3 in 60 lines of NumPy" -
             | but it doesn't say that.
        
             | joelfried wrote:
             | Anger... fear... aggression. The dark side are they. Easily
             | they flow, quick to join you in a fight. If once you start
             | down the dark path, forever will it dominate your destiny,
             | consume you it will, as it did Obi-Wan's apprentice. --
             | Master Yoda, Return of the Jedi
        
         | master_yoda_1 wrote:
         | I would better use pytorch api than this bogus code.
         | https://pytorch.org/docs/stable/generated/torch.nn.Transform...
        
           | LukeB42 wrote:
           | Is this because you don't know how to implement the weight
           | updates in NumPy yourself?
        
         | amelius wrote:
         | Even GPT3 has better humor.
        
       | eddsh1994 wrote:
       | Why do people in ML put imports inside function definitions?
        
         | jaykmody wrote:
         | Author here. It's a design choice, but there's two reasons I
         | chose to use imports like this:
         | 
         | 1) For demonstrative purposes. The title of the post is `A GPT
         | in 60 Lines of NumPy`, I kinda wanted to show "hey it's just
         | numpy, nothing to be scared about!". Also if an import is ONLY
         | used in a single function, I find it visually helps show that
         | "hey, this import is only used in this function" vs when it's
         | at the top of the file you're not really sure when/where and
         | how many times an import is used.
         | 
         | 2) Scoping. `load_encoder_hparams_and_params` imports
         | tensorflow, which is really slow to import. When I was testing,
         | I used randomly initialized weights instead of loading the
         | checkpoint which is slower, so I was only making use of the
         | `gpt2` function. If I kept the import at the top level, it
         | would've slowed things down unnecessarily.
        
         | [deleted]
        
         | codethief wrote:
         | Another reason (besides the ones already mentioned in the other
         | comments) is that some imports might only be available on
         | certain operating systems or architectures. I once wrote
         | heavily optimized ML code for Nvidia Jetson Nano devices but I
         | still wanted to be able to test the overall application (the
         | non-Nvidia-specific code) on my laptop or in pipelines.
        
         | moyix wrote:
         | One reason is that some ML libraries are really slow to import,
         | so you don't want to put them at top-level unless you
         | definitely need them. E.g. if I had just one function that
         | needed to use a tokenizer from the Transformers library, I
         | wouldn't want to eat a 2 second startup cost every time:
         | In [1]: %time import transformers         CPU times: user 3.21
         | s, sys: 7.8 s, total: 11 s         Wall time: 1.91 s
        
           | eddsh1994 wrote:
           | I didn't think about lazy loading, I also didn't know they
           | were scoped differently! I thought it was some sort of
           | organisation to keep imports close to usage. Thanks!
        
             | int_19h wrote:
             | The scoping also has some performance advantage: locals are
             | accessed by index in the bytecode, with all name resolution
             | happening at compile-time, but globals require a string
             | lookup in the module dictionary every time they're
             | accessed.
             | 
             | This isn't something that should matter even a little in
             | typical ML code. But in generic Python libraries, there are
             | cases when this kind of micro-optimization can help.
             | Similar tricks include turning methods into pre-bound
             | attributes in __init__ to skip all the descriptor machinery
             | on every call.
        
             | cuteboy19 wrote:
             | They are also scoped differently, but the lazy loading is
             | the key thing here
        
         | z3t4 wrote:
         | Why not? It helps limiting variable scope. The advantage of a
         | global variable without the disadvantages.
        
           | cuteboy19 wrote:
           | Why not- possible to get an importerror if you make a mistake
           | in the import statement. This kind of error should happen as
           | early as possible and you won't expect it to happen during a
           | random function call
        
             | kelnos wrote:
             | If you're writing in a dynamically-typed, interpreted
             | language like python, I think mistyping an import inside a
             | function is really the least of your concerns when it comes
             | to mistyping things.
        
         | junon wrote:
         | Lazy loading, avoiding pollution of symbols in the root scope,
         | avoiding re-exports of symbols in the root scope, self-
         | documenting code ("this function uses these libraries"),
         | portable coding (sometimes desirable), etc.
        
         | time_to_smile wrote:
         | I'm in ML and I would also like an answer to this question.
         | 
         | I've seen a lot of Python people sprinkle imports all over the
         | place in their code. I suspect this is a bad habit learned from
         | too much time working in notebooks where you often have an "oh
         | right, I need XXX library now" and just import it as you need
         | it.
         | 
         | The aggressive aliasing I do get since in DS/ML work it's very
         | common to have the same function do slightly different things
         | depending on the library (standard deviation between numpy and
         | pandas is a good example)
         | 
         | But I personally like all of my imports at the top so I know
         | what this code I'm about to read is going to be doing. I do
         | seem to be in the minority in this (and would be glad to be
         | correct if I'm make some major error).
        
           | jwilber wrote:
           | Almost every tech company will have some sort of commit hook
           | using isort to force correct import ordering at the top of
           | the file.
        
             | claytonjy wrote:
             | Ha, if only! I've been the one to introduce this at the
             | last three jobs I've had, two of which had hundreds of
             | engineers and plenty of python code before I got there.
             | 
             | "Best practices" are incredibly unevenly distributed, and I
             | suspect this is only more true for data/ML-heavy python
             | code.
        
             | clawlor wrote:
             | New (v5) isort doesn't move imports to the top of the file
             | anymore, at least not by default. There is a flag to retain
             | the old behavior, but even then I don't think it will move
             | imports from, say, inside a function body to the top of the
             | module.
        
           | matsemann wrote:
           | I often end up having to inline imports, because python
           | doesn't support circular imports.
           | 
           | Of course, "don't do circular imports". But if my Orders
           | model has OrderLines, and my OrderLines points to their
           | Order, it's damn hard to avoid without putting everything in
           | one huge file..
        
         | joxel wrote:
         | I'll have to do imports to change backends for matplotlib
         | sometimes
        
         | reallymental wrote:
         | Scope-dependent imports. What if a package is just required for
         | that particular function, and once that function is done, the
         | imported package is no longer required?
        
         | w0m wrote:
         | I do _sometimes_ - just depends on the context and how often
         | the function(xor library) is going to get called.
         | 
         | Here - they put `import fire` _only_ in the `if __name__ ==
         | "__main__":` - that seems reasonable to me as anyone pulling in
         | the library from elsewhere doesn't need the pollution.
        
           | theptip wrote:
           | Does that import have side effects? Are we really worried
           | about adding an entry to the imports dict if not? Or put
           | differently, what cases do we actually get a negative effect
           | from just importing at the top?
        
             | kelnos wrote:
             | Importing another module takes non-zero time and uses non-
             | zero memory, and let's face it: python is not exactly a
             | fast language. Personally I'd appreciate a library author
             | that takes steps to avoid a module load when that module is
             | only used (for example) in some uncommonly-taken code
             | paths.
             | 
             | In some (many?) cases it's probably premature optimization,
             | but it doesn't hurt, so I don't see why anyone would get up
             | in arms over it.
        
             | apetresc wrote:
             | Oh yeah, imports in Python are not just, like, extending a
             | namespace like in many other languages. They, at runtime,
             | go and run the module's __init__ and can have arbitrary
             | side effects - an entire program can run (although usually
             | shouldn't) just in the import. Imports of large modules
             | often take entire seconds.
             | 
             | It is absolutely worthwhile to avoid unnecessary imports if
             | possible.
        
               | theptip wrote:
               | I know they _can_ have side-effects, I've just never seen
               | a case where it actually mattered, and I have used Python
               | professionally for 10 years. So I'm curious if this is
               | more common in ML libraries or something.
        
               | apetresc wrote:
               | I guess it depends on your definition of "side-effects"
               | but it definitely comes up in common ML packages. For one
               | example, importing `torch` often takes seconds, or tens
               | of seconds, because the import itself needs to determine
               | if it should come up in CUDA mode or not, how many
               | OpenCL/GPU devices you have, how they're configured, etc.
        
             | Kichererbsen wrote:
             | importing is a _runtime_ operation: unless previously
             | imported, the interpreter will go and import that module,
             | executing that modules code. that can take a while. it will
             | also bind a name in the current scope to the modules name,
             | so... that might be considered pollution?
        
           | lizard wrote:
           | Right, I do this with argparse for creating simple CLIs for a
           | module generally intended to be imported and used in another
           | program. argparse has nothing to do with the actual module
           | functions and won't be needed if the module if going to be
           | used in a web app or some other context.
           | 
           | This make even more sense for a non-standard library like
           | fire because you won't even need this dependency if you're
           | going to import the module and write your own interface
           | instead.
           | 
           | The import in main doesn't seem particularly useful in
           | context on a quick read, but considering the line
           | 
           | > utils.py contains the code to download and load the GPT-2
           | model weights, tokenizer, and hyper-parameters.
           | 
           | it seems possible some downloads are happening on import so
           | does make sense to defer until actually needed, as suggested
           | in sibling comments.
        
         | sega_sai wrote:
         | At least one reason to do that is to allow optional module
         | dependencies.
        
       | lspears wrote:
       | For those interested I would also check out Andrej Karpathy's
       | YouTube video on building GPT from scratch:
       | 
       | https://youtu.be/kCc8FmEb1nY
        
         | azath92 wrote:
         | Karpathy has a bunch of great resources on this front! His
         | minGPT writeup is excellent https://github.com/karpathy/minGPT
         | His more recent project nanoGPT which references this video is
         | a much more capable, but still learning friendly,
         | implementation.
        
       | thomasfromcdnjs wrote:
       | This reads really well, thank you very much.
        
       | [deleted]
        
       | [deleted]
        
       | terran57 wrote:
       | From the article:
       | 
       | "Of course, you need a sufficiently large model to be able to
       | learn from all this data, which is why GPT-3 is 175 billion
       | parameters and probably cost between $1m-10m in compute cost to
       | train.[2]"
       | 
       | So, perhaps better title would be "GPT in 60 Lines of Numpy (and
       | $1m-$10m)"
        
         | 99_00 wrote:
         | Anyone know what the minimum cost for creating a model is and
         | what the limitation would be?
        
           | sharemywin wrote:
           | this is pretty small:
           | 
           | https://github.com/karpathy/nanoGPT
        
           | [deleted]
        
         | pumanoir wrote:
         | I saw this [1] presentation where they use scheme to train GPT
         | on a single consumer GPU. I've had no luck finding the 'scorch'
         | compiler they mentioned in the video.
         | 
         | 1.
         | https://youtu.be/rDke29MbKQA?list=PLyrlk8Xaylp7NvZ1r-eTIUHdy...
        
         | rvz wrote:
         | And it will be even more expensive to train it again on larger
         | amounts of data and with a model with 10 times more parameters.
         | 
         | Only Big Tech giants like Microsoft, Google, etc can afford to
         | foot the bill and throw away millions into training LLMs,
         | whilst we celebrate and hype about ChatGPT and LLMs getting
         | bigger and significantly more expensive to train when they get
         | confused, hallucinate over silly inputs and confidently
         | generate bullshit.
         | 
         | That can't be a good thing. OpenAI's ClosedAI model needs to be
         | disrupted like how Stable Diffusion challenged DALLE-2 with an
         | open source AI model.
        
           | Kranar wrote:
           | I disagree, I run a small tech company that has a group
           | that's been experimenting with stable diffusion and we
           | noticed that an extreme version of the Pareto Principle
           | applies here as well where you can get ~90% of the benefits
           | for like 5% of the cost, combined with the fact that
           | computing power is continuously getting cheaper.
           | 
           | Based on that groups success, they've recently proposed a
           | mini project inspired by GPT that I am considering funding;
           | the data its trained on is all publicly available for free,
           | and most it comes from Common Crawl. I suspect that it will
           | also yield similar results, where you can tailor your own
           | version of GPT and get reasonably good models for a fraction
           | of the price as well. We're no where close to the scale of
           | Big Tech giants, but I've noticed for the better part of 15
           | years that small companies can actually derive a great deal
           | of the benefits that larger companies have for a fraction of
           | the cost if they play it smart and keep things tight.
        
             | 99_00 wrote:
             | Do you think it is possible for the AI to request
             | information to fill in gaps in it's model?
             | 
             | For example, the AI doesn't have enough information about a
             | companies process, or a regulation. It chats with an expert
             | to fill in the gaps.
             | 
             | I have no understanding of AI
        
               | TheCoreh wrote:
               | It is! You can specify on its prompt that it should
               | "request additional info via search query, using the
               | following syntax: [[search terms here]], before coming to
               | a final conclusion" then you integrate it with a
               | traditional knowledge base textual look up, and run it
               | again with that information concatenated
        
               | HellsMaddy wrote:
               | Yes, check out LangChain [0]. It enables you to wire
               | together LLMs with other knowledge sources or even other
               | LLMs. For example, you can use it to hook GPT-3 up to
               | WolframAlpha. I'm sure you could pretty easily add a way
               | for it to communicate with a human expert, too.
               | 
               | [0]: https://github.com/hwchase17/langchain
        
               | alfor wrote:
               | Yes.
               | 
               | It's trained on completing the text.
               | 
               | If an expert write a long test and you and "in summary: "
               | at the end, the model will complete with something
               | approximating truth (depend on size of model, training,
               | etc)
               | 
               | Humains do a similar things. We have a model in our head
               | of the subject discussed and we can summarize, but we
               | will forget some parts, make errors, etc. GPT is very
               | similar.
        
               | simonw wrote:
               | This is happening already. The trick is to run a search
               | against an existing search engine, then copy and paste
               | the search results into the language model and ask it to
               | answer questions based on what you provide it.
               | 
               | This is how the new Bing Assistant works. It's also how
               | search engines like https://you.com/ and
               | https://www.perplexity.ai/ work - as exposed by a prompt
               | leak attack against Perplexity a few weeks ago:
               | https://simonwillison.net/2023/Jan/22/perplexityai/
               | 
               | I wrote a tutorial about one way of implementing this
               | pattern yourself here:
               | https://simonwillison.net/2023/Jan/13/semantic-search-
               | answer...
        
               | [deleted]
        
               | crosen99 wrote:
               | A small difference between the pattern you describe and
               | the one of the inquiry is where responsibility lies for
               | retrieving and incorporating the augmentation. You
               | describe the pattern where an orchestration layer sits in
               | front of the model, performs the retrieval, and then
               | determines how to serve that information down to the
               | model. The inquiry asks about whether the AI/model itself
               | can perform the retrieval and incorporation function.
               | 
               | It's a small difference, perhaps, but with some
               | significance since the retrieval and incorporation
               | occurring outside the model has a different set of trade
               | offs. I'm not specifically aware of any work where model
               | architectures are being extended to perform this function
               | directly, but I am keen to learn of such efforts.
        
               | [deleted]
        
         | zeknife wrote:
         | There are GPT-2 checkpoints small enough to run on basically
         | any modern computer
        
         | MuffinFlavored wrote:
         | Will one business model be for OpenAI to "license" out access
         | to their trained model?
         | 
         | How large is the model on disk(s) once it is trained?
        
           | hackernewds wrote:
           | They must have time traveled to today in the past and read
           | your comment, since this is precisely their business model!
        
           | theptip wrote:
           | Perhaps I'm missing your point, but isn't that what they do
           | with their API right now? You pay for text completions, and
           | can fine-tune their model with your data.
        
             | veqq wrote:
             | But you can't run the code on your own machine.
        
               | mattnewton wrote:
               | Of course, if they leaked the model weight's and a local
               | inference binary for it they would lose the ability to
               | charge for it. Clones with the weights would crop up all
               | over the place.
        
           | shagie wrote:
           | From various sources, the model itself is about 800 GB on
           | disk.
        
       | barbazoo wrote:
       | So much criticism in the comments. I appreciated the write-up and
       | the code samples. For some people not in ML like myself it's hard
       | to understand the concept behind GPT and this made it a little
       | bit clearer.
        
         | tysam_and wrote:
         | I think this is a factor of putting one's self out there. I've
         | had this happen on ML projects I've put out too, though being
         | hyper-engaged in trying to thoughtfully respond to all (or as
         | many as possible of) the comments section for me has seemed to
         | lower negativity a bit just because it brings up the 'person-
         | in-the-room' effect up to an online audience...at least, so I
         | think! :D
         | 
         | I thought it was a great post and manky kudos to the author for
         | putting themselves out like that! I really appreciated this and
         | any work that does this kind of effort in onboarding people and
         | giving people tools to understand something well really I think
         | has some of the most long-term impact to the field.
         | 
         | Lowering barriers to entry, making resources accessible to all,
         | and decreasing experimentation cycle time I think are some of
         | the most critical components to making any progress at all in
         | the field beyond a basic pittance. Imagine if everyone had easy
         | access to, knowledge about, and rapid experimentation results
         | in things like quantum mechanics, large-algorithm testing,
         | painting arts, musical arts, etc. It would drive things so much
         | further forward at an individual and field-based level so
         | quickly. <3 :)))) :D :D ;D :D :D :))))))))
        
       | eslaught wrote:
       | I know this probably isn't intended for performance, but it would
       | be fun to run this in cuNumeric [1] and see how it scales.
       | 
       | [1]: https://github.com/nv-legate/cunumeric
        
       | adamnemecek wrote:
       | It turns out that transformers have a learning mechanism similar
       | to autodiff but better since it happens mostly within the single
       | layers as opposed to over the whole graph. I wrote a paper on
       | this recently https://arxiv.org/abs/2302.01834v1. The math is
       | crazy.
        
         | LukeB42 wrote:
         | Can you explain like I'm 5 why this matters distinctly from how
         | transformers are normally trained with autodiff and what its
         | possible applications are?
        
           | adamnemecek wrote:
           | I'm talking about attention only transformers. Those don't
           | have an autodiff but still learn. The math is actually really
           | cool.
        
         | macrolocal wrote:
         | First question: why should the attention mechanism output and
         | residual stream match?
        
           | adamnemecek wrote:
           | Match is a bad word, the don't match, they are duals. The
           | residual stream aka identity mapping needs to be the identity
           | of the attention mechanism as the attention mechanism learns.
           | 
           | But this is the same for all residual streams, not just those
           | in transformers.
           | 
           | Join my discord to discuss this further
           | https://discord.gg/mr9TAhpyBW
        
             | macrolocal wrote:
             | Wait-- the residual stream makes the attention mechanism
             | learn the _difference_ from the identity! Are you sure you
             | 're not thinking about auto-encoders?
             | 
             | Edit: ok, Discord it is.
        
               | adamnemecek wrote:
               | Do you see a similarity between residual stream and Dirac
               | function?
        
               | crosen99 wrote:
               | I don't believe autodiff is finding the difference in
               | that sense. It's finding derivatives.
        
               | macrolocal wrote:
               | Well, the paper uses gradient descent to minimize that
               | _difference_ , like auto-encoders do.
        
               | crosen99 wrote:
               | Gradient descent is just how neural networks (including
               | auto-encoders) optimize parameters to minimize the loss
               | function. They do this using derivatives to descend down
               | the slope of the function. Autodiff is one way to compute
               | the derivatives. Maybe we're saying the same thing.
        
         | tpoacher wrote:
         | "Combinatorial Hopf" would make an excellent beer name!
         | 
         | "Bartender! A half-pint of your finest Combinatorial Hopf, if
         | you please!"
        
       | freecodyx wrote:
       | Since most models require little code compared to big software
       | projects, why not use c++ or any other compiled language
       | directly. Python with it's magic functions, shortcuts is just
       | hiding too much complexity which can result in bug performance
       | issues. Plus code is more hard to maintain
        
         | CaptainNegative wrote:
         | > Python with it's magic functions, shortcuts is just hiding
         | too much complexity
         | 
         | One counterpoint would be that verbosity, especially in the
         | heavy syntax style of languages such as C++, distracts the
         | reader and helps bugs hide in plain sight. For a silly example,
         | imagine trying to read and verify the correctness of an
         | academic paper from its uncompiled LaTeX source.
        
         | stavros wrote:
         | Which magic functions and shortcuts in the posted code do you
         | feel might introduce bugs?
        
           | freecodyx wrote:
           | In general, the article is fine.
        
             | aiisjustanif wrote:
             | I'm curious as well, what specific lines would cause an
             | issue or abstracts too much.
        
       ___________________________________________________________________
       (page generated 2023-02-09 23:00 UTC)