Some history: chatGPT and neural nets got a real kick in the ass when Robert Mercer decided to apply Markov chains to high-frequency trading. Markov came up with his theories shortly before being blackballed by the University of St. Petersberg for refusing to rat his students out to the Tsar. Mercer, for his part, basically figured out that Markov chains would allow his hedge fund to reverse-engineer the trading decisions of other dark pool trading bots and front-run them. This made him a lot of money, that he didn't want to pay taxes on, so he hired Cambridge Analytica to destroy the world. All that and he only delayed the inevitable. In other words, the blacker the box, the better the performance. This is important because we're talking about training a model - a representation of reality. There is no part of neural networks or markov bots that attempt to explain that model, their sole purpose is to ape it, input to output. They will give you what you have, and hopefully allow you to predict what you'll get... assuming the future matches the past. My sound world is governed and defined by Fourier transforms.. This is applied math stuff that argues that any function, no matter how random and chaotic, can be modeled as a series of sine waves. It's a curve fit and for most things it's good enough. You talking into your phone becomes a collection of bits through liberal application of Fourier transforms. And most of the time it works and the world continues in its orbit but sometimes the normies can't tell if it's yanny or laurel at which point we need experts who can explain, in no uncertain terms, that it's fucking laurel, that the normie confusion about it is due to their inexperience with codecs gone bad, that when the curve fit no longer fits the curve the philosophical "what does green even mean, maaaaaan" discussion is fine for Medium but if you're prepping a legal brief it is generally accepted to mean 495-570nm, full stop. All well and good unless your video game doesn't include bicyclists at night where there are no crosswalks. Had an interesting insight while still talking to my mother. Medicaid was paying for her physical therapy. She got "better" much faster than me or my sister anticipated - although her therapists never used the word "better." They kept using the phrase "return to baseline." At one point I asked what, precisely, "return to baseline" meant. The lead therapist cleared her throat, put on her lawyer hat and stated that for purposes of Medicaid reimbursement, "baseline" is determined to be that level of performance at which improvement plateaus such that qualitative measures improve no more than twenty percent over the course of X sessions where X is dependent on the qualitative measure. "What you're telling me," I said, "is that 'baseline' is not 'where was she before' but 'where does improvement stop.'" "For purposes of Medicaid reimbursement, that is correct," she said. Now - my mother left their tender care with a walker. She was good for 20 steps, with assistance, before being winded and in pain. Prior to the accident she was getting around without assistance. "Flattens out at some constant value" does not mean the problem is solved, it means the model can't get any closer. Yeah - "if that value is sufficiently small, then the training can be considered successful" but who is determining the value? "Our self-driving model has avoided running over imaginary bicyclists for 2 million runs, it'll be fine in Phoenix." Yeah but are we good enough? Remember - ELIZA was created to show what a bad fucking idea all this bullshit is IN 1965! Joseph Weizenbaum's ELIZA, running the DOCTOR script, was created to provide a parody of "the responses of a non-directional psychotherapist in an initial psychiatric interview" and to "demonstrate that the communication between man and machine was superficial" And at some point, someone will decide that tradeoff is good enough - Microsoft figured Bing was ready, Google figured Bard was ready. To do that, they performed a sleight-of-hand that Microsoft didn't pull with Tay, but which underpins all this bullshit: large language models are trained to talk. Search engines are supposed to provide ANSWERS. This is a great article that I would badge if I had any left. Stephen Wolfram is a straight shooter in my experience and he's done a great job of explaining what the chatGPT is doing - it's chatting. It's not "AnswerGPT." "I'm sorry but you're mistaken, Avatar isn't out yet because the year is 2022" is a stellar answer for a chat bot. It's fucking garbage from a search standpoint. This is bad enough when we're looking up showtimes. Complications ensue when we're making staffing decisions. This is very much like how Google made their AI marginally less racist by deleting gorillas from the model.. That one sentence "I don't think there's any particular science to this" is why the whole thing is going to crash and burn in an extremely ugly way after doing a fuckton of damage. "Please explain for the jury, Doctor Scientist, how your program determined my client's position should be terminated due to her work performance, rather than her inability to thrive in a racist environment:" "Thank you, Doctor Scientist. No further questions." Title VII? 13,000 words. The theoretical, scientific underpinnings of Markov chains and neural networks will severely limit any LLM from accurately reproducing law, let alone parse it. Sure - but what we do with it and how we make it isn't, QED. The problem here is now and has always been pareidolia. We see something that talks and we presume it has a soul. The better it talks the more soul we assign to it. The more soul we assign to it, the more value it has and the more value it has the more we let it trample humans. Until, that is, we've trampled enough that they threaten to tear down society. The fact that there's a lot more news about ChatGPT sucking than chatGPT succeeding is on the one hand heartening but on the other hand deeply discouraging. Neither Microsoft nor Google care. There is no bad news. Fuckups and how they respond just allow faceless corporations to show how much they care. And the Markov bots only operate on a time horizon of a few milliseconds anyway so we're looking for a derivative of a derivative of a derivative of a signal in order to juice the stock price. "Dr. Wolfram, can you please explain whether these 'large language models' can separate meaningful language from meaningless gibberish?" "In other words, Dr. Wolfram, flawed data will produce flawed responses?" "Thank you, Dr. Wolfram. Your Honor, the prosecution rests." _______________________________________ A fourier transform will allow you to process an analog signal digitally. It rounds the corners off square waves but then, so does physics. It's "good enough" for what we need most of the time - you listen to Spotify at 128kbps VBR, I mix at 48kHz 32-bit floating point unless I need 96kHz or 192kHz. Even then, it tells you what is, not what will be and the whole of what we want LLMs to do is tell us what will be. Large language models are improvisational LUTs. LUTs are great so long as you don't wander off the map. In the case of office racism, the AI knows what a stereotypical employee should do in a stereotypical environment and anything that deviates from the stereotype is statistically rounded off. Ergonomics and biomechanics are governed by the "5% human" and the "95% human." Your cars, your bicycles, your scissors, your coffee mugs are designed around 90% of humanity and the other 10% cope for better or worse. I've long said that any schlub can do 80-90% of any job, it's that last 10-20% that keeps you employed. AI is gonna be great for the stuff that requires no expertise. Unfortunately, expertise involves knowing when expertise is required and AIs suck at that. Google was gonna have their self-driving cars on the road by what, 2018? This is the problem marketing always has: they don't understand the difficulty of complex problems and they don't want to. Google is usually smart enough not to let marketing steer the ship while Tesla is the opposite of that. Results were predictable. Unfortunately for big stupid tech companies, Western law has sided with "wronged individual" over "faceless corporation" every time the faceless corporation can't prove they were abiding by the law. And the achilles heel of AI is that the more sophisticated it is, the less you can prove.Particularly over the past decade, there’ve been many advances in the art of training neural nets. And, yes, it is basically an art. Sometimes—especially in retrospect—one can see at least a glimmer of a “scientific explanation” for something that’s being done. But mostly things have been discovered by trial and error, adding ideas and tricks that have progressively built a significant lore about how to work with neural nets.
In earlier days of neural nets, there tended to be the idea that one should “make the neural net do as little as possible”. For example, in converting speech to text it was thought that one should first analyze the audio of the speech, break it into phonemes, etc. But what was found is that—at least for “human-like tasks”—it’s usually better just to try to train the neural net on the “end-to-end problem”, letting it “discover” the necessary intermediate features, encodings, etc. for itself.
And, similarly, when one’s run out of actual video, etc. for training self-driving cars, one can go on and just get data from running simulations in a model videogame-like environment without all the detail of actual real-world scenes.
And what one typically sees is that the loss decreases for a while, but eventually flattens out at some constant value. If that value is sufficiently small, then the training can be considered successful; otherwise it’s probably a sign one should try changing the network architecture.
In the future, will there be fundamentally better ways to train neural nets—or generally do what neural nets do? Almost certainly, I think.
Or put another way, there’s an ultimate tradeoff between capability and trainability: the more you want a system to make “true use” of its computational capabilities, the more it’s going to show computational irreducibility, and the less it’s going to be trainable. And the more it’s fundamentally trainable, the less it’s going to be able to do sophisticated computation.
Why does one just add the token-value and token-position embedding vectors together? I don’t think there’s any particular science to this. It’s just that various different things have been tried, and this is one that seems to work.
But anyway, here’s a schematic representation of a single “attention block” (for GPT-2):
(As a personal comparison, my total lifetime output of published material has been a bit under 3 million words, and over the past 30 years I’ve written about 15 million words of email, and altogether typed perhaps 50 million words—and in just the past couple of years I’ve spoken more than 10 million words on livestreams. And, yes, I’ll train a bot from all of that.)
So how is it, then, that something like ChatGPT can get as far as it does with language? The basic answer, I think, is that language is at a fundamental level somehow simpler than it seems.
...is there a general way to tell if a sentence is meaningful? There’s no traditional overall theory for that. But it’s something that one can think of ChatGPT as having implicitly “developed a theory for” after being trained with billions of (presumably meaningful) sentences from the web, etc.
The basic concept of ChatGPT is at some level rather simple. Start from a huge sample of human-created text from the web, books, etc. Then train a neural net to generate text that’s “like this”. And in particular, make it able to start from a “prompt” and then continue with text that’s “like what it’s been trained with”.