The whole series is good, if you're familiar with some linear algebra, but I found this one interesting in particular because of the last chapter (17:00 mark onwards) discussing superposition.
I had always conceptualized the 50K neurons of GPT as the 50K parameters/dimensions it could assess tokens on, not realizing that superpositions exponentially increase the possibilities and what we consider nuance. Which also makes it impossible to truly know what it is 'thinking', as there are just too many superpositions to take into account. And, as I read somewhere else but can't find now, it also means the complete inability of any LLM output to be resistant to adverse agents.
It's also fasincating that GPTs scale so well essentially through what feels like a hack, with higher dimension matrices allowing for much more vectors when the assumption is let go that vectors have to be perpendicular. The much more nuanced language modeling of LLMs comes from its size and this quirk of linear algebra, is how I understand that part.