I like to train large deep neural nets too 🧠🤖💥 | First Paper (AutoAgents: A Framework for Automatic Agent Generation) Accepted @ IJCAI 2024 | Role Model Karpathy
ByteDance drops OmniHuman🔥 This is peak SOTA performance - flawless natural gestures with perfect lip sync and facial expressions. This is the second time they've released SOTA level talking-heads only this time with hands and body motion. Project: https://omnihuman-lab.github.io/
The beauty in GRPO is the fact that it doesn’t care if the rewards are rule-based or learned, the hack: let the data self-normalize— trajectories in a batch compete against their mean, no value model, no extra params, just clean, efficient RL that cuts memory usage by 50%, while maintaining SOTA performance. btw it was introduced 9months prior to R1: arxiv.org/pdf/2402.03300
I updated the LLM Scientist roadmap and added a ton of new information and references. It covers training, datasets, evaluation, quantization, and new trends like test-time compute scaling.
The LLM Course has been incredibly popular (41.3k stars!) and I've been touched to receive many, many messages about how it helped people in their careers.
I know how difficult this stuff can be, so I'm super proud of the impact it had. I want to keep updating it in 2025, especially with the LLM Engineer roadmap.
minimal single script implementation of knowledge distillation in LLMs. In this implementation, we use GPT-2 (124M) as student model and GPT-2 Medium (340M) as teacher via reverse Kullback-Leibler (KL) divergence, trained on a small chunk of openwebtext.
damn I love nvidia's bullish stance on taking AI to the edge - from being the overlord of compute to cutting-edge physical AI with SOTA multiverse simulation engines that brings the scaling laws under your control!!
My favorite: Cosmos - fully opensourced, open-weight physics based video gen platform, what an incredible way to start off the year✨
nanoBLT: Simplified lightweight implementation of a character-level Byte Latent Transformer model (under 500 lines of code). The model is 2x4x2 (n_layers_encoder, n_layers_latent, n_layers_decoder) layer deep trained on ~1M bytes of tiny Shakespeare with a patch size of 4.