#

8

Weekly Digest: 06th Sep 2024

Business: Alibaba Cloud's Qwen2-VL dominating Video LMs, Ilya Sutskevar's new start-up Safe Super Intelligence garners $1 Billion in funding, Anthropic's increasing competition to OpenAI, End of ApplePay monopoly, Telegraph CEO's free speech tweet, Problem with LMSys's Chatbot Arena

Technology: Magic's 100 Million token context window and architecture commentary

Long Read: ML Common's machine learning benchmark for the chip industry - the increasing competition to NVIDIA in inference space

Resources: Building LLMs ground up - 3 Hour workshop, Physics of LLM Models paper series, ML Learning roadmap for 2024

AI in Businesses

Alibaba Cloud released Qwen2-VL models that achieve state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. Model’s visual capabilities across six key dimensions: complex college-level problem-solving, mathematical abilities, document and table comprehension, multilingual text-image understanding, general scenario question-answering, video comprehension, and agent-based interactions. Overall, the 72B model showcases top-tier performance across most metrics, often surpassing closed-source models like GPT-4o and Claude 3.5-Sonnet. (Official Blog, QwenLM Githubpage)
Safe Super Intelligence (SSI) gets $ 1 billion at $ 5 billion valuation. SSI is founded by OpenAI co-founder Sutskever to promote AGI safety. The startup is nascent, with no available information on its technical underpinnings or revenue models - this at the moment sounds like a persona-driven investment and potentially speculative (Kenrick Cai et al, Thomson Reuters)
Traffic to Anthropic's Claude models has increased drastically. But Anthropic's 15 million page visits a month pales in comparison with 337 million page visits for OpenAI's chatGPT. That is the consumer market, but matters are just warming up in the enterprise segment where Anthropic could pose a significant competitive threat to OpenAI which has ~1 million enterprise customers (William Coulman, Sherwood News)
Apple Pay won't be the only NFC payment option on iPhones anymore. After 10 years of existence - as a very successful product, Apple is forced to allow 3rd party app developers to introduce their own NFC payment options. There are pros and cons to this - on one side it increases options for customers and potentially better prices, but on the other side there could be payment fraud and privacy issues from spurious app developers (David Pierce, The Verge)
Pavel Durov (CEO of Telegram) X.com tweet, days after his arrest saga in Paris on charges of fraud and trafficking. Pavel speaks for freedom and privacy and aims to keep the encryption. However, how much accountability will be taken for genuine concerns of terrorist activity and pirated content is unclear from Pavel's post. (Pavel Durov, X tweet)
Techcrunch criticism on why Chatbot Arena an LLM Leaderboard maintained by LMSys (a non-profit body) though gaining in popularity, may not be the right metric to use. Kyle Wiggers argues that LMSys is not transparent in its evaluation approach and its current approach seems to favour proprietary LLMs (as opposed to open source). (Kyle Wiggers, Techcrunch)

Technology updates from AI

Magic releases details on ultra-long context models which may process up to 100M token contexts. This can be a game changer as now instead of fine-tuning with your data or applying the RAG framework, we could supply the entire space of knowledge in our context and then interact meaningfully with the model - all this is at a fraction of compute that will be required for Llama 3.1 should it need to process a 100M context window (650 H100s to A fraction of an H100 - ie ~1000X cheaper in compute).

How does Magic propose to achieve this
- The attention mechanism in traditional transformer models requires memory which will quadratically increase with the number of tokens. 100M tokens are virtually impossible to run. Even at the 1M token window (which is the context window from Gemini models considered the longest), the model tends to game the architecture by focusing on a narrow section of the document
- Magic’s LTM models use a hash-based memory management approach (hashing forces the model to scan the entire context rather than finding a local reference and getting satisfied with it) and sequence-dimension processing (splitting context as the whole sequence blocks). Hashing makes the information in the context incompressible but also allows for more targeted retrieval of information, where only specific pieces of information are stored and accessed when needed, instead of maintaining a full Key-Value cache for every token. This drastically reduces the memory requirements and, consequently, the need for large GPU resources
sequence-dimension processing involves either sparse attention, segment-wise processing, or retrieval-based attention, where the model only attends to relevant portions of the sequence
Hash-based memory management refers to a technique where information is stored and retrieved using hash functions, which generate unique and fixed-size representations (hashes) for given pieces of data

The FLOPs cost of Llama 405B’s attention mechanism is n_layers * n_heads * d_head * n_ctx * 2 per output token. At 100M context, our mechanism is roughly 1000 times cheaper for LTM-2-Mini. For our largest LTM-2 model, context will be roughly twice as expensive as for LTM-2-mini, so still 500x cheaper than Llama 405B. This comparison focuses on Llama's attention mechanism and our LTM mechanism’s FLOPs and memory bandwidth load. Costs from other parts of the model, such as Llama’s MLP, that have constant cost with respect to context size for each decoded token are not considered.
126 layers * 8 GQA groups * 128 d_head * 2 bytes * 2 (for k & v) * 100 million = 51TB. An H100 has 80GB of memory. 51TB / 80GB = 637.5 H100s.

(Official Blog, Magic)

Long Read

The MLPerf Inference benchmark suite, which encompasses both data centre and edge systems, is designed to measure how quickly hardware systems can run AI and ML models across a variety of deployment scenarios
An additional commentary provided by IEEE Spectrum article
Nine different AI benchmarks were evaluated, covering tasks like image generation, LLM Q&A, object detection, and recommendation engines.
Nvidia’s dominance in AI training remains strong, but competition in AI inference, especially in power efficiency, is growing. Nvidia’s new Blackwell chip performed 2.5 times better than previous models on the LLM Q&A benchmark, its only submission.
Nvidia's Blackwell chip supports advanced 4-bit floating-point precision, enhancing performance and computational speed. Blackwell’s success is also attributed to its increased memory bandwidth, nearly doubling that of the H200 chip.
Cerebras and FuriosaAI announced new inference chips but did not submit to the MLPerf benchmark.

(Dina Kenkina, IEEE Spectrum & ML Commons)

Resources

Building LLMs from the Ground Up: A 3-hour Coding Workshop (Sebastian Raschka, Lightning AI - Deep Learning Educator, previously Professor at University of Wisconsin Maddison)
Physics of Language Model Series. With these 5 papers, the authors cover fundamental explainability and engineering aspects of the LLMs.
(Zeyuan Allen-Zhu, Yuanzhi Li - Meta / Mohamed bin Zayed University of AI)
Learn Machine Learning Effectively: A 2024 Roadmap. A great set of resources to help with this journey at all levels (Smitha Kolan, School of Machine Learning)
- Awesome NLP - Everything NLP related (Keon Kim, Github page)
- Awesome MLOps - Everything MLOps related (Dr. Larysa Visengeriyeva, GitHub Page)

Weekly Digest: 06th Sep 2024

Collationist.

#

8

Weekly Digest: 06th Sep 2024

Technology Posts

Obervations from Karpathy on AI evolution

21 Lessons for 21st Century by Yuval Noah Harari

The future of AI compute - with Jonathan Ross

Who will dominate the AI Ecosystem

Top trends in the AI industry

Current direction of NLP research in RAG

Unlocking the Power of Neural Information Retrieval:

US Federal Budget in numbers