I'm excited to share that the Data Phoenix Digest is back every week after a short break. We're turning Data Phoenix into a community focused on Data & AI as an education project.
We're going to change how we do our weekly updates a little, too. We will make sure you know about all the latest news and events in our community so you can get involved. Plus, we will share key insights from the best research papers, articles, and news, helping you keep up with what's new and learn more in the Data & AI field!
Be active in our community and join our Slack to discuss the latest news of our community, top research papers, articles, events, jobs, and more...
Data Phoenix community news
- Webinar "Why you should move to a Lakehouse" / May 25
- Webinar "Unlocking Data Value with Large Language Models" / June 15
Video records of past events:
- Webinar "Making AI Easy with YOLOv8"
- Webinar "Introduction to Graph Neural Networks"
- Webinar "Evaluating XGBoost for balanced and Imbalanced datasets"
Summary of the top papers and articles
GPT in 60 Lines of NumPy
This article explains how you can implement a GPT from scratch in just 60 lines of numpy. The trained GPT-2 model is then tried in practice to generate some text. If you are looking for a simple introduction to the GPT as an educational tool, this article is for you.
Koala: A Dialogue Model for Academic Research
Koala is a chatbot trained by fine-tuning Meta’s LLaMA on dialogue data gathered from the web. This article describes the dataset curation and training process of the model, and also presents the results of a user study that compares the model to ChatGPT and Alpaca.
Unleashing ML Innovation at Spotify with Ray
The goal for Spotify’s ML Platform is to create a seamless user experience for AIML practitioners who want to take an ML application from development to production. In this article, you can find a comprehensive dive-in into how this ML Platform works. Check it out!
StackLLaMA: A hands-on guide to train LLaMA with RLHF
This article delves into the steps involved in training a LlaMa model to answer questions on Stack Exchange with RLHF through Supervised Fine-tuning (SFT); Reward / preference modeling (RM); and, Reinforcement Learning from Human Feedback (RLHF). Check it out!
How to train your own Large Language Models
LLMs have made a significant impact in the field of AI, but most companies currently lack the ability to train these models themselves, relying instead on a few major tech firms as providers. Replit has made significant investments in developing the infrastructure necessary to train their own LLMs from scratch. In this blog post, they explain how they did it.
Papers & projects
SegGPT: Segmenting Everything In Context
SegGPT is a generalist model for segmenting everything in context. It can perform arbitrary segmentation tasks in images or videos via in-context inference, such as object instance, stuff, part, contour, and text. SegGPT is evaluated on a broad range of tasks.
Transformer models: an introduction and catalog
This paper offers a somewhat comprehensive but simple catalog and classification of the most popular Transformer models. The paper also includes an introduction to the most important aspects and innovation in Transformer models.
Attending to Graph Transformers
In this paper, the authors derive a taxonomy of graph transformer architectures, bringing some order to this emerging field. They overview their theoretical properties, survey structural and positional encodings, and discuss extensions for important graph classes.
PaLM-E: An Embodied Multimodal Language Model
In this paper, the authors propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Learn how their new approach plays out!
Mask-Free Video Instance Segmentation
In this paper, the authors propose a method for removing the mask-annotation requirement in Video Instance Segmentation (VIS). MaskFreeVIS achieves highly competitive VIS performance, while using bounding box annotations for the object state. The Temporal KNN-patch Loss (TK-Loss) is used to provide strong mask supervision without any labels.
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit downstream code tasks. Such flexibility is enabled by a mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pre-training tasks.
Data Phoenix Newsletter
Join the newsletter to receive the latest updates in your inbox.