Data Phoenix Digest - ISSUE 6.2023

Upcoming webinars about unlocking data value with LLM and reducing NLP Inference costs, testing ML models for production, how to train your own LLM, CompressGPT, FriendlyCore, 3D generation on ImageNet, guidance for diffusion models, S-NeRF, QA4RE, and more.

Dmitry Spodarets
Dmitry Spodarets

Hey folks,

I'm excited to share that the Data Phoenix Digest is back every week after a short break. We're turning Data Phoenix into a community focused on Data & AI as an education project.

We're going to change how we do our weekly updates a little, too. We will make sure you know about all the latest news and events in our community so you can get involved. Plus, we will share key insights from the best research papers, articles, and news, helping you keep up with what's new and learn more in the Data & AI field!

Be active in our community and join our Slack to discuss the latest news of our community, top research papers, articles, events, jobs, and more...

Want to promote your company, conference, job, or event to the Data Phoenix community of Data & AI researchers and engineers? Click here for details.

Data Phoenix community news

Upcoming events:

Video records of past events:

Featured Article

Get ready for some thrilling updates on our Slack! By becoming a member, you'll be entered for a chance to win one of three copies of the book, "Experimentation for Engineers." But that's not all! As a special bonus, we're also offering an exclusive 35% discount code (bldataphoenix23) on all Manning Publications products in any format. Don't let this incredible opportunity slip away!

Summary of the top papers and articles


GPT in 60 Lines of NumPy
This article explains how you can implement a GPT from scratch in just 60 lines of numpy. The trained GPT-2 model is then tried in practice to generate some text. If you are looking for a simple introduction to the GPT as an educational tool, this article is for you.

Koala: A Dialogue Model for Academic Research
Koala is a chatbot trained by fine-tuning Meta’s LLaMA on dialogue data gathered from the web. This article describes the dataset curation and training process of the model, and also presents the results of a user study that compares the model to ChatGPT and Alpaca.

Unleashing ML Innovation at Spotify with Ray
The goal for Spotify’s ML Platform is to create a seamless user experience for AIML practitioners who want to take an ML application from development to production. In this article, you can find a comprehensive dive-in into how this ML Platform works. Check it out!

StackLLaMA: A hands-on guide to train LLaMA with RLHF
This article delves into the steps involved in training a LlaMa model to answer questions on Stack Exchange with RLHF through Supervised Fine-tuning (SFT); Reward / preference modeling (RM); and, Reinforcement Learning from Human Feedback (RLHF). Check it out!

How to train your own Large Language Models
LLMs have made a significant impact in the field of AI, but most companies currently lack the ability to train these models themselves, relying instead on a few major tech firms as providers. Replit has made significant investments in developing the infrastructure necessary to train their own LLMs from scratch. In this blog post, they explain how they did it.

Papers & projects

SegGPT: Segmenting Everything In Context
SegGPT is a generalist model for segmenting everything in context. It can perform arbitrary segmentation tasks in images or videos via in-context inference, such as object instance, stuff, part, contour, and text. SegGPT is evaluated on a broad range of tasks.

Transformer models: an introduction and catalog
This paper offers a somewhat comprehensive but simple catalog and classification of the most popular Transformer models. The paper also includes an introduction to the most important aspects and innovation in Transformer models.

Attending to Graph Transformers
In this paper, the authors derive a taxonomy of graph transformer architectures, bringing some order to this emerging field. They overview their theoretical properties, survey structural and positional encodings, and discuss extensions for important graph classes.

PaLM-E: An Embodied Multimodal Language Model
In this paper, the authors propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Learn how their new approach plays out!

Mask-Free Video Instance Segmentation
In this paper, the authors propose a method for removing the mask-annotation requirement in Video Instance Segmentation (VIS). MaskFreeVIS achieves highly competitive VIS performance, while using bounding box annotations for the object state. The Temporal KNN-patch Loss (TK-Loss) is used to provide strong mask supervision without any labels.

CodeT5+: Open Code Large Language Models for Code Understanding and Generation
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit downstream code tasks. Such flexibility is enabled by a mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pre-training tasks.

If you enjoy our work, we would greatly appreciate your support by sharing our digest with your friends on Twitter, LinkedIn, or Facebook using the hashtag #dataphoenix. Your help in reaching a wider audience is invaluable to us!