Data Phoenix Digest - ISSUE 10.2023
Hey folks,
Welcome to this week's edition of Data Phoenix Digest! This newsletter keeps you up-to-date on the news in our community and summarizes the top research papers, articles, and news, to keep you track of trends in the Data & AI world!
Be active in our community and join our Slack to discuss the latest news of our community, top research papers, articles, events, jobs, and more...
Data Phoenix community news
AI Events Calendar
We are happy to announce that our new AI events calendar has launched with a weekly newsletter. The calendar is already filled with exciting and valuable events, and the first issue of our newsletter, featuring a selection of upcoming events for the week, will kick off this weekend. If you're organizing webinars, workshops, meetups, conferences, or hackathons, add them to our calendar, and we'll gladly help spread the word to our community.
Upcoming webinars:
Multilingual Semantic Search
Connecting Large Language Models with embeddings and semantic search on your own data has become widely popular. But how does this work in other languages and across languages? Join me for this talk why multilingual semantic search is amazing, how respective models are trained, and new use-cases this unlocks.
Rise in the use of synthetic data for regulated industries
Synthetic data is evolving and becoming extremely important for organizations. This session will uncover facts about synthetic data. It will also talk about some of the most impactful use cases associated with it, along with challenges that companies face while harnessing its power.
How to use LLMs to Interface with Multiple Data Sources
Following emerging Large Language Model Operations (LLM Ops) best practices in the industry, you’ll learn about the key technologies that enable Generative AI practitioners like you to build complex LLM applications. Specifically, we’ll deep dive on “data frameworks” like LlamaIndex, and we’ll demonstrate how to create state-of-the-art hierarchical indexes from different data sources. During the event, we will also show you how another commonly known LLM Ops framework (LangChain) underlies much of the functionality of LlamaIndex. All demo code will be provided via GitHub links during and after the event!
Video records of past events:
Summary of the top articles and papers
Articles
Time-Series Forecasting: Deep Learning vs Statistics — Who Wins?
This article provides a comprehensive and unbiased view on the application of Deep Learning in the field of Natural Language Processing (NLP) and time-series forecasting, particularly focusing on the use of pre-trained transformers. Check it out!
Accelerating Stable Diffusion Inference on Intel CPUs
Recently, the HuggingFace team has introduced the latest generation of Intel Xeon CPUs (code name Sapphire Rapids). In this article, they demonstrate different techniques to accelerate Stable Diffusion models on Sapphire Rapids CPUs.
WALTS: Walmart AutoML Libraries, Tools and Services
WALTS is an enterprise-scale AutoML framework designed to meet the rising demand of employing ML for business. In this article, the authors elaborate on how they explore models from a pool of candidates and test the selected one with a business use-case.
Introduction to mypy
The article explores how mypy, by adding type annotations and checks, can help discover bugs at compile-time, thereby enhancing the efficiency of Python projects. It guides readers from beginners to a solid understanding of mypy through the use of various examples.
Papers & projects
Llama 2: Open Foundation and Fine-Tuned Chat Models
This article introduces Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) with a parameter range of 7 billion to 70 billion. The optimized Llama 2-Chat models for dialogue achieve superior performance compared to open-source chat models on multiple benchmarks. The authors comprehensively describe their fine-tuning approach, safety enhancements, and human evaluations, aiming to facilitate community engagement and responsible development of LLMs.
MIS-FM: 3D Medical Image Segmentation using Foundation Models Pretrained on a Large-Scale Unannotated Dataset
This work introduces Volume Fusion (VF), a novel self-supervised learning strategy for 3D segmentation model pretraining using unannotated medical images. VF fuses random patches from foreground and background sub-volumes, leveraging fusion coefficients as self-supervised segmentation targets. The proposed model, pretrained on 110k unannotated 3D CT volumes, demonstrates superior performance compared to training from scratch and state-of-the-art self-supervised methods on various downstream segmentation tasks involving head and neck organs, as well as thoracic/abdominal organs.
A Survey of Large Language Models
In this survey, the authors review the recent advances of LLMs. In particular, they focus on four major aspects of LLMs: pre-training, adaptation tuning, utilization, and capacity evaluation. Besides, they also summarize the available resources for developing LLMs and discuss the remaining issues for future directions.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter is a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs.
Animated Drawings
Animated Drawing is a system that automatically animates children's drawings of the human figure, is robust to the variance inherent in these depictions, and is simple enough for anyone to use. Here you can find the Animated Drawings Demo, a freely available public website that has been used by millions of people around the world.