Data Phoenix Digest - ISSUE 25

NEWS

What's new this week?

Digital twins for traffic. "Hey, Disney!" voice assistant. The rise of AI inventor. AI vs Brain cancer in children. Metal 3D printing.

Funding News

  • Malbek, an AI-fueled Contract Lifecycle Management platform, raises a $15.3 Million Series A funding round led by Noro-Moseley Partners.
  • Leena AI, an AI-powered conversational platform, raises $30 million in a series B round of funding led by Bessemer Venture Partners.
  • Astera Labs, the industry leader for connectivity solutions for intelligent systems, raises $50M as part of an oversubscribed Series-C funding round led by Fidelity Management and Research.

ARTICLES

GPT-4 Will Have 100 Trillion Parameters — 500x the Size of GPT-3
In this overview article, you'll learn about the potential (and limits) of GPT-4, an autoregressive language model that is designed to outperform GPT-3. Maybe be released next year!

How to Create an AutoML Pipeline Optimization Sandbox
In this article, we'll look into the ways and methods of implementing an automated machine learning pipeline optimization sandbox web app using Streamlit and TPOT.

Supercharge Image Classification with Transfer Learning
Hop in to learn how to leverage pretrained ResNets from Tensorflow-Hub to take advantage of their ability to be easily transfer-learnt/fine-tuned on new datasets.

Recognizing People in Photos Through Private On-Device Machine Learning
Apple's ML research team explains their approach to recognizing people in photos in various poses and wearing extreme accessories by using private, on-device machine learning.

DagsHub — GitHub for Data Science
DagsHub is an open-source DS/ML collaboration platform that allows you to quickly build, scale and deploy ML projects by leveraging the power of git and DVC. Check this one out!

Multilayer Perceptron Explained with a Real-Life Example and Python Code: Sentiment Analysis
This is the first in a series of articles dedicated to Deep Learning. Learn about Multilayer Perceptron, a neural network that learns the relationship between linear and non-linear data.

How Airbnb Enables Consistent Data Consumption at Scale
In a series of articles, Airbnb explains how they use Minerva for handling data and analytics. Check out the first post and the second post to learn the details.

Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot
In this article, you'll learn how Uber uses open-source technology to build Uber’s first “near real-time” exactly-once events processing system for Ads on Uber Eats.

Why Data Scientists Shouldn’t Need to Know Kubernetes
Data scientists should be good at Kubernetes, or shouldn't they? This post argues that they can do their DS work by using a good infrastructure abstraction tool instead of getting YAML files to work.

PAPERS

LightAutoML: AutoML Solution for a Large Financial Services Ecosystem
LightAutoML is an AutoML system developed for a large European financial services company and that has already been deployed in numerous applications. The paper presents an overview of it.

Revisiting 3D ResNets for Video Recognition
In this paper, the researchers explore training and scaling strategies for video recognition models and propose a simple scaling strategy for 3D ResNets.

Dual-Camera Super-Resolution with Aligned Attention Modules
The paper presents a novel approach to reference-based super-resolution with the focus on dual-camera super-resolution, which utilizes reference images for high-quality and high-fidelity results.

Pix2seq: A Language Modeling Framework for Object Detection
Pix2Seq is a simple and generic framework for object detection. The idea of the authors is that they cast object detection as a language modeling task conditioned on the observed pixel inputs.

Julia for Biologists
Julia is a programming language that meets the challenges in the computational biosciences, including collecting, curating, processing, and analyzing large genomic and imaging datasets.

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
TrOCR is a model pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. It outperforms other models on both printed and handwritten text recognition tasks.

Merlion: A Machine Learning Library for Time Series
Merlion is an open-source ML library for time series. It features a unified interface for many commonly used models and datasets for anomaly detection and forecasting.

FreeStyleGAN: Free-view Editable Portrait Rendering with the Camera Manifold
In this paper, the authors introduce the notion of the GAN camera manifold, the key element to precisely define the range of images that the GAN can reproduce in a stable manner.

Chemical-Reaction-Aware Molecule Representation Learning
Hongwei Wang et al. propose using chemical reactions to assist learning molecule representation, to preserve the equivalence of molecules in respect to chemical reactions in the embedding space.

PLATO-XL: Exploring the Large-Scale Pre-Training of Dialogue Generation
Siqi Bao et al. present the models of PLATO-XL with 11 billion parameters, trained on both Chinese and English social media conversations, to explore the limit of dialogue generation pre-training.

CHEATSHEETS

Data Science Cheatsheet 2.0
The cheat sheet is based on MIT's Machine Learning courses 6.867 and 15.072. It includes all the info  to assist you with exam reviews, interview prep, and anything in-between.

CODE & TOOLS

JupyterLab Desktop App
JupyterLab App is the cross-platform standalone application distribution of JupyterLab. It is a self-contained desktop app which bundles a Python environment with several popular Python libraries ready to use in scientific computing and data science workflows.

JOBS

Looking to feature your open positions in the digest? Kindly reach out to us at editor@dataphoenix.info for details. We'll be proud to help your business thrive!