Data Phoenix Digest - 23.09.2021

CV for workplace security, a new wave of invest into NLP, webinars "Pachyderm in production: lessons learned", fast AutoML with FLAML + Ray Tune, improving neural network subspaces, YOLOv5 on CPUs, learning neural causal models with active interventions, videos, datasets, jobs, and more ...

Dmitry Spodarets


What's new this week?

Computer Vision for workplace security. A new wave of invest into NLP. Synthetic data. AI fights cancer and accelerates discovery of new materials.

  • AI- and Computer Vision-powered workplace safety and security systems are touted as a silver bullet solution for enterprises. But what about the privacy of workers?
  • Enterprises are increasing their investments in NLP: 60% of tech leaders say their NLP budgets grew by at least 10% while for 33% the spending climbed by more than 30%.
  • According to research by Synthesis AI, 89% of tech decision makers who use vision data agree synthetic data is an innovative technology that organizations badly need to to adopt.
  • The pharmaceuticals firm GSK partners with King’s College London to use AI to develop personalized treatments for cancer by investigating the role played by genetics in the disease.
  • Researchers at the University of Liverpool have developed an AI tool that can discover truly new materials. It has already helped to discover four new materials.

Funding News

  • HeartLab raises $2.45 million in seed funding to expand its AI-powered heart scanning and reporting platform to cardiologists in the United States.
  • Blackbird closes a $10 million Series A in a bid to prepare to launch the next version of its disinformation intelligence platform this fall.
  • Sorcero closes a $10 million Series A financing round led by CityRock Venture Partners, to support the increasing demand from new and existing customers.

The Data Phoenix Events team invites you all on September 29 to our "The A-Z of Data" webinars. The topic — Pachyderm in production: lessons learned.

In this talk, we will take a look at yet another MLOps tool - Pachyderm. This tool is gaining in popularity and is unique for some use-cases. The speaker will share the experience of applying Pachyderm to a real-world, BigData NLP project. Most importantly, we will see the hidden limitations of Pachyderm and why it's not quite the tool it claims to be.

Speaker: Oleh Lokshyn is a Machine Learning Architect at SoftServe. He built ML workflows on GCP, Azure, and on-premises for different supervised and unsupervised models. Oleh holds several certifications: Google Cloud Professional Machine Learning Engineer, Google Cloud Professional Data Engineer, Microsoft Certified Azure Data Scientist Associate.

Participation is free, but pre-registration is required.
Webinar language: Russian.


Probabilistic Machine Learning and Weak Supervision
In this article, the Watchful team demonstrates a proof of principle of how humans can collaborate with machines to label training data and to build machine learning models.

News Classification: Fine-Tuning RoBERTa on TPUs with TensorFlow
In this tutorial, you'll learn how to use a pre-trained RoBERTa model for a multiclass classification task with Hugging Face transformers.

Fast AutoML with FLAML + Ray Tune
FLAML is a lightweight Python library from Microsoft Research that finds accurate ML models. In this article, you'll find out how to implement and scale economical AutoML with FLAML.

Jellyfish: Cost-Effective Data Tiering for Uber’s Largest Storage System
Jellyfish, a data tiering solution for storage systems, has been successful in reducing the operating expenses and unlocking more savings for Uber. Let's learn more about it!

Scaling LinkedIn's Hadoop YARN Cluster Beyond 10,000 Nodes
LinkedIn uses Hadoop as their major service for handling big data analytics and machine learning. Let's have a closer look at how the company has managed to scale Hadoop YARN.

Kedro — A Python Framework for Reproducible Data Science Project
Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. Applied concepts include modularity, separation of concerns and versioning.

A Step-by-Step Guide for Detecting Causal Relationships Using Bayesian Structure Learning on Python
In this guide for beginners, you'll learn how to identify causal relationships on Python. For that, you'll use a combination of various Bayesian Structure Learning methods.

Define and Run Machine Learning Pipelines on Step Functions using Python, Workflow Studio, or States Language
In this AWS article, you'll learn how to run end-to-end ML pipelines in Step Functions using three different methods: Python, Drag and Drop, and JSON. Enjoy!

Topic Model Based Recommendation Systems
Want to get recommendation systems powered by ML? Check out this beginner-friendly guide by Jamie McGowan to learn the basics you need to know to produce recommendations.

Improving Neural Network Subspaces
In this article, you'll find an overview of Apple's training a subspace of neural networks, with the  results that evaluate its effectiveness in terms of accuracy, calibration, and robustness.

YOLOv5 on CPUs: Sparsifying to Achieve GPU-Level Performance and a Smaller Footprint
The DeepSparse Engine combined with SparseML’s recipe-driven approach enables GPU-class performance for the YOLOv5 family of models. Try it for yourself!

Hypothesis Testing Explained
Hypothesis Testing is a form of inferential statistics that allows us to draw conclusions about an entire population based on a representative sample. Learn more about it in this 101 post!


RAMA: A Rapid Multicut Algorithm on GPU
RAMA is a highly parallel primal-dual algorithm for the multicut (i.e. correlation clustering) problem, a classical graph clustering problem widely used in machine learning and computer vision.

Learning Neural Causal Models with Active Interventions
In this paper, the researchers introduce an active intervention-targeting mechanism which enables a quick identification of the underlying causal structure of the data-generating process.

Automatic Foot Ulcer Segmentation Using an Ensemble of Convolutional Neural Networks
Foot ulcer is associated with substantial morbidity and mortality. In this paper, the authors propose an ensemble approach based on LinkNet and UNet to perform foot ulcer segmentation.

NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo
In this paper, Yi Wei et al. present a new multi-view depth estimation method that utilizes both conventional SfM reconstruction and learning-based priors over the recently proposed NeRFs.

Finetuned Language Models Are Zero-Shot Learners
In this paper, the researchers propose a simple method for improving the zero-shot learning abilities of language models that shows an improved zero-shot performance on unseen tasks.


How Apple Scans Your Phone (And How to Evade It) - NeuralHash CSAM Detection Algorithm Explained
In this video, you'll learn about Apple's new algorithm that is capable of scanning images uploaded to iCloud for CSAM (child abuse material). Let's theorize on methods to stop it.

Machine Learning Projects [Collection]
The playlist includes 17 machine learning projects spanning across different topics and niches of ML work, from various prediction tasks to anomaly detection and face recognition.


A unique collection of datasets by Christoph Schuhmann. He claims that it is the world’s largest openly available image-text-pair dataset with 400 million samples.

100+ Open Audio and Video Datasets
The collection of audio and video datasets collected by Twine. Each dataset features recordings, participants involved, the languages of the speech content, the file size, and file type.


Looking to feature your open positions in the digest? Kindly reach out to us at [email protected] for details. We'll be proud to help your business thrive!