DataScience Digest — 17.06.21

Facebook AI migrates its systems to PyTorch, metric learning tips & tricks, session-based recommender systems, AndroidEnv, materials from PyCon US 2021, and more...

Dmitry Spodarets


Facebook AI migrates its systems to PyTorch. Another launch of a self-driving car. The never-ending fight with bias and AI systems that learn by watching YouTube. EU mobilizes to rein in tech giants. Machine learning that speeds up simulations in material science.

Facebook’s AI has migrated all their AI systems to PyTorch. Within a year, there are more than 1,700 PyTorch-based inference models in full production at Facebook, and 93 percent of their new training models are on PyTorch.

The times are hardly perfect for self-driving car companies. Not a problem for Waabi. This Toronto-based self-driving car startup, has secured $83.5 million in a Series A funding round led by Khosla Ventures. The company’s financial backers include Geoffrey Hinton, Fei-Fei Li, Peter Abbeel, and Sanja Fidler.

Another frontier for AI is bias, and it seems that OpenAI has discovered a way to improve the “behavior” of language models with respect to ethical, moral, and societal values. The developers can now dictate the tone and personality of a model depending on the prompt that the model’s given. And more is up to come in this area as a system that learns to match images in videos with words by watching millions of YouTube videos with transcribed speech has been released. This means more “contextual” understanding of reality.

The move to make AI systems more efficient, secure, and unbiased can be interpreted from a business perspective, yet it makes sense to look at it legally as well. For example, the EU is now looking into AI-powered voice assistants. Not only will they reveal new regulations soon but also limit the usage of devices provided they have the slightest “bias” issues.

And, finally,check out how AI and machine learning can help accelerate simulations in material science and review of MLOps platforms. A lot of insights there!


Building Scalable Machine Learning Pipelines for Multimodal Health Data on AWS
Machine learning is used extensively in the healthcare and life sciences industries. Among many approaches and methods to increase the accuracy and efficiency of ML models, Multimodal ML stands out as one of the most promising. In this article, you’ll learn how to build a scalable, cloud architecture for Multimodal ML on health data.

Session-based Recommender Systems
In this extensive research report by Cloudera Fast Forward, you’ll learn all the ins and outs of designing, building, and managing AI/ML-powered recommender systems. The authors will demonstrate how to use specific algorithms and datasets to arrive at conclusions about the do’s and don’ts of building such systems (e.g. while using word2vec).

Metric Learning Tips & Tricks
In this article, the author presents ways of overcoming the limitations of classification, such as the number of training samples, production integration, and scaling. Specifically, he’ll explain how to train an object matching model with no labeled data and use it in production, to ensure metric learning is more scalable and flexible.

Tinkering with the Mobile Apps Dataset
In this article, the author demonstrates how you can use an open-source dataset featuring mobile apps data to build your own models. The article includes such steps as choosing a dataset, exploratory data analysis, feature engineering, and predicting with a model. The dataset and the models are available for re-use.

Testing Airflow DAGs
In this guide, the Astronomer team explores the ways of effective testing of DAGs. They’ll also look into specific tests such as validation testing, unit testing, and data and pipeline integrity testing that they recommend to anybody running Airflow in production. The tests account for DAGs’ unique structure and relationship to other code and data.

Dynamically Generating DAGs in Airflow
In this guide, the Astronomer team looks into specific methods of dynamically generating DAGs in Airflow, from single-file methods to multiple-file methods. Every method is accompanied by code and examples. The team also presents DAG Factory, an open source Python library for dynamically generating Airflow DAGs from YAML files.

Deep Learning for Projectile Trajectory Modeling
In this article, you’ll find a review of the paper entitled “Simulated Data Generation Through Algorithmic Force Coefficient Estimation for AI-Based Robotic Projectile Launch Modeling”, which proposes a novel method of modeling robotic launching of non-rigid objects using neural networks.

AI Traces the Origin of Metastatic Cancer Better than Humans
In this article by Receptor.AI, we’ll dive into cancer research.You’ll learn how machine learning can help doctors trace the origin of the metastatic cancer tumors for so-called CUP cases — Cancer of Unknown Primary. The study shows that AI-based analysis could be used in various areas, from the classical drug discovery to cancer diagnostics and therapy.


AndroidEnv: A Reinforcement Learning Platform for Android
In this paper, Daniel Toyama et al. introduce AndroidEnv, an open-source platform for Reinforcement Learning (RL) research built on top of the Android ecosystem. AndroidEnv allows RL agents to interact with a wide variety of apps and services commonly used by humans through a universal touchscreen interface.

Barlow Twins: Self-Supervised Learning via Redundancy Reduction
Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks, but a recurring issue with this approach is the existence of trivial constant solutions. The team proposes using Barlow Twins to avoid such collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible.

HateCheck: Functional Tests for Hate Speech Detection Models
In this paper, Paul Rottger et al. introduce HateCheck, a suite of functional tests for hate speech detection models. It features 29 model functionalities and test cases for each functionality. HateCheck has been tested on near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses.

Domain Consensus Clustering for Universal Domain Adaptation
In this paper, Guangrui Li et al. investigate Universal Domain Adaptation (UniDA), which aims to transfer the knowledge from source to target under unaligned label space. The main challenge of UniDA is separating common classes from private classes. They propose Domain Consensus Clustering (DCC), which performs adaptation over unaligned label space via encouraging discriminative target clusters. The code is available on GitHub.

Hallucination in Object Detection — A Study in Visual Part Verification
In this paper, Osman Semih Kayhan et al. introduce the first visual part verification dataset: DelftBikes, which has 10,000 bike photographs, with 22 densely annotated parts per image, where some parts may be missing. They hope that their study will help resolve the problem of hallucinating object detectors that detect missing objects.

Subdivision-Based Mesh Convolution Networks
This paper introduces a novel and flexible CNN framework (SubdivNet) for 3D triangle meshes with Loop subdivision sequence connectivity. Making an analogy between mesh faces and pixels in a 2D image allows us to present a mesh convolution operator to aggregate local features from adjacent faces. By exploiting face neighborhoods, this convolution can support standard 2D convolutional network concepts.

Event Materials

PyCon US 2021
This playlist features all keynotes, talks, and other materials from PyCon US 2021, a virtual conference for the community using and developing the open-source Python programming language. Over 80 videos in total!