Data Phoenix Digest - 15.07.2021

Robots that never stumble, AI looks into boiling (yes!), a 3D brain segmentation pipeline for MRI, elastic distributed training with XGBoost on Ray, ClawCraneNet, data science at the command line and more...

Dmitry Spodarets


What’s new this week?
AI as a design problem. Robots that never stumble. AI looks into boiling (yes!). China’s grip on AI-powered healthcare technology and AI’s potential in drug discovery. The autonomous journey of Saildrone Surveyor.

Many people don’t trust AI, which is understandable — for them artificial intelligence is a mysterious black box that does something peculiar with their personal data. What can we, engineers, do to help folks understand AI better? For instance, we could look at AI as a design problem, an integral part of real-world solutions that people can really get.

The robot situation is no better. People are afraid of them, maybe even more than of AI. And robots were really clumsy just a few years ago. Now, however, they can be powered by a new model for robotic locomotion that adapts in real time to any terrain it encounters. They will get you if they want.

Fear factor aside, AI can be a silver bullet to find answers to a great deal of the problems humanity is now struggling with. For one thing, it can help researchers understand boiling better, which provides insights about heating of almost anything, from computer chips to nuclear reactors. It has spurred a wave of healthcare technology innovations, a significant portion of which is of China’s origin, and has given us tools to discover drugs faster at a larger scale (e.g. think of how quickly we were able to create COVID vaccines).

The Saildrone Surveyor ship just made it to Hawaii from San Francisco. An autonomous, AI-powered boat took part in the largest attempt yet to map Earth's undersea landscape and didn’t need a single human to cross half the Pacific.


Catalyst.Neuro: A 3D Brain Segmentation Pipeline for MRI
In this article, you’ll learn about Catalyst.Neuro, an advanced brain segmentation pipeline, about its fundamental concepts implemented and different deep learning models to perform and complete brain segmentation tasks.

Effortless Distributed Training of Ultra-Wide GCNs
Graph independent subnetwork training (GIST) is a distributed training framework for large-scale graph convolutional networks (GCNs). It massively accelerates the training of GCNs for any architecture and can be used to enable training of large-scale models.

Reverse Engineering Generative Models from a Single Deepfake Image
Facebook AI in partnership with Michigan State University (MSU) presents a new method of detecting and attributing deepfakes. It relies on reverse engineering from a single AI-generated image to the generative model used to produce it.

Build End-to-End ML Workflows with Kubernetes and Apache Airflow
Halodoc’s team explains how they leveraged Apache Airflow and Kubernetes, to move beyond CRONTAB and manage batch inference workloads. In the previous article, they outlined how they built their ML platform using Kubeflow on Amazon EKS.

Overview of Deep Learning Architectures Computers Use to Detect Objects
In this article, the author provides a brief overview of deep learning architectures that help computers detect objects. The list includes convolutional neural networks, object detection datasets, R-CNN, fast R-CNN, YOLO, and more.

How to Build E(n) Equivariant Normalizing Flows, for Points with Features?
In this article, you’ll find the techniques that Emiel Hoogeboom and his team used to make E(n) Equivariant Flows. It explores Normalizing Flows, Continuous Time Normalizing Flows, E(n) Equivariant GNNs, Argmax Flows, and tying it together for E(n) Normalizing Flows.

Do You Read Excel Files with Python? There is a 1000x Faster Way.
Many Python users rely on excel files to load/store data, because business people like to share data in excel or csv format. Unfortunately, Python is very slow with Excel files. In this article, you’ll find five ways to load data in Python, to achieve a 3x increase in speed.

The Importance of Layered Thinking in Data Engineering
This article will offer you guidelines to build sustainable and robust data pipelines. You’ll experiment with a real-world example and, by the end of the article, you’ll understand why a layered data engineering approach is a must.

Elastic Distributed Training with XGBoost on Ray
In this research article, the team at Uber Engineering discusses how to move distributed XGBoost on Ray and how to find the right abstractions to seamlessly incorporate Ray and XGBoost Ray into Uber’s ML ecosystem.

Tuning Model Performance
Creating and maintaining a high performing model is an iterative process. Michelangelo, Uber’s Machine Learning platform, provides a large catalog of functionalities that can be used during model development and tuning stages. Learn how to use it in this article.


ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation
In this paper, Chen Liang et al. introduce a novel approach of imitating how humans segment an object with the language guidance. Extensive experiments on A2D Sentences and J-HMDB Sentences show that the method outperforms state-of-the-art methods by a large margin. Qualitative results also show that the team’s results are more explainable.

Graph Transformer Networks: Learning Meta-path Graphs to Improve GNNs
In this paper, you’ll learn about Graph Transformer Networks (GTNs) capable of generating new graph structures, which preclude noisy connections and include useful connections for tasks, while learning effective node representations on the new graphs in an end-to-end fashion. Compared to GTNs, FastGTNs are 230x faster and use 100x less memory while allowing the identical graph transformations as GTNs.

Darker than Black-Box: Face Reconstruction from Similarity Queries
Anton Razzhigaev et al. propose a new approach to reconstruct the face by querying only similarity scores of the black-box model. The algorithm operates in a more general setup, but the experiments show that it is query efficient and outperforms the existing methods.

Probabilistic Graph Reasoning for Natural Proof Generation
In this paper, Changzhi Sun et al. look into the reasoning over natural language statements, to propose PRobr, a novel approach for joint answer prediction and proof generation. Experiments on multiple datasets under diverse settings verify the effectiveness of PRobr.

Automated Graph Learning via Population Based Self-Tuning GCN
GCNs have been successfully applied to a broad range of tasks, such as node classification, link prediction, and graph classification. In this paper, the researchers propose a novel method to automate the training of GCN models through hyperparameter optimization.

Automated Evolutionary Approach for the Design of Composite Machine Learning Pipelines
The effectiveness of the ML methods for real-world tasks depends on the proper structure of the modeling pipeline. Nikolay Nikitin et al. propose to automate the design of composite ML pipelines by combining ideas of both automated ML and workflow management systems.


Data Science at the Command Line, 2e
Take a sneak peek at how the second edition of Data Science at the Command Line is being written. The book is scheduled to be published by O’Reilly Media in October 2021. This website offers you an opportunity to take a look around.


Machine Learning Course from University of Oxford
A collection of materials from Oxford’s 2014-15 course on Machine Learning. In total, 16 lectures, seven weeks of practicals, and four class sessions. All the required code is available on GitHub.


HANA_AutoML is a simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate a wide range of routine tasks that Data Scientists encounter in their work.


Introducing the Habitat-Matterport 3D Research Data Set for Training Embodied AI
Facebook AI has released a Matterport open source licensed data set, the largest ever data set of indoor 3D scans. Habitat-Matterport 3D Research Dataset (HM3D) is a collection of 1,000 Habitat-compatible 3D scans made up of accurately scaled residential spaces.