Data Phoenix Digest

DataScience Digest — 10.06.21

Machine learning in healthcare, the top 10 TED talks on AI, fraud detection in Uber, DatasetGAN, Text-to-Image generation via transformers, and more...

by Dmitry Spodarets

Updated June 10, 2021

NEWS

What’s new this week?

Machine learning in healthcare, from credibility concerns to real-world cases of solving unsolvable problems. The limits and mistakes of AI systems. Smart public transport. And an ongoing debate on AI security.

Can AI make patient care less accurate and efficient? As it turns out, yes, it for sure can. Poor access to quality datasets, push to release AI papers ASAP without proper peer review, and legal constraints are major challenges that organizations and researchers face. And yet, AI can be a game changer for many; for example, for individuals with Parkinson’s disease.

AI is hardly a silver bullet, but for some it can become a bullet that kills. Did you know that the AI Incident Database launched late in 2020 now contains 100 incidents? If you work for an AI business, make sure you won’t end up in this hall of shame — it’s not worth it. Any poorly designed AI is a problem, a problem that can cost lives, and we can’t expect AI to fix itself (at least, now).

Speaking of AI’s limitations… At this year’s International Conference on Learning Representations (ICLR), a team of researchers from the University of Maryland presented an attack technique meant to slow down deep learning models that have been optimized for fast and sensitive operations. So, while Italy’s Florence is testing AI to optimize its transit system, the proposed technique raises a question, “Should we give AI the power to truly control any system that, if hacked, may endanger human lives?”

ARTICLES

Introducing Orbit, An Open Source Package for Time Series Inference and Forecasting
Orbit (Object-ORiented BayesIan Time Series) is a general interface for Bayesian time series modeling developed by Uber Engineering. In this article, you’ll learn the ins and outs of Orbit, from the basics and use cases to a tutorial and benchmarks to follow. Uber is going to introduce more dedicated Bayesian time series models, so the project is worth a look.

Fraud Detection: Using Relational Graph Learning to Detect Collusion
Uber’s popularity attracted the attention of financial criminals in cyberspace. One type of fraudulent behavior is collusion, a cooperative fraud action among users. In this article, Uber Engineering demonstrates a case study of applying a cutting-edge, deep graph learning model called relational graph convolutional networks (RGCN) to detect such collusion.

KELM: Integrating Knowledge Graphs with Language Model Pre-training Corpora
Large pre-trained NLP models rely on natural language corpora from the Web, which limits their coverage and may cause misrepresentation of critical facts. Knowledge graphs feature structured data, but they are too hard to integrate with the existing pre-training corpora in language models. Google may have found the solution with KELM.

Airflow and Ray: A Data Science Story
In this article, you’ll learn about a Ray provider for Apache Airflow. Ray is a Python-first cluster computing framework that allows Python code, even with complex libraries or packages, to be distributed and run on clusters of infinite size, enabling fast transformations of Airflow DAGs into scalable machine learning pipelines.

A Checklist to Track Your Data Science Progress
Progress is fickle. You may think that you are moving forward while, actually, being stuck in the repetition rut. That’s why you need to have a system to track your progress; for example, you can use this awesome checklist by Pascal Janetzky. Get an overview of your progress and find the next goal just by following these steps.

High-Performance Speech Recognition with No Supervision at All
AI-powered speech recognition is available only for a small fraction of languages. This is why Facebook AI developed wav2vec Unsupervised (wav2vec-U), a way to build speech recognition systems that require no transcribed data at all, that combines years of work in speech recognition, self-supervised learning, and unsupervised machine translation.

Fitness Navigator
In this article, the author shares his thoughts on a research paper Regularization for Deep Learning: A Taxonomy (2017) by J. Kukačka, V. Golkov and D. Cremers, and well as talks about potential and existing ways of improving machine learning models. Dig in to learn about optimal fit, fit quality, and cost function.

Almost Free Inductive Embeddings Out-Perform Trained Graph Neural Networks in Graph Classification in a Range of Benchmarks
In this extensive research article, the author experiments with reported benchmarks for actual performance of Untrained Graph Convolutional Network (uGCN) with randomly assigned weights vs. a fully grown (end-to-end trained) Graph Convolutional Network (CGN) in supervised setting. The results are quite interesting.

PAPERS

DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort
DatasetGAN is an automatic procedure to generate massive datasets of high-quality semantically segmented images requiring minimal human effort. Presented by an international team of researchers, it outperforms all semi-supervised baselines and is on par with fully supervised methods using labor intensive annotations.

Long Text Generation by Modeling Sentence-Level and Discourse-Level Coherence
Generating long and coherent text is an important but challenging task. In this paper, the authors propose a long text generation model that represents the prefix sentences at sentence level and discourse level in the decoding process. Extensive experiments show that the model can generate more coherent texts than state-of-the-art baselines.

CogView: Mastering Text-to-Image Generation via Transformers
Text-to-Image generation is a challenging task that requires powerful generative models and cross-modal understanding. CogView is a 4-billion-parameter Transformer with VQ-VAE tokenizer that, according to the authors, achieves a new state-of-the-art FID on blurred MS COCO, outperforms previous GAN-based models and a recent similar work DALL-E.

An Attention Free Transformer
Attention Free Transformer (AFT) is an efficient variant of Transformers that eliminates the need for dot product self attention. AFT-local and AFT-conv are two model variants that take advantage of the idea of locality and spatial weight sharing. AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.

ByT5: Towards a Token-Free Future with Pre-Trained Byte-to-Byte Models
In this paper, Linting Xue et al. demonstrate that a standard Transformer architecture can be used with minimal modifications to process byte sequences. They carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts.

VIDEOS

Full Stack Deep Learning - UC Berkeley - 2021
This is a comprehensive course on full stack deep learning recorded at UC Berkeley by Sergey Karayev, Josh Tobin, and Pieter Abbeel. The course consists of 22 lectures covering deep learning fundamentals and all the way up to model deployment and monitoring.

The Top 10 TED Talks on AI
In this listicle, you’ll find short descriptions and links to the best talks delivered at the TED platform on the topic of AI and machine learning. It includes talks by Ray Kurzweil, Fei-Fei Li, Nick Borstron, Sam Harris, Garry Kasparov, and others.

PROJECTS

Know Your Data
Know Your Data (KYD) is a collection of 70+ TensorFlow datasets. It allows you to easily find and sort datasets by name and size, and choose the right dataset for your tasks. You can also check out the project’s documentation for more details.

by Dmitry Spodarets

Updated June 10, 2021