Data Phoenix Digest - ISSUE 9.2024

Hi, everyone, and welcome to this month's edition of Data Phoenix Digest!

Today, I want to reveal some exciting upcoming changes at Data Phoenix and share what we have already begun implementing.

But first, I want to congratulate our team, contributors, and members on reaching the milestone of 10,000 members: great job, and a huge thank you to everyone for your contributions. Let's keep up the momentum and move forward together.

To make our online newspaper and community more engaging and beneficial, we have started implementing the following changes:

Moving to Discord: We have launched our Discord server to address the limitation of Slack's 90-day message history. We are currently setting it up, and I invite everyone to join. If you're interested in helping with moderation and development, please message me directly. We would love to have more moderators on board.
Launched Event Calendar: We have launched our event calendar in test mode. In addition to our events, we are gathering all AI and data-related events. If you organize events, feel free to add them to our calendar. We are happy to help you with the promotion.
Community Survey: We are launching a survey to understand better the content you are interested in and how we can be more helpful to you. I would greatly appreciate your participation.

Digest Changes: We will maintain the monthly digest, during which I will share the latest news about Data Phoenix. Additionally, we are introducing two new weekly newsletters—one dedicated to events from our calendar and the other to the latest news in the AI industry. Depending on the survey results, there might be more updates coming.
Website Updating: With our friends from Cyberink, we have started rebuilding our website. We aim to make it more user-friendly for our members and contributors, enhancing our collective learning and growth in the AI and Data world.
Time for Data Phoenix Conference: After hosting many webinars, it's time to organize our first online conference. We are planning it in October, with the announcement scheduled for mid-July. If you wish to speak at our conference or support us as a sponsor, don't hesitate to contact me without waiting for the official announcement.

These are the key updates from Data Phoenix that I wanted to share today.

Cyberink: Where Technology Meets Creativity in Software Solutions

Are you searching for a top-notch software developer to work on your project but unsure where to begin? Look no further, as you've come to the right place. Allow me to introduce Cyberink, our IT company that excels in a wide range of technologies including cybersecurity, design, web/mobile development, and consultation services.

At Cyberink, we boast a team of experienced professionals skilled in building applications using popular technologies like React Native, React.js, CI/CD, AWS, Angular, JavaScript/TypeScript, HTML, PHP, CSS, Java, and C++. What sets us apart is our belief that every team member should have a broader engineering understanding rather than just coding skills. We strive to think creatively, provide valuable insights, and assist our clients in creating software that not only meets their needs but also drives profitability.

Still hesitant about choosing us for your IT needs? Take a moment to visit our website for more information. You can also explore our profile on the reputable platform Upwork to gain insight into our team's work history and skill sets, such as languages spoken, educational backgrounds, job scores, and overall expertise.

If you're ready to partner with our team, feel free to reach out to us via email, connect with us on LinkedIn, or request a callback on our website. We guarantee prompt communication and personalized attention to your software development needs. Cyberink is dedicated to creating software that is not only functional but a work of art.

Now, let's dive into the upcoming webinars we have planned for July and August:

Your AI has 99 problems: Why high-quality data is the key to production-grade AI systems

Speakers: Emmanuel Turlay (Founder/CEO of Airtrain AI (YC S22))

In AI like in Machine Learning, data is the crux of most problems. Whether you are developing GenAI apps, building AI agents, curating an evaluation set for your RAG pipeline, creating a training set to fine-tune an LLM, or trying to understand how your product is performing, you must become one with your data. In this talk, Airtrain Founder/CEO Emmanuel Turlay will discuss the importance of high-quality data for AI/ML data workflows, why and when you should use a data-first approach when choosing AI tooling, what real-life business and academic use cases benefit, and how Airtrain AI can improve and maintain data quality.

Building Time Series Foundational Models: Past, Present and Future

Speakers: Leo Pekelis (Chief Scientist at Gradient)

Time series data is ubiquitous across industries: the startup COO predicts customer demand; a clinician in the ICU reads medical charts, the stock broker forecasts security prices. In the past, a technical and domain expert would build, train, and implement a new model for each task, in each industry's swim lane. This is a massive intellectual fragmentation bottleneck!

Luckily, transformer architectures, enabling zero-shot sequence modeling across modalities, are a perfect solution. We introduce a new frontier in transformer modalities - time series - where massive amounts of domain knowledge are taught to large time series models (LTSMs), forming a universal prior across forecasting, imputation, classification, and anomaly detection tasks.

Join us as we review the next frontier of AI, showcasing Gradient’s LTSM, a novel architecture, and massive time series dataset, achieving state of the art performance on time series tasks. Our foundational model and datasets are fully open sourced. Finally, we preview multimodal foundational time series models, where working with time series data is as easy as prompting ChatGPT.

Using Vector Databases for Multimodal Search and Retrieval Augmented Generation

Speakers: Zain Hasan (Developer Relations Engineer at Weaviate)

Many real-world problems are inherently multimodal, from the communicative modalities humans use such as spoken language and gestures to the force, sensory, and visual sensors used in robotics. For machine learning models to address these problems and interact more naturally and wholistically with the world around them and ultimately be more general and powerful reasoning engines, we need them to understand data across all of its corresponding images, video, text, audio, and tactile representations.

In this talk, Zain Hasan will discuss how we can use open-source multimodal embedding models in conjunction with large generative multimodal models that can that can see, hear, read, and feel data(!), to perform cross-modal search(searching audio with images, videos with text etc.) and multimodal retrieval augmented generation (MM-RAG) at the billion-object scale with the help of open source vector databases. I will also demonstrate, with live code demos, how being able to perform this cross-modal retrieval in real-time can enables users to use LLMs that can reason over their enterprise multimodal data. This talk will revolve around how we can scale the usage of multimodal embedding and generative models in production.

Improving LLM Quality, Speed and Cost with Dynamic Routing

Speakers: Daniel Lenton (CEO at Unify)

In this talk, we explore how we can improve the performance of LLM applications by dynamically routing each prompt to the most appropriate model, whilst balancing user preferences for quality, speed and cost. Further, we will demonstrate how a custom trained router can continually improve over time, by training on the stream of data being processed by the live application. We will demonstrate this for RAG applications and also for co-pilot applications, and we will also explore how the router training scales with dataset size, and explore some of the shortcomings of the approach. See you there!

Transformers need glasses! Information over-squashing in language tasks

Speakers: Petar Veličković (Staff Research Scientist, Google DeepMind; Affiliated Lecturer, University of Cambridge)

We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis – specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways – leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.

Thank you for reading this digest to the end. If you have any questions, ideas on improving Data Phoenix, or potential collaboration opportunities, please don't hesitate to reach out. I look forward to connecting with you online or meeting in the SF Bay Area.

Best regards,
Dmytro Spodarets

Subscribe

Data Phoenix Digest - ISSUE 9.2024

Your AI has 99 problems: Why high-quality data is the key to production-grade AI systems

Building Time Series Foundational Models: Past, Present and Future

Using Vector Databases for Multimodal Search and Retrieval Augmented Generation

Improving LLM Quality, Speed and Cost with Dynamic Routing

Transformers need glasses! Information over-squashing in language tasks

Comments

Read Next

Harvey raises $150M at a $8B valuation in its third funding round this year

LangChain announces a $125 million Series B on the eve of its third anniversary

WebAgents: the open-source framework enabling AI agent orchestration across the internet