Kolena, a startup funded by Mohamed Elgendy, Andrew Shi, and Gordon Hart, just raised $15 million in a funding round led by Lobby Capital. The startup is working on developing a platform that can test and validate solutions for any machine learning problem. According to TechCrunch's interview with Elgendy, this brings Kolena's total to $21 million, which will be spent expanding the startup's research team, developing partnerships with regulatory bodies, and increasing their marketing and sales efforts.
Elgendy stated in the same interview that what sets Kolena apart from its competitors is that the team at Kolena was focused on developing a new comprehensive framework for model testing and validation, rather than just simplifying the current testing and evaluation practices. Having a workflow-agnostic platform means that in addition to providing insights on data coverage or highlighting the risks associated with deploying a specific instance of a given model, users can also create custom test cases and measure their model's performance against other models, while also obtaining explanations on potential issues that might be causing the model to underperform.
Kolena's strengths are showcased in blog post series devoted to model testing and validation focusing on OpenAI's GPT-4. In the latest post of the series, they address the issue of GPT-4's apparent regression over time. In a nutshell, the issue is the following: In a surprising development of events, some users reported and a study found that the June version of GPT-4 was less accurate when it came to performing basic tasks it had succeeded at in March, such as identifying prime numbers and reconstructing the chain of reasoning behind the given answer. (It is pretty surprising to see that not only does GPT-4 answer incorrectly in June, but it also completely ignores the instruction to replicate its chain of thought.)
Following this finding, Kolena engineer Mark Chen sets out to test the performance and accuracy of GPT-4 (June), GPT-4-0314 (March), and GPT-3.5. Using the Conversational Question Answering dataset (CoQA), he finds that GPT-4 is the best performer overall, significantly outperforming GPT-3.5 in the amount of correct answers given. However, the first surprising finding is that, while GPT-4 has a higher number of correct answers than GPT-4-0314, it performs worse than the latter by individual metric measured. Chen then stratifies the data from the CoQA by data source to find that GPT-4 performs better when the data source is Wikipedia, but worse overall, and for every other data source. He then concludes that it is impossible to know whether GPT-4 is finetuned for Wikipedia from that test alone, but that given OpenAI's goals for ChatGPT (assisting users with everyday queries for which Wikipedia is a relevant source), it would make sense to value complete comprehension of Wikipedia over better understanding of other less relevant and frequent data sources.
This brings us back to another of Elgandy's statements in the TechCrunch interview, where he says, "For example, a model with 95% accuracy in detecting cars isn’t necessarily better than one with 89% accuracy. Each has their own strengths and weaknesses — e.g. detecting cars in varying weather conditions or occlusion levels, spotting a car’s orientation, etc." This sort of fine-grained and customizable testing may just be what Kolena needs to stay ahead of its competitors. In fact, Kolena's offerings become even more attractive when one considers its approach to privacy. In contrast to platforms requiring users to upload their models or datasets, Kolena only stores testing results for benchmarking purposes, and even those can be deleted upon request.
Kolena is still not fully commercially available as the company seems to following a per-request approach, but the company plans to widen the access to its framework by Q2, 2024.
Data Phoenix Newsletter
Join the newsletter to receive the latest updates in your inbox.