Less than a year ago, part of the team at research initiative Epoch uploaded a paper to arXiv detailing the analysis that led them to conclude that high-quality training data for LLMs could run out as early as 2026. In contrast, low-quality language data may exhaust between 2030 and 2050, and image data between 2030 and 2060. In addition, as large models grow, the returns become marginal since larger models also demand additional material resources, and the system's complexity makes it harder to optimize and avoid overfitting. However, high-quality datasets also have a simple yet powerful advantage: they represent the kind of language that models are expected to replicate.
As a result of the possible data shortage, developers and startups are starting to look for alternatives that range from fine-tuning smaller models with higher quality data and redefining the divide between high and low-quality data to using the same dataset more than once, or, as is the case with Aindo, producing synthetic data that replicates the statistic distributions of the patterns in the original real-life data. Several forecasts indicate that synthetic data will comprise a substantial percentage of data used to train future AI projects and that the field is valuable.
Perhaps an indicator that these forecasts are on the right track is the news that Aindo has closed a 6 million EUR Series A round led by United Ventures with the participation of Vertis SGR. The funding will allow the company to grow its team by ten more employees who will continue to develop solutions that align with its mission. In particular, the money will be invested in research on synthetic data generation and data exchange, the two main offerings of the platform.
Aindo specializes in developing synthetic datasets for the healthcare, financial, energy, and infrastructure sectors. Clients feed their unstructured data into the platform and obtain data formats suitable for statistical analysis. Then, the structured data can be submitted for synthetic data generation. This process yields privacy-compliant data that preserves the statistical utility of the original data set without revealing any sensitive information (thus its importance in the healthcare and financial sectors, for example). That data can be further analyzed to obtain novel information and insights.
What sets Aindo apart from other companies specializing in synthetic data is that the platform enables what they have termed data exchange. Since the generated datasets are privacy-compliant and statistically significant, clients can safely share these datasets with stakeholders or the research community, leverage them to find experts that fulfill their needs, and even profit by transacting with the generated datasets. The platform is currently available on a per-request basis.
Data Phoenix Newsletter
Join the newsletter to receive the latest updates in your inbox.