Cartesia's $27M seed round will enable it to build better alternatives to transformer-based models
Cartesia, having raised $27 million in seed funding, is developing innovative AI architectures that enable more efficient, long-memory, and multi-modal intelligence with their breakthrough State Space Model (SSM) technology, exemplified by their hyper-realistic Sonic voice generation model.
Cartesia recently announced it secured a $27 million seed round led by Index Ventures, with participation from prominent investors including Lightspeed, General Catalyst, and 90 angel investors. The startup is building efficient models that feature a long memory to overcome a persistent limitation of transformer-based models: their inability to efficiently process more than a few minutes of audio or seconds of video at a time. Current transformer-based models display performances that match their size. Generally, this means that longer multimodal inputs require more powerful models. These, in turn, demand hardware resources few can afford.
By introducing novel architectures like S4 and Mamba, Cartesia addresses the drawbacks associated with traditional transformer-based models as the new architectures' computational needs scale linearly with input size rather than exponentially. Moreover, rather than having to go over everything they have learned to retrieve a piece of data, Cartesia's models can compress everything they have learned into a fixed-size state that can be continuously updated.
Cartesia's standout achievement is Sonic, a hyper-realistic voice generation model, already in production with thousands of customers and available for testing at Cartesia's playground. Sonic resulted from the startup's research into a new SSM architecture for multi-stream models that can handle several streams of different data modalities at the same time. With this significant funding, Cartesia aims to continue pushing the boundaries of real-time, multimodal AI technologies that can compress massive contextual information and generate across multiple modalities seamlessly.