Meta has shared another round of research artifacts, including SAM 2.1 and Spirit LM

Meta's Fundamental AI Research (FAIR) has made a habit of sharing the results of its research, whether by previewing research papers or making research tools, including open-source models, for the community to experiment with and build upon, all in the interest of benefitting everyone. Recently, FAIR released a new collection of open-source AI models and research, continuing its commitment to "innovating for the greater good".

The collection includes SAM 2.1, an updated checkpoint of the wildly popular Meta Segment Anything Model 2; Meta Spirit LM, a model seamlessly integrating speech and text; Layer Skip, a tool for accelerating LLM generation times; SALSA, a tool for validation of cryptography standards; the model training codebase Meta Lingua; Meta Open Materials 2024, a model and dataset to drive AI-assisted inorganic materials discovery; and MEXMA, a new pre-trained cross-lingual sentence encoder.

The highlight of the research artifact release is undoubtedly SAM 2.1. The updated checkpoint comes slightly under three months after the SAM 2 launch. SAM 2.1 addresses previous limitations in handling visually similar and small objects while enhancing occlusion handling capabilities. According to FAIR, SAM2 has been downloaded 700,000 times since its initial release and the model has already found applications in fields ranging from medical imaging to meteorology. Simultaneously, FAIR has launched the SAM 2 Developer Suite, a package of open-source code to streamline building with SAM. The package includes code for fine-tuning SAM 2 with private data, and the front-end and back-end code for Meta's web demo.

Another notable release is Meta Spirit LM, a groundbreaking multimodal language model that seamlessly integrates speech and text. Unlike traditional text-to-speech pipelines, Spirit LM preserves expressive aspects of speech. The model is available in two varieties: Spirit LM Base for phonetic modeling and Spirit LM Expressive for capturing tone and style. Spirit LM Expressive uses pitch and style tokens to capture emotions in speech, such as anger or excitement, to generate speech adhering to that tone.