Hugging Face recently launched HuggingSnap, an iOS application that runs SmolVLM2, a small but performant multimodal language model that accepts video, images, and text as inputs, and generates text in response. It can be used for:

  • vision understanding tasks, such as answering questions about or identifying and describing objects within an image or video;
  • generating text grounded on visual information, like writing a story grounded on the contents of one or more images;
  • text-only tasks, as one would with a standard language model.

Apps that leverage multimodal generative AI for various vision and text-based tasks are hardly new. However, HuggingSnap's key selling point is that SmolVLM2 is run locally and efficiently. As a result, the app does not require an internet connection to work, and all data is processed within the device without performance losses.

HuggingSnap can be downloaded from the App Store or built from the GitHub repository. It requires an iPhone running iOS 18 to run.