Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn't arrive within 3 minutes, check your spam folder.

Ok, Thanks

Google open-sourced its SynthID Text watermarking tool

Google has open-sourced SynthID Text, a watermarking tool that modifies language models' token probability distributions to create detectable patterns in AI-generated text, making it available through its Responsible Generative AI Toolkit and Hugging Face's Transformers library.

Ellie Ramirez-Camara profile image
by Ellie Ramirez-Camara
Google open-sourced its SynthID Text watermarking tool
Credit: Google DeepMind

Google recently announced it will open-source its SynthID Text watermarking tool making it available as part of its Responsible Generative AI Toolkit and through Hugging Face's Transformers library. Google claims it has already integrated the DeepMind-developed technology into its models for some time without sacrificing the outputs' speed, quality, diversity, and creativity.

SynthID Text leverages the underlying mechanism by which LLMs generate text: they predict the likelihood of their next token based on the previous ones and the probabilities assigned to the potential candidates to continue the output text. SynthID Text creates watermarks for AI-generated text by modifying the probability distribution for the tokens in several places using one or more randomly generated watermarking functions.

The pattern of word choices where the function was applied and the adjusted probability scores are considered the watermark. A detector is trained to evaluate text against those watermarking functions: the closer the pattern and scores are to the randomly generated functions, the more likely the text is AI-generated.

The strategy has some limitations: for instance, prompts that don't allow for significant variations such as factual questions with short answers cannot be watermarked, as there will not be many opportunities to modify a token's probability distribution if the starting point does not provide with options to choose from. Moreover, Google says that the watermark will resist some degree of tampering, such as cropping a longer text, lightly paraphrasing it, or replacing some of its words. It will not, however, withstand heavy paraphrasing or translation to another language.

Still, the availability of SynthID Text marks a significant milestone in making watermarking technology widely accessible. Developers deploying SynthID do not need to retrain their models, but they have to generate a watermark per model if they use different tokenizers. Otherwise, they need to ensure they train the detector on outputs from all models sharing the watermark and tokenizer. By doing this, they can set up a reasonably functional method to detect whether a piece of text was generated using their models. More technical details can be found on the Hugging Face blog post and the Responsible Generative AI Toolkit documentation

Ellie Ramirez-Camara profile image
by Ellie Ramirez-Camara
Updated

Data Phoenix Digest

Subscribe to the weekly digest with a summary of the top research papers, articles, news, and our community events, to keep track of trends and grow in the Data & AI world!

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Read More