Unethical scraping accusations against Perplexity AI keep piling up
Perplexity AI has been facing accusations of plagiarism and unethical web crawling practices, including ignoring the Robots Exclusion Protocol, as investigations reveal issues with its content sourcing and summarization methods.
While we wait for that story of the Recording Industry Association of America's (RIAA) lawsuits against Suno AI, and Uncharted Labs to finish unfolding, this week TechCrunch has done an excellent job compiling the plagiarism allegations leveled at Perplexity AI, starting with Forbes accusing Perplexity AI of plagiarizing the investigative report it had published on the latest endeavors of Google CEO Eric Schmidt’s new venture, and going all the way to WIRED confirming developer Robb Knight's suspicions that Perplexity AI is actively ignoring the Robots Exclusion Protocol, an industry-standard method of signaling to crawlers that a website does not want its content to be accessed or leveraged, and WIRED's experiments to uncover the inner workings of the Perplexity chatbot.
The story begins with Forbes' report in early June that many of Perplexity's curated posts were too similar to the source material from Forbes, CNBC, and Bloomberg and poorly attributed to them. At that time, Perplexity CEO Aravind Srinivas noted that Perplexity Pages, the feature under fire, was only two weeks old, had "rough edges" and would be improved with the collected feedback, including addressing the matter of the easy-to-miss attributions. However, Perplexity's stance on its right to search for content and ingest it into summarizations remained largely unchanged.
Fast forward to a couple of weeks later, when one of developer Robb Knight's colleagues from Macstories discovered it was possible to get a quote from a specific story even after Knight had disallowed bot access using the Robots Exclusion Protocol and server-side blocking. By performing similar experiments, Knight and reporters at WIRED discovered that Perplexity was likely using automated web browsers running on an undisclosed IP address (at the time, Perplexity maintained a publicly accessible list of its IP range). WIRED also confirmed that the server at the IP address Knight had discovered had probably visited WIRED and other Condé Nast publications thousands of times.
WIRED went on to test its hypotheses on why sometimes Perplexity's chatbot offers a vague and inaccurate summary of the indicated source. In a rather extreme setup, WIRED created a new (blocked) site containing the single sentence "I am a reporter with WIRED" and asked the chatbot to summarize it. In return, Perplexity's chatbot delivered a story about a girl who followed a trail of glowing mushrooms in a magical forest. It is unclear why the chatbot was driven to such an exotic reply, but the finding, together with Perplexity's explanation of how its chatbot works, did increase confidence in the theory that, in cases in which Perplexity does not attempt to visit a website, the chatbot cobbles together a summary using publicly accessible information like metadata or search engine previews.
The cherry on top of this saga is that mere days after publishing its findings about the Perplexity chatbot, one of the authors of that article asked three chatbots, Perplexity's included, to summarize the article. ChatGPT and Claude attempted to guess what the article was about based on the headline and noted they could not access the full text. Perplexity's chatbot confidently delivered a 287-word summary including the following nearly verbatim fragment from the article "Despite monitoring the website's server logs, no evidence was found that Perplexity attempted to visit the page. Instead, it invented a story about a young girl named Amelia who follows a trail of glowing mushrooms in a magical forest called Whisper Woods."
While the plagiarism claims pile up and join cases of full-on hallucinations (WIRED names several incidents where Perplexity made up facts and attributed them to WIRED reporting), it is unclear where things should go from here. Since WIRED published its investigation, Amazon has reportedly opened an investigation into Perplexity AI based on the scraping claims, for terms of service violations. Moreover, Perplexity CEO Aravind Srinivas has come forward, noting that Perplexity uses third-party crawlers on top of its own, potentially offloading all the illicit scraping charges to the unidentified third party which could not be named on pain of violating an NDA. Srinivas also suggested that WIRED designed its prompts to elicit the specific behavior it reported on, and defended its product by reminding everyone that Perplexity AI has never claimed its chatbot does not hallucinate.
In short, and quite unsurprisingly, Srinivas is now defending its company by pulling all the tricks we've seen tech firms pull when questioned about their unethical and sometimes outright illicit practices. It remains to be seen if they will be as effective as they have been for everyone else.