Amazon’s cloud division has launched an investigation into Perplexity AI. The question is whether the AI search startup is violating Amazon Web Services rules by removing websites that tried to prevent it, WIRED has learned.
An AWS spokesperson, who spoke to WIRED on the condition of not being named, confirmed the company’s investigation into Perplexity. WIRED had already found that the startup, which has support of the Jeff Bezos family fund and Nvidia, and was recently estimated The site, valued at $3 billion, appears to rely on content from scraped websites that have been blocked from access through the Robots Exclusion Protocol, a common web standard. While the Robots Exclusion Protocol is not legally binding, the terms of use typically are.
The Robot Exclusion Protocol This is a decades-old web standard that involves placing a plain text file (like wired.com/robots.txt) on a domain to indicate which pages should not be visited by robots and robots. automated crawlers. Although companies using scrapers may choose to ignore this protocol, most have traditionally followed it. Amazon’s spokesperson told WIRED that AWS customers must adhere to the robots.txt standard when crawling websites.
“AWS’s terms of service prohibit customers from using our services for any illegal activity, and our customers are responsible for complying with our terms and all applicable laws,” the spokesperson said in a statement.
Scrutiny of Perplexity practices follow a June 11 report from Forbes who accused the startup of stealing at least one of its items. WIRED’s investigations confirmed the practice and found additional evidence of scratching abuse And plagiarism by systems tied to Perplexity’s AI-powered search chatbot. Engineers at Condé Nast, WIRED’s parent company, block Perplexity’s crawler on all of its websites using a robots.txt file. But WIRED discovered that the company had access to a server using an unpublished IP address — 44.221.181.252 — that visited Condé Nast properties at least hundreds of times over the past three months, apparently to scrape Condé Nast’s websites.
The machine associated with Perplexity appears to be engaged in widespread crawling of news sites that block bots from accessing their content. Spokespeople for the Guardian, Forbes and the New York Times also claim to have detected the IP address on its servers on multiple occasions.
WIRED traced the IP address to a virtual machine known as an Elastic Compute Cloud (EC2) instance hosted on AWS, which launched its investigation after we asked whether using AWS infrastructure for removing websites that prohibited it violated the company’s terms of service.
Last week, Perplexity CEO Aravind Srinivas responded to WIRED’s inquiry by first stating that the questions we asked the company “reflect a deep and fundamental misunderstanding of how Perplexity and Internet “. Srinivas then told Fast Company that the secret IP address observed by WIRED while scraping Condé Nast websites and that a test site we created was operated by a third party company that performs website crawling and indexing services. He declined to name the company, citing a nondisclosure agreement. Asked if he would tell the third party to stop exploring WIRED, Srinivas replied: “It’s complicated. »