How to Prevent ChatGPT From Stealing Your Content & Traffic

Cyber Security Threat Summary:
ChatGPT and similar large language models (LLMs) have added further complexity to the ever-growing online threat landscape. Cybercriminals no longer need advanced coding skills to execute fraud and other damaging attacks against online businesses and customers, thanks to bots-as-a-service, residential proxies, CAPTCHA farms, and other easily accessible tools. Now, the latest technology damaging businesses' bottom line is ChatGPT. Not only have ChatGPT, OpenAI, and other LLMs raised ethical issues by training their models on scraped data from across the internet. LLMs are negatively impacting enterprises' web traffic, which can be extremely damaging to business. Among the threats ChatGPT and ChatGPT plugins can pose against online businesses, there are three key risks we will focus on:

  • Content theft (or republishing data without permission from the original source)can hurt the authority, SEO rankings, and perceived value of your original content.
  • Reduced traffic to your website or app becomes problematic, as users getting answers directly through ChatGPT and its plugins no longer need to find or visit your pages.
  • Data breaches, or even the accidental broad distribution of sensitive data, are becoming more likely by the second. Not all "public-facing" data is intended to be redistributed or shared outside of the original context, but scrapers do not know the difference. The results can include anything from a loss in competitive advantage to severe damages to your brand reputation.
Depending on your business model, your company should consider ways to opt out of having your data used to train LLMs” (TheHackerNews, 2023). Security Officer Comments:
Industries highly vulnerable to ChatGPT-driven attacks are those emphasizing data privacy, unique content, and intellectual property, with revenue dependent on advertising, views, and unique visitors. These sectors encompass:
  • E-Commerce: Distinct product descriptions and pricing models are crucial competitive factors.
  • Streaming, Media, & Publishing: Providing exclusive, imaginative, and engaging content is paramount.
  • Classified Ads: Pay-per-click (PPC) ad revenue is vulnerable to reduced website traffic, bot issues like click fraud, and skewed analytics from scrapers.
Additionally, ChatGPT's training data is drawn from various sources, such as Common Crawl, WebText2, Books1 and Books2, and Wikipedia. The predominant source, Common Crawl, grants access to web information through an open repository of web crawl data. Businesses should not solely rely on user agent identification for identifying Common Crawl's crawler bot, CCBot, due to user agent spoofing by malicious bots. To allow or block CCBot effectively, utilizing attributes like IP ranges or reverse DNS is recommended. Blocking ChatGPT necessitates at least preventing traffic from CCBot. Suggested Correction(s):
The article details several in depth ways organizations can block ChatGPT and related attacks. However, first and foremost they are recommending that companies:
  • Robots.txt: Since CCBot respects robots.txt files, you can block it with the following lines of code: User-agent: CCBot Disallow: /
  • Blocking CCBot User Agent: You can safely block an unwanted bot through user agent. (Not that, in contrast, allowing bot traffic through user agent can be unsafe, easily abused by attackers.)
  • Bot Management Software: Whether it's for ChatGPT or a dark web database, the best way to prevent bots from scraping your websites, apps, and APIs is with specialized bot protection that uses machine learning to keep up with evolving threat tactics in real time.