AI companies are running out of fuel, your content will fill the tank

Ryan Shenefelt

The AI Space Race has boosted the S&P and NASDAQ indices, leads headlines in business and tech reporting, and has become an integral part of earnings calls, strategic planning sessions, and casual conversations around the virtual water cooler. These AI companies have quickly earned their stripes, but as they continue to learn and grow, they’re running out of fuel: Data. 

To combat this, leading AI companies are getting creative with where they acquire their data and what they can “crawl” to train their Large Language Models (what makes top AI platforms like ChatGPT, Google Gemini, and Microsoft Copilot work). These models, up until now, have been trained on publicly available data.

Meta, the parent company of Facebook, Instagram, Threads and WhatsApp recently announced that it will begin using public and “non-public” data (user content) on its own platform, dating back to 2007, to train its model, LLaMA, beginning on June 26, 2024. Meta made this “announcement” via push notification to their European Union-based users. The EU’s GDPR, often seen as the most strict data privacy and security regulation, is being used as a stress test to see if this passes European regulations. The United States American Privacy Rights Act is still working through Congress, meaning that Meta’s updates are all above board, at least for now.

While legal, many U.S.-based companies and users are left unsure and uneasy about the latest announcement from Meta. “Will Meta train on my personal messages?” “Will they start giving away trade secrets and pricing that my company has developed?” The easy answer to those questions is “No.” However, companies should begin reassessing their content policy as we enter the new age of AI training. 

AI companies have essentially “run out of internet” to crawl and train on. This puts into perspective how much data is included in a typical AI response. Your data point (whether a blog, an image, a status update, or a comment) is one tiny drop in the ocean of data. The nature of an LLM offers anonymity in the pure mass of data needed. 

As AI optimists, we are always thinking five years down the road. We seek positive business use cases for emerging technology like AI while evaluating and considering risks. While burying your head in the sand and avoiding AI may seem the easiest option, we encourage you to dive into that sea of data and see how it’s used. If you are concerned that Meta will use your company’s social posts to train LLaMA to make other companies’ content look like yours, consider the perspective of that trove of data it pulls from. Additionally, if you are considering posting elsewhere to avoid your data being used, it is only a matter of time before other social media networks do the same. In short, target your audience based on where their eyeballs are and the best business intent and results those networks will help you gain. Your data is already out there in one form or another, and the alternative of not putting out content and advertising far outweighs the risks of potential AI piracy. 

We can’t end this article without our take on capitalizing on AI. Generative AI can save your company time and the hassle of doing repetitive tasks, developing or starting workflows or getting the gears turning for creative ideation. We recommend using AI to start the process, but don’t become overly dependent on it for the end result. For example, it’s a bad look for your business to use imagery that looks like a competitor’s or has a written voice that is inconsistent with how you’ve “spoken” for a decade. Empower your employees to gut check and evaluate the content and make it their/your own. The AI-generated content should be your inspiration and your base, but it should never be your final product.  

Ryan Shenefelt is an account manager and education & innovation lead at de Novo Marketing.