Industry Insights
How to Train AI Models? Start with Structured Data
Meta’s latest AI update highlights a growing truth: smarter models require smarter data. Learn how to structure yours for visibility in generative search.

Sam Davis
May 1, 2025

Meta’s latest AI training update is a reminder: smarter models start with smarter data
Meta just announced a new phase of AI training in Europe. Starting this month, its generative models will learn from public content shared by adults in the EU on Facebook and Instagram — along with user interactions with Meta AI.
It's a major move, communicated transparently and backed by an opt-out mechanism. But beyond the headlines, it reflects a deeper truth that applies to every brand:
AI is only as smart as the data it's fed.
And right now, AI is becoming a key discovery channel where customers ask questions, and make decisions — often without ever clicking through to your website. That means if you want to show up in AI-driven search, chat, or recommendations, you need to make your brand data findable, trustworthy, and ready for training.
What is LLM training, and why is it required?
AI is only as smart as the data it's trained on. Training is how every large language model (LLM) learns to operate. Think of it like giving a child thousands of books, articles, and conversations to read and learn from, so they can one day write essays, answer questions, or chat with people convincingly. These models, like ChatGPT or Meta's AI, don't "understand" language like humans do, but they learn patterns — how words and ideas tend to follow one another — by analyzing massive amounts of text.
With Meta's announcement in mind, let's consider why they want to train their models from application users based in the EU. Languages aren't just about words — they come with dialects, slang, humor, cultural references, and even different ways of expressing emotions. Think about how Irish English sounds different from American English, or how a joke from Italy might not land the same way in Sweden. If an AI has only been trained on content from the U.S., it might miss the mark when talking to someone in France or Germany. By training on public posts and interactions from adults in Europe, Meta's AI learns things like:
- The different languages and dialects spoken across Europe (there are dozens!).
- Local sayings, humor, and cultural references that make conversations feel natural.
- The ways people in different countries use language online — like how formal or informal they are, or what kind of emojis they like.
This means that when you ask Meta AI a question, it can respond in a way that feels relevant and familiar, whether you're in Lisbon, Warsaw, or Helsinki.
Without regional examples — of how people talk, what questions they ask, what's culturally relevant — the model can't deliver locally relevant answers. They need more data. Better data. Local data.
So if you're wondering how generative AI knows what to say — whether it's Meta AI, Google Gemini, or ChatGPT — the answer is simple: it's trained on what it can find.
What makes your brand's data useful to AI?
At Yext, we've said: AI is your new customer. And that customer is asking you for four things:
- Clean, structured information it can trust, i.e. multiple entity content, structured schema markup, listings, FAQs, product details, etc. – all structured data.
- Consistent information across every platform, visible across 100s of global and EU-specific publishers
- Updated details that reflect what's true about your brand, products, and services today
- Content that sounds like how people actually talk
Meta's AI (and all other models) can only be helpful if it learns from data that reflects real-world language and local context.
Callout: Meta's update is just one example — but it reinforces the reality: if your data isn't structured, fresh, and accurate, AI won't use it. And customers won't see you.
Train the model — or it'll train without you
You can't control what Meta or OpenAI, or any other model, trains on. But, you can control whether your brand is feeding those models useful data. Here's how:
- Add schema markup so AI tools can interpret your content
- Keep listings and business data synced across platforms using a knowledge graph
- Regularly update core attributes such as hours, service details, menu information, FAQs, etc.
- Write content aligned to E-E-A-T (Expertise, Experience, Authority, Trust), matching natural, conversational queries. Better still, if you are a brand with a local presence, think "Local E-E-A-T"! These are "Thing+Place" content strategies, where the content helps to answer hyper-local questions such as:
- Product + Place i.e. "Men's blue jeans near Oxford Street"
- Service + Place i.e. "Guitar refretting service near central London" (sorry, I had to put a guitar playing reference in here somewhere!)
- People + Place i.e. "Financial Advisor specialising in retirement near Hammersmith"
All of this makes your brand's information more retrievable, understandable, and trustworthy — which makes it more likely to be used in generative answers.
Smarter inputs = smarter outcomes
Meta's announcement is the latest reminder: smarter AI comes from smarter data.
As AI becomes a primary discovery channel, the best thing you can do is structure your information in ways AI can find and understand. Feed the model — or risk being invisible to the new generation of search.
The model is training either way. Make sure it's learning from you.
Make your data work for you. Get the checklist for AI search readiness.