Access Innovations Leaders on Semantic Enrichment and Why The Scholarly Publishing Model Is The Blueprint for AI Readiness
And the days of XML Are Numbered

The AI revolution is exposing a fundamental truth that the scholarly publishing world has known for decades: unstructured data is a liability, not an asset. As enterprises race to build AI-powered tools and chatbots, many are discovering, often painfully, that dumping raw content into a language model produces hallucinations, policy violations, and fabricated citations.
In this Serious Insights interview, Access Innovations founder and Chief Scientist Marjorie (Margie) Hlava and VP of Business Development Veronica Showers make the compelling case that the rigorous content architecture long practiced in scholarly publishing โ semantic enrichment, controlled vocabularies, and low-level chunking โ is exactly the blueprint every organization needs to make AI work reliably. If you are wrestling with AI readiness or wondering why your early AI pilots underperformed, this conversation delivers both the diagnosis and the prescription.
Top 3 Takeaways
- AI readiness requires chunking content into small units (200โ800 tokens), tagging them with controlled vocabulary, and storing them in vector databases before any AI tool can reliably retrieve the right information.
- Controlled vocabularies and semantic enrichment solve the disambiguation problem AI cannot handle alone โ preserving meaning, context, and provenance through the chunking and ingestion process.
- Humans remain essential stewards of their knowledge domains; removing them from the semantic enrichment process risks AI outputs built on poorly grounded, domain-agnostic representations.
The Access Innovation Interview
Sheri McLeish: I’ve referred to the scholarly publishing model as a blueprint that enterprise marketers and other publishers must adopt to survive the AI revolution. Could you provide a high-level overview of what this model actually entails and why it works so well?
Veronica Showers: Because of GenAI, we are transitioning from a resource economy to an answer economy. In an answer economy, it’s not about just finding the right documents; it’s about finding the right sections within each document to determine an answer. In this new economy, we need to learn how to function in ways that are optimized for language models, and they don’t like to think in big documents.
They really think in smaller units, anywhere between 200 and 800 tokens. Anything beyond that introduces a lot of noise, so you want to keep things small. The data you have on hand needs to be pre-processed into those smaller units, tagged properly, and stored in a vector database. That is what I call the โbrainโ, first, before you can build the tools.
Margie Hlava: People are thinking, “Well, I just got my data into XML, and it’s costing me a mint, and now you want me to chunk that up?” Well, yeah, I do. XML provided a structural backbone for production, but it described the form rather than the actual meaning of the content, and you actually didn’t do very much in terms of tagging it with subject metadata from a controlled vocabulary… We’re chunking our data, we’re tagging it at a low level so that we know how to attribute it, and we know what the meaning is.
Sheri McLeish: Why did simply dumping unstructured data into early large language models cause so many headline-making failures?
Veronica Showers: When organizations dumped their data without any real training, the language models had to rely on surface-level semantic similarity, which resulted in chatbots telling users how to break policies or making up falsified references. The models lacked the structured instruction needed to successfully retrieve and apply the right information.
Sheri McLeish: If chunking and low-level semantic tagging are done correctly, does that eliminate the need for heavy XML overhead?
Margie Hlava: Yes, I think if these processes go right, there might not be a need for all that XML overhead. There is a significant movement toward linked data and content profiles that prioritize this lower-level chunking and tagging instead.
Starting Your AI Readiness Efforts
Sheri McLeish: For organizations overwhelmed by a massive backlog of unstructured content, where is the best place to start proving value?
Margie Hlava: You should look at where your unstructured information is costing you the most in manpower or exposing you to litigation liabilities because you cannot easily access your data. Automatically indexing that high-cost data hits the organization in the pocketbook and is a serious place to start.
Sheri McLeish: Is there a different approach to prioritize content if their goal is to build a specific AI tool, like a client-facing chatbot?
Veronica Showers: In that case, you have to work backward from your specific goal. Determine exactly what kind of queries the chatbot needs to answer, and then figure out what specific data needs to be fed into the tool to fulfill it.
Sheri McLeish: In the agency world, we encourage defining content models for clients so their content marketing, product information and customer service materials can dynamically flow into different outputs without manual revisions. How does this modular approach fit into AI readiness?
Veronica Showers: It fits perfectly because the smaller components allow language models to package and repackage that data in many different ways. Tagging data correctly at the chunk level establishes the semantic structure that allows those modular building blocks to be dynamically aggregated and served back to people effectively.
The Importance of Semantic Enrichment and Vocabulary
Sheri McLeish: Taxonomies constantly evolve, so how do you keep these controlled vocabularies accurate over time?
Margie Hlava: You monitor incoming content streams to see if new terms were indexed appropriately or if there was nothing there that the taxonomy could latch onto, which indicates a gap. Terms need to be added to the taxonomy to cover the gaps. For rapidly moving fields like news, you have to concentrate on the topical areas and cannot afford to fall a day behind.
Sheri McLeish: Why is establishing a structured vocabulary so important for disambiguation when feeding AI?
Margie Hlava: To a computer, homonyms look exactly the same. Structuring your vocabulary ensures that the different meanings of words are recognized in their proper context. Adding tags, keywords, or concept labels to the content provides context, enables discovery, and ensures that the meaning in the writing is preserved. English has words with many meanings, and words taken out of context lead to sometimes amusing, but often incredibly incorrect, interpretations of the information presented.
Words have different meanings in different domains. โMercuryโ, for example, can be an element in chemistry, a planet in astronomy, a god in mythology, an automobile, a messenger, a plant, etc. โLeadโ can be a management term, something you use to walk the dog, the inlet of a river to a larger body of water, or an element on the periodic table.
Words also often change labels or meanings quickly in modern discourse, leaving the earlier writings unfindable, buried in old terminology. Take the case of homeless, unsheltered, unhoused, street people, and earlier, hobos, drifters, vagrants. We came up with at least 57 synonyms for this. Laws and research exist for every one of those terms. That is a big search parameter! Or look at when COVID appeared on the scene as Coronavirus, SARS, SARS-CoV-2, Covid-19, etc. How do we keep track of these changes and ensure that we are really doing a full scan of the available research data?
The value of semantic enrichment within AI is that when the data is chunked, tokenized, and fed into vector databases, all links to the word usage (meaning and context) are lost unless we tag that data in the beginning so it is held together throughout the ingestion process by the terminology control that the semantic enrichment provides. The prediction of which word might come next is powerful: it is more powerful with guardrails of a taxonomy or other vocabulary control.
Sheri McLeish: Why is establishing the provenance of these information chunks becoming so critical?
Margie Hlava: Establishing the source or provenance of information, such as using a DOI, is becoming increasingly important for overall data accuracy. It prevents misinterpretation and ensures the AI outputs are grounded in authoritative expertise rather than generalized automation. When items are linked back and attributed to an author, it helps preserve their original intent and keeps the meaning intact. Throughout my career, including my work with the original Dublin Core group, I have focused on establishing syntax like DOIs and contributor role designations (CRediT) to ensure we know exactly who contributed to a paper and why their name is on it.
When items are linked back and attributed to an author, it helps preserve their original intent and keeps the meaning intact.
Sheri McLeish: What are your impressions of the market demand for structured content, considering its differing maturity in domains like marketing compared to technical publishing?
Veronica: The work itself is universal for any organization that wants to create a product related to AI. No matter whether you’re a publisher or a marketing firm, you still have to go through that process of taking every document, breaking it up into small components, and then tagging that document properly to instruct the language model on how to retrieve the right portions of documents and how each component should be understood.
Sheri McLeish: With AI automating so many repetitive tasks, what is the ongoing role of the human in the loop?
Margie Hlava: AI is a sophisticated pattern recognition system, but it lacks deductive reasoning and the ability to understand if information is truly complete. While automated indexing tools can certainly handle repetitive tasks, keeping a human in the loop is essential because human expertise is absolutely required to make intellectual decisions and apply common-sense reasoning.
When you are preparing content for AI, the semantic enrichment and structuring really need to be done by the people who own and understand the knowledge domain. It is human expertise that preserves the conceptual architecture, logic, and complex relationships that a specific discipline depends on. If you remove the human from this process and let a vendor’s model guess at the meaning of your field, you run the severe risk that AI outputs will be derived from poorly grounded representations that lack true domain expertise.
Ultimately, in a world where AI is increasingly mediating how knowledge is discovered and interpreted, keeping a human in the loop to safeguard the true meaning of the discipline remains our most important responsibility.
Yes, there are several additional insights from the sources that expand on this topic. Both Margie and Veronica emphasized that the shift toward AI is changing how organizations capture internal expertise, turning everyday work into structured data.
The (Past) and Future of Knowledge and the Human Role
Sheri McLeish: Margie, having followed technology changes from nine-track tapes to where we are today, do you see historical similarities in how people are reacting to AI?
Margie Hlava: Yes, technology is currently acting as either a positive or a disruptive influence, depending on how you look at it. Right now, people are curious, cautious, and quite afraid of what it will do to their publishing models. Interestingly, back in 1964, in response to the space race after Sputnik went up in 1957, the Council on Scientific and Technical Information (COSATI) report outlined a great deal of the information structures we are seeing todayโthey just didn’t have the computing horsepower back then.
Sheri McLeish: How does the rapid emergence of AI compare to past technological shifts we’ve experienced?
Veronica Showers: It reminds me of the dot-com era. In both cases, the technology was here to stay, and those who learned how to use it were going to succeed. I started learning how AI processes information and how to build agents, which brought me to Access Innovations. Because of how crucial data structuring is for these systems, I predict there will be an absolute boom in knowledge management roles specifically related to AI projects to ensure AI is grounded in structured knowledge rather than unstructured content.
Margie Hlava: We are entering a moment where organizations and publishers must stop thinking of themselves as simply providers of articles and start recognizing themselves as stewards of knowledge. Furthermore, having this structured knowledge provides a massive advantage for new, junior employees, allowing them to soak up all available internal documentation to quickly become a reliable contributing part of the organization.
Sheri McLeish: How is this shift toward semantic enrichment changing internal knowledge management, especially when it comes to retaining expertise as employees are let go or retire?
Margie Hlava: Capturing the knowledge of the people who are currently working, or those who are leaving or retiring, is an incredibly important field. The focus has to be on structuring the metadata rather than just the data itself, as this information is now being packaged and repackaged in many different ways.
Veronica Showers: Organizations are taking a taxonomic or tagging approach to move away from big blobs of text toward structured outlines. The best framework for capturing this knowledge is for employees to document what they are doing as they go, similar to how research and development firms use lab manuals. Even everyday internal assets like transcripts, emails, and podcasts can be tagged and chunked in an AI setting to be incorporated into an internal knowledge base.
Sheri McLeish: This was a great conversation, and I appreciate the time that you have spent with me. I think there is a lot that we were able to dig into, and hopefully, this assists those who are looking for this type of expertise to find you.
Veronica Showers: Good to talk to you, Sheri, thank you.
About Marjorie (Margie) Hlava and Veronica Showers
Access Innovations

Marjorie (Margie) Hlava is the founder, Chairman, and Chief Scientist of Access Innovations, who began her career as an information engineer at NASA and has helped develop around 600 taxonomies.
Veronica Showers is the VP of Business Development at Access Innovations, bringing over 20 years of experience in scholarly publishing and specializing in preparing content for generative AI through ontology application.

Access Innovations has long been a powerhouse in the scholarly publishing industry, a sector where structured content, semantic enrichment, and rigorous taxonomies have been acknowledged as necessities for decades. In this Serious Insights interview, we discuss why the scholarly publishing model is the blueprint that modern marketers and enterprise publishers must adopt to survive the AI revolution.
About Sheri McLeish

For more serious insights on AI, clickย here.
Did you find thisย interview with Marjorie Hlava and Veronica Showers useful? If so, please like, share or comment. Thank you!
The cover image is AI-generated (Adobe Firefly) from a Serious Insights prompt referencing source photos provided by the participants.

Leave a Reply