The Owl and the Bat

2026-04-06

Knowledge Production on Wikimedia Projects with Artificial Intelligence (whatever that means)

There’s no doubt that what we refer to as “Artificial Intelligence” is changing knowledge production. English Wikipedia has recently adopted a policy which prohibits the use of Large Language Models to generate or rewrite article content. A little before that, German Wikipedia started a Request for Comment discussion on a comparable policy. We typically don’t see everything AI in 2026 as a field in computer science research, but in a context of exploitative business practices that – among other effects on the real world —put considerable strains on Wikimedia’s infrastructure and thus the Knowledge Commons.

But is there a place for computational knowledge? Can humans and machines work hand in hand in a way that can be called “Artificial Intelligence” even if it differs from what we typically call AI nowadys? I think there are possibilities worth exploring. These forms of computational knowledge have – both technically and structurally – very little to do with the LLMs of “AI” corporations.

Knowledge Bases and AI Winters

At the risk of total oversimplification, it can be said that there are two approaches. One is what is known as “Good Old-Fashioned AI.” Confusingly, this approach is neither very old – although it goes back to the late 1950s – nor necessarily inherently good, so it’s a bit of a misnomer. It’s roughly what was being strongly pursued especially in the 1980s up until neural networks took over. Back then, AI looked like this: you created a knowledge base for an expert system. In that knowledge base, you tried to model the world and everything that is the case, and you defined lots of rule sets about how reality works. An ambitious endeavor, to say the least.

At some point, however, research funding for these projects in the university ivory towers ran out, and students earning some money as research assistants could no longer be paid for entering all the rules, such as “All humans are mortal” or “Socrates is a human.” And thus the whole approach basically hit a wall. Enter another AI Winter.

The newer approach takes a different view. Now it’s all about neural networks, and especially since the paper “Attention Is All You Need”, which in 2017 introduced the model of the generative pre-trained transformer (GPT), it appears that the matter is now settled. Neural networks work under this premise: Knowledge in this world is chaotic. So we throw it into a computer in its oh-so-messy form, and it should figure out its own rules from that.

And that premise works remarkably well. Machine translation using neural networks performs far better than anything that had previously been attempted with other AI approaches.

The first approach is like an owl sitting in a tree with branches and leaves: a structured system of knowledge with categories and branches that subdivide further, where we make precise conceptual distinctions and build ontologies, that is, structured knowledge systems. Humans create these, and machines can then use them in machine-readable form.

An owl

What is mainly used nowadays with neural networks follows more of a bat principle. We move through the darkness, orient ourselves via echoes, don’t try to categorize anything, and simply assume that if a ping comes back, then something must be there and we can work with it.

Wikidata

One example of the first system is Wikidata. It’s a collaboratively built, free knowledge base in over 300 languages, intended to eventually contain the sum of human knowledge and all its entities.

It works like this: there is a knowledge item Q42, which represents the author Douglas Adams. This can be displayed in any other language. You then assign properties like “Which university did this person attend?” and “What sources verify this?” You can also run queries.

For example, if you look at relationships for the entity San Francisco, there’s a link to the mayor and another to geolocation. You can then ask questions like: In which U.S. city was Gavin Newsom mayor, and how many inhabitants does that city have?

San Francisco

To make it more complex, Wikidata uses a query language called SPARQL. With it, you can ask something like: “What are the twenty largest cities in the world that have a female mayor?” You first retrieve all cities, sort them by population, then check who leads the city government and what gender is recorded for that person, and filter accordingly.

This produces answers you won’t find directly in Wikipedia – maybe implicitly, but you can’t just go to Wikipedia and ask for a list of the 20 largest cities with a female mayor. With a structured knowledge base, however, you can.

Abstract Wikipedia

We can go even further on that path towards computational knowledge with Abstract Wikipedia.

This project uses functions created by volunteers and stored in the Wikifunctions project together with the structured data from Wikidata to create language-independent articles. Anyone who has ever embarked on a journey to search for the “perfect language” or played a philosophical game with Ludwig Wittgenstein will recognize the concept. Combine that with computer science, and that’s essentially what Abstract Wikipedia is.

For large language editions like English or German Wikipedia, this will not make a huge difference. But for the many smaller languages among the more than 300 other language versions, it provides a starting point. You could say: this is an article about a year in history, and you don’t have to write it manually. It is generated from the facts in the knowledge base, supported by functions written by others. And it works, although in a way that needs refinement and human care in the coming years.

Zelph: Reasoning over the Knowledge Graph

We can also go back to where we started and look at world knowledge as a graph. Zelph is a semantic knowledge system that represents facts and rules as a unified graph of nodes and relationships, allowing it to perform automated logical reasoning, infer new knowledge, and detect contradictions within its data. Within Zelph, all relationships are treated as nodes, enabling deeper meta-reasoning about how concepts connect and interact. Zelph is built to operate on very large datasets such as the entire Wikidata knowledge graph (1.7TB) , where it can analyze, extend, and validate complex webs of structured information and can detect contradictions and make logical deductions.

Wikidata Embedding Project: Beyond the structured knowledge of the graph and ontologies

A bat

We have looked into the the model of our feathered friend the owl to represent knowledge, but what about the bat, flying through unordered darkness of knowledge and orienting itself through pings that get hopefully reflected?

The Wikidata Embedding Project is a vector-based semantic search for Wikidata by Wikimedia Deutschland that uses modern machine learning models and scalable vector databases to enable more intelligent and context-aware information retrieval. It aims to support the open-source community in building AI applications while leveraging Wikidata’s multilingual and inclusive knowledge graph to ensure broad, diverse, and accessible data coverage.

A vector is essentially a list of numbers like [0.12, -0.44, 0.98, ...], where each number represents a coordinate in a high-dimensional space, often spanning hundreds or even thousands of dimensions. Complex data such as text, images, or audio can be translated into these vectors, turning meaning into geometry. Once in this space, similar items naturally cluster close together, allowing systems to compare and retrieve related content based on distance rather than exact matches.

An embedding is the process of converting content like text into a numerical vector, for example turning “I love pizza” into something like [0.23, -0.81, 0.44, ...]. These vectors capture the meaning of the content, so texts with similar meanings produce similar vectors, allowing systems to understand and compare information based on semantic matching rather than simple word matching. Things that are semantically close are vector-space neighbors.

A vector database stores large numbers of embeddings and enables fast retrieval of similar content by comparing their positions in a high-dimensional space. When a query is made, it is first converted into an embedding; then the system searches for nearby vectors to find the most relevant matches, returning documents with similar meaning. For example, a question like “Why do cats purr?” leads the system to automatically retrieve related cat-oriented content, a process known as semantic search.

Now what?

At its 25th birthday, Wikipedia stands at a turning point: its infrastructure is under increasing strain, and its community-driven model does not scale as easily as automated systems. As a result, both the production of knowledge and access to it are beginning to shift, shaped by the growing influence of AI. The future likely lies in a hybrid approach that combines symbolic AI and knowledge graphs with vector-based methods like embeddings. Just as it ever was, the Knowledge Commons and all the things in it and around it are likely to remain chaotic, problematic, hopeful, and very very interesting.

#ai #abstractwikipedia #wikidata #wikifunctions #knowledgeproduction