The Encyclosphere: A Technical Proposal
In this post, I want to break down the basic technology of the Encyclosphere—but bear in mind this is more of a proposal for review than a finished concept. Bear also in mind that “Introducing the Encyclosphere” is a better source for general reflections about the Encyclosphere and how the Knowledge Standards Foundation will help organize it.
As a technical description of the Encyclosphere, again, this is just a rough draft. A finalized proposal will need detailed input from many experts. I do not propose to move quickly and without a lot of feedback from all potential users of the standard. The top-level, hard-to-change architecture of the Encyclosphere is precisely what we must make sure to get right.
An excellent way forward would be for people to pick one topic here to discuss, write a long blog post about just it, and we’ll talk it all over.
First, let us consider a concept similar to the Encyclosphere, that is easier to understand.
Suppose you wanted to make a single proprietary database of all encyclopedia articles online, with the information in the database in a common format, so that, regardless of the original source, all the articles looked uniform and great. Suppose the database also supported data about authors, publishers, ratings by users, and users (including raters). Then there could be a single “super encyclopedia” encompassing the whole, with multiple articles per topic and the articles arranged in order of rating. If user information included user categories, then you could sort and re-sort articles according to different categories.
That is sort of like what I want us to build. By the way, what do I mean by “us,” you ask? I mean a global decentralized movement—not a nonprofit organization like the KSF, not a corporation, not an app—that anyone can participate in using stuff they fully own.
I wouldn’t be a big fan of the above idea, although it is similar to, but importantly different from, what the Encyclosphere will be.
There would be too many practical—technical, legal, logistical, and other—difficulties with that concept. One big technical difficulty would be that scrapers would constantly break, meaning that translating a zillion different encyclopedia article formats into one common format would be a gargantuan task for any one team. The biggest legal difficulty, of course, would be copyright violations and securing and renewing permissions. There would also be extreme logistical difficulties in getting enough people to agree to use your system to rate articles. People will go to the considerable trouble of rating articles only if they can be sure their data is equally available to everyone. By the same token, if it’s a centralized, proprietary system, then not many people will want to write new articles for it. Google, with its old “Knol” project, tried and failed to do that.
But suppose we change the concept so that, instead of a centralized database, we have a series of standards-compliant feeds, i.e., a constantly-updating list of article versions, metadata, and ratings data, all formatted similarly according to a common encyclopedia specification. We make the both the standard and the feeds open, i.e., anyone can study them, use them, and build on top of them. Then a few feed aggregators monitor those feeds constantly and load them into a distributed database such as IPFS, or maybe many different (independent) servers with open APIs. Finally, encyclopedia readers would be apps that draw on those distributed databases to build their own selections of articles. Does all this sounds familiar? It should. It is roughly the same basic architecture as the Blogosphere.
To achieve that would be very difficult. For one thing, you’d have to translate articles with a bunch of different formats (such as MediaWiki markup), CSS classes, etc., into a common format. But that much seems possible, considering that encyclopedia articles do, after all, have much in common: they have metadata such as titles, authors, internal article IDs, publication dates, version numbers; content elements such as paragraphs, images and captions, tables, “infoboxes,” footnotes, bibliographies, and more; and it doesn’t take much to imagine a complementary user data, including categories and users’ article ratings, perhaps built on top of something like this developing identity system.
Let me give an idea of what the specification document(s) will likely cover, in the opinion of someone who admittedly isn’t exactly an expert on standards. In general, I anticipate three main parts, concerning (1) article metadata, (2) article content, and (3) user data and user ratings. These strike me as being of increasing difficulty.
Article metadata includes such things displayed on the page that are not part of the article, such as titles, author name, source or publication, publication date, license, etc. This would also include things not displayed on the page, such as URL, some sort of ID, version number, etc.
As I will explain below, this is probably the lowest-hanging fruit, one that we can get to work on soonest. By dumping feeds of relatively simple metadata about encyclopedia articles into a database, we could support surprisingly full-featured search engines (not just one). This doesn’t mean it will be easy. Simply scraping the data reliably from existing sources and persuading publishers to publish feeds themselves would be significant enough challenges to last us a while. One difficult issue is finding a decentralized way to determine that two articles are about the same topic, without determining what
Article content includes the actual paragraphs, tables, images, etc., that make up the content of an encyclopedia article; we might also consider including other categories of (stand-alone) reference information, such as tables, graphs, lists, galleries, and so forth. This is necessary to include in the specification if anyone wants an app that includes “local copies” of articles, rather than just having links to pages, or pages in an
<iframe>. Local copies would be nice to have not just because the result would look nicer; in fact, that is a relatively unimportant and trivial feature. The real advantage of including article content in the specification (and feeds) is the ability to creatively compile, search, manipulate, archive—and edit. This would support a genuinely decentralized system of reading and editing, not just searching.
Perhaps the biggest hurdle, when it comes to settling on a common standard for article content, is to determine what format should be used for the content of the article. Some will argue that MediaWiki markup should be adopted; but after discussion, we might decide that that language is too complex and idiosyncratic. Other possibilities including putting the content specification in terms of HTML5, with lots of extra rules regarding such things as titles and classes. Another possibility is the markup language known to many developers, Markdown; but it would have to be greatly expanded to be able to be made as expressive as full-featured encyclopedia articles (and, possibly, other reference information) require.
User data and ratings. You might wonder why I would mention “user data” in a discussion of encyclopedia standards. Mostly it is because, if we have multiple articles per topic, we really need to
Outline of the rest:
- More technical explanation of what the protocol/standards are.
- The system components: feeds, aggregators, distributed databases, and readers (above).
- Picture/diagram of how the system will work.
- How we’ll roll it out (three stages).
- Where to discuss and develop further.
All of these technical details are all very much open to debate. I might not have understood some important considerations that make this proposal unrealistic, or that make some other proposal much superior. So by all means do not spare my feelings. I want us to be as sure as we possibly can that, from a technical point of view, this the best way to create a centerless, leaderless network of encyclopedias and encyclopedia articles.
DRAFT UNDER CONSTRUCTION