The Encyclosphere: A Technical Proposal

In this post, I want to break down the basic technology of the Encyclosphere—but bear in mind this is more of a proposal for review than a finished concept. Bear also in mind that “Introducing the Encyclosphere” is a better source for general reflections about the Encyclosphere and how the Knowledge Standards Foundation will help organize it.

As a technical description of the Encyclosphere, again, this is just a rough draft. A finalized proposal will need detailed input from many experts. I do not propose to move quickly and without a lot of feedback from all potential users of the standard. The top-level, hard-to-change architecture of the Encyclosphere is precisely what we must make sure to get right.

An excellent way forward would be for people to pick one topic here to discuss, write a long blog post about just it, and we’ll talk it all over.

First, let us consider a concept similar to the Encyclosphere, that is easier to understand.

Suppose you wanted to make a single proprietary database of all encyclopedia articles online, with the information in the database in a common format, so that, regardless of the original source, all the articles looked uniform and great. Suppose the database also supported data about authors, publishers, ratings by users, and users (including raters). Then there could be a single “super encyclopedia” encompassing the whole, with multiple articles per topic and the articles arranged in order of rating. If user information included user categories, then you could sort and re-sort articles according to different categories.

That is sort of like what I want us to build. By the way, what do I mean by “us,” you ask? I mean a global decentralized movement—not a nonprofit organization like the KSF, not a corporation, not an app—that anyone can participate in using stuff they fully own.

I wouldn’t be a big fan of the above idea, although it is similar to, but importantly different from, what the Encyclosphere will be.

There would be too many practical—technical, legal, logistical, and other—difficulties with that concept. One big technical difficulty would be that scrapers would constantly break, meaning that translating a zillion different encyclopedia article formats into one common format would be a gargantuan task for any one team. The biggest legal difficulty, of course, would be copyright violations and securing and renewing permissions. There would also be extreme logistical difficulties in getting enough people to agree to use your system to rate articles. People will go to the considerable trouble of rating articles only if they can be sure their data is equally available to everyone. By the same token, if it’s a centralized, proprietary system, then not many people will want to write new articles for it. Google, with its old “Knol” project, tried and failed to do that.

But suppose we change the concept so that, instead of a centralized database, we have a series of standards-compliant feeds, i.e., a constantly-updating list of article versions, metadata, and ratings data, all formatted similarly according to a common encyclopedia specification. We make the both the standard and the feeds open, i.e., anyone can study them, use them, and build on top of them. Then a few feed aggregators monitor those feeds constantly and load them into a distributed database such as IPFS, or maybe many different (independent) servers with open APIs. Finally, encyclopedia readers would be apps that draw on those distributed databases to build their own selections of articles. Does all this sounds familiar? It should. It is roughly the same basic architecture as the Blogosphere.

To achieve that would be very difficult. For one thing, you’d have to translate articles with a bunch of different formats (such as MediaWiki markup), CSS classes, etc., into a common format. But that much seems possible, considering that encyclopedia articles do, after all, have much in common: they have metadata such as titles, authors, internal article IDs, publication dates, version numbers; content elements such as paragraphs, images and captions, tables, “infoboxes,” footnotes, bibliographies, and more; and it doesn’t take much to imagine a complementary user data, including categories and users’ article ratings, perhaps built on top of something like this developing identity system.

Let me give an idea of what the specification document(s) will likely cover, in the opinion of someone who admittedly isn’t exactly an expert on standards. In general, I anticipate three main parts, concerning (1) article metadata, (2) article content, and (3) user data and user ratings. These strike me as being of increasing difficulty.

Article metadata includes such things displayed on the page that are not part of the article, such as titles, author name, source or publication, publication date, license, etc. This would also include things not displayed on the page, such as URL, some sort of ID, version number, etc.

As I will explain below, this is probably the lowest-hanging fruit, one that we can get to work on soonest. By dumping feeds of relatively simple metadata about encyclopedia articles into a database, we could support surprisingly full-featured search engines (not just one). This doesn’t mean it will be easy. Simply scraping the data reliably from existing sources and persuading publishers to publish feeds themselves would be significant enough challenges to last us a while. One difficult issue is finding a decentralized way to determine that two articles are about the same topic, without determining what

Article content includes the actual paragraphs, tables, images, etc., that make up the content of an encyclopedia article; we might also consider including other categories of (stand-alone) reference information, such as tables, graphs, lists, galleries, and so forth. This is necessary to include in the specification if anyone wants an app that includes “local copies” of articles, rather than just having links to pages, or pages in an <iframe>. Local copies would be nice to have not just because the result would look nicer; in fact, that is a relatively unimportant and trivial feature. The real advantage of including article content in the specification (and feeds) is the ability to creatively compile, search, manipulate, archive—and edit. This would support a genuinely decentralized system of reading and editing, not just searching.

Perhaps the biggest hurdle, when it comes to settling on a common standard for article content, is to determine what format should be used for the content of the article. Some will argue that MediaWiki markup should be adopted; but after discussion, we might decide that that language is too complex and idiosyncratic. Other possibilities including putting the content specification in terms of HTML5, with lots of extra rules regarding such things as titles and classes. Another possibility is the markup language known to many developers, Markdown; but it would have to be greatly expanded to be able to be made as expressive as full-featured encyclopedia articles (and, possibly, other reference information) require.

User data and ratings. You might wonder why I would mention “user data” in a discussion of encyclopedia standards. Mostly it is because, if we have multiple articles per topic, we really need to

Outline of the rest:

  • More technical explanation of what the protocol/standards are.
  • The system components: feeds, aggregators, distributed databases, and readers (above).
  • Picture/diagram of how the system will work.
  • How we’ll roll it out (three stages).
  • Where to discuss and develop further.

All of these technical details are all very much open to debate. I might not have understood some important considerations that make this proposal unrealistic, or that make some other proposal much superior. So by all means do not spare my feelings. I want us to be as sure as we possibly can that, from a technical point of view, this the best way to create a centerless, leaderless network of encyclopedias and encyclopedia articles.

DRAFT UNDER CONSTRUCTION

By Larry Sanger

See this page for my bio. Welcome to this site! Thanks for being here!

6 comments

  1. Larry, Great starting document! (BTW, at least in Debian, Mozilla can’t lose the sidebar Activity/Post, etc. so I copy-n-pasted to read it.)

    I’m sure I’m going to mis-step but are you saying we would create search results for a limited slice of online content known as encyclopedias? That is we aren’t going to try to compete with Google and boil about 15% of the ocean but select a sub-set that is amenable to further processing. Yes?

    Without even looking I know that sub-set numbers in the thousands but the ones I know of are fairly stodgy so there would be change but not at a break neck pace. Some even have APIs but if they are in CommonCrawl results, might be easier just to mine there.

    In terms of data scope, would archive.org and arXiv.org be in or out as encyclopedias? Or perhaps supplemental data to encyclopedias? Matching across authors + titles. Just curious. I suppose that awaits development of an inclusion criteria for “encyclopedias.”

    Thanks again for kicking this off!

    1. Thanks very much for the feedback, Patrick!

      You’re certainly right that we could try to apply this concept to all different kinds of content. I’ve thought quite a bit about that. But if there is to be ratings, then there must be clear editorial standards for ratings…and that requires having different categories of content, it seems to me. So if we can get this to work for encyclopedia articles, then we can start defining “knowledge standards” and ways of settling upon definite topics of other kinds of web pages and websites.

      By the time the encyclopedia standards are well settled, the way forward should be clear.

      I don’t think archive.org is an encyclopedia. An encyclopedia contains articles that give introductions to what is known about specific named topics.

      Yes, at some point there will need to be an agreed-upon definition of “encyclopedia article” (not necessarily “encyclopedia”).

  2. Article content format is going to be very difficult. If you’re going to try to import preexisting content, then the conversion is bound to be lossy. Note that blog feeds (RSS/Atom) did not have to be a complete representation of the raw blog post because they linked back to the original in its own format. You could skim the feed, and a lot of the time you could read the whole article from the feed, but some of the time you would need to follow the link to get the full content.

    The vision for the encyclosphere seems to be that a “reader” is all you need to read articles, you don’t need to fall back to an upstream copy. That’s ambitious.

  3. There are a couple sentences here that seem cut off, and the paragraph ends without a period:

    * “One difficult issue is finding a decentralized way to determine that two articles are about the same topic, without determining what ”

    * “Mostly it is because, if we have multiple articles per topic, we really need to ”

    As for the content, I would definitely be an advocate of Markdown as a format if it’s an option. Ideally, though, my vision would be to *not have* a required content type for encyclosphere articles. I think that’s the only perfect way to get around Anomaly UK’s concern about converting existing content to another format. Though, I’m new here, so I don’t know if this has already been discussed and shot down.

  4. Idk if it’s already been discussed, but I have few things to say-

    – 3 points mentioned in the article here appears similar to what one does in academic research-
    1. We have the meta details which we use to identify the publication- title, author, publication-date, source or doi etc.
    2. Then we have the publication content itself having all the details
    3. And the credibility is based upon the citations of the publications.. in our case the ratings are relevant.
    We have this template (and a whole publication industry standards) to learn as a case study for what we are doing.

    – We can also think about org-mode format of articles instead of markdown. In my opinion, org-mode in emacs can handle the complexity that is required for our needs. And it is also plain-text just like markdown.

    – The next problem will be to format all the output across the platforms in a consistent way. To convert the articles on different websites to our standards, we can work on to develop some tool like pandoc (or extend upon it further) that can convert multiple form of document types into one another.

Leave a comment