Guest post: Protocol Areas

The goal of KSF is to provide protocols for online encyclopedias.

Top-level, this means

  • How to publish an article
  • How to read a published article
  • How to search published articles
  • How to apply a rating of some kind to an article

Some of this is at the level of transports — how articles are moved around the network (webservers, torrents, IPFS, …). We probably want to support multiple transports. I’m not going to address them here

Searching is also going to be open-ended: the goal is to enable multiple reader applications that can be used to search and navigate a virtual encyclopedia of articles selected according to some criteria that their makers consider good. Therefore many search and navigation interfaces should blossom. KSF needs to provide the underlying search primitives to make that possible.

Search ultimately has to be driven by two things: the content of articles, and claims that more or less trusted sources have made about particular articles. The first technical task is to define the format of article content, article metadata, and third-party claims, so as to enable search.

I’m not in this post going to attempt to propose a data model. I’m going to try to list requirements that the data model will have to meet.

Article Data Model

  • Article Content
  • Article Title
  • Article ID
  • Article Revision ID
  • Article Parent revision
  • Article Author (e.g. DIF value)
  • Article Topic (ontology, tags, keywords, …)
  • Article Status (“encyclopedia entry”, “discussion draft”, “blog post”, …)
  • Related articles, by relation (“alternative to”, “includes content from”, “refers to”, “elaborates on”, …)


Rating Data Model

  • Rating target type (article, collection, revision, author, delegated rater, …)
  • Rating target ID
  • Rating date
  • Collection of:
  • Rating classification (appropriateness, format, accuracy, …)
  • Rating parameters (topic, ?)
  • Rating value

All these need to be included in data structure definitions that people can write software against, and which can be serialised, probably as XML. Lots of them are hard:

The article content format needs to be something that casual users can author. That rules out the higher-fidelity formats available (LaTeX, docbook…). Probably it means some subset of HTML, like MediaWiki.  I am tempted to wish for a choice of two formats, an easy-edit format plus a high-fidelity format, but there would be very limited interchange between the two, and there is a good argument for insisting on a single portable canonical format. Note that it should be possible to render the articles from the canonical format into different views for different readers. That is not too hard. But multi-directional lossless transformations are probably not feasible: there must be a single canonical source for each entry.

Article ID sounds easy but it isn’t. Zooko’s Triangle comes into play. Article IDs must either be non-human-readable opaque blobs (e.g. hash of the initial published version), or be named according to some centralized hierarchy (e.g. as owner of DNS domain, I and only I can name an article “ksf://”.  It is essential that given an article ID, a search tool can reliably find the correct content for that ID.  I think we should go with opaque article IDs, and attach scoped names at the level of ratings, but that’s TBD.

Article Topic is the general concept of classifying articles according to content. This is hugely difficult but there’s a ton of prior art. I hope we have or will have experts in the domain to work on it. However, it’s debatable whether classification really belongs in the article, or as something that can be attached to the article, or both. If I publish an article, but someone else thinks they can classify it more accurately, do they need to republish a revision, or can they just attach a classification.

Articles need to be signed cryptographically by their author. We need to decide on a signing protocol, but in principle that’s relatively easy. (Although once you start to deal with things like delegating authority to an app, or revoking keys, it stops being easy).  An author “identity” has to include a public key to verify that the article is written by the author. It may or may not include more than that.

Ratings are the most important part of the system. When some reader finds an article, that will be because someone or some group have rated it as worth reading. Ratings are signed in the same way as articles.

An article rating will always reference a specific revision of an article. But it may be defined in such a way that it will still be considered to apply to, for instance, a later revision of the same article published by the same author. Or you could apply a rating directly to an author, or to another rater (delegated rating).  e.g. “Author with public key XYZ writes good articles under the topic of Chemistry”.




By Anomaly UK

Software Developer and blogger

1 comment

  1. There’s another API object that needs to be defined, a “collection” such as a feed or a browseable website.

    That’s fundamentally just a collection of articles and ratings, and some metadata. One important piece of metadata is the “normal location” — a URI at which you can find the latest version of the this feed.

    We can’t rely on a hash-addressed store like a torrent for that purpose, because those can’t be updated.

Leave a comment