Guest post: Encyclosphere – possible structure and article format
A vision of some of the details of Encyclosphere from the perspective of the author.
“Everyone is entitled to his own opinion, but not to his own facts.” – Daniel Patrick Moynihan
While a clever truism, this statement glosses over many real-life complexities. In reality, a given group’s set of ‘accepted facts’ are more uncertain than is implied by the confidence with which they are stated/used, and surprisingly often they are outright false. In addition, many things positioned as facts are merely speculation, rumor, or opinion.
On top of this, articles and statements are created by people and people have bias due to their specific set of beliefs. This is unavoidable. Whenever presenting information, authors choose to include details they consider important and exclude others they do not. This involves a value judgement. Even if all the facts in an article are agreed upon, which details are chosen to be highlighted and which are chosen to be ignored will reflect the bias of the authors. The end result is that articles do not present a neutral view of reality no matter the stated goals of such authors to do so. It is the nature of knowledge that a neutral point of view is impossible.
Wikipedia has become the dominant online encyclopedia, but Wikipedia does not allow for multiple points of view and doing so is likely not possible within a single organization anyway. By its nature an organization will reflect the views of its owners, the majority of its members, or the governing body that controls it. Dissenting views will naturally be suppressed. As in Wikipedia, a clique of similarly minded members will rise to the top as gate-keepers and censorship will be done through mechanisms such as ‘notability’ or ‘trusted source’ requirements, where those in control are the arbiter of ‘what’ is notable and ‘who’ is trusted. Everything stated may appear true and reasonable, while behind the scenes contradicting information or alternative views have been suppressed, giving an exaggerated impression of consensus and certainty and what is important.
Wikipedia’s notability requirements are problematic for other reasons as well. With nearly infinite storage capability and well established scalable search techniques there is no reason every person on Earth couldn’t have their own Wikipedia page should they choose to do so, or niche hobbyists or other special interest groups to have pages that are of great interest to them, but that Wikipedia gatekeepers consider as unworthy. Wikipedia is not a book that one is going to browse from beginning to end. Given current technology, artificial notability requirements feel unnecessary and political in nature. Similarly Wikipedia’s requirements for “no original content” blocks possibly important contributions as these contributions must first pass the filter of “reliable sources” through vetted external gatekeepers before they are considered valid.
Encyclosphere core goal is to facilitate an encyclopedia-style representation of knowledge with the ability to accommodate multiple points of view on any subject a given group or individual wishes to create. To do so, such a system needs to be decentralized. No central author. No owner. No one group with majority say. No gatekeepers. And, in order to be useful, such a decentralized system needs standard ways to communicate, which is where the Knowledge Standards Foundation (KSF) comes in.
Below is a conceptual view of the Encyclosphere ecosystem
Elements of the ecosystem
Raw data are any source of data used to create structured encyclopedia articles. Providers convert raw data into structured articles. For example the current Wikipedia data is ‘raw data’ because for the most part it is weakly structured and not machine readable/processable. Providers do the work of converting Wikipedia articles into structured articles. There may be many different attempts and versions. Providers may choose to omit or alter wikipedia articles as well as fits their own values and mandates.
Other sources of raw data are the same sources used to back wikipedia articles: public databases, news organizations and articles, blogs, websites, and so on.
Providers are individuals or organizations who create content. They may create very niche content (e.g. advanced medical articles for example, niche hobbies, political interests, wikis, fandom or other), or very general content aimed at the general population (Wikipedia, Government bodies, IMDB, etc). It’s completely up to them.
Have editing tools for their users to contribute information.
Handle author login and administration.
Provide Internet accessible feeds for this information over HTTP, IPFS, torrent or other.
Aggregators are search engine that visit and extract updates from various providers, provide a query mechanism and present information in a format suitable for reading on a computer, phone, etc.
Extract information from Provider feed and index it.
Provide query mechanisms for users to search the information.
Present the information for viewing/reading.
Users are those people searching for encyclopedia style information. They visit those aggregators they trust that has selected and curated the available information in a way that closest to their own values and biases.
Enter queries to search for the information they want.
Read the presented information.
Combine and organize the information for display to others (graphs, tables, their own articles on blogs, news sites, social media and so on).
Knowledge Standards Foundation (KSF)
The KSF provides standards and tools for authors to write and distribute articles in a consistent and compatible way.
Provides formatting standards for articles and distribution.
Provides examples, tools to help authors and aggregators get started.
Receives feedback from authors and aggregators to incorporate and adjust the standards as necessary.
Advertises and promotes Encyclosphere effort to gain awareness and growth.
While the providers, aggregators, authors and users are presented as separate areas of the ecosystem, in practice it is expected there will be overlap of responsibilities in many areas. Many providers will likely choose to provide a way to query and view their content directly and so also take on the functions of aggregator. Authors and users will of course overlap as well.
The most important role of KSF is to outline a standard way for authors to publish their information. Such as standard needs to:
Provide a minimum set of consistent semantics and syntax. What are the elements of an article and what do they mean?
Be machine readable in such a way that computers can also understand the semantic meaning of core elements of a document.
Be extendable both by KSF and by publishers and aggregators outside of KSF for niche functionality.
Have as little inherent bias as possible.
It is proposed that the article format be XML-based with a combination of custom elements and RDF inspired attributes. HTML might initially seem like a good option due to its broad usage and familiarity, however HTML is not semantic in origin. Attempting to extract meaning from generic HTML is an effort in heuristic and error-prone guess work. HTML has also grown quite vast in size and complexity, containing well over 100 tags, 100 attributes, dozens of style settings and programming hooks, the vast majority of which are not applicable to general read-only encyclopedic style articles.
For now it is proposed that there is no schema or DTD and that tags are ‘convention-based’ (outlined with words rather than computer-oriented verification language). KSL would propose a base set of tags and their roles, but providers would be free to add custom tags as they like and aggregators could choose to support or ignore custom tags as they like and through common use these tags may then become de facto standard without any need for official sanctioning of such tags by KSF. A similar organic evolution of tags is how the early HTML operated as well as other standards such as RSS.
Possible XML tags:
article – the article root element
id – the ID of this article representing the main entity, category or concept of this article.
repository – a link and namespace of the repository of IDs used by the ontological elements in this document. There can be multiple repositories.
meta – information about article that adds extra information valuable for indexing and/or providing a summary or info-box.
section – indicates a block of related content. Sections can contain sections i.e. subsections.
emphasis – indicate a word or phrase has more weight in a sentence.
code(language=”java, cpp, python”) – used to signal special content differs from regular spoken language text and likely should be parsed or presented a certain way depending on the language specified.
list – a list of items. Can contain li list items.
li – an item inside of a list
p – paragraph
math – Used to indicate equations
quote(url) – To quote an external source verbatim. Optional url attribute points to the original source.
media(url, mimetype) – used to point to media such as images, audio, video, vector drawings, interactive media, html and possible others. The type of media would be determined from extension of the url e.g. *.mp4 for an MP4 encapsulated video file. In the case of ambiguous url extensions the mime type could be stated explicitly. Presentation layer may optionally embed this information directly or show it as a link.
data(type=”csv, xml, json”) – used to indicate raw data that ‘may’ be presentable in a table, graph or other format.
label – used to add a label to article, section, list, math, quote, media and data elements
entity(id) – used to indicate a ‘thing’, person, place, category or concept. Basically anything that can be used as a noun part-of-speech in a sentence. The contents inside an entity tag are for display and can be any value that is synonymous with the id of this entity. The contents can also be empty if no display is desired.
property(id) – used to indicate a relationship or property. The contents inside a property tag are for display and can be any value that is synonymous with the id of this property. The contents can also be empty if no display is desired.
value(id, v) – used to indicate a measurable value along a specific dimension (time, distance, mass, temperature, value in USD, location, etc). The id is the entity of the dimension and measurement standards as well as a conical description the value representation (likely as a regular expression). The v attribute is the conical representation of the value. The contents of the value are for display. The attribute ‘v’ can be omitted if the contents are shown in conical form. If ‘v’ is specified the contents can also be empty.
time(id, value) – A time range, or specific instance in time for which a statement is applicable. Time range can be either a named time period (an entity) or a value quantity.
location(id, value) – An area or specific point in space (typically somewhere on the surface of the Earth) where a statement is applicable. Location is specified by either a named location (an entity) or by a coordinate value.
z – used to indicate the end of a sentence where there would otherwise be ambiguity.
For those familiar with typical document structures, most tags are fairly straightforward. The subject, property, value, time and location tags are special ontology tags and used to add machine readable meaning to encyclopedia documents. The ideas here are borrowed from ideas resource description framework model (RDF) and from Wikidata (which also uses concepts from RDF). RDF is complex and verbose and is not intended to be written or read directly by people. Here I try to capture some of the ideas of RDF in a simplified manner that can be presented in a compact fashion and easily read and understood by humans.
‘entity’ – is an identifiable object, category, or concept e.g. ‘dog’. Entity can be a specific instance e.g. ‘my dog Java’, a specific or general category e.g. ‘Labrador Retriever’, or an abstract ‘concept’ e.g. ‘pet’. Entities are assigned a unique id to avoid ambiguity across synonyms and different languages. In wikidata, the ID Q38726 is used to stand for the category of ‘Labrador Retriever’. The entity tag is used to unambiguously refer to object, category, or concept by linking a label to a unique ID. For example using the wikidata entity namespace we might have the following My pet dog Java is a <entity id=’Q38726’>Labrador Retriever</entity>.
‘property’ – is an attribute of a subject/object e.g. ‘color’. Properties can be used to relate an entity to other entities (‘on’, ‘in’, ‘beside’, ‘part-of’, ‘instance-of’, ‘member-of’ and so on) or to relate an entity to a quantity along a dimension.
‘value’ – is the subject of what an entity relates to. It can be an entity, or a quantity value along a dimension e.g. the entity ‘red’ or the color value ‘#FF0000’’. Dimensions are things like position, distance, area, velocity, mass, energy, price in USD. They may be continuous or quantized.
‘time’ – is a range of time or a specific instance in time to which a statement is valid e.g. <entity id=”Q9439”>Queen Victoria</entity> <property id=”P31”>was the</property> <entity id=”Q19643”>queen</entity> of <location id=”Q21”>England</location> from <time value=”1837–1901”>1837 to 1901</time>
‘location – is an area or specific point in space for which a statement is valid. See the example above for ‘time’ which contains a location example.
Having the specific qualifiers ‘time’ and ‘location’ is a value judgement. Why not just have a general qualifier restriction? That is what Wikidata has chosen to do. The problem with general qualifiers is that it’s hard for a machine to know what to do with this information such as how to use it to assist in search. Since most qualifiers in are either time or space I am proposing making them explicit to keep things simple as possible for authors and aggregators to use and to do something useful with. There is always a trade-off between flexibility and interoperability. One could simply specify “use any tag name you want with any attribute name and any content”. Or even more generally. “content is a sequence of bits”. While extremely flexible, such general requirements are unusable without more detail. It is the struggle of any standard to try to hit an acceptable balance between between flexibility and interoperability.
Keywords and Statements
There are two main usages for the ontological tags. One is to uniquely and canonically identify key terms across synonyms and languages. Language is ambiguous and the entity tag provides a way to unambiguously refer to an object, category or concept by referring to an ID that applies to all of its synonyms. For example “Dr. Mary Jane”, “Mary”, “Jane, Mary”, “Mrs. Jane” may all refer to the same person while “Dr. Mary Jane” may refer to a different person with exactly the same name and title. These can be equated or differentiated by using an explicit ID to refer to a specific individual e.g. <entity id=”Q52345”>Dr. Mary Jane</entity> equals <entity id=”Q52345”>Mary</entity> and does not equal <entity id=”Q62345”>Dr. Mary Jane</entity>.
This is the simplest of usages for encyclosphere articles. Although simple, it provides great value in assisting search engine to property identify key search terms and to provide better results.
The second main usage of ontological tags is to make knowledge statements. <entity id=”Q9439”>Queen Victoria</entity> <property id=”P31”>was the</property> <entity id=”Q19643”>queen</entity> of <location id=”Q21”>England</location> from <time value=”1837–1901”>1837 to 1901</time>.
This can then be used across articles for more sophisticated searches such as “monarchs of England between 1700-1800”.
Statements are the combination of entity, property, value and optionally time or location within a single sentence. When parsing a sentence would be ambiguous (such as occurs when there are abbreviations in the sentence), the end of a sentence is indicated directly by using the <z/> tag.
Statements can be placed inside meta tags at the beginning of a document such as:
<entity id=”Q9439”>Queen Victoria</entity>
<property id=”P31”>was the</property>
<time value=”1837–1901”>1837 to 1901</time>
<entity id=”Q9439”>Queen Victoria</entity>
<property id=”P263”>resided at</property>
<entity id=”Q42646”>Windsor Castle</entity>
<time value=”1819–1901”>1819 to 1901</time>
<entity id=”Q9439”>Queen Victoria</entity>
<value id=”Q12138”>May 24, 1819</value>
<time value=”1819-05-24” />
<location id=”Q207385”>Kensington Place</location>
<entity id=”Q9439”>Queen Victoria</entity>
<value id=”Q12138”>January 22, 1901</value>
<time value=”1901-01-22” />
<location id=”Q565155”>Osborne House</location>
Statements can also be embedded throughout the document. Statements in both the meta area and embedded in the document can optionally be extracted by aggregators to present as summaries or in infoboxes. Information extracted from statements can also be combined to show information that cuts across documents such as showing a table of “monarchs of England sorted by time”
Repositories and standards
While publishers should be able to create their own entities and property specific to their domain, there should also be a ‘base’ set of entities and properties for the most established entities/properties. KSF should play that role, providing an initial base repository of IDs for entities and properties. It would also be possible and reasonable to use Wikidata’s repository, however Wikidata does not aim to provide a base, but a comprehensive all-inclusive set of IDs. This makes Wikidata a centralized gatekeeper of IDs and conflicts with the goals of KSF to not have gatekeepers. Pointing to and using the KSF repository would be optional and multiple repositories are allowed for a single document as specified in the repository tags.