The Old Encyclopedia Digitization Project

How much knowledge is buried in old encyclopedias?
We aim to find out—and to make it available via modern search engines, readers, and free files available via the encyclosphere, with the Old Encyclopedia Digitization Project.

Since last summer, the Knowledge Standards Foundation has been working on an evolving project we call the Old Encyclopedia Digitization Project. We began with this thought: we’ve made a file format for encyclopedia articles (the ZWI file format); wouldn’t it be awesome if we digitized a bunch of old, public domain encyclopedias and put them into ZWI files?The vision. Imagine searching hundreds of old encyclopedias from a single search engine, not by browsing the books, one at a time, whether in paper format or in the clunky, slow-to-load PDF format. The value for students of history should be obvious, but also for those fields with significant historical components, such as theology, philosophy, and the humanities generally. There are many topics where an older perspective reflects different values, concerns, and traditions; the information is, of course, not “up-to-date” in terms of the latest thinking, but then, the latest thinking on some topics (especially in the humanities) mostly reinterprets older information in light of current intellectual trends.

Besides, working with paper-to-ZWI articles will refine our notions about what the ZWI file format needs to support (and has already started doing so: the relevant metadata about paper encyclopedias is different from that about digital-only encyclopedias).

So far: (1) Scanning. After some cursory searching in, we emerged with the impression that very few old encyclopedias had been scanned well enough to do adequate job of OCR (optical character recognition: moving from a digital image of a page to an editable text file). To make a long story short, we learned a fair bit about the scanning process, only to discover in the end that, in fact, there are hundreds and hundreds of nicely-scanned volumes available. You just need to know where to look.

So far: (2) OCR. We adopted ABBYY FineReader PDF 15 OCR Editor and satisfied ourselves that the HTML output was editable. We spent quite a few weeks (a process that is still ongoing) refining and documenting our nascent editorial process. (I can send you “Old Encyclopedia Digitization Project: Policies and Procedures” (v.1.1) if you are interested.) I went through the ABBYY process and prepared the HTML of the article, and our lead developer, Dr. Sergei Chekanov, outputted a ZWI file. You can read the result, for what it is worth, here.

So far: (3) zwiformat.rb and more editorial work. The HTML that ABBYY outputs is really bad—messy, with a lot of bad HTML and CSS. As a programmer, one thing I don’t suck at is text manipulation and scraping. So I myself wrote over 2800 lines of Ruby (and counting), which takes ABBYY’s ugly HTML, images, and the original scans as input, and will soon produce a complete ZWI file, properly formatted as output. Here is the repo, not that it’s good for anyone else but me, at this point; it’s definitely a work in progress.

What’s next. We have already put one encyclopedia into our brand-spanking new (Encyclopedia Britannica 11th edition; but that is not based on our own OCR work). But, hopefully sooner rather than later, we will start adding many newly-OCR’d and ZWI-ified encyclopedia articles there. From there, they will migrate to EncycloSearch and EncycloReader. Oldpedia runs on the aggregator software that runs EncycloSearch, newly dubbed EncycloEngine. (By the way, if you want to install and run EncycloEngine, and run your own aggregator, you can. We’ll even help you do so.)

A future volunteer program? Already we must thank Dr. Nancy Hildebrandt and Denis Boyles (author of Everything Explained That Is Explainable, a history of the EB11) for giving their time toward testing and settling some editorial standards for the OEDP. Nancy is our test editor for ABBYY work and has helped me test zwiformat.rb and we appreciate her help. We are just starting to engineer a process for a future volunteer program in which others would use ABBYY and possibly zwiformat.rb (if I can get it working well enough) to convert high-quality, library-scanned encyclopedias into new additions to Oldpedia and thereby the world, free for everyone forever.

Partnerships? If any group of theologians, historians, or what have you would like to work with us on ZWI file preparation based on old paper encyclopedias that have not yet been properly digitized, please get in touch.

Do you support this work?
If so, great. We think it’s important—but we need your financial support. Please consider donating regularly.

We will not accept donations from governments, from large corporations, from reference publishers, or from any organization or individual that we feel represents a threat to the independence of this endeavor. Does this mean we’ll have more trouble raising money? Fine, I don’t care.

So, yes, we do need your support. We would love to see larger donations from wealthy freedom-loving individuals and foundations, as long as the donations are without strings; and any amount over $5,000 means that I will have to fly out to meet you face-to-face. If I’m going to take that much money from you, I need to know you.

We would love to have your $5, $25, or $100. DONATE HERE. Every little bit helps.


Dr. Larry Sanger, President, Knowledge Standards Foundation
([email protected])
The Knowledge Standards Foundation is building a #decentralized network of encyclopedias where anyone, anywhere can access all the free encyclopedias easily. We are a 501(c)(3) non-profit organization.

By Larry Sanger

See this page for my bio. Welcome to this site! Thanks for being here!

1 comment

Leave a comment