How artificial intelligence is cracking the code of the Vatican Secret Archives

New computer software and crowdsourcing may make centuries worth of handwritten Latin documents available online.

The Archivum Secretum Vaticanum, or the Secret Vatican Archives (SVA), sounds like something from the conspiracy-theories of Dan Brown’s potboiler fiction.

In reality, the SVA is simply the private (from the Latin word “secretum”) archives of the pope. In fact, since Pope Leo XIII opened the archives up to researchers in 1881, they haven’t even been private. Once Vatican documents are 75 years old, scholars are free to peruse them to their hearts’ content.

In theory, starting from the 8th century, everything from historical documents, acts promulgated by the Vatican, account books, and correspondence of the popes is available to researchers.

The only problem: the sheer volume of the archives makes them virtually inaccessible.

According to an article by Sam Kean in The Atlantic, of the 53 linear miles of shelving in the Vatican Secret Archives, only “a few millimeters’ worth” of pages have been scanned, transcribed, and made available for computer searches online.

Enter In Codice Ratio, a research project that is using artificial intelligence and optical-character recognition (OCR) to automatically transcribe the contents on the Vatican archives.

As Kean points out in his article for The Atlantic, OCR works great on typeset documents, but it can’t handle handwritten text. The letters tend to run together and are not always “nice, clean examples” of the letters they are supposed to represent.

Here’s where artificial intelligence comes in. Researchers recruited Italian high school students without any knowledge of Medieval Latin to help them. Presented with examples of letters that the OCR software identified, the students would see if those letters were correct matches. All the students had to do was match visual patterns. The software noted the corrections the high school students made, and learned from its mistakes.

When they first began the project “the idea of involving high-school students was considered foolish,” Paolo Merialdo, a scientist behind Codice Ratio, told Kean. “But now the machine is learning thanks to their efforts. I like that a small and simple contribution by many people can indeed contribute to the solution of a complex problem.”

Transcribing these ancient written documents by computer was hardly smooth sailing from there on out, and results have been mixed. One-third of the words contained typos, which makes for an annoying reading experience, but is still seen as a great advance.

“Imperfect transcriptions can provide enough information and context about the manuscript at hand” to be useful, Merialdo told Kean.

What’s more, scientists behind the project expect the software to improve with time since the more artificial intelligence learns, the better it gets.

Read the entire article from The Atlantic here.