Digitalization: OCR & Text Searching

I remember when the internet became accessible to households. It changed people’s lives. Throughout the years, search engines became the main tool to access web pages. You would enter keywords, and it would filter the search for you within thousands of references and URLs and give you the best matching results possible.

Could we use this concept to apply to Digital History? If you choose any book or newspaper collection published before 1920 for example, it is most likely digitalized; you will be able to read millions of pages in databases1. In other words, the only way not to get lost is to use text searching process to filter and efficiently focus your research on the period of history you are working on.

Instead of searching in paper archives for hours to find one memo or one article, we can now obtain the same result thanks to text searching online, and more precisely thanks to Optical Character Recognition (OCR) technology. According to Milligan (2013), it is “a process that takes an image, recognizes shapes that are in the forms of letters, and writes the output in plain text 2.” In other words, if I search for the word “China”, it will be highlighted throughout the digitalized newspaper or book wherever this word is mentioned.

At the time you would read the newspaper, on-hand, you would first visualize it in its entirety. You could understand it in a more global way which gives a quick look into each of the different titles. It would allow the reader to know what to expect from the newspaper.

Text searching changed the way we engage with information. OCR brings focus on small fragments of texts within newspapers or articles. When reading digitalized sources, like the newspaper, one would not read it with a global vision of it, but only focused on one article at a time. On one hand, it is great to target words and focus the research, on the other hand, there is a risk to miss important clues. Those clues might only be seen by understanding the source in its entirety. Articles can be correlated. Information may need to be crossed in order to understand the “atmosphere” at the time of the artifact.

The use of digital media democratized the access to knowledge and to our history. The speed of research highly increased and helped to raise new research questions.

There are also downsides when digitalizing a source. Indeed, the touch and feel are lost. According to Tony Guidone (2019), the Nazi coins produced in 1940 are heavier than the ones from 1943 because the German army needed to use as much metal as possible for weapons. In other words, they would use less metal for the coins in 1943. This fact cannot be deduced from a digital source but only by touching the coins. The original context of a source can be lost as well when digitalized. The original context could bring more meaning to the artifact.

Digital projects may create significant social change in the future. People can quickly access knowledge freely on the web and they can share it. Digital projects are great vectors of social interactions 3, such as sharing information or commemorating an event after main disasters like “hurricane Sandy” or “9/11”4.  

Digitalizing sources and using efficient tools like OCR for text searching present many strengths to participate in Digital History. Now we can wonder, do you think digitalized sources will completely replace the non-digitalized ones? Personally, I think there are complementary.

  1. According to Tony Guidone’s presentation at GMU, September 9, 2019
  2. From “Online Databases, Optical Character Recognition, and Canadian History, 1997–2010” by I. Milligan, December 2013, The Canadian Historical Review, Volume 94, Number 4, University of Toronto Press.
  3. See “The Legacies of George Mason” project http://masonslegacies.org/exhibits and also “The Enslaved Children of George Mason” https://ecgm.omeka.net/
  4. See an example of digital project http://911digitalarchive.org/

Leave a Reply

Your email address will not be published. Required fields are marked *

css.php