Shawn Peters
Shawn Peters

Mar 21, 2020 4 min read

OCR and Text Mining pt. 1

thumbnail for this post

The conversation started out simply enough at lunch one day:

“Dude have you ever been to prnt.sc? There’s this tool called Lightshot that uploads screenshots and just indexes the URLs. You could easily write a script to loop through them.”

I had done something similar several years ago in college when I wrote a script to randomly generate Imgur URLs for a drinking game. Cat was one drink, celebrities were three or something like that. The rules were variable.

Almost a decade later, I am far less interested in figuring out fun ways to get drunk and far more interested in data mining.

We walked and talked after lunch and discussed what made Lightshot so interesting and whether we could pull anything cool from it. The next day at lunch my friend had a script written and had watched a loop of a few hundred images.

There were text message conversations; there were IP addresses; email addresses of world leaders; there were Bitcoin exchange usernames and passwords.

Holy shit. Alright, let’s see what we can do with this potential treasure trove.

LightShot Details

What is LightShot? Before we talked about it at lunch, I hadn’t heard of it. After some research into the platform, here is an overview of what we are dealing with.

LightShot is a screen capture and editing tool that allows users to quickly and easily take a screenshot, apply some markup on it, and it uploads automatically to a server from which it can be shared.

There is a desktop application for Mac and Windows (and Ubuntu, kind of) as well as browser plugins for Chrome, Firefox, Internet Explorer, and Opera[1].

LightShot was developed by Skillbrains, a developer whose only footprint online appears to be LightShot-related. The third page of Google results for a “Skillbrains” search finally yielded a Wordpress site last updated in 2010.

https://skillbrains.wordpress.com/

There is no developer website for Skillbrains that I could find. No links to a dev page off of the old blog above, and browsing directly to https://skillbrains.com redirects to https://app.prntscr.com/en/.

.sc is the internet country code top level domain for Seychelles, which is an archipelagic island country in the Indian Ocean. According to Wikipedia, the domain’s intended use is for entities connected with the country of Seychelles, but in actuality it is used for a random assortment of sites. It is not unusual then for Skillbrains to be able to use it for the TLD of LightShot since prnt.sc is shorthand for “Print Screen”. This is corroborated by their ownership of prntscr.com as well.

Skillbrains does appear to be a Russian developer, though. A domain name lookup of prntscr.com returned:

And the application author in the App Store is a Russian software developer who now lives and works in the United States. The LinkedIn details for prntscr.com, pin it down to the Novosibirsk Region of Russia.

The origin of the app developers is not important outside of trying to understand where the webapp and the images are stored. Different laws obviously apply in different countries for what can and cannot be hosted, as well as the law enforcement that responds to illegal activity.

Legality is a topic that we will return to a few times throughout this process, and this will be one of the touchstones.

What are we going to do with it?

Enough about the app itself; what are we going to do with it? There are two major boons of the LightShot application:

  1. It is all public, and
  2. The URLs are indexed.

LightShot’s privacy policy states that:

“Every Image can always be accessed and viewed by anyone who types in that exact URL. No image uploaded to this website is ever completely hidden from public view.”

So per the developers’ own policy, we can always find every image that is uploaded to the servers from video game screenshots to family photos to bank account information.

URL indexing is the portion that makes this very exciting, because it makes image viewing and data collection very easy. Every URL of every image on prnt.sc is of the form:

https://prnt.sc/xxxxxx

where “xxxxxx” is a string of alphanumeric characters. This gives us 36^6 (2,176,782,336) potential URL combinations. In ten years, there have been approximately 34 that number of images uploaded.

1.6 billion data points is an intimidating number regardless of the content. However, since every single image on the site is uniquely identifiable by generating a random 6-character string with an over 75% hit rate, we can easily loop through images by any number of means and not waste a lot of time.

The most interesting bits of data on these images are pieces of text.

Things like usernames and passwords, email addresses, and more.

How much of the content that is publicly available on prnt.sc is personally identifiable information or otherwise could compromise someone’s security?

We will spend the next several blog posts finding out and detailing the steps taken to do so.

Thanks for reading and stay tuned!

[1] https://app.prntscr.com/en/download.html