Shawn Peters

Aug 9, 2020 7 min read

Data mining PII via optical character recognition on publicly hosted image sites pt. 1

Cybercrime is expected to reach $6 trillion in damages in 2021 (Herjavec Group, 2019), and has an average yearly impact to the global economy of $450 billion (Harper et al., 2018). Many instances of identity theft are the result of simple phishing campaigns, where a malicious actor tricks a target into entering their username and password into a fake version of a trusted website or downloading an attachment containing malware. Usually these phishing schemes create a sense of urgency, and convince the target that they need to act quickly and without thinking. When the target follows a link in an email and “authenticates” to confirm it is them, the attacker obtains those credentials.

Once the attacker has the user credentials, they can take over the account. If the target has used the same username and password for other accounts, the attacker could take those over as well. If the attacker is able to take over the target’s email account, then they could have full control over any account the target has created using that email address. The attacker can simply change the password for the email account to lock the victim out, and then reset passwords for any other accounts the target has associated with that email. Confirmation notifications for account-level changes are often validated by email, so the attacker could continue to pivot and change passwords to other accounts belonging to that user.

By themselves, usernames and passwords are not especially useful. The details that we protect using that security mechanism are. Personally Identifiable Information (PII) is defined by Johnson of the U.S. General Services Administration as “information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other personal or identifying information that is linked or linkable to a specific individual” (Johnson, 2007).

More people and more devices are connected to the internet today than at any other time, and this trend will continue for many years. In her article addressing the right to be forgotten, Dr. Fiona Brimblecombe notes that people “are uploading personal and private data to the Web to improve their job prospects (for example, by using LinkedIn), their sex lives (Tinder and Grindr) and their social and family lives (Facebook and Instagram)” (Brimblecombe, 2020). These web applications and thousands more all store information that is potentially private or sensitive.

Even high profile data breaches, such as Ashley Madison’s* compromise of 37 million user accounts, have been the result of poor credentials management (Lord, 2017). In this case, there were credentials hard-coded into the service’s source code which allowed attackers to compromise the database. After the database was compromised, poor encryption practices within the database allowed the attackers to decrypt 11 million passwords in less than a week.

The company also did not use any verification to set up an account, which resulted in many people who did not actually create accounts having their email addresses leaked. It put strain on many relationships when spouses had to explain that they did not create an Ashley Madison account, even though their email address was associated with one.

*Ashley Madison’s Wikipedia entry describes it as: “The Ashley Madison Agency, is a Canadian online dating service and social networking service marketed to people who are married or in relationships. Its slogan is Life is short. Have an affair.”

Problem Statement

The popularity of social media has resulted in thousands of sites that allow users to upload text, images, and other media. While many corporations such as Facebook and Google have imposed measures to give users control over their content, not everyone uses these tools. Ibrahim and Tan analyzed users’ awareness and engagement with privacy tools on social media sites and found a broad range of attitudes (Ibrahim, 2019). Most users knew about the privacy tools and used them to manage who had access to their data, but as many as 27% still share information publicly.

There are also many other popular, but less-regulated sites that do not offer privacy tools at all. Malicious actors can take advantage of this type of unprotected data to cause harm. Users might upload screenshots without realizing that compromising data might also be on their screen, malicious users might share data they have stolen, remote collaborators might assume it is safe to share a screen of their database config because it will just be a needle in a haystack.

Project Purpose

The purpose of this project is to investigate the textual personally identifiable information contained in a large, public image set. The results of this exploratory investigation will be analyzed in a report with the intent of exposing the dangers of public hosting tools.

In this study, an image-hosting site, https://prnt.sc, will be analyzed. The site hosts over 1.7 billion images uploaded via a client application named Lightshot. Lightshot is a mobile and PC application that is used to take a screenshot, automatically upload the image, and generate an indexed public URL to easily share the image.

Prnt.sc is used worldwide, and had nearly 18 million visits in May 2020 (SimilarWeb, 2020). Over the six month span between December 2019 and May 2020, the site was visited over 99 million times. The same report details that 14% of the latest month’s traffic originated from Russia, but the second most traffic (8.12%) came out of the United States.

Using a script to randomly generate prnt.sc URLs and then uses an optical character recognition (OCR) library to scrape the text from those images. Over the course of several weeks in March and April, approximately 1.18 million images were scraped and cataloged. The data collection for this research accounted for about 2.5% of all traffic to the website during March and April 2020, but did not cause a significant deviation from the totals of the surrounding months.

The resulting data set will be analyzed to understand the amount of PII data that is available through this image hosting service. The intent is to help describe the dangers of sharing personal data via internet-connected applications, demonstrate the ease with which such data can be gathered, and prescribe a solution to protect both users and hosts.

Objectives

Explain the data set and how it was collected and prepared, as well as alternative and scalable methods of collecting the same data.
Perform exploratory analysis of the data set using MongoDB queries, and Node.js scripts. Using custom keyword lists, understand the composition of financial, address, account, and other PII. Examine the associated images for potential trends or other contextual details.
Extract as much personally identifiable information as possible from the collected data set.
Assess the responsibilities of users and hosts with respect to protecting data.
Recommend ways to avoid these and other pitfalls for improved personal online safety.

Significance

This study aims to underline the ease with which sharing seemingly mundane details online can compromise personal data. The data set under analysis consists of text scraped from images using rudimentary scripting tools and a computer built in 2012. Modern distributed systems, a concerted development effort, or simply ill intent could each increase the impact of this type of exercise. All three combined could quickly and easily dump the most sensitive pieces of information into a pastebin for anyone with a link to take advantage of.

References

Brimblecombe, F. (2020). The Public Interest in Deleted Personal Data? The Right to Be Forgotten’s Freedom of Expression Exceptions Examined through the Lens of Article 10 Echr. Journal of Internet Law, 23(10), 1–29.

Harper, A., Sims, S., Baucom, M., Regalado, D., Spasojevic, B., Eagle, C., Linn, R., Martinez, L., Harris, S. (2018) Gray Hat Hacking: the Ethical Hacker’s Handbook. McGraw-Hill Education.

Herjavec Group. (2019). 2019 Official Annual Cybercrime Report. https://www.herjavecgroup.com/the-2019-official-annual-cybercrime-report/

Ibrahim, S., & Qing Tan. (2019). A Sudy [sic] on Information Privacy Issue on Social Networks. ISeCure, 11(3), 19–27.

Johnson III, C. (2007). Safeguarding Against and Responding to the Breach of Personally Identifiable Information (Report No. M-07-16). https://www.whitehouse.gov/sites/whitehouse.gov/files/omb/memoranda/2007/m07-16.pdf

« Data mining PII via optical character recognition on publicly hosted image sites pt. 2 Text Messages Sentiment Analysis »