big-data | frustrated robot

All Posts

A thumbnail image

Data mining PII via optical character recognition on publicly hosted image sites pt. 4

Results The hypothesis that personally identifiable text data can be extracted from publicly hosted images was proven to be true. A narrow focus on usernames and passwords was applied in this research because that information can be exploited to gain more data from other accounts. From a data set of 1.18 million records, a focused subset of 6081 images were identified as high potential for containing compromising information. Out of those 6081 images, 1044 usernames and passwords, as well as 18 social security numbers, were extracted.

Jan 27, 2021 - 15 min read

A thumbnail image

Data mining PII via optical character recognition on publicly hosted image sites pt. 3

Data Collection & Methodologies The data for this project was collected from https://prnt.sc using a Node.js script and the OCR package TextBoxes. Data Collection All screenshots taken using the Lightshot application are hosted at a public URL of the form https://prnt.sc/****** where the trailing six characters are the unique identifier of the uploaded image. The string is alphanumeric, so there are 36^6 (2.1 billion) possible combinations. On the homepage, the site reports that there are just over 1.

Nov 14, 2020 - 17 min read

A thumbnail image

Data mining PII via optical character recognition on publicly hosted image sites pt. 2

Introduction On June 14, 2016, Ellen Nakashima of The Washington Post published a story that the Democratic National Convention (DNC) had been infiltrated by two teams of state-sponsored Russian hackers (Greenberg, 2019). According to Nakashima, one of the groups, named Cozy Bear*, gained access to the DNC’s email and chat communications and had been monitoring those channels for over a year. The other group, Fancy Bear, gained access to DNC servers in April 2016 and exfiltrated opposition research documents.

Aug 15, 2020 - 13 min read

A thumbnail image

Data mining PII via optical character recognition on publicly hosted image sites pt. 1

Cybercrime is expected to reach $6 trillion in damages in 2021 (Herjavec Group, 2019), and has an average yearly impact to the global economy of $450 billion (Harper et al., 2018). Many instances of identity theft are the result of simple phishing campaigns, where a malicious actor tricks a target into entering their username and password into a fake version of a trusted website or downloading an attachment containing malware.

Aug 9, 2020 - 7 min read

A thumbnail image

Text Messages Sentiment Analysis

Introduction Personal communications have always been a research-rich resource. Theodore Roosevelt’s journal and correspondence have been studied by historians, the FBI wants a backdoor into iPhones, even metadata about our phone calls are extremely valuable to the intelligence community. Sentiment analysis is a method used to study the subjectivity of collections of text. It can be used in a variety of applications from analyzing song lyrics, to product reviews, to assessing public opinion through trends on Twitter.

Mar 21, 2020 - 6 min read

A thumbnail image

OCR and Text Mining pt. 1

The conversation started out simply enough at lunch one day: “Dude have you ever been to prnt.sc? There’s this tool called Lightshot that uploads screenshots and just indexes the URLs. You could easily write a script to loop through them.” I had done something similar several years ago in college when I wrote a script to randomly generate Imgur URLs for a drinking game. Cat was one drink, celebrities were three or something like that.

Mar 21, 2020 - 4 min read