Data mining PII via optical character recognition on publicly hosted image sites pt. 4

Results
The hypothesis that personally identifiable text data can be extracted from publicly hosted images was proven to be true. A narrow focus on usernames and passwords was applied in this research because that information can be exploited to gain more data from other accounts. From a data set of 1.18 million records, a focused subset of 6081 images were identified as high potential for containing compromising information. Out of those 6081 images, 1044 usernames and passwords, as well as 18 social security numbers, were extracted.
Among the usernames and passwords extracted, there were a significant number of images that were planted as honeypots by an unknown entity, possibly a government or a malicious actor. It is likely that these honeypot images, such as the “jiratrade” emails, were uploaded in bulk to try to trick someone that happened across them into going to the listed website and resetting the username and password. This would be an extremely roundabout way to try to harvest credentials, however, and it creates more questions than answers.
What is the likelihood that someone finds one of these “jiratrade” email images? There were 916 of these images out of a total of 1,189,294 scraped, which gives a 0.08% chance that a random image inspected on the website is a “jiratrade” email containing credentials. The concentration of these types of images is anomalous, but the reasoning behind them lies beyond the scope of this research.
One question that presented itself during the data analysis was that of how many of the images were hosted on imgur.com rather than prnt.sc. From the collected data, 387,923 were hosted on imgur.com and 801,361 were hosted by Skillbrains on prnt.sc, which means that almost 1⁄3 of all images are hosted externally.
The data set was large and selected randomly, and can be expected to be representative of the full population of 1.7 billion images hosted on https://prnt.sc. If 1044 usernames and passwords can be extracted from every 1.18 million images, then there are approximately 1,503,360 usernames and passwords in clear text on the website.
Recommendations & Conclusion
Lightshot is an undeniably popular application that is used worldwide. Its ease of use makes it an attractive mechanism for quickly taking and sharing screenshots. Users do not need to worry about how to host their own images, the site dynamically generates shortened URLs to make them social media-friendly, and it is free. Over 1.7 billion images have been uploaded to prnt.sc using Lightshot since January, 2010. When the image count passes 2.1 billion images, all of the six-character URLs will be taken. Will the service just bump the character count to seven, or perhaps start back from the first index?
The site attracts millions of visitors each month, and its use of ads likely generates significant revenue. Some of that money must pay the hosting costs for the site and images collection, but over 30% of Lightshot images are actually hosted on Imgur.
Financials are only important because enough money is being made to maintain a site that stores billions of uncompressed images for over 10 years. This level of popularity and success essentially obligates the developers to invest in and provide safeguards.
Host Recommendations
There are three recommended steps for the developers of this application to improve protections for its users:
- Require user accounts.
- Provide privacy settings.
- Pre-process images.
In addition to these recommendations, Skillbrains support was contacted with a list of URLs of the images that contained credentials with a request that they be removed from the site. There was no response and the images are still available at the time of this writing.
Require User Accounts
Lightshot does allow users to create accounts, but an account is not required to engage with any of the website’s functionalities. On the other hand, requiring a user to create an account to upload images would add a level of accountability to an otherwise completely anonymous process.
The Terms of Service specifically state that everything uploaded to the site is public and they reason that it is to ensure the site is not used for illegal purposes.
We do not “collect” anything that you post: images, comments, messages, etc., and do not “process” or determine any purposes for processing of any information that you post. In particular, every image uploaded (even if it was uploaded to your gallery) to this website is public – whether uploaded directly without going through a user account, or uploaded via a user account – and has its own URL. Every image can always be accessed and viewed by anyone who types in that exact URL. No image uploaded to this website is ever completely hidden from public view. This is to ensure that this website will not be used as a platform for illegality.
This argument is problematic due to the lack of accountability if images were uploaded that were illegal, or the platform was used for illegality. The best case scenario would be that someone else sees the image and reports it to the administrators who then take it down, but what is to stop the original poster from simply re-uploading it?
That would also require the developers to actively respond to reports of misuse of the platform, which did not happen when the compromising images discovered during this research were reported. Requiring every user to create an account to be able to use the features of the site would be a good step toward accountability.
Anyone can see any image that is hosted on prnt.sc. That is a great selling point for the site on its surface, because it means it is easy. User experience as a discipline is driven by the principle of making products as easy to use as possible. Unfortunately, that ease of use comes at a price with this service.
Provide Privacy Settings
It is not enough to require that all users of Lightshot have an account to interact with the site. Account configuration must also include security and privacy settings. YouTube’s approach to who is allowed to view an uploaded video could be a guide here. Options such as “public”, “private”, and “only with a shared link” would be simple and vastly improve control over how images are shared.
Giving users control over their images is a safeguard that essentially every major hosting site offers. If there are truly instances where a person needs to share delicate information, and Lightshot is their only means of doing so, then they should not need to allow the entire world to see it.
Pre-process Images
By including a pre-processing step to the image upload procedure, Skillbrains could programmatically ensure that their website is not used as a platform for illegality. Facebook has advanced algorithms that analyze every image that is uploaded to their site to ensure that it does not violate certain rules. Investments in machine learning, both financial and intellectual, have pushed the technology forward exceptionally in the past decade.
Granted, Facebook’s algorithms are proprietary and not available for free, but the investment in development by Skillbrains would make their platform safer and a better user experience. The platform already hosts nearly two billion images, all of which are free for the developer to use as a training set.
This research has proven how trivial it is to analyze images for their text contents. Certainly a filter could be put in place to prevent pornographic images, images with excessive profanity, and more from being uploaded in the first place.
These three suggestions are the the minimum that the host can do to help protect their users and themselves. The first two proposed steps are industry standard practices, and the third is in line with many of the requirements of their own terms of service.
The European Union’s General Data Protection Regulation (GDPR) is blatantly violated by the structure of this application and website as well. A citizen of the EU has the legal right to be forgotten and have all of their data removed from a site on demand (Kročil, Pospíšil, 2020) While many complexities exist in the law, Lightshot clearly violates this regulation by providing no mechanisms by which a user may request that their data be made available to them, and no infrastructure to execute on a right to be forgotten request.
GDPR has changed the way that companies handle user data, and forced many organizations to provide greater control of personal data to their users (Li et al., 2019). Russia is not a member of the EU, and so lives in a less defined area for legal action. Analysis of law is outside of the scope of this research as well, but there has been no visible attempt by the developers to comply with GDPR.
Lightshot has existed for over a decade and has been a successful platform for image sharing, but the host has taken on next to no responsibility for ensuring the service is used in a safe and legal manner.
User Recommendations
Users should not trust that hosts have their best interests at heart when creating a website or service. News about poor security practices is commonplace, and users should be vigilant and proactive about personal safety when using the internet. To help avoid the shortcomings of Lightshot’s scant security measures, and for general individual best practice, the following three recommendations should be adopted.
- Use strong passwords.
- Use unique passwords.
- Understand that there is no true security by obscurity.
Strong Passwords
Password strength is dictated by its complexity. Unfortunately, something that is complex for a human is not always complex for a computer. Computers are very good at rapidly looping through lists, and so are perfectly suited for entering passwords by brute force.
Many times, when people think of a complex password they think of something another human is unlikely to guess. There are very few hackers attempting to brute force passwords manually, but people that know personal details about users have an advantage in potentially guessing a weak password. So users need to create passwords that are too complex for a human and a computer to guess.
Many solutions have been proposed to help users secure their accounts via strong and memorable passwords. In his article Security analysis of Game Changer Password System, Brumen discusses the pros and cons of game changer passwords, for example, which highlights the constant tradeoff between security and memorability (Brumen, 2019).
The idea of a game changer password security method is that a user places pieces on a game board, such as chess or Monopoly, rather than having a word or a phrase to authenticate. In the chess example, a user selects four chess pieces and places them on an 8 x 8 chess board. Each piece and square on the board have an identifier, for example a black king (BK) placed on b6 and a white bishop (WB) placed on c5 represents the partial password “BKb6WBc5”. Clearly, placing four pieces on a chessboard is easier than memorizing a seemingly random 16 character string.
A Monopoly board is conceptually the same, but with different pieces and different square names.
Unfortunately, he goes on to demonstrate that method can easily be overcome by brute force. The chess example provides users with 347,892,350,976 possibilities for passwords. Depending on the hashing function used, Brumen shows that the entire space of chess passwords can be searched in 51 seconds.
The small number of possible combinations, 3.47 x 10^11 is five orders of magnitude smaller than recommended (Grassi et al., 2017), is compounded by the issue that plagues all aspects of security: the human factor. Brumen’s study showed that people are more likely to choose some pieces and squares over others in chess, and the issue was even more prevalent with a Monopoly board.
Monopoly’s boot piece was selected nearly 25% of the time, and “Jail” and “Go” were each chosen in at least 6% of trials. Similarly, the corner squares on the chess board were far more likely to be selected.
In order for this solution to truly increase user safety, it needs to generate more complex passwords and have a more normalized distribution of piece and square combinations.
Passphrases are one of the best solutions to the problem of complexity and memorability. Whereas a password is a single word that is a few characters long, a passphrase is a collection of words (optimally separated by spaces) that is 25 to 50 characters in length.
Arnold Reinhold designed a method of generating secure passphrases using a list of 7,776 short words preceded by five digit numbers, and a set of five dice (Reinhold, 2020). His method, called Diceware, allows one to roll the five dice and select the word corresponding to their numbers from the list. An excerpt of the list is available in Table 5.
For online authentication portals that have mechanisms in place to prevent brute force attacks, a shorter passphrase is generally acceptable. If it is possible that an offline attack can be staged, such as against a password manager, then it is advisable to increase the length of the passphrase generated using Diceware to about eight words.
Table 5
Unique Passwords
It is not sufficient to have one secure passphrase and then use it for all accounts. Though the risk of a brute force compromise is significantly reduced, a single leak of that password or passphrase can still compromise any account it was used to secure. Again we encounter the issue of memorability: how can a person memorize a unique passphrase for tens or hundreds of accounts?
Password managers are tools that can help to solve this problem. Modern browsers generally have basic password managers built into them, allowing users to store and automatically populate credentials for websites; however, browser-based solutions tend to store the passwords unencrypted on the local device.
Standalone password managers are a safer option than their browser-based counterparts. There are a variety to choose from, with features to suit nearly every user’s needs.
A good password manager generates, stores, and populates user credentials for all the websites and applications that a person has. It incorporates the concept of a single “master password” to unlock the password vault, which then automatically fills in the username and password for the website. The major advantages of this method are that users only need to remember a single complex password, a unique password can be generated for every account, all the credentials are encrypted and stored safely either locally or in the cloud.
Optimally, a user should utilize the Diceware method to create a strong passphrase and then use that as the master password for the password manager.
Security by Obscurity
Finally, society must stop believing in security by obscurity, or the idea that an individual is simply a needle in a haystack online. This research has demonstrated that with a basic script and minimal computing power, personally identifiable information can easily be mined out of a collection of 1.2 million images. Computers can crack passwords at rates of 6 billion per second (Brumen, 2019) with some hash types, and distributions of thousands of common passwords are a Google search away: https://www.passwordrandom.com/most-popular-passwords.
Password stuffing, account pivoting, and web scraping are all processes that can be run automatically. Anyone with basic scripting skills and Kali Linux can perpetuate very powerful attacks at a large scale. A needle in a haystack is no longer inconspicuous if the hay can be filtered away. Instead, all users of the internet should be proactive in securing their online accounts just as they do in the physical world.
Conclusion
Malicious actors are everywhere on the web, and their techniques for compromising unsuspecting users are evolving at a rapid pace. This research used a human component to mine and analyze images, but even that component could likely be automated with a good classification algorithm. We, as users, have a responsibility to be conscientious of our own safety, and providers of online services should be similarly cautious.
A study by a group of German students found that adware, in which attackers hijack ad services on websites, is a rapidly emerging threat to privacy and safety online (Urban et al., 2017). Requests made via these adware injections contained personal information in approximately 37% of cases. It highlights the fact that any service or vector that a bad actor can find to use for potential gain will be used, and in many cases the victim will never see or know about it.
Users must, then, be cautious with the data that they do have control over. Banks are required to comply with “Know Your Customer” regulations, and all users of the internet should operate similarly. A bank will not open accounts for clients that cannot pass background checks or prove what purpose their bank accounts will serve. It is necessary for an applicant to demonstrate where their money will be flowing from. Users of the internet should consider where their data will be going. Andrew Lewis stated, “If you are not paying for it, you’re not the customer; you’re the product being sold.” Do you know the customers and consumers of your personal data?
Technology is constantly evolving, and it is impossible to stay ahead of malicious actors when developing systems. There are many success stories, though. In 2006, Instant Messaging was a key inroad for attackers because it was an emerging technology that was still developing safeguards (O’Sullivan, 2006). As technologies evolve and are iterated upon by developers, they get safer. Security researchers are becoming much more prevalent at organizations, but many users’ general attitudes towards security have not changed.
In this project it has been demonstrated that there is a non-trivial amount of credentials publicly available on an image hosting site. There were over 400,000 keyword matches on a keyword set that ranged from home and IP address details to video games. In the world of cybercrime data equals money and power, and we, as a society of internet users, must do a better job of protecting ourselves.
References
Brumen, B. (2019). Security analysis of Game Changer Password System. International Journal of Human-Computer Studies, 126, 44–52. https://doi-org.proxy.uwec.edu/10.1016/j.ijhcs.2019.01.004
Grassi, P.A., Fenton, J.L., Newton, E.M., Perlner, R.A., Regenscheid, A.R., Burr, W.E., Richer, J.P., Lefkovitz, N.B., Danker, J.M., Choong, Y.-Y., Greene, K.K., Theofanos, M.F. (2017). NIST Special Publication 800-63B Digital Identity Guidelines. Authentication and Lifecycle Management. https://dx.doi.org/10.6028/NIST.SP.800-63b.
Li, H., Yu, L., & He, W. (2019). The Impact of GDPR on Global Technology Development. “Journal of Global Information Technology Management”, 22(1), 1–6. https://doi-org.proxy.uwec.edu/10.1080/1097198X.2019.1569186
O’Sullivan, S. (2006). Instant Messaging vs. instant compromise. Network Security, 2006(7), 4–6. https://doi-org.proxy.uwec.edu/10.1016/S1353-4858(06)70408-1
Reinhold, A. (2020). The Diceware Passphrase Homepage. https://theworld.com/~reinhold/diceware.html
Urban, T., Tatang, D., Holz, T., & Pohlmann, N. (2019). Analyzing leakage of personal information by malware. Journal of Computer Security, 27(4), 459–481. https://doi-org.proxy.uwec.edu/10.3233/JCS-191287