Seeing the signature green padlock and “https” in the browser bar means one thing for most internet users: safety. However, is this sense of security justified? The short answer is a loud, resounding, no!
To start, let’s define what “https” really means: that the website being accessed is encrypted, and all information sent through the site is protected by an encryption key that allows only the user and the owner of the site to read it. URLs that start in https also contain a file, called a web certificate, with information about the owner of the encryption key (the subject) and the company that created the certificate (the issuer).
Modern internet browsers use this information to tell users if a webpage is trustworthy. However, cybercriminals have found a way to take advantage of users’ innate trust in encrypted websites and use them for nefarious activities. Knowing this, Cyxtera’s research team recently set out to investigate how deep neural networks can be used to uncover malicious web certificates in the wild.
A False Sense of Security
A 2018 survey by a leading industry analyst firm asked users what they understood the security indicator in their browser to mean regarding the safety of the website. Unsurprisingly, most of the respondents incorrectly indicated that they equate the symbol and the https in the URL with trustworthiness and safety.
This misunderstanding is the result of a mantra that has existed in the cybersecurity world since the 1990s: “Only trust websites when you see https in your address bar.” Though this was valid in past, it no longer holds in a world where cybercriminals are able to harness the very same technology and make it work in their favor.
The Browser Protection Approach
When a website is accessed, modern browsers validate the information contained in a web certificate and, based on that, inform the user if they should be careful when accessing that particular website. If a website is marked suspicious or is reported, it is added to a blacklist. Unfortunately, the blacklist results in delays in detecting malicious activity, as not all dangerous websites with valid certificates are immediately detected by a browser or reported by a user. This allows threat actors to deceive users into divulging sensitive information; even in a short period of time, they are capable of causing a lot of damage.
Left: Example of a phishing site with a valid certificate shown as “secure” in the browser; Right: Example of legitimate website shown as “not secure” in the browser
Analyzing the Data in Web Certificates
It was with all of this in mind that the Cyxtera research team began their experiment to see if they could improve malicious certificate detection using deep learning. To start, they gathered 1 million certificates from legitimate websites, 3,000 certificates from phishing sites, and 3,000 certificates from malware-infected websites. From this data, they learned that almost all malicious websites use self-generated certificates that can be obtained for free; less than 10 percent use purchased certificates that contain fake information. Fifty-five percent of legitimate businesses also use self-generated certificates, but with one big difference: their certificates contain real, verifiable information. Further, many legitimate websites are flagged by browsers as suspicious due to improperly generated web certificates or because they’ve let their certificate expire.
Letting the Algorithm Learn by Itself
Armed with this information, the Research team set out to make a tool capable of distinguishing when a web certificate is being used for legitimate purposes or for malicious activities while avoiding the high number of false positives and negatives generated by browser detection systems. To do so, they first processed and extracted the main characteristics of web certificates that differentiate between safe and legitimate sites. Then, they set out to create their algorithm.
Other research teams who have tackled the same problem using machine learning have approached the problem using a Support-Vector Machine (SVM) algorithm with moderate success. The Cyxtera team chose to deploy a deep-learning algorithm in the hope of achieving a higher level of accuracy. An algorithm called Long Short-Term Memory (LSTM) allows a deep learning approach capable of learning by itself from the text content of web certificates. This algorithm is capable of discovering new patterns on its own, without help from its programmers, meaning that it is able to investigate beyond what humans are able to infer from certificate content.
To test the new algorithm, the research team created an SVM algorithm and compared its performance to that of the LSTM algorithm when used on the same dataset to detect malware and phishing certificates. Though the SVM performed adequately, the deep-learning algorithm was able to outperform it by 5 percent in the case of malware and 3 percent in the case of phishing.
By employing a deep learning approach to identify when a web certificate is being abused, the Cyxtera research team was able to detect malicious certificates with a significant level of accuracy, avoiding the need to rely on slow browser detection systems.
Using this deep algorithm to identify potential abuse of TLS certificates has improved the performance of Cyxtera’s Digital Threat Detection platform, reducing the time needed to detect new attacks from malicious websites.