Last year we published an #AskSecPro series where we explained how signature-based malware analysis works, as well as how traditional signatures are created. An area we don’t often talk about in public channels, but has played a pivotal role in SiteLock becoming a global leader in website security solutions, is our research and development efforts in new security technologies. In addition to our more traditional approaches to malware detection, SiteLock continues to explore new frontiers in technological improvement to push the field of security research forward. For some time SiteLock has been developing machine learning mechanisms as part of its process for discovering new malware iterations on an automatic basis. Our research in the field has shown that machine learning promises to be an important part of early malware detection and preliminary identification. One of the most significant breakthroughs we’ve had in machine learning as it pertains to malware detection and signatures, has been in feature-based signature analysis.
Feature-based signatures differ from traditional signatures in that the signature’s purpose is not to find known malware, it’s to find malware that’s never been seen before. One of the limitations of a traditional signature approach to malware detection is that it can not detect never-before-seen malware in the wild. Rather you’re only able to detect malware that has already been previously identified and classified in a signature database. With traditional signatures, you execute your malware search by asking the “yes” or “no” question “does this code match what we know to be malware?” Whereas in feature-based signature analysis, we leave behind strictly-defined program instructions in favor of what is effectively encouraging the machine to form the questions we don’t yet know to ask.
The term Feature-Based refers to the method of analyzing code based on its features, that is, its actions, mechanisms, and behavior.
The generation of new traditional signatures typically relies on a large staff of analysts to dissect website code in order to define exactly what is and isn’t malware, then designing a safe way to surgically remove the problem code. This works exceptionally well in finding and documenting new malware, assuming that your staff can scale to meet the volume of code being analyzed. However, the scalability of this arrangement might come into question when you’re the largest website security provider in the world in terms of volume. Enter feature-based signature analysis which, like a human auditor, focuses on the behavior of the application being inspected. Based on the behavior, feature-based signature analysis can determine with a sliding scale of certainty whether the application is up to no good.
Feature-based signature analysis is a scalable solution to data analysis, but really only feasible on a massive scale because an enormous data set is required to perform analyses that produce patterns of any tangible value. At SiteLock, we perform malware audits on over one billion files per day, which allows us to form a substantial data set to analyze through machine learning. As of today, we’re able to evaluate over 13.8 duovigintillian behavioral variations on every file we audit using feature-based signature analysis. To put things in perspective, if we were able to employ every single one of the 7.125 billion living humans on earth to perform this analysis on a daily basis, each person would be need to perform over 29.1 vigintillion points of analysis per second to match the load of our feature-based signature analysis system.
A vigintillian is a one followed by sixty-three zeroes.
A duovigintillian is a one followed by sixty-nine zeroes.
These numbers are so large, I had to look them up on the internet to put them into words!
While we are able to analyze a massive number of variations, the majority of new malware we’ve found to date has been located in a comparatively narrow corridor of about 80,000 possible combinations. Feature-based signature analysis is just one of the many Skunkworks projects that SiteLock is currently developing. Through the use of cutting-edge machine learning technologies, SiteLock carries on the fight in the arms race of application security. We strive to protect the web from malicious adversaries by continuing to bring new and emerging technologies into our defense arsenal. Stay tuned for future articles on our technology as more develops.