Security researchers at security firms like SiteLock® audit code that has been flagged as suspicious, either by individuals or by an automated system performing behavioral analysis (which we’ll talk more about in the next section), to determine whether or not the code is actually malicious. If a file or piece of code is deemed malicious by the security researcher, it enters the database, typically as either a file match signature, or a code snippet signature.
When a file is found to contain malware and only malware, a file match signature will be created based on the unique characteristics of the file. Often file match signatures will contain a message digest of the file, also known as a ‘checksum’ or a ‘hash,’ for increased process speed and efficiency. By using hashes, the scanner is able to avoid the computationally-intensive route of reading the entire contents of every single file against the entire contents of every single iteration of malware ever discovered, reducing a process which could take days or weeks down to a process that runs in minutes or hours.
Hash-like identification logic can be seen in the form of using license plates on automobiles to identify them. If you were tasked with identifying every unique attribute of a specific car, what makes it different from the other thousands of cars with the same make and model, you probably could, but it would be an incredibly time-consuming process and wouldn’t be a very practical method for identification. Instead, many parts of the world have adopted the use of fixed-length license plates as a more efficient method for identification. Just like a car could have any number of documentable characteristics, an individual file in the wild could have any arbitrary length and size. Like license plates, hashes have fixed lengths, such as 128 bits in the popular MD5 format or 160 bits in SHA-1 format, which allow for the quick and practical identification of malicious files on WordPress sites.
When a security researcher has found a legitimate file that has been compromised by malware, for example where malicious code has been injected to an existing web page, it will typically be entered into the database as a code snippet signature in the form of either plaint text or a regular expression. A regular expression is a character representation that defines a search pattern, and thus another method for increasing scan speed and efficiency by reducing the computational tax of the scan operation.
Signatures often follow a uniform naming standard and will look something like “SiteLock-PHP-BACKDOOR-GENERIC-MD5” which helps tell us the background at a glance:
- SiteLock is the name of the database the signature is contained within;
- PHP is the language the malware was written in;
- BACKDOOR tells us that the malware being documented is a backdoor;
- GENERIC tells us that the malware specimen is probably somewhat run-of-the-mill, meaning the malware may simply be a new iteration of a popular malware distribution;
- MD5 indicates that the signature is a file match signature that has been stored in MD5 hash format.
By using this classification format, the security mechanisms are able to organize and reference individual signatures even while sourcing multiple databases.