Researchers who study stylometry—the statistical analysis of linguistic style—have long known that writing is a unique, individualistic process. The vocabulary you select, your syntax, and your grammatical decisions leave behind a signature. Automated tools can now accurately identify the author of a forum post for example, as long as they have adequate training data to work with. But newer research shows that stylometry can also apply to artificial language samples, like code. Software developers, it turns out, leave behind a fingerprint as well.
Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt's former PhD student and now an assistant professor at George Washington University, have found that code, like other forms of stylistic expression, are not anonymous. At the DefCon hacking conference Friday, the pair will present a number of studies they've conducted using machine learning techniques to de-anonymize the authors of code samples. Their work could be useful in a plagiarism dispute, for instance, but it also has privacy implications, especially for the thousands of developers who contribute open source code to the world.
How To De-Anonymize Code
Here's a simple explanation of how the researchers used machine learning to uncover who authored a piece of code. First, the algorithm they designed identifies all the features found in a selection of code samples. That's a lot of different characteristics. Think of every aspect that exists in natural language: There's the words you choose, which way you put them together, sentence length, and so on. Greenstadt and Caliskan then narrowed the features to only include the ones that actually distinguish developers from each other, trimming the list from hundreds of thousands to around 50 or so.
The researchers don't rely on low-level features, like how code was formatted. Instead, they create "abstract syntax trees," which reflect code's underlying structure, rather than its arbitrary components. Their technique is akin to prioritizing someone's sentence structure, instead of whether they indent each line in a paragraph.
'People should be aware that it’s generally very hard to 100 percent hide your identity in these kinds of situations.'
Rachel Greenstadt, Drexel University
The method also need requires examples of someone's work to teach an algorithm to know when it spots another one of their code samples. If a random GitHub account pops up and publishes a code fragment, Greenstadt and Caliskan wouldn't necessarily be able to identify the person behind it, because they only have one sample to work with. (They could possibly tell that it was a developer they hadn't seen before.) Greenstadt and Caliskan, however, don't need your life's work to attribute code to you. It only takes a few short samples.
For example, in a 2017 paper, Caliskan, Greenstadt, and two other researchers demonstrated that even small snippets of code on the repository site GitHub can be enough to differentiate one coder from another with a high degree of accuracy.
Most impressively, Caliskan and a team of other researchers showed in a separate paper that it’s possible to de-anonymize a programmer using only their compiled binary code. After a developer finishes writing a section of code, a program called a compiler turns it into a series of 1s and 0s that can be read by a machine, called binary. To humans, it mostly looks like nonsense.
Caliskan and the other researchers she worked with can decompile the binary back into the C++ programming language, while preserving elements of a developer’s unique style. Imagine you wrote a paper and used Google Translate to transform it into another language. While the text might seem completely different, elements of how you write are still embedded in traits like your syntax. The same holds true for code.
“Style is preserved,” says Caliskan. “There is a very strong stylistic fingerprint that remains when things are based on learning on an individual basis.”
To conduct the binary experiment, Caliskan and the other researchers used code samples from Google’s annual Code Jam competition. The machine learning algorithm correctly identified a group of 100 individual programmers 96 percent of the time, using eight code samples from each. Even when the sample size was widened to 600 programmers, the algorithm still made an accurate identification 83 percent of the time.
Plagiarism and Privacy Implications
Caliskan and Greenstadt say their work could be used to tell whether a programming student plagiarized, or whether a developer violated a noncompete clause in their employment contract. Security researchers could potentially use it to help determine who might have created a specific type of malware.
More worryingly, an authoritarian government could use the de-anonymization techniques to identify the individuals behind, say, a censorship circumvention tool. The research also has privacy implications for developers who contribute to open source projects, especially if they consistently use the same GitHub account.
“People should be aware that it’s generally very hard to 100 percent hide your identity in these kinds of situations,” says Greenstadt.
For example, Greenstadt and Caliskan have found that some off-the-shelf obfuscation methods, tools used by software engineers to make code more complicated, and thus secure, aren't successful in hiding a developer's unique style. The researchers say that in the future, however, programmers might be able to conceal their styles using more sophisticated methods.
“I do think as we proceed, one thing we’re going to discover is what kind of obfuscation works to hide this stuff,” says Greenstadt. “I’m not convinced that the end point of this is going to be everything you do forever is traceable. I hope not, anyway.”
In a separate paper, for instance, a team led by Lucy Simko at the University of Washington found that programmers could craft code with the intention of tricking an algorithm into believing it had been authored by someone else. The team found that a developer may be able to spoof their "coding signature," even if they're not specifically trained in creating forgeries.
Greenstadt and Caliskan have also uncovered a number of interesting insights about the nature of programming. For example, they have found that experienced developers appear easier to identify than novice ones. The more skilled you are, the more unique your work apparently becomes. That might be in part because beginner programmers often copy and paste code solutions from websites like Stack Overflow.
Similarly, they found that code samples addressing more difficult problems are also easier to attribute. Using a sample set of 62 programmers, who each solved seven "easy" problems, the researchers were able to de-anonymize their work 90 percent of the time. When the researchers used seven "hard" problem samples instead, their accuracy bumped to 95 percent.
Aylin Caliskan, George Washington University
In the future, Greenstadt and Caliskan want to understand how other factors might affect a person’s coding style, like what happens when members of the same organization collaborate on a project. They also want to explore questions like whether people from different countries code in different ways. In one preliminary study for example, they found they could differentiate between code samples written by Canadian and by Chinese developers with over 90 percent accuracy.
There’s also the question of whether the same attribution methods could be used across different programming languages in a standardized way. For now, the researchers stress that de-anonymizing code is still a mysterious process, though so far their methods have been shown to work.
“We’re still trying to understand what makes something really attributable and what doesn't,” says Greenstadt. “There’s enough here to say it should be a concern, but I hope it doesn’t cause anybody to not contribute publicly on things.”