Preferred Name


Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.


Date of Graduation


Document Type


Degree Name

Master of Science (MS)


Department of Computer Science


Xunhua Wang

Nathan Sprague

Brett Tjaden


Code stylometry is applying analysis techniques to a collection of source code or binaries to determine variations in style. The variations extracted are often used to identify the author of the text or to differentiate one piece from another.

In this research, we were able to create a multi-input deep learning model that could accurately categorize and group code from multiple projects. The deep learning model took as input word-based tokenization for code comments, character-based tokenization for the source code text, and the metadata features described by A. Caliskan-Islam et al. Using these three inputs, we were able to achieve 90% validation accuracy with a loss value of 0.1203 using 12 projects consisting of 5,877 files. Finally, we analyzed the Bitcoin source code using our data model showing a high probability match to the OpenSSL project.



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.