To those who are well-versed with authentication, we know that there are three things that may be used as an authentication factor.
In many cases (including Gmail), we have just one password which lets us access our accounts. So, what happens if a person with malicious intents gets hold of it? We get a notification that our account has been accessed from a new browser, but that's it. The malicious person gets access to our account and all is lost.
There are many solutions to circumvent the above problem. One of them is 2FA which maybe a password/biometric or password/OTP combination. Keystroke dynamics is another solution which prevents a person from accessing your account even if they know your password.
Keystroke dynamics is a type of behavioral biometric. It relies on the assumption that every person has a unique typing pattern. This typing pattern includes timing samples - the time for which a key is pressed and/or the time between key presses and more.
If a malicious person knows your password, keystroke dynamics can help prevent access to your system because the malicious person most likely has a different typing pattern than you. In the world of cyber-cells and cyber-police, just preventing access is not enough. It is also vital to know the identity of the person trying to hack into the system.
The concept of keystroke dynamics can be used in two ways:
In my system, I'll be using three simple features:
With the help of friends and colleagues at NetApp, I was able to get training samples of 25 users. Each typed the sample password thirty times under a semi-controlled environment.
As expected, the system is not so efficient. The most probable cause for this is the level of statistics used. Like I said before, simple statistics is not sufficient for identification systems.
Machine learning is a very good concept for this application as explained in this paper: https://ieeexplore.ieee.org/document/7833085/
The paper for the system I described was published at http://www.ijemr.net/DOC/AStudyOfPersonIdentificationUsingKeystrokeDynamicsAndStatisticalAnalysis.pdf
The GitHub repository for this system lies at: https://github.com/nikhilh-20/keystroke_dynamics
Keystroke dynamics is a reasonably good behavioral biometric for authentication or identification. However, it does have its inherent drawbacks.
- Something we know (pin, passwords).
- Something we have (OTP pad).
- Something we are (iris, fingerprints).
The need for eliminating single factor authentication
In many cases (including Gmail), we have just one password which lets us access our accounts. So, what happens if a person with malicious intents gets hold of it? We get a notification that our account has been accessed from a new browser, but that's it. The malicious person gets access to our account and all is lost.
There are many solutions to circumvent the above problem. One of them is 2FA which maybe a password/biometric or password/OTP combination. Keystroke dynamics is another solution which prevents a person from accessing your account even if they know your password.
What is Keystroke Dynamics?
Keystroke dynamics is a type of behavioral biometric. It relies on the assumption that every person has a unique typing pattern. This typing pattern includes timing samples - the time for which a key is pressed and/or the time between key presses and more.
If a malicious person knows your password, keystroke dynamics can help prevent access to your system because the malicious person most likely has a different typing pattern than you. In the world of cyber-cells and cyber-police, just preventing access is not enough. It is also vital to know the identity of the person trying to hack into the system.
The concept of keystroke dynamics can be used in two ways:
- User authentication - involves comparing input data with the authorized user template.
- User identification - involves comparing input data with ALL the registered user templates.
In general, user identification is more resource intensive and also more prone to inefficiencies. In this post, I'll set up a simple keystroke-dynamics-enabled user identification system using Python 2.7, which analyzes the typing pattern of the user using simple statistics and identifies the most-probable user if possible.
It is important to note that simple statistical analysis of timing samples lead to inefficient or not-very-efficient identification systems. However, authentication systems using statistical analysis are reliable.
Which features are we using?
In my system, I'll be using three simple features:
- Dwell time - the time interval for which a key is pressed down and released.
- Flight time - the time interval between key press and next key press.
- Key affinity - user preference to use shift or caps lock keys for special or uppercase characters.
Overview
System Training Overview
This algorithm uses the simplest method of training - static training.
We'll be using a single password, .zoroBen1 which each user types thirty times. With this password, we should be able to extract ten dwell time samples and nine flight time samples. I've ignored two flight time samples because of their unstable nature:
- the flight time between a character and the caps-lock or shift keys. However, the flight time between caps-lock | shift key and the next character is considered.
- the flight time between the last entered character and the carriage key.
A flag marks the difference between the usage of shift or caps lock keys. This flag value combined with the timing samples forms the template for a particular user.
System Testing Overview
When testing, the input data flows through a statistical layer where it is compared with all the user templates available in the system. The system then outputs a list of usernames from the most probable to the least probable. This type of system performance measurement is called the Cumulative Matching Curve (CMC) which is a rank-based performance measurement algorithm.
Algorithm Flow
I've described the flow on a very high level here. If you need more information please refer to the published paper, the link for which I've mentioned in the Readings section.
- The first step is to initialize global variables as part of object attributes. This includes - sample password string, number of expected dwell time samples, number of expected flight time samples, accepted deviation, etc.
- I've used pygame module of Python to capture the password and calculate the associated timing samples.
- A simple string check is executed to check if the entered password is correct.
- Depending on the script action, there are two paths the script can take:
- if the system is in training mode, it will store the key-affinity flag value and timing samples in a csv file.
- if the system is in testing mode, it will move into the comparison layer.
- If the code flow reaches this point, then it is in testing mode. At this point, the templates for all users are read from multiple .csv files
- For each character's timing sample, the Euclidean distance between the user input and the user template is calculated. If this is less than a certain multiple of the standard deviation for that character for that specific user template, then the input timing sample for that character is considered a hit (or a match).
- Now that we have the comparison data between the user input and user templates, we proceed to calculate the scores. This score is basically a measure of the closeness between the user input and the various user templates.
- This score is then combined with the caps-lock | shift flag to form the final rank list of users.
Results
With the help of friends and colleagues at NetApp, I was able to get training samples of 25 users. Each typed the sample password thirty times under a semi-controlled environment.
When ten users typed the password, the system was able to identify them correctly (i.e. at rank-0) 76% of the time and in the top 5, 100% of the time. When twenty users tried typing the password, the system identified them correctly 62% of the time and in the top 5, 92% of the time.
As expected, the system is not so efficient. The most probable cause for this is the level of statistics used. Like I said before, simple statistics is not sufficient for identification systems.
Readings
Machine learning is a very good concept for this application as explained in this paper: https://ieeexplore.ieee.org/document/7833085/
The paper for the system I described was published at http://www.ijemr.net/DOC/AStudyOfPersonIdentificationUsingKeystrokeDynamicsAndStatisticalAnalysis.pdf
The GitHub repository for this system lies at: https://github.com/nikhilh-20/keystroke_dynamics
Take away
Keystroke dynamics is a reasonably good behavioral biometric for authentication or identification. However, it does have its inherent drawbacks.
- A person's typing pattern may change if he/she is given a new laptop or if he/she gets used to his/her new laptop.
- Timing samples may vary between keyboards and processors.
- Environmental disturbances are unpredictable which may affect the typing pattern.
- With a billion people on the planet, there is a very high chance that there is more than one person with a similar typing pattern.
Demo
Here's a demo on the working of the system:






