1. Introduction

Recurrent neural networks (also called RNN) are gaining popularity in various interesting applications like next-word prediction, speech recognition, machine translation, or image captioning. In this blog, I have looked at a simple example, GenPredict, Name-based Gender Prediction. Genpredict could also be considered as a 'Hello World' example of the sequence classification task. We know that RNNs work because of their ability to recognize patterns within the training data. This blog tries to understand which patterns are recognized by the neural network. Demo video for the project can be found here.

2. Dataset

Dataset was obtained from the Social Security Administration website which contains firstname of all the applicants for social security card from year 1880 to 2015. A total of 1,05,386 unique names (39,728 male and 65,658 female) were obtained. CSV file for the dataset can be found here.

3. Understanding the data

Before we jump into the recurrent neural network section of the blog, let's gain some insights into the dataset. Note that the below analysis is done for the entire dataset but the train dataset observed the same properties.

3.1 Length of the name

The average length of male names and female names in the data are as below :

Average length male names : 6.23
Average length female names : 6.60
As the difference is relatively small, it could be inferred that length of the name doesn’t play major role in the gender prediction task.

3.2 First & Last Character of the name

Below is the frequency count for the first and last character of male and female names.

Last Character : It can be seen that about 28% of the male names end with 'n' whereas 28% of female names end with 'a'. Also, 67% of the last character is a vowel (a,e,i,o,u) in female names compared to 33% in male names.

             MALE - Last Character     
WORD  COUNT  PERCENT  bar graph
n     9179   28.99%   
e     2857   9.02 %   
s     2492   7.87 %   
y     2190   6.92 %   
l     2176   6.87 %   
r     2013   6.36 %   
o     1667   5.26 %   
i     1401   4.42 %   
d     1269   4.01 %   
h     1117   3.53 %   
a     919    2.9  %   
m     829    2.62 %   
k     807    2.55 %   
t     718    2.27 %   
z     325    1.03 %   
g     279    0.88 %   
u     231    0.73 %   
c     228    0.72 %   
b     226    0.71 %   
f     133    0.42 %   
x     127    0.4  %   
v     117    0.37 %   
w     108    0.34 %   
j     105    0.33 %   
p     95     0.3  %   
q     59     0.19 %   
             FEMALE - Last Character     
WORD  COUNT  PERCENT  bar graph
a     10611  28.86%    
e     5498   14.95%    
n     5013   13.63%    
h     3925   10.67%    
y     3546   9.64 %    
i     2767   7.52 %    
l     1379   3.75 %    
s     950    2.58 %    
r     788    2.14 %    
t     494    1.34 %    
o     474    1.29 %    
d     263    0.72 %    
m     202    0.55 %    
z     185    0.5  %    
u     173    0.47 %    
g     132    0.36 %    
k     98     0.27 %    
x     71     0.19 %    
c     46     0.13 %    
b     39     0.11 %    
w     28     0.08 %    
j     28     0.08 %    
p     24     0.07 %    
f     14     0.04 %    
v     12     0.03 %    
q     12     0.03 %    

First Character : The first character doesn't show as much variation in the frequency count of the most frequent character wrt other characters as observed in the last character.

             MALE - First Character     
WORD  COUNT  PERCENT  bar graph
j     3952   9.95 %   
a     3512   8.84 %   
d     3357   8.45 %   
k     3029   7.62 %   
m     2608   6.56 %   
t     2567   6.46 %   
s     2493   6.28 %   
c     2244   5.65 %   
r     2085   5.25 %   
l     1782   4.49 %   
b     1670   4.2  %   
e     1517   3.82 %   
n     1161   2.92 %   
h     1107   2.79 %   
g     1083   2.73 %   
z     804    2.02 %   
o     714    1.8  %   
p     657    1.65 %   
f     584    1.47 %   
w     568    1.43 %   
i     561    1.41 %   
y     555    1.4  %   
v     550    1.38 %   
q     295    0.74 %   
u     149    0.38 %   
x     124    0.31 %   
             FEMALE - First Character     
WORD  COUNT  PERCENT  bar graph
a     7381   11.24%    
s     6022   9.17 %    
m     5507   8.39 %    
k     4905   7.47 %    
j     4596   7.0  %    
l     4440   6.76 %    
t     4264   6.49 %    
c     4137   6.3  %    
d     3899   5.94 %    
r     2784   4.24 %    
n     2476   3.77 %    
e     2436   3.71 %    
b     2234   3.4  %    
g     1390   2.12 %    
h     1127   1.72 %    
z     1113   1.7  %    
p     1079   1.64 %    
v     1078   1.64 %    
y     1054   1.61 %    
i     1050   1.6  %    
f     832    1.27 %    
o     708    1.08 %    
w     561    0.85 %    
q     274    0.42 %    
u     166    0.25 %    
x     145    0.22 %    

The above analysis suggest that the last character should play a higher role in the prediction of the gender than the first character. In other words, someone can say with higher confidence that a name ending with 'n' belongs to male than a name starting with 'j'. 'a' is most frequent last character and 'j' is the most frequent first character in male names.

3.3 N-gram (2, 3, 4) suffixes and prefixes of the name

Here are the top 5 most frequent bi-gram, tri-gram and 4-gram suffixes and prefixes of the names in each class.

             MALE - 2-Gram Prefix     
WORD  COUNT  PERCENT  bar graph
ja     1992   5.01 %   
ma     1394   3.51 %   
da     1172   2.95 %   
de     1126   2.83 %   
ka     989    2.49 %   
             FEMALE - 2-gram Prefix     
WORD  COUNT  PERCENT  bar graph
ma     3049   4.64 %    
sh     2808   4.28 %    
ja     2220   3.38 %    
ka     1957   2.98 %    
la     1842   2.81 %    
             MALE - 2-Gram Suffix     
WORD  COUNT  PERCENT  bar graph
on     3317   8.44 %   
an     2644   6.73 %   
en     1691   4.3  %   
el     1360   3.46 %   
er     1287   3.28 %   
             FEMALE - 2-gram Suffix     
WORD  COUNT  PERCENT  bar graph
na     5244   8.09 %    
ia     4776   7.37 %    
ah     3516   5.42 %    
ne     2772   4.28 %    
ie     2460   3.79 %    
             MALE - 3-gram Prefix     
WORD  COUNT  PERCENT  bar graph
mar     484    1.22 %   
sha     411    1.04 %   
jer     266    0.67 %   
cha     256    0.65 %   
bra     246    0.62 %   
             FEMALE - 3-gram Prefix     
WORD  COUNT  PERCENT  bar graph
sha     1722   2.63 %    
mar     1262   1.92 %    
cha     728    1.11 %    
she     586    0.89 %    
car     377    0.57 %    
             MALE - 3-gram Suffix     
WORD  COUNT  PERCENT  bar graph
ell     616    1.56 %   
ion     567    1.43 %   
ton     539    1.36 %   
son     434    1.1  %   
ian     429    1.08 %   
             FEMALE - 3-gram Suffix     
WORD  COUNT  PERCENT  bar graph
sha     1657   2.53 %    
nna     1408   2.15 %    
ana     1278   1.95 %    
ina     1001   1.53 %    
yah     976    1.49 %    
             MALE - 4-gram Prefix     
WORD  COUNT  PERCENT  bar graph
john     116    0.3  %   
chri     111    0.29 %   
mark     95     0.24 %   
jama     93     0.24 %   
marc     79     0.2  %   
             FEMALE - 4-gram Prefix     
WORD  COUNT  PERCENT  bar graph
shan     374    0.58 %    
mari     335    0.52 %    
char     290    0.45 %    
shar     289    0.44 %    
sher     227    0.35 %    
             MALE - 4-gram Suffix     
WORD  COUNT  PERCENT  bar graph
rick     231    0.59 %   
rius     227    0.58 %   
rion     225    0.58 %   
ious     191    0.49 %   
vion     151    0.39 %   
             FEMALE - 4-gram Suffix     
WORD  COUNT  PERCENT  bar graph
anna     803    1.24 %    
isha     721    1.11 %    
elle     664    1.02 %    
lynn     647    1.0  %    
iyah     535    0.82 %    

3 observations can be drawn from the above tables.

  1. Suffix important than prefix: Suffix n-grams are more frequent in both the male and female names compared to the prefix n-grams. Making suffixes more important than prefixes for the gender prediction task.
  2. Dominant suffixes different for male and female names: It can be seen from the above tables that the most frequent suffixes for the male class and female class are different. But we find the most frequent prefixes to be the same in both the classes. For example, "sha", "mar" and "cha" are most frequent prefixes in both the classes.
  3. Patterns prevail in female names: Comparing the top suffixes and prefixes in female and male names, it can be observed that the most frequent suffixes and prefixes have higher percentage in female names than male names. For example most frequent female 3-gram suffix "sha" has higher percentage (2.53%) than most frequent male 3-gram suffix "ell" (1.56%). This observation implies that female names tend to follow patterns more than male names.

3.4 Phoneme representation of the name

Generally male names are more voiced in sound than female names, here is an interesting paper discussing the same. Indicating that phonemes do play some role in name-gender relation.

If the data is modified to incorporate the phoneme representation of the names, I feel the accuracy of the model would increase. This was not performed as part of the experiments here but would be an interesting study to perform. Phoneme representation of the names can be generated using the following python library (link).

4. Deep learning model based on Recurrent Neural Network

4.1 Model Architecture

A character level recurrent neural network architecture was used to accomplish the gender prediction task. Below figure gives an overview of the network. It has 4 major components:

  1. Embedding Lookup Table
  2. GRU (or LSTM) Cell
  3. Logits Layer
  4. Softmax Layer

Each name is split into individual characters and padded with 'pad' symbol to obtain a constant length across all the names. Individual characters are then fed into the network one after another. Each character is embedded into a vector using the embedding lookup table and fed in the GRU cell. The state of the GRU cell gets updated after consuming the character embedding. The state of the GRU cell represents the knowledge of all the characters it has observed with a high emphasis on the recently observed characters. Final state of the GRU cell (calling it 'recurrent embedding' - r_emb in the rest of the blog) is fed into the logits layer which calculates the score for the 2 classes, male class and female class. Logits layer is basically a multi layer preceptron with a weight matrix (calling it W_logits ) and bias (B). Scores for the classes are calculated via matrix multiplication of W_logits and r_emb and adding B to it.

scores = matrix_multiplication(W_logits,r_emb) + B
Softmax layer calculates the actual probabilities for the 2 classes from the scores. Cross-entropy function is used to calculated the cost (or loss) of the model and stochastic gradient descent algorithm to reduce this cost and train the network. To read more about cross-entropy cost function I would highly recommend reading this blog. Note that LSTM cell can also be used instead of GRU cell. Dropout and learning rate decay were used as regularization techniques.

4.2 Training and Accuracy

To train the deep neural network model, data was randomly divide into two classes: train and test.

Train dataset:
  Male name count: 27,809 (70% of the total male names)
  Female name count: 27,849 
Test dataset:
  Male name count: 11,919
  Female name count: 37,849
Count of training samples in each class was considered equal to avoid creating bias for a particular class. The train and test data in each class can be found here.

Accuracy was calculated based on the correct classification of the input examples. In mathematical terms, it is the ratio of number of examples correctly classified to total examples. Here are the accuracy for the best model on the train dataset, test dataset and the accuracy disaggregated for each class in test examples.

  Accuracy on train dataset: 80.78%
  Accuracy on test dataset: 80.44%
  Accuracy on male examples in test dataset: 79.15%
  Accuracy on female examples in test dataset: 81.02%

Observe that female examples have a higher percentage of accuracy even though the number of test female examples (37,809) is large compared to male examples (11,919), strengthening our observation in the data analysis about patterns prevailing in female names.

Note: I also trained a model where the dataset was divided into 70:30 ratio of training examples and test examples. This naturally resulted in higher number of total training examples with female names dominating the training dataset. The overall test accuracy for the model with this train dataset was ~82%. The patterns in both the cases of dataset 1) equal count of examples in each class and 2) 70:30 where the same. The below model analysis is preformed where the training dataset had equal count of examples in each class.

4.3 Model Parameter

Below are the hyper-parameters that worked best.

    learning_rate = 0.1
    layers        = 1
    cell_size     = 32
    max_epoch     = 20
    batch_size    = 20
    vocab_size    = 26
    num_classes   = 2
    cell_type     = "gru"
    input_size    = 128
    keep_prob     = 0.8
    decay_epoch   = 5
    lr_decay      = 0.5

5. Model Analysis

Let's try to analyse the network now that it has learned to predict the gender.

5.1 Logits layer weight matrix (W_logits)

The table below shows the logits weight matrix W_logits for the trained model.

  NEURON_NUM      W_logits_col1     W_logits_col2     DIFFERENCE        OPPOSITE POLARITY
       1          0.4388            -0.4485           0.0097            TRUE      
       2          -0.1946           0.197             0.0024            TRUE      
       3          0.1188            -0.132            0.0132            TRUE      
       4          0.1818            -0.182            0.0002            TRUE      
       5          -0.3941           0.3957            0.0016            TRUE      
       6          0.3168            -0.3286           0.0118            TRUE     
       7          0.0564            -0.062            0.0056            TRUE      
       8          0.0759            -0.0744           0.0015            TRUE      
       9          0.1469            -0.1517           0.0048            TRUE      
      10          -0.0011           0.0009            0.0002            TRUE      
      11          0.0585            -0.0475           0.011             TRUE      
      12          0.1869            -0.1757           0.0112            TRUE      
      13          -0.3448           0.3475            0.0027            TRUE      
      14          -0.1291           0.1311            0.002             TRUE      
      15          -0.0417           0.0552            0.0135            TRUE      
      16          -0.4021           0.412             0.0099            TRUE      
      17          -0.2899           0.2843            0.0056            TRUE      
      18          -0.1002           0.0902            0.01              TRUE      
      19          0.3009            -0.2924           0.0085            TRUE      
      20          -0.2298           0.2205            0.0093            TRUE      
      21          -0.299            0.2821            0.0169            TRUE      
      22          0.3892            -0.3973           0.0081            TRUE      
      23          -0.0403           0.052             0.0117            TRUE      
      24          0.1606            -0.1574           0.0032            TRUE      
      25          -0.1912           0.2003            0.0091            TRUE      
      26          -0.4405           0.4259            0.0146            TRUE      
      27          -0.2547           0.2566            0.0019            TRUE      
      28          -0.6039           0.5969            0.007             TRUE      
      29          -0.1535           0.1589            0.0054            TRUE     
      30          0.2634            -0.2724           0.009             TRUE      
      31          0.2251            -0.2266           0.0015            TRUE      
      32          -0.5256           0.5211            0.0045            TRUE      

Column 1 in the logits weight matrix is multiplied with the recurrent embedding from the GRU cell to produce the female score whereas column 2 produces the male score. Each dimension in the recurrent embedding is considered as a Neuron. Neuron activation is the value of that particular neuron (embedding dimension).

male_score   = dot_product(r_emb, W_logits_col1) = Σ​(neuron_activation[i] * W_logits_col1[i])
female_score = dot_product(r_emb, W_logits_col2) = Σ​(neuron_activation[i] * W_logits_col2[i])

r_emb = recurrent embedding vector of dimension 32 = [neuron_activation[1], neuron_activation[2],..., neuron_activation[32]]

An easier way to think about the above equation is that each neuron in the recurrent embedding is multiplied by the corresponding weight from column 1 in the W_logits to produce a female subscore and from column 2 for male subscore. Sum of these subscore for all the neurons (1 to 32) of the recurrent embedding along with the bias terms for each class, produce the final score.

Observations
  1. Polarity of values in a row: Logits values for a row have opposite polarity for column 1 (female logits weights) and column 2 (male logits weights). Based on the polarity of the neuron activation, W_logits row increases the score for a class. At the same time, it decreases the score for the other class. For example, in case of neuron 1, when the neuron activation value is positive, male score is increased but at the same time female score is decreased.
  2. Difference between the values in a row: The absolute difference between the male column entry and corresponding female column entry is quite low. This means that same value of score gets added in one class and subtracted from the other.
  3. Neuron importance based on the absolute value in W_logits: The logits value for each neuron is different hence, each neuron has some importance. Higher weight in W_logits would add higher score, making the corresponding neuron important for the prediction. For example, neuron 1 is more important that neuron 2 as it generates a high value score to be added or subtracted for the same activation value in both the neurons. This is a great predictor for parameter tuning. If the number of important neurons are less than the recurrent embedding size can be reduced. In our case recurrent embedding size of 32 was obtained in the similar way. I was using 128 dimensions at first and saw that almost more than half the rows in W_logits had negligible, motivating me to reduce the recurrent embedding size to 32.

5.2 Character Embedding - Embedding Lookup Table v/s Recurrent Embedding Representation

There are two ways to obtain character embedding.
  1. Embedding Lookup Table: Embedding lookup table converts an input character into vector. T-SNE plot for the character embedding is plotted below. It can be observed that the characters which occur mostly in female name suffixes like 'a', 'e', 'i', 'y' gather around each other.
  2. Recurrent Embedding : Feeding just a character as input, recurrent embedding would contain information about that character. Hence, the character representations obtained from the recurrent embedding would have a similar alignment as that of the embedding lookup table. This can be observed from the T-SNE plot for the 32 dimensional recurrent embedding for character input. The characters (a, e, i, y) lie closer to each other in the second plot too. The table gives the prediction along with probability, for character input. 'a', 'e', 'i' and 'y' are predicted as female class whereas other characters are predicted as male class.

T-SNE : embedding lookup table.

T-SNE : recurrent embedding of char-input.
  input-character               predicted-gender                       probability
       a                             0                                  0.932037
       b                             1                                  0.89787
       c                             1                                  0.773074
       d                             1                                  0.847266
       e                             0                                  0.716494
       f                             1                                  0.925287
       g                             1                                  0.761032
       h                             1                                  0.529182
       i                             0                                  0.715751
       j                             1                                  0.869343
       k                             1                                  0.923236
       l                             1                                  0.694704
       m                             1                                  0.729725
       n                             1                                  0.647062
       o                             1                                  0.848708
       p                             1                                  0.873483
       q                             1                                  0.834077
       r                             1                                  0.727535
       s                             1                                  0.763149
       t                             1                                  0.612793
       u                             1                                  0.695273
       v                             1                                  0.781005
       w                             1                                  0.7693
       x                             1                                  0.673478
       y                             0                                  0.621023
       z                             1                                  0.703863

5.3 Recurrent embedding of N-gram prefixes & suffixes

Below is the T-SNE plots of the recurrent embedding, final state of the GRU cell, when the input names are top 30 frequent prefixes and suffixes in male names and female names. The red points correspond to the male names and blue points correspond to the female names.


    top 30 most frequent 3-gram prefixes

    top 30 most frequent 3-gram suffixes

    top 30 most frequent 4-gram prefixes

    top 30 most frequent 4-gram suffixes
Observations
  1. Suffix clusters : Top 30 most frequent 3-gram and 4-gram suffixes for male names (red points) and female names (blue points) form quite distinct clusters. Whereas though there is some clustering effect in prefixes it is not distinct. As a result it could be inferred that the network learns to classify the input name based on suffix than prefix. One of the probable reasons, for the network to learn suffixes over prefixes is because the percentage of most frequent suffix is higher and suffixes are distinct than prefixes.

6. Conclusion and Future Work

In conclusion I think, the patterns that the neural network learns is based on 2 factors, frequency and distinct patterns between the classes. For example, in GenPredict suffixes were learnt by the network because the frequency of most frequent suffixes was higher than most frequent prefixes. Also, between the classes suffixes are distinct than prefixes. For example - "leen" is mostly a female suffix whereas the prefix "chris" occurs in both the classes frequently. Note that this information is recognized by the network by just looking at the examples. Note that there might be some other patterns that the network might have recognized for classification not found in my data or model analysis. Here's a list of future work that could be easily performed:
  1. Predicting ethnicity, age, etc. : Names can also provide other information like the ethnicity of the person or the age of the person based on the popularity of the name in a particular decade.
  2. Using phoneme representation of the name : Phoneme representation can provide some hidden patterns which are not evident from the spelling giving a slight increase in the accuracy.
  3. Predicting demographic features based on installed apps : It would be interesting to predict the demographics of a person like age, gender, etc. based on the apps they have installed on their phones.