Recurrent neural networks (also called RNN) are gaining popularity in various interesting applications like next-word prediction, speech recognition, machine translation, or image captioning. In this blog, I have looked at a simple example, GenPredict, Name-based Gender Prediction. Genpredict could also be considered as a 'Hello World' example of the sequence classification task. We know that RNNs work because of their ability to recognize patterns within the training data. This blog tries to understand which patterns are recognized by the neural network. Demo video for the project can be found here.
Dataset was obtained from the Social Security Administration website which contains firstname of all the applicants for social security card from year 1880 to 2015. A total of 1,05,386 unique names (39,728 male and 65,658 female) were obtained. CSV file for the dataset can be found here.
3. Understanding the data
Before we jump into the recurrent neural network section of the blog, let's gain some insights into the dataset. Note that the below analysis is done for the entire dataset but the train dataset observed the same properties.
3.1 Length of the name
The average length of male names and female names in the data are as below :
Average length male names : 6.23 Average length female names : 6.60As the difference is relatively small, it could be inferred that length of the name doesn’t play major role in the gender prediction task.
3.2 First & Last Character of the name
Below is the frequency count for the first and last character of male and female names.
Last Character : It can be seen that about 28% of the male names end with 'n' whereas 28% of female names end with 'a'. Also, 67% of the last character is a vowel (a,e,i,o,u) in female names compared to 33% in male names.
MALE - Last Character WORD COUNT PERCENT bar graph n 9179 28.99% e 2857 9.02 % s 2492 7.87 % y 2190 6.92 % l 2176 6.87 % r 2013 6.36 % o 1667 5.26 % i 1401 4.42 % d 1269 4.01 % h 1117 3.53 % a 919 2.9 % m 829 2.62 % k 807 2.55 % t 718 2.27 % z 325 1.03 % g 279 0.88 % u 231 0.73 % c 228 0.72 % b 226 0.71 % f 133 0.42 % x 127 0.4 % v 117 0.37 % w 108 0.34 % j 105 0.33 % p 95 0.3 % q 59 0.19 %
FEMALE - Last Character WORD COUNT PERCENT bar graph a 10611 28.86% e 5498 14.95% n 5013 13.63% h 3925 10.67% y 3546 9.64 % i 2767 7.52 % l 1379 3.75 % s 950 2.58 % r 788 2.14 % t 494 1.34 % o 474 1.29 % d 263 0.72 % m 202 0.55 % z 185 0.5 % u 173 0.47 % g 132 0.36 % k 98 0.27 % x 71 0.19 % c 46 0.13 % b 39 0.11 % w 28 0.08 % j 28 0.08 % p 24 0.07 % f 14 0.04 % v 12 0.03 % q 12 0.03 %
First Character : The first character doesn't show as much variation in the frequency count of the most frequent character wrt other characters as observed in the last character.
MALE - First Character WORD COUNT PERCENT bar graph j 3952 9.95 % a 3512 8.84 % d 3357 8.45 % k 3029 7.62 % m 2608 6.56 % t 2567 6.46 % s 2493 6.28 % c 2244 5.65 % r 2085 5.25 % l 1782 4.49 % b 1670 4.2 % e 1517 3.82 % n 1161 2.92 % h 1107 2.79 % g 1083 2.73 % z 804 2.02 % o 714 1.8 % p 657 1.65 % f 584 1.47 % w 568 1.43 % i 561 1.41 % y 555 1.4 % v 550 1.38 % q 295 0.74 % u 149 0.38 % x 124 0.31 %
FEMALE - First Character WORD COUNT PERCENT bar graph a 7381 11.24% s 6022 9.17 % m 5507 8.39 % k 4905 7.47 % j 4596 7.0 % l 4440 6.76 % t 4264 6.49 % c 4137 6.3 % d 3899 5.94 % r 2784 4.24 % n 2476 3.77 % e 2436 3.71 % b 2234 3.4 % g 1390 2.12 % h 1127 1.72 % z 1113 1.7 % p 1079 1.64 % v 1078 1.64 % y 1054 1.61 % i 1050 1.6 % f 832 1.27 % o 708 1.08 % w 561 0.85 % q 274 0.42 % u 166 0.25 % x 145 0.22 %
The above analysis suggest that the last character should play a higher role in the prediction of the gender than the first character. In other words, someone can say with higher confidence that a name ending with 'n' belongs to male than a name starting with 'j'. 'a' is most frequent last character and 'j' is the most frequent first character in male names.
3.3 N-gram (2, 3, 4) suffixes and prefixes of the name
Here are the top 5 most frequent bi-gram, tri-gram and 4-gram suffixes and prefixes of the names in each class.
MALE - 2-Gram Prefix WORD COUNT PERCENT bar graph ja 1992 5.01 % ma 1394 3.51 % da 1172 2.95 % de 1126 2.83 % ka 989 2.49 %
FEMALE - 2-gram Prefix WORD COUNT PERCENT bar graph ma 3049 4.64 % sh 2808 4.28 % ja 2220 3.38 % ka 1957 2.98 % la 1842 2.81 %
MALE - 2-Gram Suffix WORD COUNT PERCENT bar graph on 3317 8.44 % an 2644 6.73 % en 1691 4.3 % el 1360 3.46 % er 1287 3.28 %
FEMALE - 2-gram Suffix WORD COUNT PERCENT bar graph na 5244 8.09 % ia 4776 7.37 % ah 3516 5.42 % ne 2772 4.28 % ie 2460 3.79 %
MALE - 3-gram Prefix WORD COUNT PERCENT bar graph mar 484 1.22 % sha 411 1.04 % jer 266 0.67 % cha 256 0.65 % bra 246 0.62 %
FEMALE - 3-gram Prefix WORD COUNT PERCENT bar graph sha 1722 2.63 % mar 1262 1.92 % cha 728 1.11 % she 586 0.89 % car 377 0.57 %
MALE - 3-gram Suffix WORD COUNT PERCENT bar graph ell 616 1.56 % ion 567 1.43 % ton 539 1.36 % son 434 1.1 % ian 429 1.08 %
FEMALE - 3-gram Suffix WORD COUNT PERCENT bar graph sha 1657 2.53 % nna 1408 2.15 % ana 1278 1.95 % ina 1001 1.53 % yah 976 1.49 %
MALE - 4-gram Prefix WORD COUNT PERCENT bar graph john 116 0.3 % chri 111 0.29 % mark 95 0.24 % jama 93 0.24 % marc 79 0.2 %
FEMALE - 4-gram Prefix WORD COUNT PERCENT bar graph shan 374 0.58 % mari 335 0.52 % char 290 0.45 % shar 289 0.44 % sher 227 0.35 %
MALE - 4-gram Suffix WORD COUNT PERCENT bar graph rick 231 0.59 % rius 227 0.58 % rion 225 0.58 % ious 191 0.49 % vion 151 0.39 %
FEMALE - 4-gram Suffix WORD COUNT PERCENT bar graph anna 803 1.24 % isha 721 1.11 % elle 664 1.02 % lynn 647 1.0 % iyah 535 0.82 %
3 observations can be drawn from the above tables.
- Suffix important than prefix: Suffix n-grams are more frequent in both the male and female names compared to the prefix n-grams. Making suffixes more important than prefixes for the gender prediction task.
- Dominant suffixes different for male and female names: It can be seen from the above tables that the most frequent suffixes for the male class and female class are different. But we find the most frequent prefixes to be the same in both the classes. For example, "sha", "mar" and "cha" are most frequent prefixes in both the classes.
- Patterns prevail in female names: Comparing the top suffixes and prefixes in female and male names, it can be observed that the most frequent suffixes and prefixes have higher percentage in female names than male names. For example most frequent female 3-gram suffix "sha" has higher percentage (2.53%) than most frequent male 3-gram suffix "ell" (1.56%). This observation implies that female names tend to follow patterns more than male names.
3.4 Phoneme representation of the name
Generally male names are more voiced in sound than female names, here is an interesting paper discussing the same. Indicating that phonemes do play some role in name-gender relation.
If the data is modified to incorporate the phoneme representation of the names, I feel the accuracy of the model would increase. This was not performed as part of the experiments here but would be an interesting study to perform. Phoneme representation of the names can be generated using the following python library (link).
4. Deep learning model based on Recurrent Neural Network
4.1 Model Architecture
A character level recurrent neural network architecture was used to accomplish the gender prediction task. Below figure gives an overview of the network. It has 4 major components:
- Embedding Lookup Table
- GRU (or LSTM) Cell
- Logits Layer
- Softmax Layer
Each name is split into individual characters and padded with 'pad' symbol to obtain a constant length across all the names. Individual characters are then fed into the network one after another. Each character is embedded into a vector using the embedding lookup table and fed in the GRU cell. The state of the GRU cell gets updated after consuming the character embedding. The state of the GRU cell represents the knowledge of all the characters it has observed with a high emphasis on the recently observed characters. Final state of the GRU cell (calling it 'recurrent embedding' - r_emb in the rest of the blog) is fed into the logits layer which calculates the score for the 2 classes, male class and female class. Logits layer is basically a multi layer preceptron with a weight matrix (calling it W_logits ) and bias (B). Scores for the classes are calculated via matrix multiplication of W_logits and r_emb and adding B to it.
scores = matrix_multiplication(W_logits,r_emb) + BSoftmax layer calculates the actual probabilities for the 2 classes from the scores. Cross-entropy function is used to calculated the cost (or loss) of the model and stochastic gradient descent algorithm to reduce this cost and train the network. To read more about cross-entropy cost function I would highly recommend reading this blog. Note that LSTM cell can also be used instead of GRU cell. Dropout and learning rate decay were used as regularization techniques.
4.2 Training and Accuracy
To train the deep neural network model, data was randomly divide into two classes: train and test.
Train dataset: Male name count: 27,809 (70% of the total male names) Female name count: 27,849 Test dataset: Male name count: 11,919 Female name count: 37,849Count of training samples in each class was considered equal to avoid creating bias for a particular class. The train and test data in each class can be found here.
Accuracy was calculated based on the correct classification of the input examples. In mathematical terms, it is the ratio of number of examples correctly classified to total examples. Here are the accuracy for the best model on the train dataset, test dataset and the accuracy disaggregated for each class in test examples.
Accuracy on train dataset: 80.78% Accuracy on test dataset: 80.44% Accuracy on male examples in test dataset: 79.15% Accuracy on female examples in test dataset: 81.02%
Observe that female examples have a higher percentage of accuracy even though the number of test female examples (37,809) is large compared to male examples (11,919), strengthening our observation in the data analysis about patterns prevailing in female names.
Note: I also trained a model where the dataset was divided into 70:30 ratio of training examples and test examples. This naturally resulted in higher number of total training examples with female names dominating the training dataset. The overall test accuracy for the model with this train dataset was ~82%. The patterns in both the cases of dataset 1) equal count of examples in each class and 2) 70:30 where the same. The below model analysis is preformed where the training dataset had equal count of examples in each class.
4.3 Model Parameter
Below are the hyper-parameters that worked best.
learning_rate = 0.1 layers = 1 cell_size = 32 max_epoch = 20 batch_size = 20 vocab_size = 26 num_classes = 2 cell_type = "gru" input_size = 128 keep_prob = 0.8 decay_epoch = 5 lr_decay = 0.5
5. Model AnalysisLet's try to analyse the network now that it has learned to predict the gender.
5.1 Logits layer weight matrix (W_logits)The table below shows the logits weight matrix W_logits for the trained model.
NEURON_NUM W_logits_col1 W_logits_col2 DIFFERENCE OPPOSITE POLARITY 1 0.4388 -0.4485 0.0097 TRUE 2 -0.1946 0.197 0.0024 TRUE 3 0.1188 -0.132 0.0132 TRUE 4 0.1818 -0.182 0.0002 TRUE 5 -0.3941 0.3957 0.0016 TRUE 6 0.3168 -0.3286 0.0118 TRUE 7 0.0564 -0.062 0.0056 TRUE 8 0.0759 -0.0744 0.0015 TRUE 9 0.1469 -0.1517 0.0048 TRUE 10 -0.0011 0.0009 0.0002 TRUE 11 0.0585 -0.0475 0.011 TRUE 12 0.1869 -0.1757 0.0112 TRUE 13 -0.3448 0.3475 0.0027 TRUE 14 -0.1291 0.1311 0.002 TRUE 15 -0.0417 0.0552 0.0135 TRUE 16 -0.4021 0.412 0.0099 TRUE 17 -0.2899 0.2843 0.0056 TRUE 18 -0.1002 0.0902 0.01 TRUE 19 0.3009 -0.2924 0.0085 TRUE 20 -0.2298 0.2205 0.0093 TRUE 21 -0.299 0.2821 0.0169 TRUE 22 0.3892 -0.3973 0.0081 TRUE 23 -0.0403 0.052 0.0117 TRUE 24 0.1606 -0.1574 0.0032 TRUE 25 -0.1912 0.2003 0.0091 TRUE 26 -0.4405 0.4259 0.0146 TRUE 27 -0.2547 0.2566 0.0019 TRUE 28 -0.6039 0.5969 0.007 TRUE 29 -0.1535 0.1589 0.0054 TRUE 30 0.2634 -0.2724 0.009 TRUE 31 0.2251 -0.2266 0.0015 TRUE 32 -0.5256 0.5211 0.0045 TRUE
Column 1 in the logits weight matrix is multiplied with the recurrent embedding from the GRU cell to produce the female score whereas column 2 produces the male score. Each dimension in the recurrent embedding is considered as a Neuron. Neuron activation is the value of that particular neuron (embedding dimension).
male_score = dot_product(r_emb, W_logits_col1) = Σ(neuron_activation[i] * W_logits_col1[i]) female_score = dot_product(r_emb, W_logits_col2) = Σ(neuron_activation[i] * W_logits_col2[i]) r_emb = recurrent embedding vector of dimension 32 = [neuron_activation, neuron_activation,..., neuron_activation]
An easier way to think about the above equation is that each neuron in the recurrent embedding is multiplied by the corresponding weight from column 1 in the W_logits to produce a female subscore and from column 2 for male subscore. Sum of these subscore for all the neurons (1 to 32) of the recurrent embedding along with the bias terms for each class, produce the final score.Observations
- Polarity of values in a row: Logits values for a row have opposite polarity for column 1 (female logits weights) and column 2 (male logits weights). Based on the polarity of the neuron activation, W_logits row increases the score for a class. At the same time, it decreases the score for the other class. For example, in case of neuron 1, when the neuron activation value is positive, male score is increased but at the same time female score is decreased.
- Difference between the values in a row: The absolute difference between the male column entry and corresponding female column entry is quite low. This means that same value of score gets added in one class and subtracted from the other.
- Neuron importance based on the absolute value in W_logits: The logits value for each neuron is different hence, each neuron has some importance. Higher weight in W_logits would add higher score, making the corresponding neuron important for the prediction. For example, neuron 1 is more important that neuron 2 as it generates a high value score to be added or subtracted for the same activation value in both the neurons. This is a great predictor for parameter tuning. If the number of important neurons are less than the recurrent embedding size can be reduced. In our case recurrent embedding size of 32 was obtained in the similar way. I was using 128 dimensions at first and saw that almost more than half the rows in W_logits had negligible, motivating me to reduce the recurrent embedding size to 32.
5.2 Character Embedding - Embedding Lookup Table v/s Recurrent Embedding RepresentationThere are two ways to obtain character embedding.
- Embedding Lookup Table: Embedding lookup table converts an input character into vector. T-SNE plot for the character embedding is plotted below. It can be observed that the characters which occur mostly in female name suffixes like 'a', 'e', 'i', 'y' gather around each other.
- Recurrent Embedding : Feeding just a character as input, recurrent embedding would contain information about that character. Hence, the character representations obtained from the recurrent embedding would have a similar alignment as that of the embedding lookup table. This can be observed from the T-SNE plot for the 32 dimensional recurrent embedding for character input. The characters (a, e, i, y) lie closer to each other in the second plot too. The table gives the prediction along with probability, for character input. 'a', 'e', 'i' and 'y' are predicted as female class whereas other characters are predicted as male class.
T-SNE : embedding lookup table.
T-SNE : recurrent embedding of char-input.
input-character predicted-gender probability a 0 0.932037 b 1 0.89787 c 1 0.773074 d 1 0.847266 e 0 0.716494 f 1 0.925287 g 1 0.761032 h 1 0.529182 i 0 0.715751 j 1 0.869343 k 1 0.923236 l 1 0.694704 m 1 0.729725 n 1 0.647062 o 1 0.848708 p 1 0.873483 q 1 0.834077 r 1 0.727535 s 1 0.763149 t 1 0.612793 u 1 0.695273 v 1 0.781005 w 1 0.7693 x 1 0.673478 y 0 0.621023 z 1 0.703863
5.3 Recurrent embedding of N-gram prefixes & suffixes
Below is the T-SNE plots of the recurrent embedding, final state of the GRU cell, when the input names are top 30 frequent prefixes and suffixes in male names and female names. The red points correspond to the male names and blue points correspond to the female names.
top 30 most frequent 3-gram prefixes
top 30 most frequent 3-gram suffixes
top 30 most frequent 4-gram prefixes
top 30 most frequent 4-gram suffixes
- Suffix clusters : Top 30 most frequent 3-gram and 4-gram suffixes for male names (red points) and female names (blue points) form quite distinct clusters. Whereas though there is some clustering effect in prefixes it is not distinct. As a result it could be inferred that the network learns to classify the input name based on suffix than prefix. One of the probable reasons, for the network to learn suffixes over prefixes is because the percentage of most frequent suffix is higher and suffixes are distinct than prefixes.
6. Conclusion and Future WorkIn conclusion I think, the patterns that the neural network learns is based on 2 factors, frequency and distinct patterns between the classes. For example, in GenPredict suffixes were learnt by the network because the frequency of most frequent suffixes was higher than most frequent prefixes. Also, between the classes suffixes are distinct than prefixes. For example - "leen" is mostly a female suffix whereas the prefix "chris" occurs in both the classes frequently. Note that this information is recognized by the network by just looking at the examples. Note that there might be some other patterns that the network might have recognized for classification not found in my data or model analysis. Here's a list of future work that could be easily performed:
- Predicting ethnicity, age, etc. : Names can also provide other information like the ethnicity of the person or the age of the person based on the popularity of the name in a particular decade.
- Using phoneme representation of the name : Phoneme representation can provide some hidden patterns which are not evident from the spelling giving a slight increase in the accuracy.
- Predicting demographic features based on installed apps : It would be interesting to predict the demographics of a person like age, gender, etc. based on the apps they have installed on their phones.