Gender Classification From Text Using ML
By Nóra Ambróz on May 14, 2019

Gender Classification From Text Using ML


The drive behind the project

As part of my ongoing learning process in the field of ML\DL, I took this task to play around with NLP. This is my first attempt in NLP and wanted to share what I did. The main thing I wish to take from this task is how to preprocess free text and convert it into features.

So what do you say? let's begin...?

Git: omerugi/Gender_Text_Analysis

The Task

To better understand human behavior we can analyze texts that people have written and get some deeper knowledge of how they act based on a profile we build, there are many aspects that can contribute but here we will focus on one. In this project, the one aspect we will focus on is - determining the gender of the author based on the text they wrote!

We will use ML to do so, using a pipeline to preprocess the data and get features from text and then classify using ML models.

Dataset

The dataset contains 754 samples were:

  • gender?- is the target column, represents the authors' gender?m = men f=women.
  • story?- the text the author wrote.

Dataset Analysis

No alt text provided for this image
No alt text provided for this image

We can understand from it that there is a big gap between the number of men and the number of women in the data. This is something that we should address later on and fix so the data will be balanced between labels.

Preprocessing

Step 1 - Regex for text cleaning

# regular expressions
import re        

Regular expressions (AKA Regex) is an in-text patterns searcher, when giving a pattern the regex will iterate over the test and will try to match this pattern to sub-sequences of the text.

To learn more and try test regular expressions: https://regexr.com/

To use regex we created two functions :

  1. clean_doc - The function iterates over all the samples in the dataset and overrides them with a clean test from clean_smaple.
  2. clean_smaple - Clean each sample using regex, removing patterns from the text that are not a letter / is a digit / single letter / double spaces.

def clean_doc(docs_org)
? ? docs = docs_org.copy()
? ? for i in range(len(docs_org)):
? ? ? ? docs[i] = clean_smaple(docs[i])
? ? return docs:

def clean_smaple(s)
? ? s = re.sub(r'[^\w]|[\d]', ' ', s)
? ? s = re.sub(r"\s\w\s","",s)
? ? while "? " in s:
? ? ? ? s = s.replace("? "," ")
? ? return s:        

Step 2 - TF-IDF and N-Grams for feature extraction

from sklearn.feature_extraction.text import TfidfVectorizer        

Term frequency-inverse document frequency (AKA TF-IDF) ?is a numerical statistic that is intended to reflect how important a word is to a document in a collection or?corpus. (wikipedia)

Basically, it gives each word a score based on its rarity in the text and in the whole corpus.

TF - Term frequency

tf(t,d) = count of t in d / number of words in d        

Counts the number appearances of word "t" in document "d".

Note: We normalize the results so longer documents won't be more "powerful" when compared to shorter ones.

DF - Document Frequency

df(t) = occurrence of t in N documents        

Counts the number of documents "t" appearances in.

IDF - Inverse Document Frequency

idf(t) = N/df        

IDF is the inverse of the DF. It will give a low number to terms that are more frequent and higher numbers to ones that are less frequent.

For avoiding the value from exploding when the number of documents is high, we use Log.

idf(t) = log(N/(df + 1))        

Note: we add 1 to avoid dividing by zero.

TF-IDF

tf-idf(t, d) = tf(t, d) * log(N/(df + 1))        

By multiplying TF by IDF, we get the TF-IDF score!

N-Grams

In our case, it refers to a sequence of N words together as one. Sometimes a sequence of words can have a different meaning from each word apart, and should also be taken into account, so in sklean one of the parameters is to determine what is the length of sequences we wish for it to try in the TD-IDF.

1-Gram = a single word. 2-Gram = couple and on...

vec = TfidfVectorizer(
? ? ? ? lowercase=True, # all lowercase
? ? ? ? max_features = 200, # the max amount of N-gram
? ? ? ? max_df= 0.7, # Max appearances
? ? ? ? min_df = 3, # Min appearances
? ? ? ? ngram_range=(1,3)) # From 1-Gram to 3-Gram        

Step 3 - Standard Scalar

from sklearn.preprocessing import StandardScaler        

This method will basically, normalize the features?individually so that each feature will have?mean = 0?and?var = 1.

Step 4 - PCA

from sklearn.decomposition import PCA        

Principal component analysis?(AKA PCA) is the process of computing the principal components and using them to perform a change of basis?on the data, sometimes using only the first few principal components and ignoring the rest.

The main idea is to obtain lower-dimensional data while preserving as much of the data's variation as possible. We use it to reduce the number of features. For more info: A One-Stop Shop for Principal Component Analysis

I've used this method due to the fact that I wish to reduce the dimension of the data after the TF-IDF, I could have picked fewer features but I preferred to find a decent amount of features then to find a better representation of them using PCA.

Step 5 - Label Encoder

from sklearn.preprocessing import LabelEncoder        

?We are using the encoder to change our label from chars to numeric values, the label encoder finds how many unique labels there are and give each of them a number. In our case the target featuer?f = 0 m = 1.

Step 6 - Over Sampling

Due to the fact that we have a small number of females compared to men, we used oversampling. We duplicated female samples that are in the mid-average length of the text so we won't add noise to the data and keep the distribution centralized.

temp = []
for i,row in zip( range(n), df_tf_pca.itertuples(index=False)):
? ? if row[-1] == 0 and 250 <= word_counter[i] <= 350:
? ? ? ? temp.append(list(row))
? ? ? ? temp.append(list(row))
? ? else:
? ? ? ? temp.append(list(row))
df_oversamp = pd.DataFrame(temp , columns=df_tf_pca.columns)        

Step 7 - Train Test Split

from sklearn.model_selection import train_test_split        

Splitting the test samples into train and test - when 0.33 are test and we used stratify to get even amount of samples in the split.

x = df_oversamp.drop(["gender"], axis = 1)
y = df_oversamp["gender"]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42, stratify=y)        

Models

For each model, we have used?Grid Search?to find the best hyperparameters. We have tried only two - KNN and MLP to see if there is a difference between them.

We toke KNN as it is one of the most common and simple ML algorithms and MLP as a basic NN to see the performance.

KNN

from sklearn.neighbors import KNeighborsClassifier        

k-nearest neighbors algorithm?(AKA k-NN) is determining a sample's label by doing a vote between its K k-nearest neighbors. Using a distance function to find who are those neighbors between all dimensions.

The simplified idea is that "if I'm closer to more neighbors that are labeled X, there is a high probability that I'm also X".

n_neighbors_knn = [5,10,12,15,20,50,100]
weights_knn = ["uniform", "distance"]
hyperparameters_knn = dict(n_neighbors=n_neighbors_knn, weights=weights_knn)
grid_search_knn = GridSearchCV(KNeighborsClassifier(), hyperparameters_knn, cv=5, scoring='f1_micro', n_jobs=-1)
grid_search_knn.fit(X_train, y_train)N        
No alt text provided for this image

MLP

from sklearn.neural_network import MLPClassifier        

Multilayer perceptron?(AKA MLP), it's a basic NN containing input, output, and hidden layers. Each of those layers is built from a perceptron with an activation function and moves it forward.

mlp = MLPClassifier()
hidden_layer_sizes_mlp = [(50,20),(100,50,20),(150,100,50)]
max_iter_mlp = [50,100,150,200,300]
random_state_mlp = [20,30,40]
hyperparameters_mlp = dict(hidden_layer_sizes=hidden_layer_sizes_mlp, max_iter=max_iter_mlp,random_state=random_state_mlp)
grid_search_mlp = GridSearchCV(mlp, hyperparameters_mlp, cv=5, scoring='f1_micro', n_jobs=-1)
grid_search_mlp.fit(X_train, y_train)
print(sklearn.metrics.classification_report(y_test, grid_search_mlp.predict(X_test)))        
No alt text provided for this image

Netanel Stern

CEO and security engineer

6 个月

???? ??? ?? ?? ?????? ??????? ??? ???? ???? ????? ???? ?????? ???: https://chat.whatsapp.com/HWWA9nLQYhW9DH97x227hJ

赞
回复

??? ????? ?????? ?????? ????????? Telegram https://t.me/NanaExgroup LinkedIn https://www.dhirubhai.net/company/nanaex-group-????-????? ?????/?? ????? !

赞
回复
David Strook

??ETL Developer & Data Analyst

3 å¹´

??? ????!! ????? ?? ?????? ??? ?? ??????! ??????!!

赞
回复
Yigal Neeman

?? ????? ???? ?'??? | ????? ????? ????? ????? ??

3 å¹´

Ilay Damari??? ?? ????? ??? ?? ?? ????? ?????? ???? ???

赞
回复
Sandra Popa

Student | Software Developer @ Intel

3 å¹´

?????? ??? ?????! ??

赞
回复

要查看或添加评论,请登录

Omer Rugi的更多文章

  • Environment Variables in Python

    Environment Variables in Python

    Want to improve security in your code? Want to use workflow’s best practice? You better get familiar with environment…

    7 条评论

社区洞察

其他会员也浏览了