Gender Classification From Text Using ML
The drive behind the project
As part of my ongoing learning process in the field of ML\DL, I took this task to play around with NLP. This is my first attempt in NLP and wanted to share what I did. The main thing I wish to take from this task is how to preprocess free text and convert it into features.
So what do you say? let's begin...?
The Task
To better understand human behavior we can analyze texts that people have written and get some deeper knowledge of how they act based on a profile we build, there are many aspects that can contribute but here we will focus on one. In this project, the one aspect we will focus on is - determining the gender of the author based on the text they wrote!
We will use ML to do so, using a pipeline to preprocess the data and get features from text and then classify using ML models.
Dataset
The dataset contains 754 samples were:
- gender?- is the target column, represents the authors' gender?m = men f=women.
- story?- the text the author wrote.
Dataset Analysis
We can understand from it that there is a big gap between the number of men and the number of women in the data. This is something that we should address later on and fix so the data will be balanced between labels.
Preprocessing
Step 1 - Regex for text cleaning
# regular expressions
import re
Regular expressions (AKA Regex) is an in-text patterns searcher, when giving a pattern the regex will iterate over the test and will try to match this pattern to sub-sequences of the text.
To learn more and try test regular expressions: https://regexr.com/
To use regex we created two functions :
- clean_doc - The function iterates over all the samples in the dataset and overrides them with a clean test from clean_smaple.
- clean_smaple - Clean each sample using regex, removing patterns from the text that are not a letter / is a digit / single letter / double spaces.
def clean_doc(docs_org)
? ? docs = docs_org.copy()
? ? for i in range(len(docs_org)):
? ? ? ? docs[i] = clean_smaple(docs[i])
? ? return docs:
def clean_smaple(s)
? ? s = re.sub(r'[^\w]|[\d]', ' ', s)
? ? s = re.sub(r"\s\w\s","",s)
? ? while "? " in s:
? ? ? ? s = s.replace("? "," ")
? ? return s:
Step 2 - TF-IDF and N-Grams for feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer
Term frequency-inverse document frequency (AKA TF-IDF) ?is a numerical statistic that is intended to reflect how important a word is to a document in a collection or?corpus. (wikipedia)
Basically, it gives each word a score based on its rarity in the text and in the whole corpus.
TF - Term frequency
tf(t,d) = count of t in d / number of words in d
Counts the number appearances of word "t" in document "d".
Note: We normalize the results so longer documents won't be more "powerful" when compared to shorter ones.
DF - Document Frequency
df(t) = occurrence of t in N documents
Counts the number of documents "t" appearances in.
IDF - Inverse Document Frequency
idf(t) = N/df
IDF is the inverse of the DF. It will give a low number to terms that are more frequent and higher numbers to ones that are less frequent.
For avoiding the value from exploding when the number of documents is high, we use Log.
idf(t) = log(N/(df + 1))
Note: we add 1 to avoid dividing by zero.
领英推è
TF-IDF
tf-idf(t, d) = tf(t, d) * log(N/(df + 1))
By multiplying TF by IDF, we get the TF-IDF score!
N-Grams
In our case, it refers to a sequence of N words together as one. Sometimes a sequence of words can have a different meaning from each word apart, and should also be taken into account, so in sklean one of the parameters is to determine what is the length of sequences we wish for it to try in the TD-IDF.
1-Gram = a single word. 2-Gram = couple and on...
vec = TfidfVectorizer(
? ? ? ? lowercase=True, # all lowercase
? ? ? ? max_features = 200, # the max amount of N-gram
? ? ? ? max_df= 0.7, # Max appearances
? ? ? ? min_df = 3, # Min appearances
? ? ? ? ngram_range=(1,3)) # From 1-Gram to 3-Gram
Step 3 - Standard Scalar
from sklearn.preprocessing import StandardScaler
This method will basically, normalize the features?individually so that each feature will have?mean = 0?and?var = 1.
Step 4 - PCA
from sklearn.decomposition import PCA
Principal component analysis?(AKA PCA) is the process of computing the principal components and using them to perform a change of basis?on the data, sometimes using only the first few principal components and ignoring the rest.
The main idea is to obtain lower-dimensional data while preserving as much of the data's variation as possible. We use it to reduce the number of features. For more info: A One-Stop Shop for Principal Component Analysis
I've used this method due to the fact that I wish to reduce the dimension of the data after the TF-IDF, I could have picked fewer features but I preferred to find a decent amount of features then to find a better representation of them using PCA.
Step 5 - Label Encoder
from sklearn.preprocessing import LabelEncoder
?We are using the encoder to change our label from chars to numeric values, the label encoder finds how many unique labels there are and give each of them a number. In our case the target featuer?f = 0 m = 1.
Step 6 - Over Sampling
Due to the fact that we have a small number of females compared to men, we used oversampling. We duplicated female samples that are in the mid-average length of the text so we won't add noise to the data and keep the distribution centralized.
temp = []
for i,row in zip( range(n), df_tf_pca.itertuples(index=False)):
? ? if row[-1] == 0 and 250 <= word_counter[i] <= 350:
? ? ? ? temp.append(list(row))
? ? ? ? temp.append(list(row))
? ? else:
? ? ? ? temp.append(list(row))
df_oversamp = pd.DataFrame(temp , columns=df_tf_pca.columns)
Step 7 - Train Test Split
from sklearn.model_selection import train_test_split
Splitting the test samples into train and test - when 0.33 are test and we used stratify to get even amount of samples in the split.
x = df_oversamp.drop(["gender"], axis = 1)
y = df_oversamp["gender"]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42, stratify=y)
Models
For each model, we have used?Grid Search?to find the best hyperparameters. We have tried only two - KNN and MLP to see if there is a difference between them.
We toke KNN as it is one of the most common and simple ML algorithms and MLP as a basic NN to see the performance.
KNN
from sklearn.neighbors import KNeighborsClassifier
k-nearest neighbors algorithm?(AKA k-NN) is determining a sample's label by doing a vote between its K k-nearest neighbors. Using a distance function to find who are those neighbors between all dimensions.
The simplified idea is that "if I'm closer to more neighbors that are labeled X, there is a high probability that I'm also X".
n_neighbors_knn = [5,10,12,15,20,50,100]
weights_knn = ["uniform", "distance"]
hyperparameters_knn = dict(n_neighbors=n_neighbors_knn, weights=weights_knn)
grid_search_knn = GridSearchCV(KNeighborsClassifier(), hyperparameters_knn, cv=5, scoring='f1_micro', n_jobs=-1)
grid_search_knn.fit(X_train, y_train)N
MLP
from sklearn.neural_network import MLPClassifier
Multilayer perceptron?(AKA MLP), it's a basic NN containing input, output, and hidden layers. Each of those layers is built from a perceptron with an activation function and moves it forward.
mlp = MLPClassifier()
hidden_layer_sizes_mlp = [(50,20),(100,50,20),(150,100,50)]
max_iter_mlp = [50,100,150,200,300]
random_state_mlp = [20,30,40]
hyperparameters_mlp = dict(hidden_layer_sizes=hidden_layer_sizes_mlp, max_iter=max_iter_mlp,random_state=random_state_mlp)
grid_search_mlp = GridSearchCV(mlp, hyperparameters_mlp, cv=5, scoring='f1_micro', n_jobs=-1)
grid_search_mlp.fit(X_train, y_train)
print(sklearn.metrics.classification_report(y_test, grid_search_mlp.predict(X_test)))
CEO and security engineer
6 个月???? ??? ?? ?? ?????? ??????? ??? ???? ???? ????? ???? ?????? ???: https://chat.whatsapp.com/HWWA9nLQYhW9DH97x227hJ
??? ????? ?????? ?????? ????????? Telegram https://t.me/NanaExgroup LinkedIn https://www.dhirubhai.net/company/nanaex-group-????-????? ?????/?? ????? !
??ETL Developer & Data Analyst
3 å¹´??? ????!! ????? ?? ?????? ??? ?? ??????! ??????!!
?? ????? ???? ?'??? | ????? ????? ????? ????? ??
3 å¹´Ilay Damari??? ?? ????? ??? ?? ?? ????? ?????? ???? ???
Student | Software Developer @ Intel
3 å¹´?????? ??? ?????! ??