ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Gender Classification From Text Using ML

Omer Rugi

Backend Software Engineer @ Tipalti || Data Flows Optimization || C#, .NET, Java, Spring, Python, SQL, Mongodb, Redis, Kafka

å‘å¸ƒæ—¥æœŸ: 2022å¹´1æœˆ26æ—¥

The drive behind the project

As part of my ongoing learning process in the field of ML\DL, I took this task to play around with NLP. This is my first attempt in NLP and wanted to share what I did. The main thing I wish to take from this task is how to preprocess free text and convert it into features.

So what do you say? let's begin...?

Git: omerugi/Gender_Text_Analysis

The Task

To better understand human behavior we can analyze texts that people have written and get some deeper knowledge of how they act based on a profile we build, there are many aspects that can contribute but here we will focus on one. In this project, the one aspect we will focus on is - determining the gender of the author based on the text they wrote!

We will use ML to do so, using a pipeline to preprocess the data and get features from text and then classify using ML models.

Dataset

The dataset contains 754 samples were:

gender?- is the target column, represents the authors' gender?m = men f=women.
story?- the text the author wrote.

Dataset Analysis

We can understand from it that there is a big gap between the number of men and the number of women in the data. This is something that we should address later on and fix so the data will be balanced between labels.

Preprocessing

Step 1 - Regex for text cleaning

# regular expressions
import re

Regular expressions (AKA Regex) is an in-text patterns searcher, when giving a pattern the regex will iterate over the test and will try to match this pattern to sub-sequences of the text.

To learn more and try test regular expressions: https://regexr.com/

To use regex we created two functions :

clean_doc - The function iterates over all the samples in the dataset and overrides them with a clean test from clean_smaple.
clean_smaple - Clean each sample using regex, removing patterns from the text that are not a letter / is a digit / single letter / double spaces.

def clean_doc(docs_org)
? ? docs = docs_org.copy()
? ? for i in range(len(docs_org)):
? ? ? ? docs[i] = clean_smaple(docs[i])
? ? return docs:

def clean_smaple(s)
? ? s = re.sub(r'[^\w]|[\d]', ' ', s)
? ? s = re.sub(r"\s\w\s","",s)
? ? while "? " in s:
? ? ? ? s = s.replace("? "," ")
? ? return s:

Step 2 - TF-IDF and N-Grams for feature extraction

from sklearn.feature_extraction.text import TfidfVectorizer

Term frequency-inverse document frequency (AKA TF-IDF) ?is a numerical statistic that is intended to reflect how important a word is to a document in a collection or?corpus. (wikipedia)

Basically, it gives each word a score based on its rarity in the text and in the whole corpus.

TF - Term frequency

tf(t,d) = count of t in d / number of words in d

Counts the number appearances of word "t" in document "d".

Note: We normalize the results so longer documents won't be more "powerful" when compared to shorter ones.

DF - Document Frequency

df(t) = occurrence of t in N documents

Counts the number of documents "t" appearances in.

IDF - Inverse Document Frequency

idf(t) = N/df

IDF is the inverse of the DF. It will give a low number to terms that are more frequent and higher numbers to ones that are less frequent.

For avoiding the value from exploding when the number of documents is high, we use Log.

idf(t) = log(N/(df + 1))

Note: we add 1 to avoid dividing by zero.

é¢†è‹±æŽ¨è

Mastering Long Document Insights: Advanced Summarization with Amazon Bedrock and Anthropic Claude 2 Foundation Model

Mastering Long Document Insights: Advancedâ€¦

Gary Stafford 1 å¹´å‰

BERT

Darshika Srivastava 1 å¹´å‰

FineTuning BERT- Named Entity Recognition -â€¦

Akash K. 5 ä¸ªæœˆå‰

TF-IDF

tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

By multiplying TF by IDF, we get the TF-IDF score!

N-Grams

In our case, it refers to a sequence of N words together as one. Sometimes a sequence of words can have a different meaning from each word apart, and should also be taken into account, so in sklean one of the parameters is to determine what is the length of sequences we wish for it to try in the TD-IDF.

1-Gram = a single word. 2-Gram = couple and on...

vec = TfidfVectorizer(
? ? ? ? lowercase=True, # all lowercase
? ? ? ? max_features = 200, # the max amount of N-gram
? ? ? ? max_df= 0.7, # Max appearances
? ? ? ? min_df = 3, # Min appearances
? ? ? ? ngram_range=(1,3)) # From 1-Gram to 3-Gram

Step 3 - Standard Scalar

from sklearn.preprocessing import StandardScaler

This method will basically, normalize the features?individually so that each feature will have?mean = 0?and?var = 1.

Step 4 - PCA

from sklearn.decomposition import PCA

Principal component analysis?(AKA PCA) is the process of computing the principal components and using them to perform a change of basis?on the data, sometimes using only the first few principal components and ignoring the rest.

The main idea is to obtain lower-dimensional data while preserving as much of the data's variation as possible. We use it to reduce the number of features. For more info: A One-Stop Shop for Principal Component Analysis

I've used this method due to the fact that I wish to reduce the dimension of the data after the TF-IDF, I could have picked fewer features but I preferred to find a decent amount of features then to find a better representation of them using PCA.

Step 5 - Label Encoder

from sklearn.preprocessing import LabelEncoder

?We are using the encoder to change our label from chars to numeric values, the label encoder finds how many unique labels there are and give each of them a number. In our case the target featuer?f = 0 m = 1.

Step 6 - Over Sampling

Due to the fact that we have a small number of females compared to men, we used oversampling. We duplicated female samples that are in the mid-average length of the text so we won't add noise to the data and keep the distribution centralized.

temp = []
for i,row in zip( range(n), df_tf_pca.itertuples(index=False)):
? ? if row[-1] == 0 and 250 <= word_counter[i] <= 350:
? ? ? ? temp.append(list(row))
? ? ? ? temp.append(list(row))
? ? else:
? ? ? ? temp.append(list(row))
df_oversamp = pd.DataFrame(temp , columns=df_tf_pca.columns)

Step 7 - Train Test Split

from sklearn.model_selection import train_test_split

Splitting the test samples into train and test - when 0.33 are test and we used stratify to get even amount of samples in the split.

x = df_oversamp.drop(["gender"], axis = 1)
y = df_oversamp["gender"]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42, stratify=y)

Models

For each model, we have used?Grid Search?to find the best hyperparameters. We have tried only two - KNN and MLP to see if there is a difference between them.

We toke KNN as it is one of the most common and simple ML algorithms and MLP as a basic NN to see the performance.

KNN

from sklearn.neighbors import KNeighborsClassifier

k-nearest neighbors algorithm?(AKA k-NN) is determining a sample's label by doing a vote between its K k-nearest neighbors. Using a distance function to find who are those neighbors between all dimensions.

The simplified idea is that "if I'm closer to more neighbors that are labeled X, there is a high probability that I'm also X".

n_neighbors_knn = [5,10,12,15,20,50,100]
weights_knn = ["uniform", "distance"]
hyperparameters_knn = dict(n_neighbors=n_neighbors_knn, weights=weights_knn)
grid_search_knn = GridSearchCV(KNeighborsClassifier(), hyperparameters_knn, cv=5, scoring='f1_micro', n_jobs=-1)
grid_search_knn.fit(X_train, y_train)N

MLP

from sklearn.neural_network import MLPClassifier

Multilayer perceptron?(AKA MLP), it's a basic NN containing input, output, and hidden layers. Each of those layers is built from a perceptron with an activation function and moves it forward.

mlp = MLPClassifier()
hidden_layer_sizes_mlp = [(50,20),(100,50,20),(150,100,50)]
max_iter_mlp = [50,100,150,200,300]
random_state_mlp = [20,30,40]
hyperparameters_mlp = dict(hidden_layer_sizes=hidden_layer_sizes_mlp, max_iter=max_iter_mlp,random_state=random_state_mlp)
grid_search_mlp = GridSearchCV(mlp, hyperparameters_mlp, cv=5, scoring='f1_micro', n_jobs=-1)
grid_search_mlp.fit(X_train, y_train)
print(sklearn.metrics.classification_report(y_test, grid_search_mlp.predict(X_test)))

Netanel Stern

CEO and security engineer

6 ä¸ªæœˆ

???? ??? ?? ?? ?????? ??????? ??? ???? ???? ????? ???? ?????? ???: https://chat.whatsapp.com/HWWA9nLQYhW9DH97x227hJ

èµž

å›žå¤

NanaEx Group

1 å¹´

??? ????? ?????? ?????? ????????? Telegram https://t.me/NanaExgroup LinkedIn https://www.dhirubhai.net/company/nanaex-group-????-????? ?????/?? ????? !

èµž

å›žå¤

David Strook

??ETL Developer & Data Analyst

3 å¹´

??? ????!! ????? ?? ?????? ??? ?? ??????! ??????!!

èµž

å›žå¤

Yigal Neeman

?? ????? ???? ?'??? | ????? ????? ????? ????? ??

3 å¹´

Ilay Damari??? ?? ????? ??? ?? ?? ????? ?????? ???? ???

èµž

å›žå¤

Sandra Popa

Student | Software Developer @ Intel

3 å¹´

?????? ??? ?????! ??

èµž

å›žå¤

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Omer Rugiçš„æ›´å¤šæ–‡ç«

Environment Variables in Python

2022å¹´4æœˆ8æ—¥

Environment Variables in Python

Want to improve security in your code? Want to use workflowâ€™s best practice? You better get familiar with environmentâ€¦

7 æ¡è¯„è®º

Gender Classification From Text Using ML

Omer Rugi

Backend Software Engineer @ Tipalti || Data Flows Optimization || C#, .NET, Java, Spring, Python, SQL, Mongodb, Redis, Kafka

The drive behind the project

The Task

Dataset

Dataset Analysis

Preprocessing

Step 1 - Regex for text cleaning

Step 2 - TF-IDF and N-Grams for feature extraction

é¢†è‹±æŽ¨è

Step 3 - Standard Scalar

Step 4 - PCA

Step 5 - Label Encoder

Step 6 - Over Sampling

Step 7 - Train Test Split

Models

KNN

MLP

Omer Rugiçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Countdown to the launch of our API: the Vulavula API release

?? ALBERT: Transforming NLP with Lightweight Innovation ??

Developing LLMs for Generative AI Tokenization and Vectorization

Survey Analysis of Areas of Life Satisfaction using NLP and Logistic Regression

Ready to Train Your Own LLM? Dive In with Code!

Explore the World of AI Tools

Your social media posts VS Controversial CEOs' - with NLP

Foundational Papers in NLP: Bi-Directional Attention Flow (BIDAF) network - Seo et al 2016.

The power of connections: Explore the world of graph databases with Neo4j by Thomas Adler

The drive behind the project

The Task

Dataset

Dataset Analysis

Preprocessing

Step 1 - Regex for text cleaning

Step 2 - TF-IDF and N-Grams for feature extraction

é¢†è‹±æŽ¨è

Step 3 - Standard Scalar

Step 4 - PCA

Step 5 - Label Encoder

Step 6 - Over Sampling

Step 7 - Train Test Split

Models

KNN

MLP

Omer Rugiçš„æ›´å¤šæ–‡ç«

Environment Variables in Python

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Countdown to the launch of our API: the Vulavula API release

?? ALBERT: Transforming NLP with Lightweight Innovation ??

Developing LLMs for Generative AI Tokenization and Vectorization

Survey Analysis of Areas of Life Satisfaction using NLP and Logistic Regression

Ready to Train Your Own LLM? Dive In with Code!

Explore the World of AI Tools

Your social media posts VS Controversial CEOs' - with NLP

Foundational Papers in NLP: Bi-Directional Attention Flow (BIDAF) network - Seo et al 2016.

The power of connections: Explore the world of graph databases with Neo4j by Thomas Adler

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†