Reducing the Rates of Recidivism for People of Color Through Machine Learning

Reducing the Rates of Recidivism for People of Color Through Machine Learning

We know that the prison industrial system is a challenge for ethnic minorities. There are several challenges that undergo with the prison industrial system given the mass incarceration for Black and Latinx minorities, legalities and unfairness of criminal punishment, and the prison labor that adheres to companies. All of this was important to note when making our project because we needed to understand the difficulties that are played within the system and the unfairness that adheres a prison to be re-admitted within the prison system. Within legal courts, Recidivism is measured by criminal acts that resulted in nearest, reconviction or return to prison without a new sentence. Due to the limitation of jobs, the struggles of parole, and the comfort of being in prison for too long, inmates gain comfort within the system rather than being home.?

1) Data Collection:

Our data collection originally started with four datasets. The first dataset was recidivism data, recidivism beginning in 2008. The recidivism dataset tracked inmates within a 5-year period to see which people returned and which ones did not. The features of this dataset included County of Indictment, Age at Release, Release Year, Age at Release, and Return Status. The county of indictment contained data from all 62? New York Counties as well as listed any that were unknown if the inmate was not able to be identified where they originated. The county of the indictment is considered the county in which they are believed to commit the crime. Age at release is the age at which the prisoner was released from the system. One of the challenges of not being able to track inmates individually is not being able to see within the 5-year time period, within which time frame did someone return. Having the information of whether someone returned within the 1st year or 3rd year of the 5-year time period, allows us to also draw conclusions of the likelihood of return rate. Return rate allows us to draw conclusions on how parole officers, or how taking care of inmates should be taken within the year. All the features within this dataset are categorical and ordinal.

The other dataset we took a look at was Inmates Under Custody beginning in 2008. The Inmates Under Custody is updated yearly with information about each inmate under custody. This dataset took a look at current age, housing facility, latest admission type, most serious crime, gender security level, race/ethnicity, facility security level, and current age. The latest admission type feature did not have any unique features as it listed all inmates as new court commitments, therefore, all being admitted at that current time period. The security level gave us an insight into what type of crimes were committed by the inmate. Higher security prisons are more likely to have inmates who have considered Class A Felonies, and lower security prisons often guarantee less crime and less serious crime for inmates. Housing facilities are the facilities within the county in which the prisoners are staying. This is important information because often inmates move from one location to another, and within the prison system there is not often digitalized and modernized information kept. Moreover, it is possible that some of the inmates may have been transferred to another state, or inmates from another state are being kept in one of the facilities in New York State. Age was an important feature of this dataset, with a mean of 37 we found that most of the dataset is comprised of individuals below this age. This feature also happened to play a significant influence on our models. With only 4.32% of missing values for the Race/Ethnicity feature, this dataset had no significant amount of missing values, and for this reason, we did not consider implementing any sort of technique to handle missing values. Instead, we opted to include more data from a different array of sources such as census demographics for each county to factor in socio-economic statistics in our predictions, or merge features from the datasets we were working on first, in the interest of finding more correlations in other factors such as poverty levels, race or gender in recidivism

In order to classify each crime under Most Serious Crime in our datasets, we had to look for a way to connect each of these crimes to a class. For example, we wanted to connect Adultery with its approach class under the New York State Penal Law, in this case, that is a class B misdemeanor. To achieve this, we found a pdf file containing these crimes following the format of the New York State Department of Corrections and Community Supervision. This data, however, was missing the class for each crime, but it contained the authority code for each. So we found another file containing the Penal offenses with their class and code. By combining these two we were now able to include features containing a full description of the crime and its appropriate class in our datasets.


One of the challenges that we had with the data collection was finding datasets that allowed us to track inmates across all datasets to see their chances of returning and comprehend how they are impacted. However, due to privacy concerns, there were no datasets that allowed us to have a unique identifier that was the same within all datasets. This made it difficult to measure with our hypothesis because we wanted to track inmates separately and see if we can predict if the individual will return. Without inmate data, we couldn’t make unique identifiers or assume that all datasets had the same inmates. This changed our project to still be able to guess the prediction of recidivism based on a set of features.?

The other dataset that we had we took a look at was the Jail Population By County Data. This file contains a census of the daily counts within each jail. This dataset was last updated on February 3rd, 2022 and it has the daily count of each submitted by the State Commissioner of New York. The dataset contained information from New York City, and outside of the City however it didn’t provide facility or county information. The dataset contains a column listing whether someone is within the NYC facility or not. It is stated as NYC or not NYC. The year in which the census was tracked, Census of total inmates in which the facility is responsible whether they are housed at the jail. Boarded Out contains information on the daily average number of individuals under a facility’s jurisdiction that is boarded out to another county’s correctional facilities. These members are not counted and reflected the in House Census. The Boarded contains the average daily census of individuals from other jurisdictions that are housed at the facility, but not containing those in federal custody. The in-house census contains the average daily census of individuals that the county is responsible for housing. Those in the houses census are sentenced, civil, federal, technical, parole violators, and others. Sentenced contains the data of individuals who have been convicted and sentenced to a jail term and are physically at the facility given the count. The reason, why we did not pursue this dataset is due to the fact that we had no unique identifier for tracking inmates the only information we can use was county data. County data allowed us to be able to look amongst housing facilities and county demographics, While, this dataset could have been helpful for allowing us to see how many members can board in and out it doesn’t allow us to draw conclusions on how many are returning from recidivism. Apart from this all of our datasets retraced back to 2008, apart from this data which went all the way back to 1997. While this was something we noted, we can just drop we still agreed that the dataset was not a great reflection of our data.?

No alt text provided for this image

Lastly, the other dataset we took a look at was Prison Admission Beginning 2008. This dataset was a prison admission which contained admission type, gender, county, etc. This dataset took a look at admission year, which is the year in which the inmate entered the prison. Admission month which is the month of admission they entered the system. The admission month is hot encoded with the representation of 0: January all the way to 12: December. Gender is the representation of the gender with which this person associates with. Age at admission which is when the person entered the prison system.Most serious crime in which the crime committed. Other features included the county of commitment, last known residence, and admission type. The admission type will consider the new court commitment, parole violation, and return to court. This was the representation of recidivism. This dataset was very similar to the recidivism dataset except it allows us to be able to determine the last known residence for the county and the county in which the indictment occurred. The reason why we didn’t pursue this dataset is that it did not introduce any new information except the counties and being able to track the inmates in the manner of where they are.?

2) Exploratory Data Analysis (including visualizations)

When conducting exploratory data analysis, we also need to build comprehension of what recidivism data looked like in reality. Within New York State, the recidivism count is 43% as of 2022, given African Americans are more likely to return to the system. The recidivism rate that is reported is 3,863 among African American ex-prisoners was 50.6.? We began the exploratory data analysis, to comprehend what correlations we are building within our models with an ER model.? This model, gave us an idea of how we connect datasets, although we didn’t pursue all the models listed here. This allowed us also able to determine county data and age that were most likely to be found within the data.?

Within all of our datasets we didn’t experience any null values, or any values that needed to be dropped instead, we needed to add features. Within the recidivism dataset, any counties that were listed as unknown which was one had to get dropped. Overall, we had to drop it because it was unrecognized when merging because the counties that we had listed were not all unknown. In this dataset as well when taking a look at the average of recidivism, the age of recidivism that was most common was 35. This is important to note as it allows us to comprehend the ages that are most in the system. Looking at the ages within the dataset, the most who are in the dataset are 20-35. The largest population within the data frame is 18-20-year-olds. This allowed us to draw conclusions that those who are younger and undergo the prison industrial complex system are more likely to experience it again. Those who are older, are least likely to be in the system. Younger people become comfortable with the system, and commit more crimes given the difficulty of returning. However, this dataset doesn’t represent the whole of other states or crimes committed. One of the largest biases we noticed within this dataset was that females and males were large distinguish. There was a total of 176,808 males within the dataset and 11,482 females represented about 4% of the dataset. Although we are aware that males are more likely to commit more crimes than females, we feel that it would have been beneficial for future exploration to display the distinction between female and male facilities, to see if there are any distinctions between male and female crime.?

No alt text provided for this image


This is a quick summary of the dataset given the representation of the largest features with the release year and the age at release. With a total of 188,650 inmates being tracked within the dataset. These inmates were tracked from 2008-2015 but given that it is tracked amongst the 5 years the recidivism rate was up to 2015-2020.

No alt text provided for this image

This displays the return status grouped by different individuals in this circumstance this is the amount who are returning. From this, we are able to see that a total of 78,647 underwent recidivism, and a total of 110,003 did not return.?

No alt text provided for this image

Caption: We see there is no correlation between age at release and release year, with every year facing a large increment of those who were released.?


The incarcerated individuals under custody dataset represent individuals under custody in the New York State Department of Corrections and Community Supervision. This dataset is usually updated at the end of the fiscal year on March 31. Its features include data about admission type, county, gender, age, race/ethnicity, crime, and the housing facility.?

The dataset contains 722087 rows and a total of 9 columns. See below a quick summary of the dataset:

No alt text provided for this image

As previously mentioned, this dataset had no significant missing values. See below a table with the percentage of missing values for each feature.

No alt text provided for this image

Looking at features such as Race/Ethnicity and Gender, we see that most of the inmates on our dataset are predominantly male (96%) and black (50%).?

No alt text provided for this image


No alt text provided for this image


We also wanted to see which counties are more prevalent under the County of Indictment feature (see chart below). This tells us where most of the inmates have been indicted.?

No alt text provided for this image


The dataset contained a large number of unique values for Most Serious Crime. This amount of unique values forced us to work on an alternative way to classify these values so that we could feed them into our model, this process will be further explained in detail in the data preparation section of this report.


No alt text provided for this image

3) Preliminary research and/or models - including any dead-ends you hit or anything you spent substantial time on that didn't make it into your final models.

Due to the fact that we decided to not proceed, with some datasets our biggest challenge came in finding ways to track inmates. Since we knew that county data was a large subset across the data, we pulled county demographic to determine if county information plays a portrayal between inmates and recidivism. County data was not a great feature alone, as it measured not The way we did this was by using the U.S Census Dataset. The demographics that we feel best represents the demographics that impacted recidivism is poverty, so we looked at the poverty rate that was being measured within the 62 counties with the poverty rate within the last 12 months. Due to formatting, we had to encode the dataset, so the features that were contained within this data were all the counties including United States, Population, Percentage Below Poverty, Subsets per age ( >18,18-20), Gender, Gender Poverty, and Poverty gave age.?

We also included within our dataset, Penal Law Offenses, and the Class Felony for all the Inmates Under The Custody Dataset, and hot-encoded the class felony with the inmates under custody. The Penal Law Offenses allowed us to be able to determine corrections and being able to see how crime is measured within the system gives us the authority and legal weight. Class Felonies, however, are measured A. This allows us to determine the length of time and the severity based on how inmates are given their judgment based on A being as severe as manslaughter and E being a misdemeanor.?

4) Data Preparation( Building Our Model)?

Before we got to our final model, there were a couple of feature extractions we had to do. One of which was containing a severity map. Within our dataset, we noted that there were several ways of measuring sentiment for being able to determine the severity of the crime. Within NYC there are many ways in which severity is measured, however, there is no weight associated with this severity. In our circumstance, before underlying with some kind of balanced weight we used class felonies to be able to determine severity weight. We created a function called get_standarized_penal data that returned the based on crime codes the standardized weight. The reason, why we didn’t pursue a sentimental analysis is to determine whether the data was positive or neutral. A crime cannot be generalized as neutral or positive in anyone’s eyes however based on penalty offenses one could be greater than the other.?

No alt text provided for this image

Another improvement that was made was normalizing the county weight. We used min - max to normalize county data in relation to other counties by calculating

( Value - Min(pop))//( max pop - min pop). We also binarized gender to be able to easier as we noticed once we merged our datasets we faced difficulties with the model being able to read male and female data. Our representation was 1: Female, and 0: Male. Within our final model, we worked with two different models to be able to test and measure however most of our models we worked mainly with mainly testing on recidivism and county data statistics. However, we did work with two of the datasets our first was the get_inmate_data() which merged the dataset of penal law data and inmate data. Recidivism data is merging the county of indictment and recidivism data.?


No alt text provided for this image
No alt text provided for this image

Our final model consisted of 123853 rows and 31 columns, that were merged with the recidivism dataset and?

No alt text provided for this image


5) Models:

All of the models that we utilized were statistical classifier models, that measured somewhat kind of predictive models.?

The first model we constructed was decision trees. Decision trees are able to make a certain decision about some kind of process given a decision algorithm. Decision trees can be used either for regression or classification, for our circumstance we used it for classification.? It models probable outcomes, and possible consequences, Given the decision tree, we assumed that a tree of sufficient depth could spot and split sections of the data based on similar features and ranges of features that would be too laborious to investigate manually. A challenge that we came across with the decision model was trying to find the balance between informative features and useless ones since extra features slowed it down considerably. In order to do this, we built a function that checked the best features. The function that measured the best features, was called prune_features which tests the modeling function on a progressively smaller subset of the features to get the best results. This also returned the model, accuracy, and the list of features that produced it.This was also what we used to cross-validate all models and see which features were the best selection. The overall training set was constructed on 20% of the dataset.?


No alt text provided for this image


Next, we constructed the naive Bayes model. A naive Bayes model is a probabilistic machine learning model that is used in a variety of classification tasks. It uses the Bayes theorem in which it used for calculating conditional probabilities. The assumptions that are made by Naive Bayes are that each feature is made independent, and equal. Each feature is given the same amount of importance and assuming that none are dependent. For the Naive Bayes model, we wondered if a larger statistical/ probabilistic analysis would reveal similarities between recidivism. The model was familiarly straightforward. Similar to decision trees, naive Bayes was tested given 20% of the model.?


On the other hand, we constructed a COMPAS Model. COMPAS model is an acronym for Correctional Offender Management Profiling for Alternative Sanctions that are utilized to predict recidivism risk. COMPAS? models are used to determine the highest-lowest risk of someone being re-offended. The COMPAS model works by providing a category-based evaluation and is often using 137 factors to be able to measure the comparison for the COMPAS model. The reason, we decided to use this model, is because it is actually used in court to make decisions on the recidivism risk while awaiting trial, in the state of New York. It is considered to be able to predict recidivism 61% of the time. Often within this model, we use binary classification to be able to evaluate the model performance. The accuracy score includes both kinds of errors which are false positives and false negatives. A false positive is a defendant that is predicted as medium/ high risk but does not re-offend. False negatives on the other hand is a defendant that is predicted as a low, risk but does not re-offend. In the circumstance of our model, we worked with the COMPAS model a bit differently. We used the COMPAS Model to be trained with decision trees and naive Bayes, on the specific feature “Gender”. The way in which we used the COMPAS Model was by splitting females who did and did not return, and males who did and did not return. We used our decision tree and naive Bayes model, to be able to construct our model prediction of those who are most likely to return based on their recidivism data. The objective was to use this as a train to figure out the percentage of those who are more likely to return given Gender. In order to do this, we built a test classification function that allowed us to separate it into 4 dimensions by gender, and return status. For each data frame, we returned a subtype tied to a percentage and the original data frame represented accuracy.?

No alt text provided for this image


We also constructed a random forest model. This is with the objective of using some of the features that did not have a meaningful correlation with one another so that we could achieve an accurate prediction given that a random forest tree model is fundamentally a good approach to dealing with datasets that has plenty of features with low correlation. To test and sort of cross-validate with other values, we evaluated how much each feature weighs on the predictive power of the model. In the case of the Random Forest model implemented on recidivism data and demographics, the feature "Age at Release" had a higher overall score in how important the feature is in our prediction at approximately 0.7, most of the remaining features had a score below 0.05. See the figure below for the complete results.

No alt text provided for this image

k-nearest neighbors algorithm was the fourth model, another widely used classification technique. KNN works by finding the distances between a query and all the examples in the data, selecting the specified number of examples (K) closest to the query, then votes for the most frequent label (in our case: Returned Parole Violation', New Felony Offense, Not Returned). To chose the most convenient value for K at first was done by trying several Ks and picking the one that gives us the highest accuracy percentage, which was 69 with a 0.58 accuracy, after seen this we decided to implement a for loop (which values are multiples of 63 for the total number of counties in NY) that will be executed repeatedly until the K value with the highest accuracy is found, as a result, we obtain 272 as the most beneficial number for K. With a large value for K everything tends to be classified as the most probable class, hence error rate will increase. In our case, we can see that the small number of features in the training data set directly affects the model's effectiveness.

No alt text provided for this image


6) Evaluation of Results:?

Within the evaluation of our results, our COMPAS model contained different subsets of females and males. While, we did not binarize our dataset, since we used decision trees and a naive Bayes model to measure our scale from 0-1. The accuracy was 1, given most likely to return, and 0 with least likely to return. In this circumstance, we had females who did not return represented by Female_Returned_0 and Female_Returned_1 which was those who returned, the same applied to those who were male. Female_Returned_0 for naive Bayes was measured as 1.0, which is considered as a false_positive in this circumstance given that there is a high risk for re-offending but there is no re-offending.

No alt text provided for this image

?On the other hand, those who did return were measured were in -accurately, therefore, being a false negative. For males, this was also a false positive with 99% accuracy, and males who did not return being a false negative. This also allowed us to be able to determine that in comparison to decision and naive Bayes. Naive Bayes had the least accuracy in being able to predict recidivism and classifying features that were accurate. The decision tree model also predicted false positives and false negatives however there was still a higher amount of accuracy being measured between Females and Males. Given males, were measured with a 22% risk of returning given they already returned. This also allowed us to say that the COMPAS model should not be used within the court and trials as they are not a good measure of accuracy. Our percentage of accuracy was less than 61% given in comparison to what New York Courts have.?

No alt text provided for this image
No alt text provided for this image


Next, our decision tree was measured on f1_score and accuracy. An f1 score is the mean of the precision and recall, where an f1 score reaches its best value at 1 and worse at 0. The relative contribution of precision is based on weighing and depending on the average parameter. Our f1 score wasn’t that high with decision trees it was 0.35 given an accuracy of 60%.

No alt text provided for this image
No alt text provided for this image


On the other hand, our naive Bayes model had an f1 score of 33% which is a lower f1 score and a 58% accuracy. While the accuracy was pretty close to naive Bayes this further explained our explanation of our cross-validation because, with the COMPAS model, we were able to see that there is a decrease in accuracy and a lower f1 score.?

No alt text provided for this image
No alt text provided for this image


Or classification for the KNN? system has been trained to distinguish between new felony offenses, not returned, and parole violations. A confusion matrix shows us the result of testing the algorithm. The resulting confusion matrix is the following:

No alt text provided for this image


What should happen for 100% accuracy is that all the non-diagonal values should be zero and all the correct predictions are located on the diagonal of the table. We can see from the matrix that the model in question has trouble distinguishing even the labels with low representation. Out of 5 New Felony Offense, the model predicted 4 as a not returned and 1 as Returned Parole Violation which clearly reflects the results of the predictions with 0.00. Not return obtained 0.59 and was the one with the highest precision followed by palore violation with 0.42.

For our random forest model, we got an accuracy of 0.57. This was somewhat not expected as this type of model should be somewhat more effective in our dataset, given that our dataset contains a modest amount of features that are not necessarily correlated.

No alt text provided for this image

As we see on the classification report above, there is not a high difference in precision, meaning that our model is not being too careful at predicting recidivism. This balance is reflected in the f1 score.

7) Improvements (different data preparation, different algorithms, hyperparameter tuning, etc)

There are many ways in which we believe that our models can use for improvement. One of many was analyzing and fixing biases within our dataset. Our dataset has an overall gender bias with more males than females being represented in the data. We believe that being able to analyze males and females individually will give us further insight into how males and females are measured differently within the system. Having a gender bias, allows us to believe that this isn’t the largest representation of inmates.?

Another way in which believe we can improve is by finding optimal ways which we believe do not contain hot-encoding datasets and finding a rate of recidivism that is already calculated within the dataset. The DART model is a tool that already builds prediction.? Lastly, we found that merging a lot of datasets in order to maintain accuracies of recidivism and length in prison. Merging datasets allowed us to build our final model, however, it caused restrictions on what data can and cannot be used.?


8) Conclusion, results on the test set, a summary of improvements, with visualizations

While there were many struggles that came with the process of being able to train and build our models, this overall project allowed us to replicate and build a new analysis of how the criminal system is viewed within NYC. This project taught us the importance of being able to replicate data and being able to calculate risks. While other studies have utilized and replicated this study by creating profiles we tried our best to minimize bias. It is essential to be able to replicate studies and learn within machine learning how to make our models more accurate to be able to reduce human error. Our models were close to the accuracy that we have seen with others as they have 67% accuracy and our models displayed 60% of accuracy. We were able to conclude that the largest attributes that considered recidivism age was the largest contribution of recidivism. We hope to be able to measure the future socio-economic factors more specifically and see how they impact recidivism.

????????????

回复

Thank you for including us in your list. This is great work!

要查看或添加评论,请登录

Jessica De Mota Munoz的更多文章

  • How Pandas Became My Favorite Library...

    How Pandas Became My Favorite Library...

    When I started my Computer Science journey at college, I started programming with Python and one of the most…

    6 条评论

社区洞察

其他会员也浏览了