Jazmia Henry: Expanding Equity in Natural Language Processing
Stanford Institute for Human-Centered Artificial Intelligence (HAI)
Advancing AI research, education, policy, and practice to improve humanity.
By Beth Jensen
As a child growing up in Atlanta, Jazmia Henry was confused by what she saw. Her Caribbean-born father, a professor with an MBA, frequently spoke in his native Virgin Islands Creole. Her mother could also speak a creole — or blended — language called Gullah Geechee, found in the coastal regions of the Carolinas and Georgia. Regardless of what language her parents spoke, both were consistently treated with respect by their community.
That wasn’t the case with others in the family’s circle — many of whom also held advanced academic degrees — when they chose to speak in African American Vernacular English, or AAVE.
“Everyone seemed to identify that my parents were speaking a different language; no one doubted their intelligence,” Henry says. “But for people speaking AAVE, the second there were no longer just Black folks in the room, or AAVE speakers, they immediately switched to Standard English. There’s a shame and a stigma attached to AAVE; a sense that if we use this language outside, we’re going to be judged as being less intelligent, even by others who also use it.? As long as that’s the case, there’s always going to be a problem.”
NLP does not reflect the broader community?
For the past 18 months as a fellow at the Stanford Institute for Human-Centered Artificial Intelligence and the Center for Comparative Studies in Race and Ethnicity, Henry has focused on that problem by looking for ways to include AAVE in the large natural language processing models that increasingly affect the lives and outcomes of individuals.??
“My project began with the idea to collect AAVE data, bring it together into one reservoir, and incorporate it into some of these models that have been trained on other languages, with the goal of both improving the performance of those models, and to begin to better understand this language,” she says.?
Large language models tend to perform poorly when trying to understand or generate words in AAVE, and often pick up and codify biased information. When such models are commercialized and used by companies to make decisions that affect the lives of AAVE speakers, that bias is not only present, but amplified across society. The impact can be wide-reaching and include individuals who are unfairly targeted and edited on social media sites; denied opportunity in employment, housing, and banking; or unjustly treated in the law enforcement and judicial systems.
“Right now, the field of NLP not only does not reflect the broader community, it actively discriminates against that community,” Henry says.
Her effort to incorporate AAVE into NLP models, however, ran into obstacles early on. Compiling and defining AAVE for these models is problematic, in part because AAVE evolves more quickly than most languages, and because much of the vocabulary flips Standard English on its head. For example, the word “mad” is often defined as meaning “angry.” In AAVE, however, it’s frequently used to mean “very,” as in “mad funny.”?
领英推荐
“And those are just the beginning issues,” Henry says. “How do we even define AAVE? We know it when we hear it, but writing out a calculation to show a model how to identify it and pick it up is a lot harder to do. Also, much of AAVE is defined tonally by the speaker and by the situation, things that NLP models don’t take into consideration. So we’re faced with the problem of how to identify speech within context in a way that’s fair to the language, but also works for the model.”
“Right now, the field of NLP not only does not reflect the broader community, it actively discriminates against that community.”
Compiling a vocabulary for change
To move forward, Henry opted to create a dataset of AAVE vocabulary to be used by the engineers, academics, and builders who create NLP models. The corpus she built is divided into four collections: the lyrics of 15,000 songs by 105 artists spanning nearly a century; a selection of books; leadership speeches from 34 orators ranging from Fredrick Douglass and Sojourner Truth to Ketanji Brown Jackson; and thousands of social media video transcripts, tweets, and blogs posted by Black thought leaders. Together, the open-source collection encompasses over 141,000 words and serves as a starting point for further study.
“My hope is that by starting with this, I can inspire more researchers to enter this space and push it forward, so that AAVE can be represented when we talk about languages included in NLP,” she says.?
Henry will be starting a new job as a senior applied engineer at Microsoft this fall, where she’ll be working to develop deep reinforcement learning models designed to help machines perform manufacturing tasks more safely. She hopes the project she began at HAI will not only help researchers eventually incorporate AAVE into language processing models but provide its speakers — and everyone else — with a new appreciation of its history and value.
“Growing up, I lived for a time in Stone Mountain Georgia, which is the headquarters of the KKK,” she says. “Every day we learned about Black history and had conversations about the things that were taken from our enslaved ancestors and from us as their descendants. Understanding the link between African American Vernacular English and African languages can better help us understand ourselves. This new language could take away shame and inject pride. It could be the evidence that we persevered in the way we communicate with each other, and that there’s nothing that can destroy us as a people, as a culture, and as a language.”
This article is part of the?People of HAI?series which spotlights our community of scholars, faculty, students, and staff coming from different backgrounds and disciplines.
Stanford HAI's mission is to advance AI research, education, policy, and practice to improve the human condition.?Learn more.
In the Business of Big Data
2 年Always fascinating to read about the Library of Humanity ??