Programs to find banned words
#nsf #bannedwords #maga #python #genai
Some have reported that content submitted to the NSF (National Science Foundation) are being flagged. TikTok is also "shadow banning" certain words.
Other departments in the federal government has their own list of banned words, such as this reportedly from within the CDC (Centers for Disease Control):
I'm leaving discussion of the veracity, rationality, and morality of this subject to others.
The focus of this article is the opportunity to write a utility which alerts whether YOUR document contains such words.
My initial programming gen prompt
To begin, I'd like the logic that scans text within a program I can run in a Command Terminal app. I'll get to the GUI drag-and-drop features later after I experiment with text scanning.
Here is my initial prompt within Perplexity.ai:
Write a Python CLI program to identify whether text contains a list of words loaded from a word list in an external CSV file.
PROTIP: The word list is stored in a CSV file which many programs can edit. Such a list will certainly be updated in the future.
The result, after some enhancements are at:
The program makes use of the internal library "csv" for reading the CSV file at
Currently, it contains only a few words (starting with "a").
The program was also generated to make use of the "typer" 3rd-party library at https://pypi.org/project/typer/
I added Python code to ensure that the libraries are loaded using CLI commands.
To run the program, the usage commands in Terminal recommend this:
./scan-bad.py "sample text is advocate for children" scan-ban.csv
The response should be:
The input string contains the following words from the list: advocate
A future version of this would read a document instead of a short text string.
Use of "@app.command()" is a stylistic choice to enable Typer features:
app = typer.Typer()
A Python "list comprehension" is used to find words from the word list that are present in the input string (case-insensitive).
With AI, I can quickly generate in different programming languages (CLI, Python, NodeJs).
Not The Same as Sentiment Analysis
It would not be appropriate to use "AI Sentiment Analysis" tools for the identification of specific words because the output of Sentiment Analysis generally outputs just a general rating of how many positive or negative words were used in a group of words provided to the AI.
However, the need here is to identify reference to concepts considered "thought crimes" -- a word used in George Orwell's dystopian novel Nineteen Eighty-Four.
Azure AI Content Moderation
Microsoft has a "Content Moderation" service as part of their "AI Content Safety" SaaS offering running within their Azure cloud. See https://www.youtube.com/watch?v=zmvkFbjsTrc
See:
Tutorials about Content Safety ("Responsible AI") are at:
Programming Documentation for the Content Safety service is at:
The service to monitor user and AI-generated content.
Word Replacement
Content Moderation utilities may not offer replacements like spell checkers like Grammarly.
Comments in social media posts mention replacement words such as "non-male" instead of "female", "non-caucasian", etc.
Concepts in context, not just words
Just looking for appearance of specific words from a list of words may cause a high number of false negatives. Plural forms of words would need to be added implicitly.
The intent of bans is to identify all thought and expression around certain topics.
Here is where GenAI LLMs (Large Language Models) is different than programmatic approach of just finding exact word matches. RAG (Retrieval Augmented Generation) techniques can be used to provide the context of concepts referenced by prompt requests to LLMs . Unlike a list of words, LLMs and the RAG reference concepts in multi-dimensional vectors.
Most humans can only imagine three dimensions, as illustrated above. But LLMs contain billions of dimensions, with each word (token) represented as a different dimension vector. Vectors enable calculations of the extent one word is conceptually related to other words as the distance between vectors.
However, there are tools such as L
There are several databases which work with vectors as RAGs: MongoDB Atlas, Pinecone, Weaviate, etc.
LangChain and Haystack utilities can be used in a locally run program that loads and then attaches a RAG database along with the prompts to analyze text.
Online: LlamaHub, AWS Textract, and Azure AI Document Intelligence provide connectors and document loaders to create a RAG based on user-defined parameters such as chunk size, overlap, and retrieval depth (e.g., top-k results). For example, LangChain’s WebBaseLoader extracts content from HTML web pages (based on a URL) and FileLoader extracts from local files. pdftotext and pandoc extracts from files in PDF, DOCX, PPTX, and other specific formats.
After that, is there a LLM that rewrites text to avoid the thought crime policing?
TBDverlap, and retrieval depth (e.g., top-k results). For example, LangChain’s WebBaseLoader extracts content from HTML web pages (based on a URL) and FileLoader extracts from local files. pdftotext and pandoc extracts from files in PDF, DOCX, PPTX, and other specific formats.
After that, is there a LLM that rewrites text to avoid the thought crime policing?
Independent consultant and published author. AI Automation testing expert with WebdriverIO, Playwright Selenium and Codeless platforms.
1 周Hey Wilson! I did a similar project some years ago. But it was only 7 words - inspired by George Carlin. And I had them all encrypted. ????
Tech-IT-industrie (retouching repair) Cyber oriented. French & English speaking. ?? R&D. ??? I Build Smart Contract ?? Dapp web3 & web2. ?? Creator ?I love Cryptoworld ? ?? $PEPE ??
1 周Can it work with X?
International comparative statistics are important (index of freedom of expression for ex)