Programs to find banned words
Image: NikonUSA.com

Programs to find banned words

#nsf #bannedwords #maga #python #genai

Some have reported that content submitted to the NSF (National Science Foundation) are being flagged. TikTok is also "shadow banning" certain words.

https://www.npr.org/2025/02/13/nx-s1-5295043/sen-ted-cruzs-list-of-woke-science-includes-self-driving-cars-solar-eclipses

From https://www.tiktok.com/@turnthepage3/video/7468718543194885407?is_from_webapp=1&sender_device=pc&web_id=7480886948300260907

From

Other departments in the federal government has their own list of banned words, such as this reportedly from within the CDC (Centers for Disease Control):

From https://www.youtube.com/watch?v=6_P0HT-Pya4&t=52s

I'm leaving discussion of the veracity, rationality, and morality of this subject to others.

The focus of this article is the opportunity to write a utility which alerts whether YOUR document contains such words.

My initial programming gen prompt

To begin, I'd like the logic that scans text within a program I can run in a Command Terminal app. I'll get to the GUI drag-and-drop features later after I experiment with text scanning.

Here is my initial prompt within Perplexity.ai:

Write a Python CLI program to identify whether text contains a list of words loaded from a word list in an external CSV file.        

PROTIP: The word list is stored in a CSV file which many programs can edit. Such a list will certainly be updated in the future.

The result, after some enhancements are at:

https://github.com/wilsonmar/python-samples/blob/main/scan-ban.py

The program makes use of the internal library "csv" for reading the CSV file at

https://github.com/wilsonmar/python-samples/blob/main/scan-ban.csv

Currently, it contains only a few words (starting with "a").

The program was also generated to make use of the "typer" 3rd-party library at https://pypi.org/project/typer/

I added Python code to ensure that the libraries are loaded using CLI commands.

To run the program, the usage commands in Terminal recommend this:

./scan-bad.py "sample text is advocate for children" scan-ban.csv        

The response should be:

The input string contains the following words from the list: advocate        

A future version of this would read a document instead of a short text string.

Use of "@app.command()" is a stylistic choice to enable Typer features:

app = typer.Typer()        

A Python "list comprehension" is used to find words from the word list that are present in the input string (case-insensitive).

With AI, I can quickly generate in different programming languages (CLI, Python, NodeJs).

Not The Same as Sentiment Analysis

It would not be appropriate to use "AI Sentiment Analysis" tools for the identification of specific words because the output of Sentiment Analysis generally outputs just a general rating of how many positive or negative words were used in a group of words provided to the AI.

However, the need here is to identify reference to concepts considered "thought crimes" -- a word used in George Orwell's dystopian novel Nineteen Eighty-Four.

Azure AI Content Moderation

Microsoft has a "Content Moderation" service as part of their "AI Content Safety" SaaS offering running within their Azure cloud. See https://www.youtube.com/watch?v=zmvkFbjsTrc

  1. Try the Text Moderation browser GUI on Microsoft's "AI Foundry" website at: https://ai.azure.com/explore/contentsafety/text
  2. Sign in using your Azure subscription AND select/define a project. (Charges begin accruing from here).
  3. Select "Azure AI Services" resource.
  4. Click on "Violent content with mispelling" provided as an example. That text should appear under "2. Test". Click "Run test".
  5. When I see pop-up error "Your account does not have access to this resource, please contact your resource owner to get access." I gave up. Do you know who I can contact to get around this?

See:

https://learn.microsoft.com/en-us/azure/ai-services/content-safety/quickstart-text?tabs=visual-studio%2Cwindows&pivots=programming-language-python

Tutorials about Content Safety ("Responsible AI") are at:

https://learn.microsoft.com/en-us/training/modules/responsible-ai-studio/

Programming Documentation for the Content Safety service is at:

https://learn.microsoft.com/en-us/azure/ai-services/content-safety/studio-quickstart?pivots=content-safety-studio

The service to monitor user and AI-generated content.

https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety/


Word Replacement

Content Moderation utilities may not offer replacements like spell checkers like Grammarly.

Comments in social media posts mention replacement words such as "non-male" instead of "female", "non-caucasian", etc.

Concepts in context, not just words

Just looking for appearance of specific words from a list of words may cause a high number of false negatives. Plural forms of words would need to be added implicitly.

The intent of bans is to identify all thought and expression around certain topics.

Here is where GenAI LLMs (Large Language Models) is different than programmatic approach of just finding exact word matches. RAG (Retrieval Augmented Generation) techniques can be used to provide the context of concepts referenced by prompt requests to LLMs . Unlike a list of words, LLMs and the RAG reference concepts in multi-dimensional vectors.

From

Most humans can only imagine three dimensions, as illustrated above. But LLMs contain billions of dimensions, with each word (token) represented as a different dimension vector. Vectors enable calculations of the extent one word is conceptually related to other words as the distance between vectors.

However, there are tools such as L

https://enjalot.github.io/latent-scope/us-federal-laws


There are several databases which work with vectors as RAGs: MongoDB Atlas, Pinecone, Weaviate, etc.

LangChain and Haystack utilities can be used in a locally run program that loads and then attaches a RAG database along with the prompts to analyze text.

See https://lakefs.io/blog/what-is-rag-pipeline/

https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept/retrieval-augmented-generation?view=doc-intel-4.0.0

Online: LlamaHub, AWS Textract, and Azure AI Document Intelligence provide connectors and document loaders to create a RAG based on user-defined parameters such as chunk size, overlap, and retrieval depth (e.g., top-k results). For example, LangChain’s WebBaseLoader extracts content from HTML web pages (based on a URL) and FileLoader extracts from local files. pdftotext and pandoc extracts from files in PDF, DOCX, PPTX, and other specific formats.

After that, is there a LLM that rewrites text to avoid the thought crime policing?

TBDverlap, and retrieval depth (e.g., top-k results). For example, LangChain’s WebBaseLoader extracts content from HTML web pages (based on a URL) and FileLoader extracts from local files. pdftotext and pandoc extracts from files in PDF, DOCX, PPTX, and other specific formats.

After that, is there a LLM that rewrites text to avoid the thought crime policing?


Paul G.

Independent consultant and published author. AI Automation testing expert with WebdriverIO, Playwright Selenium and Codeless platforms.

1 周

Hey Wilson! I did a similar project some years ago. But it was only 7 words - inspired by George Carlin. And I had them all encrypted. ????

回复
Sofiane Dami

Tech-IT-industrie (retouching repair) Cyber oriented. French & English speaking. ?? R&D. ??? I Build Smart Contract ?? Dapp web3 & web2. ?? Creator ?I love Cryptoworld ? ?? $PEPE ??

1 周

Can it work with X?

回复

International comparative statistics are important (index of freedom of expression for ex)

回复

要查看或添加评论,请登录

Wilson Mar, MSc的更多文章

  • Lock Your Social Security Number. Do it now.

    Lock Your Social Security Number. Do it now.

    On YouTube, etc. many creators advertise for paid monthly services to notify you when they detect that your identity…

    2 条评论
  • Why AI on your (or your robot's) wrist is a breakthrough

    Why AI on your (or your robot's) wrist is a breakthrough

    Several years ago, the UK government, through BBC, gave each student in the UK a micro:bit. Other countries pay $35…

  • MCU IIoT with InductiveAutomation Ignition & AI

    MCU IIoT with InductiveAutomation Ignition & AI

    E Building a gingerbread house over the winter holiday is a tradition because we also build memories. Last year we put…

    2 条评论
  • Solve those cloud labs with us!

    Solve those cloud labs with us!

    Most YouTubers advise "get hands-on"! But how often have you got stuck following step-by-step labs at: Microsoft's…

  • Performance analysis for security is needed more than ever

    Performance analysis for security is needed more than ever

    Performance issues don't demo well. So, like superheros, performance analysis don't get slaps on the back and…

  • What writing GenAI prompts made me realize about managing people

    What writing GenAI prompts made me realize about managing people

    As I learned to write (engineer) prompts when generating text and images using using LLMs such as OpenAI's GPT…

    6 条评论
  • Hiring a Chief AI Officer is about more than AI

    Hiring a Chief AI Officer is about more than AI

    Several board consultants have recommended hiring a CAIO (Chief AI Officer) reporting to the CEO. Peter Diamandis said…

    2 条评论
  • Acronyms galore

    Acronyms galore

    Each domain has its set of acronyms that those in the know throw around. It's often embarrassing to ask what an acronym…

  • How to safeguard your personal data

    How to safeguard your personal data

    One click on a "phising" email link is all it takes to get "pwned" (owned by hackers). Many websites have been setup by…

  • Best Advice from Masterclass Celebs

    Best Advice from Masterclass Celebs

    Interviews of celebrities in Masterclass.com, Netflix, and others are so captivating to me that I listen to them (on my…

社区洞察