Intelligent Document Processing comparing AWS GenIA and ML Services (Part II)

Intelligent Document Processing comparing AWS GenIA and ML Services (Part II)

Hi, we will continue with the second part of the article.

Tokens

Refers to a unit of text that has been extracted or identified during the process of tokenization. Tokenization is the task of breaking down a sequence of text into smaller units, which could be words, sub-words, characters, or even phrases. These smaller units are the tokens. A token is approximately four characters long for typical English text.

As an example, the Azure API supports a maximum of 4,000 tokens shared between the prompt (including system message, examples, message history, and user query) and the model's response. As API calls are charged per token, and you can set a maximum limit for response tokens, you should monitor the current token count to ensure the conversation does not exceed the maximum response token limit.

Token Generation

The first step in analyzing a corpus is to break it down into tokens. For simplicity, you can think of each distinct word in the training text as a token, although in reality, tokens can be generated for partial words or combinations of words and punctuation. This quantity of tokens in a document is used for price calculation.

For example, the word "hambúrguer" is split into the tokens ham, bur, and ger, while a short and common word like "pera" is a single token.

We won't delve deeply into the Machine Learning for text classification technique in this article because it would take a considerable amount of time due to its complexity (text preparation, logistic regression)

LangChain

is an open-source framework for building applications based on large language models (LLMs). LLMs are extensive, pre-trained deep learning models trained on vast amounts of data that can generate responses to user queries, such as answering questions or creating images from text-based prompts

Intelligent Document Processing:

In our problem, we will focus on some of those steps. For the same function, we will "compare" solutions with and without, in this case, AWS BedRock or the equivalence Textract + Comprehend services. In any case, the documents will come at S3 (in AWS) and an event is triggered when a new document comes. This event will start and execute the IDP steps that are auto-explained:

  • Doc Classification Step: Basically, I will need two functions n at this step: Classification training and Classification Inference for any kind of document. In our Architecture follow the function of this service that we need:

Using AWS ML services :

  1. Textract APIs: For extract data: Structured, Semi Structured, Nao structured, Based on queries like NF e Receipts (for example ) or Identity documents
  2. Comprehend APIS: For entity identification: like Identity detection, Training of Customized Entities, or Detection of Customized Entities

Using AWS GEN IA services :

  1. BedRock APIS: We choose among all the options of LLM like Titan. We chose Anthropic's Claude better choice in our case and we used Lanchain and Python to Claude access.


  • Doc Extraction Step: Same for this step:

In our Architecture and following the function of this service we need:

a

  • Using AWS ML services :

  1. Textract APIs: Extract data from the document: Structured, Semi Structured, No structured, Based on queries, invoices e Receipts (if is the case), Identity documents.
  2. Comprehend APIS: Identification of interesting entity names from the extracted: Identity detection, Training of Customized Entities, and Detection of Customized Entities

Using AWS GEN IA services :

  1. BedRock APIS: Same functions


  • Doc Enrichment: Same for this step:

In our Architecture and following the function of this service we need:


  • Using the both ML service and GenIA the function will be:

Redaction of PII data (personally identifiable information) and PHI (if is the case) Tagging of Doc Enrichment of Metadata of Doc Legal retention BedRocK + Summary, normalization Q@A

Comparing generative AI with traditional Machine Learning (ML)

When comparing Generative AI, such as models like GPT (Generative Pre-trained Transformer), with traditional Machine Learning (ML) approaches, there are several advantages:

  1. Versatility and Adaptability: Generative AI models are pre-trained on large datasets and can adapt to a wide range of tasks without task-specific training. Traditional ML models often require more customization and specific feature engineering for each task.
  2. Contextual Understanding: Generative models, especially language models, have a better grasp of contextual information. They can understand and generate human-like text based on context, making them suitable for natural language understanding and generation tasks.
  3. Few-shot and Zero-shot Learning: Generative models can perform few-shot and zero-shot learning, meaning they can make accurate predictions with very few examples or even without specific examples for a task. Traditional ML models may struggle with limited data.
  4. Continuous Learning: Generative models can be fine-tuned on new data to adapt to specific domains or tasks, allowing for continuous learning and improvement over time. Traditional ML models might require retraining from scratch.
  5. Creativity and Novelty: Generative models can be creative in generating new content or ideas. They are capable of producing novel outputs, making them valuable for creative tasks such as content generation, art, and brainstorming.
  6. Language Understanding: For language-related tasks, Generative AI excels in understanding and generating text. It can handle context, nuances, and varying sentence structures better than traditional ML models.
  7. Reduced Feature Engineering: Generative models often require less explicit feature engineering compared to traditional ML models. They can learn complex patterns and representations from data on their own.

While Generative AI has these advantages, it's important to note that traditional ML approaches may still have their place in scenarios with well-defined tasks, large labeled datasets, and where interpretability or explainability is critical. The choice between Generative AI and traditional ML depends on the case.

Prize compassion:

Let's perform the comparison on a large set of documents ok 18k docs x 53 pag média = 954.000 pag day. ?Let made an exercise :

  • Text + Compr = $322071 + $4293= $326.364 para todo o ciclo mensal
  • BedRock: 470 Tokens p doc x 18k doc mes = 8.460.000 tkns * 1,5 fator de saida = 12.690.000 tkns / 1k tkns = 12.690 x $0.00240 = US$ 30,456 mes

in Conclusion:

BedRock is almost 10 times cheaper (in both options some services aren't in Sao Paulo). at this time those services aren't available in Sao Paulo for now, plus all the benefits describe above.






Jo?o Moreira

Especialista em Desenvolvimento de Produtos e Inova??o | Estratégia de Negócios e Crescimento | CSPO? | Canastra R2'24

9 个月

Ricardo, obrigado por compartilhar!!!

要查看或添加评论,请登录

Ricardo Jorge Baraldi的更多文章

社区洞察

其他会员也浏览了