Intelligent Document Processing comparing AWS GenIA and ML Services (Part II)
Hi, we will continue with the second part of the article.
Tokens
Refers to a unit of text that has been extracted or identified during the process of tokenization. Tokenization is the task of breaking down a sequence of text into smaller units, which could be words, sub-words, characters, or even phrases. These smaller units are the tokens. A token is approximately four characters long for typical English text.
As an example, the Azure API supports a maximum of 4,000 tokens shared between the prompt (including system message, examples, message history, and user query) and the model's response. As API calls are charged per token, and you can set a maximum limit for response tokens, you should monitor the current token count to ensure the conversation does not exceed the maximum response token limit.
Token Generation
The first step in analyzing a corpus is to break it down into tokens. For simplicity, you can think of each distinct word in the training text as a token, although in reality, tokens can be generated for partial words or combinations of words and punctuation. This quantity of tokens in a document is used for price calculation.
For example, the word "hambúrguer" is split into the tokens ham, bur, and ger, while a short and common word like "pera" is a single token.
We won't delve deeply into the Machine Learning for text classification technique in this article because it would take a considerable amount of time due to its complexity (text preparation, logistic regression)
LangChain
is an open-source framework for building applications based on large language models (LLMs). LLMs are extensive, pre-trained deep learning models trained on vast amounts of data that can generate responses to user queries, such as answering questions or creating images from text-based prompts
Intelligent Document Processing:
In our problem, we will focus on some of those steps. For the same function, we will "compare" solutions with and without, in this case, AWS BedRock or the equivalence Textract + Comprehend services. In any case, the documents will come at S3 (in AWS) and an event is triggered when a new document comes. This event will start and execute the IDP steps that are auto-explained:
Using AWS ML services :
Using AWS GEN IA services :
In our Architecture and following the function of this service we need:
领英推荐
Using AWS GEN IA services :
In our Architecture and following the function of this service we need:
Redaction of PII data (personally identifiable information) and PHI (if is the case) Tagging of Doc Enrichment of Metadata of Doc Legal retention BedRocK + Summary, normalization Q@A
Comparing generative AI with traditional Machine Learning (ML)
When comparing Generative AI, such as models like GPT (Generative Pre-trained Transformer), with traditional Machine Learning (ML) approaches, there are several advantages:
While Generative AI has these advantages, it's important to note that traditional ML approaches may still have their place in scenarios with well-defined tasks, large labeled datasets, and where interpretability or explainability is critical. The choice between Generative AI and traditional ML depends on the case.
Prize compassion:
Let's perform the comparison on a large set of documents ok 18k docs x 53 pag média = 954.000 pag day. ?Let made an exercise :
in Conclusion:
BedRock is almost 10 times cheaper (in both options some services aren't in Sao Paulo). at this time those services aren't available in Sao Paulo for now, plus all the benefits describe above.
Especialista em Desenvolvimento de Produtos e Inova??o | Estratégia de Negócios e Crescimento | CSPO? | Canastra R2'24
9 个月Ricardo, obrigado por compartilhar!!!