A Ninja’s approach to Data Extraction

A Ninja’s approach to Data Extraction

Ninjas have always had an amazing way of dealing with things. The classical stories abound with how they were able to overcome incredible odds and accomplish impossible missions.?Some even speak of fantastic powers that ninjas possessed, stealthy and even magical sometimes.

However, if you really understand the ninja’s approach to any problem, it is centered on understanding and really focusing on the fundamentals of what needs to be done.?The impossible assignment could be unraveled to its fundamental tasks and would then show the true challenge. The idea would then be to accomplish as much of the mission as possible before even reaching the impossible point.

A Ninja would take any goal that was given to him and...

  • Identify and group the different objectives that needed to be accomplished
  • Map out and mark the important places where action had to take place
  • Gather all the tools, people involved and ideas of the plan
  • Connect the people, tools and ideas together into missions
  • Turn the missions into real stories

Taking the Ninja approach to the challenge of obtaining data from unorganized and seemingly unquantifiable complexity of documents requires looking at certain fundamentals.

Classify documents/data into meaningful sets

No alt text provided for this image

  • Gather the different types of documents and data that need to be extracted. Separate each type of ‘documents+data’ into meaningful sets.? This way, each mission to extract data has one central idea that is easy to understand and connects everything together.?This also simplifies the quality control complexity later on.
  • Don’t force unconnected or loosely connected data items or document types together. They will only make things more complicated and end up consuming time and energy. Once the simpler missions are accomplished, they can always be connected to the larger goal of full data extraction and connection.

Identify the location of the data

No alt text provided for this image

  • For each document of a given type, find a way to identify the page and the part of the page where the relevant data might reside.?This gives you the location specificity to narrow down the probabilities and also to improve performance.
  • Know the type of location specificity that is being addressed - is it the page or sentence or table or section or paragraph or zone??Determine the most effective location specificity for any given set of fields.

Collect all the pieces of the data

No alt text provided for this image

  • Identify and collect all the pieces of the data needed to accomplish the extraction.?At this point, the goal is to make sure you have all the relevant pieces and not to worry about having extra candidate values that you might consider noise.
  • It is still too early to think about noise, as much of that clarity will only arise out of the connection between the different pieces.?This step is deliberately more inclusive in order to give the maximum chance for the next step to succeed.

Connect all the pieces together

No alt text provided for this image

  • Identify how the pieces are connected together.?This is where candidate values turn into fields and records.? The relationship between these different elements will automatically eliminate much of the noise.
  • Any remaining noise or other issues can then be addressed in a more pointed manner. Most times, this might be a very specific aspect of the use case that gives this solution.?But remember, most of the journey is already covered and this last bit may require that superhuman effort to make the mission a success.

Transform into the final form

No alt text provided for this image

  • The final step is to convert the records and its values into a form that allows real consumption by business processes in a convenient and quicker way. The transformation too has to be intelligent to provide the equivalent of a finished car that you can drive away from the showroom rather than just a seat on an engine with a cockpit of panels.
  • Automated data transformation can also save time and energy by simplifying consumption, verification and validation of the output. At the same time, it allows analytics to be plugged directly into this workstream to give a more turnkey feel to the solution.
  • Many basic and common transformation effects for an industry can be provided out of the box to speed things up during deployment and expansion.?The important part is to recognize that transformations happen at both value and final format levels.

In summary, any data extraction challenge can be resolved to some fundamental level of problem definitions.?Once we have this clarity, the solution approach will become clear as will the more difficult challenges of the task.?With this approach, we may be able to cover most of the journey with already available approaches and optimize how we work around the tougher bits.


By Kiran Kumar - CTO, CPO and Co-Founder at UniQreate

Abhinit Kumar

VP-Program/Project Management,Sales,Ed-Tech,Management,Investment Banking Ops

1 年

Amazing

回复

要查看或添加评论,请登录

UniQreate的更多文章

社区洞察

其他会员也浏览了