Improve data quality in ODK using a standardized taxonomy

Improve data quality in ODK using a standardized taxonomy

Taxonomy for Social Impact Data

Taxonomy is a hierarchical classification system for organizing information into categories or subcategories. 'The United Nations Office for the Coordination of Humanitarian Affairs (UNOCHA) has compiled a list of humanitarian terms, available at https://reliefweb.int/taxonomy-descriptions. AGROVOC is a multilingual thesaurus developed and maintained by the Food and Agriculture Organization (FAO). It is one of the most comprehensive taxonomies for organizing and retrieving agricultural and related information (Figure 1)

Figure 1: AGROVOC Multilingual Thesaurus

Source:

Taxonomy for AI

Taxonomies play a crucial role in AI and Large Language Models (LLMs) by organizing information, improving the quality of training data, and enabling domain-specific knowledge. For instance, LLMs can leverage taxonomies to assign labels to data and documents, allowing these labels to more accurately represent the underlying information. This enhances the model's ability to retrieve and reference relevant data points or documents in response to user prompts.

In the development of advanced data collection forms or analytics, where LLMs may struggle with hallucinations or produce suboptimal results, a well-structured taxonomy can serve as an effective inferencing model during training. This helps guide the model’s understanding and reasoning, improving the accuracy and relevance of its outputs.

Data Collection using ODK

ODK (Open Data Kit) is an open-source suite of software tools for data collection, management, and analysis in field-based environments. It is widely used in sectors like international development, humanitarian aid, research, and public health for gathering structured data in the field. As of 2024, ODK has a substantial global user base, with over 2 million users worldwide, who collectively submit more than 200 million data entries annually.

ODK uses a simple data structure, as we have seen in ODK XLS form definition (Figure 2).

  • Type: Defines the type of input (e.g., text, integer, date, select_one, select_multiple, etc.).
  • Name: The variable name for the question (this will be used to store the response in the data).
  • Label: The text that appears to the user as the prompt for the question.
  • Hint: (Optional) A helper text that guides the user on how to answer the question.

Figure 2: Sample ODK XLS form definition

ODK Survey form XLS example

Users have been giving random variables for names. For projects that collect longitudinal data, continue to collect data without reference to a taxonomy. Before we propose using a taxonomy, let us consider the overhead to create and maintain such a taxonomy.

Creating a Draft Taxonomy

In a given year, most nonprofits tend to work on three to five projects in the focus areas. Most LLMs including OpenAI, Claude, and Llama shall be able to generate fairly decent, accurate, and comprehensive taxonomies that nonprofits can use as the baseline draft. For example, i have promoted creating a list of categories and keywords for 'Individual demographics' (Figure 3). These data points on identifying documents, gender, age, marital status, and other keywords can help describe and collect data on personal information. Users can pilot this draft taxonomy to use in their ODK forms and continue to improve it.

Figure 3: Illustrative list of keywords under the category of individual demography

keyword,keywordTitle,description,mainKeywordId
demo_individual_basic,Basic Information,"Core personal identifiers including name, age, and contact details.",root_demographics
demo_individual_identity,Identity Documents,"Government-issued identification documents and legal recognition status.",root_demographics
demo_individual_age,Age Information,"Age-related data including date of birth and age group classification.",root_demographics
demo_individual_gender,Gender Information,"Gender identity and related demographic characteristics.",root_demographics
demo_individual_marital,Marital Status,"Current marital status and relationship information.",root_demographics
demo_individual_social,Social Category,"Social classification including caste, tribe, and minority status.",root_demographics         

Next Steps

In the next steps, we shall discuss on making the taxonomy programmatically available to use ODK form development and also across analytics.

Related article

Developing a Simple Taxonomy API using Node.js, Prisma ORM, and GraphQL for use with ODK

References

AGROVOC: AGROVOC: farmer field schools. (2025, Jan 02). https://agrovoc.fao.org/browse/agrovoc/en/page/c_331069

TrustRadius. (2025, January 2). ODK Overview. TrustRadius. https://www.trustradius.com/products/odk-open-data-kit


要查看或添加评论,请登录

Atanu Garai的更多文章

社区洞察

其他会员也浏览了