Improve data quality in ODK using a standardized taxonomy
Taxonomy for Social Impact Data
Taxonomy is a hierarchical classification system for organizing information into categories or subcategories. 'The United Nations Office for the Coordination of Humanitarian Affairs (UNOCHA) has compiled a list of humanitarian terms, available at https://reliefweb.int/taxonomy-descriptions. AGROVOC is a multilingual thesaurus developed and maintained by the Food and Agriculture Organization (FAO). It is one of the most comprehensive taxonomies for organizing and retrieving agricultural and related information (Figure 1)
Figure 1: AGROVOC Multilingual Thesaurus
Taxonomy for AI
Taxonomies play a crucial role in AI and Large Language Models (LLMs) by organizing information, improving the quality of training data, and enabling domain-specific knowledge. For instance, LLMs can leverage taxonomies to assign labels to data and documents, allowing these labels to more accurately represent the underlying information. This enhances the model's ability to retrieve and reference relevant data points or documents in response to user prompts.
In the development of advanced data collection forms or analytics, where LLMs may struggle with hallucinations or produce suboptimal results, a well-structured taxonomy can serve as an effective inferencing model during training. This helps guide the model’s understanding and reasoning, improving the accuracy and relevance of its outputs.
Data Collection using ODK
ODK (Open Data Kit) is an open-source suite of software tools for data collection, management, and analysis in field-based environments. It is widely used in sectors like international development, humanitarian aid, research, and public health for gathering structured data in the field. As of 2024, ODK has a substantial global user base, with over 2 million users worldwide, who collectively submit more than 200 million data entries annually.
ODK uses a simple data structure, as we have seen in ODK XLS form definition (Figure 2).
Figure 2: Sample ODK XLS form definition
领英推荐
Users have been giving random variables for names. For projects that collect longitudinal data, continue to collect data without reference to a taxonomy. Before we propose using a taxonomy, let us consider the overhead to create and maintain such a taxonomy.
Creating a Draft Taxonomy
In a given year, most nonprofits tend to work on three to five projects in the focus areas. Most LLMs including OpenAI, Claude, and Llama shall be able to generate fairly decent, accurate, and comprehensive taxonomies that nonprofits can use as the baseline draft. For example, i have promoted creating a list of categories and keywords for 'Individual demographics' (Figure 3). These data points on identifying documents, gender, age, marital status, and other keywords can help describe and collect data on personal information. Users can pilot this draft taxonomy to use in their ODK forms and continue to improve it.
Figure 3: Illustrative list of keywords under the category of individual demography
keyword,keywordTitle,description,mainKeywordId
demo_individual_basic,Basic Information,"Core personal identifiers including name, age, and contact details.",root_demographics
demo_individual_identity,Identity Documents,"Government-issued identification documents and legal recognition status.",root_demographics
demo_individual_age,Age Information,"Age-related data including date of birth and age group classification.",root_demographics
demo_individual_gender,Gender Information,"Gender identity and related demographic characteristics.",root_demographics
demo_individual_marital,Marital Status,"Current marital status and relationship information.",root_demographics
demo_individual_social,Social Category,"Social classification including caste, tribe, and minority status.",root_demographics
Next Steps
In the next steps, we shall discuss on making the taxonomy programmatically available to use ODK form development and also across analytics.
Related article
References
AGROVOC: AGROVOC: farmer field schools. (2025, Jan 02). https://agrovoc.fao.org/browse/agrovoc/en/page/c_331069
TrustRadius. (2025, January 2). ODK Overview. TrustRadius. https://www.trustradius.com/products/odk-open-data-kit