Natural Keys vs. Surrogate Keys
@cdisc

Natural Keys vs. Surrogate Keys

A natural key is a set of data that uniquely identifies an entity and distinguishes it from any other row in the table. The advantage of natural keys is that they exist already, so there’s no need to introduce a new value to the data schema. However, one of the difficulties in choosing a natural key is that just about any natural key one can think of has the potential to change. Because they have business meaning, natural keys are effectively coupled to the business, and they may need to be reworked when business requirements change. For example, the addition of a position or location that becomes a key in a new study, but which was not collected in previous studies, could require reworking the natural key in clinical trials data.

A surrogate key is a single-part, artificially established identifier for a record. Surrogate key assignment is a special case of derived data, one where a portion of the primary key is derived. A surrogate key is immune to changes in business needs. In addition, the key depends on only 1 field, so it is compact. A common way of deriving surrogate key values is to assign integer values sequentially. The --SEQ variable in the SDTM datasets is an example of a surrogate key for most datasets; in some instances, however, --SEQ might be a part of a natural key as a replacement for what might have been a key (e.g., a repeat sequence number) in the sponsor’s database .

  • #CDISC
  • #ClinicalDataInterchangeStandardsConsortium
  • #SDTM
  • #SDTMImplementationGuide
  • #ClinicalTrials
  • #ClinicalDataManagement
  • #DataStandardization
  • #DataQuality
  • #DataIntegrity
  • #NaturalKeys
  • #SurrogateKeys

Mazi ?? Ntintelo Sunil Gupta Athenkosi Nkonyeni CDISC 赛仕软件


Allan B.

SAS App Migration, Modernisation, and Manifestation

1 年

UUIDs make better surrogate keys when the table is large, and the loads are mainly inserts rather than updates (no need to calculate or maintain the highest integer) Surrogate keys are also invaluable when your natural keys contain null values

Jozef Aerts

Passionate about standards in clinical research and healthcare, and their implementation in IT systems.

1 年

--SEQ should essentially not be in SDTM. It needs to be re-calculated (post-execution) each time the number of records is increased or decreased. It also is used as a "lazy way" to connect records in different datasets with each other (--SPID is a better way) and especially to connect with the "non-standard variable" records in the corresponding SUPPxx. The latter should also not exist. They are only there as (the tools of) many reviewers are not capable to distinguish between a standard variable and a non-standard variable. And then there is also the 200-character limitation of XPT ... But as long as reviewers are uncabable to use natural keys (though defined in the define.xml) I am afraid we will need surrogate keys in SDTM, SEND and ADaM...

要查看或添加评论,请登录

? Daniel Wanjiru的更多文章

社区洞察

其他会员也浏览了