GPT May Help Developers Understand Code Base

Getting a unified view of a complex tech stack is hard. Because

Name variations occur

Name variations (for a concept signal) in complex code stack (for example, an ML application) will cause a lot of pain on code understanding regarding to the behaviors of a concept signal. If we want to add “article count” as a new signal, from the input parsing, to training data preparation, all the way to model deployment, we would have to touch a lot of different code stack / modules, and the same article count have different representations in each environment, for example, article_count, articleCount, ArticleCount, ARTICLE_COUNT, articlecount, etc. And often, for compatibility purposes there may be version info such as article_count_v1. Yet, there maybe some mapping logics in config file such that article count is mapped with different name such as KnowledgeArticleVoteCount.?

Name variations cause frictions

The different identifier strings of the same entity may have strong semantic similarities; after all, it improves code readability by referring the same entity in a consistent manner. However, traditional code search does not recognize that so it is tricky to surface all related codes in the same view. The segmented views of the same entities will cause visibility problems and cause difficulties on code understanding.?

For manual code study regarding a certain entity, it may not present a big problem but it will make automation harder: suppose one wants to study many such signals and each signal may have different variations, then how can I tell outlier signals, given that the outlier can only be judged by the emergent properties of a signal across the stack? Without effectively stitching the contexts together under one unified identity, it is really hard.?

The Need for Text Simplification

Suppose we want to study N conceptual signals that spans M code components, and suppose each code components have different ways of referring to the same concept signal, then there are N*M different representations that refers to N unique things.?

To effectively study the N signals, we cannot work with N*M representations; rather, we need to find their connections and find their normative representations.?

GPT helps Text Simplification

I tried Text to command model in OpenAi playground:

article_count normative representation is ArticleCount.?articleCount normative representation is ArticleCount.?articleCountV1 normative representation is ArticleCount.?fileNameV1 normative representation is [FileName.]

(Brackets represent the auto-completed text by GPT.)

fileNameV1 is correctly auto-completed to FileName, given my pre-staging text.?

It is straightforward to do the autocompletion for the rest of the N*M representations so that they linked together to become N things, rather than the isolated N*M things.?

When Complexity Fades, Emergent Properties Shine

After one correlates the same thing across contexts, a 360 degree view about that thing can be established.?

This 360 degree view can take many forms:?

  1. A central repo that tracks a signal’s cross-stack life cycle
  2. Flow charts that tracks each signal’s cross-stack life cycle
  3. Or even AI summarization about each signal’s story, based on their cross-stack appearance.?

With that, we can spend more interesting time studying the emergent properties of the full system:?

  1. Understand a signal’s SLA.?
  2. Security review.?
  3. Architectural soundness

Takeaway

  • Name variations may cause frictions on code readings, and it may block advanced code study throughout the tech stack.
  • Using GPT correctly, we can cost-effectively eliminate name variations, therefore build connections across silo-ed code base.
  • Emergent properties only show up with unified view.

要查看或添加评论,请登录

Pingping Xiu的更多文章

社区洞察

其他会员也浏览了