ControlFlag: What Can We Learn From One Billion Lines of Code?

ControlFlag: What Can We Learn From One Billion Lines of Code?

Justin Gottschlich is a 20-year veteran in computer software and hardware research and development. He founded and directs the Machine Programming Research (MPR) team at Intel Labs. Below Justin shares his insights into his team’s innovative research for inventing tomorrow’s technology.

Highlights:

  • ControlFlag is a novel self-supervised machine programming system from Intel Labs that can autonomously detect coding anomalies in source code control structures, and works with any programming language that has control structures.
  • By utilizing the emerging concept of semi-trust, ControlFlag can more wholly utilize self-supervised learning, enabling it to learn from over one billion lines of unlabeled source code.

In a world increasingly run by software, developers spend a disproportionate amount of time fixing bugs rather than coding new concepts. ControlFlag, a novel self-supervised machine programming (MP) research system that autonomously detects errors in control structure code, shows promise as a powerful productivity tool to assist software developers. In preliminary tests conducted at Intel Labs, ControlFlag trained and learned novel defects on more than one billion unlabeled lines of semi-trusted production-quality code.

Niranjan Hasabnis, a senior research scientist in our Intel Labs’ Machine Programming Research group, presented our paper ControlFlag: A Self-supervised Idiosyncratic Pattern Detection System for Software Control Structures at the ML for Systems workshop at NeurIPS 2020.

As heterogenous computing architectures become more ubiquitous, the software required to fully exercise such heterogeneous hardware is becoming increasingly complex, creating a greater likelihood for bugs. While a study from Gartner found that the IT industry spent an estimated $3.9 trillion in 2020 on software development costs, research from the University Cambridge showed that 50% of the IT budget is spent on debugging code.

Machine programming, which focuses on the automation of software and hardware development, aims to help programmers effectively keep pace with this heterogeneity revolution. When fully realized, ControlFlag (and other systems like it) could help alleviate the bug drain on resources by automating testing, monitoring, and program repair. In addition to identifying bugs in the control structures of high-level programming languages, ControlFlag produces suggestions as corrections for the potential errors. In the future, the ControlFlag team aims to generalize ControlFlag to find latent bugs in all software, so it can be used in any programming language, even languages used to design hardware.

Self-supervised Machine Programming

Public GitHub open-source software repositories are often key in providing the data needed to effectively train MP systems for various code analysis systems. However, such repositories often lack the labels that are generally required for more traditional machine learning systems, such as those used in supervised learning systems. Because of this, many of today’s code reasoning systems can only train on up to tens of millions of lines of code – the ones with manually added labels.

Through research, our team found that certain environmental variables could be used to formulate a consensus-based approach to provide a confidence level of trust of unlabeled code. By using this consensus formula, ControlFlag can be used such that only GitHub repositories meeting the user’s minimum required confidence level are used for training. This also has the byproduct of increasing the number of lines of code that can be learned from by more than two orders of magnitude than most of the state-of-the-art MP systems; from tens of millions of lines of code used for training to over a billion. Consequently, ControlFlag is able to mine patterns from more and diverse repositories, helping it avoid code reasoning bias, where one repository may incorrectly steer the system’s learning. Code reasoning bias can often limit more generalized learning if certain repositories are learned about in isolation, such as the ones commonly used solely in today’s MP research (e.g., POJ-104, Leet Code, and Google Code Jam).

ControlFlag's approach recasts typographic coding errors as anomalies, where, if successful, a self-supervised system trained on a large enough set of semi-trusted code will automatically learn which idiosyncratic patterns are acceptable (i.e., non-anomalous) and which are not (i.e., anomalous), without any labeled supervision.

Intel has started evaluating ControlFlag internally to identify bugs in its own software and firmware product development. It is a key element of Intel’s Rapid Analysis for Developers (RAD) project, led by principal engineer, Jim Baca and research scientist, Joe Tarango. RAD aims to accelerate the velocity of software development and debugging by providing expert-level automated assistance. RAD is currently targeting firmware with our MP debugging systems, given that such bugs can often be highly complex, sometimes leading to weeks of debugging effort to track down and fix a single bug.

How ControlFlag Works

No alt text provided for this image

ControlFlag was designed to solely focus on software defects found in programming language control structures. These programming language structures can alter a program’s execution behavior and are oftentimes the result of a logical condition being satisfied or not. Control structures were chosen as the focus of ControlFlag after the team evaluated hundreds of bugs and found that a large majority of them were errors in control structures, rather than non-controlling code (i.e., normal program instructions). Moreover, because ControlFlag is not given any rules about the types of errors it should identify, it can learn to identify a potentially unlimited number of types of defects. This has led to some rather surprising results that even the ControlFlag development team wasn’t anticipating, such as how ControlFlag learned how to access memory through pointers in the C programming language.

ControlFlag consists of two main phases: (1) code pattern mining (i.e., machine learning training) and (2) code pattern scanning (i.e., machine learning inference). The code pattern mining phase consists of learning the common (and uncommon) idiosyncratic coding patterns found in the user-specified GitHub repositories, which, when complete, generates a precedence dictionary that contains acceptable and unacceptable idiosyncratic patterns based on user-supplied anomalous thresholds. The code pattern scanning phase consists of then taking what ControlFlag has learned and having it analyze other repositories where anomalous idiosyncratic patterns may exist.

For training, our team chose the top 6,000 open-source GitHub repositories that used C/C++ as their primary programming language, and had also received a minimal rating of 100 stars. In our initial version of ControlFlag, GitHub stars were used as a mechanism to infer quality or semi-trust of the code used for training, but many other environmental factors could also be used as a form of semi-trust consensus such as number of repository forks, number of repository commits, number of contributors, and so forth.

ControlFlag’s code pattern scanning phase consists of analyzing a given source code repository against the learned idiosyncratic patterns dictionary. When anomalous patterns are identified, ControlFlag notifies the user and provides them with possible alternatives.

Some Early ControlFlag Results

ControlFlag’s parallel learning system, currently using dozens of worker threads executed on a distributed Intel Xeon CPU cluster, enable it to learn idiosyncratic coding patterns from over one billion lines of source code in only about two hours of training time. By analyzing C/C++ code, ControlFlag was about to identify previously undetected anomalies (as well as some unusual programming styles) in many widely used open source projects, such as OpenSSL and CURL, which are of production-quality grade. We believe these early results are promising.

One such example was an anomalous condition ControlFlag identified in CURL. We reported it to the CURL team after we reviewed the finding. The CURL development team then analyzed the report and agreed that the code was not only obfuscated but was erroneously used in other locations in CURL. The CURL team has since redesigned this segment of code and integrated the redesign into the latest version, correcting the issue that ControlFlag identified.

The team plans to open source a beta version of the ControlFlag system later this year.

The Future

The field of machine programming is rapidly evolving due to advances in machine learning, formal methods, data availability, and computing efficiency. In addition to automatic bug detection, other recent MP tools include automatic code generators, code recommendation and semantic similarity systems, automatic generation of performance regression tests, and automatic completion of program constructs in integrated development environments. Our team is deeply investigating each of these areas.

With ControlFlag and other MP systems like it, I envision a future where programmers spend less time on debugging and more time doing what I believe they do best: expressing their creative intentions to machines. This notion follows from our MP team’s vision paper (joint with MIT) called The Three Pillars of Machine Programming. The three pillars are intention, invention, and adaptation. ControlFlag can be thought of as a tool that aims to comprehend human intention as it has been expressed in source code, which can be thought of as a form of adapted intention (i.e., adaptation). Once that adapted intent is learned, ControlFlag attempts to identify deviations from that intention, as expressed through anomalous code. While ControlFlag is currently limited to learning basic syntactic anomalies, the team is working to expand its feature set to incorporate semantics learning, through our code semantic similarity system, MISIM (joint with MIT and Georgia Tech). If successful, ControlFlag may be able to reason about and identify more complex code deviations of the programmer’s intent.

The three pillars have become fundamental to all the research and engineering we do in machine programming at Intel Labs. To our pleasant surprise, we are also seeing it adopted by many teams inside and outside of Intel.

Richard Everts

AI Engineer w/ patents and multiple awards | Published author in the AI field | Full-stack web developer | Federally funded researcher

3 年

Depends on the language and platform I suppose, but anything to throw at this problem could help.

Allan Mays

Solution Architect - Specialty in Trading Systems Open to new opportunities

3 年

Thanks for posting

Congrats - this is huge!

回复

要查看或添加评论,请登录

Justin ("Goju") Gottschlich的更多文章

社区洞察

其他会员也浏览了