Test Driving Code Llama-2 & To Be or not to Be: What is Understanding?
Midjourney prompted by THE DECODER

Test Driving Code Llama-2 & To Be or not to Be: What is Understanding?


Recent developments in the space of Large Language Models have been both turbulent and exciting. During this time, I have been fortunate to be able to work extensively with a fusion of deep-learning graphs and Large Language Models. I have also long held that in IT the era of coding is in its twilight years and that the era of models is upon us. My view has been that models will, in many areas, subsume code rather than perceiving a need for models to generate more code. I do remember being fascinated when after GPT was first released years ago, I managed to obtain a compilable "Hello World" in C++. Since then we have seen the emergence of GitHub's Copilot and just last month (August 2023) Meta's announcement of Code Llama under a permissive commercial license.

There are many issues surrounding A.I. based code generation - not least of all that under U.S. copyright law only creative works of authorship (by a human) are intellectual property protectable under relevant copyright statutes. This leaves generated code in the cold. In February of this year the US Copyright Office declared that images created using the AI-powered Midjourney image generator for the comic book Zarya of the Dawn should not have been granted copyright protection, and the images' copyright protection will be revoked. Japan ruled in the middle of this year (2023) that datasets used in training A.I. models are outside of the scope of copyright. This raises a host of questions in the corporate sphere where normally employers seek to own the copyright of work created by their employees. What for instance happens if a corporation's intellectual property gets "out into the wild," and into the hands of third parties who have neither signed any agreements around confidential information --- the last line of defence ex patents and copyright? Such parties would now not be bound by copyright statutes, presumably even in the presence of so-called "license agreements." This situation exists already with respect to licenses readily attached to models on HuggingFace, including Meta's Llama license. The prevailing view seems to be: "Everyone has a license. We need one too. Who knows how courts will rule..." A research paper entitled "Machine Learning Models Under the Copyright Microscope" from the Max Planck Institute for Innovation & Competition Research Paper No. 21-02 has this to say: "All proprietary and open-source software licensing relies on copyright protection. In most open licenses, if the license is applied to something that is not protected by copyright (or related rights) the license is not triggered." This has potentially wide-ranging implications. Another legal framework which might apply is contract law. Contract law, however, envisages the triad of offer, acceptance and consideration for any agreement to actually form a contract. Clicking "Accept" on its own does not create a contract. The wording of the GNU GPL is informative here. It says: "5.?You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License." In short, the simple act of embedding a license agreement in a subfolder of a software archive does not create any obligation on the recipient of that license. Rather the license provides the only avenue for the recipient to not violate copyright law. Either they have accepted it or they have violated relevant copyright law. Non-applicability of copyright law to machine learning models would be a really big deal. There are already over 120 thousand models on HuggingFace - likely all in a legal vacuum - along with the content (here code) generated by these models.

There are other practical considerations from a software engineering point of view - such as that most time on a project is not taken up by writing code but by architecting, debugging and maintaining it. Tools which make code generation easy and oblivious to architectural considerations, may not be economical over the long run of a project. Is anyone going to ask Copilot to regenerate the entire project?

All this said, we must acknowledge the incredible feat of Large Language Models in being able to generate even reasonable-looking code from human natural language prompts. Symbolic A.I. never got this far. We have, indeed, never been closer to the notion of G.A.I. or General Artificial Intelligence than we have with Large Language Models.

How good then really are Large Language Models at generating code? What are their limitations? Which limitations are on account of model size or quantization and which are inherent? Is the code correct? Is it elegant? Is it efficient? As a C++ programmer of many years, I have my favourite puzzles in Project Euler. C++ of course exists for efficiency. Efficiency is paramount. We don't want average. We want the top. That is after all why we use the language.

All experiments were undertaken using Huggingface's Code Llama Playground.

Let's start.

Goldbach's Other Conjecture

Code LLama Chat Session Follows:

Evaluation: While the verbal proof may sound convincing to the uninitiated, it is also wrong. In mathematics, a composite number is a number that is the product of two numbers, so it can be neither 1 nor a prime. As 37 is prime, it cannot be the answer. Worse, the elaboration details that a prime factorization of 37 is indeed sought: “while loop to find the prime factorization of 37”. That was never going to be found! Furthermore, the answer does not attempt to illuminate the question. A model answer would include something like the sieve of Eratosthenes to check a range of numbers for primality and then test them to arrive at a solution. Instead, the solution offered by Code Llama proposes an answer verbally first (a conjecture) and then introduces a C++ test to support that conjecture – which turns out to be false. But the test failed to detect that. It all sounded so confident...

When this problem is offered to candidates during job interviews, it is hoped that the candidate demonstrates an efficient implementation around primality and modern use of the C++ language to obtain the best big-O notation solution possible. C++ is employed where optimum efficiency is desired or required and therefore the benefit of the use of the language is negated when na?ve solutions are employed.

Double-Base Palindromes

Code LLama Chat Session Follows:

Evaluation: The first observation and question would be why one might build out a one million-element data structure, simply to loop over it. ?Only one integer as a loop variable is required for this as the solution offered is imperative, not functional. The declaration uses a functional idiom; the implementation is an imperative one. The is precisely 1 million times as inefficient in terms of memory allocation as the model answer. If this sounds pedantic, we are only looking at a very small example. Yet how would this unawareness of scale bear out when used on non-trivial inputs? The second comment would be that in idiomatic C++ the problem cries out for a templated function predicate rather than writing two separate functions. One might imagine how this code would "blow out" if 'n' other bases were added to the problem instead of just two. A C++ idiomatic solution for the isPalindromic predicate might be something as shown below:

Idiomatic Predicate for isPalindromic


The above provides a generic isPalindromic predicate implemented in only two lines of logic, requiring only reverse iteration on type T. The base 2 case becomes as simple as utilizing <std::bitset>. Since the C++ compiler infers the type at the call site, this function may be used as-is without explicit instantiation of a templated type for each variant of the function. The logic is clear and succinct in only two lines of code: We instantiate "n” as the reverse of “m” ( the definition of a Palindrome ) and return the equality test of n == m.

The Code Llama solution is very much procedural in style --- how a Python coder might approach the problem.


Largest Collatz Sequence

Code LLama Chat Session Follows:


Evaluation: The key to this puzzle is realizing that it exhibits an overlapping substructure. Therefore only a brute force approach will explore the Collatz sequence repeatedly one million times – as is done in the above offering. This is a classic use case for dynamic programming: ?a model solution would utilize memoization (not memorization) in the pursuit of O(n) execution.

For our next adventure we realized that deep learning has a small data problem. In short, deep learning models require Big Data to converge on answers, especially where billion-dimension models are concerned. The bigger the model, the more this is true. Problems for which many answers exist in the public domain are very amenable to this. Of course, application programmers really ought to be integrating existing frameworks, rather than writing yet another incarnation of a merge sort or Fibonacci sequence from scratch. Indeed, Code Llama will generate a Fibonacci sequence for even esoteric languages like Lisp - as shown below:

Fibonacci Sequence in Common Lisp


The above offering calls for a digression. Code Llama was "smart" enough to employ recursion, but in doing so produced a solution which will "blow up" the call stack in most languages or in the case of Lisp which is smarter than most languages simply will simply not return in a reasonable amount of time. There is a kudos in order: Code Llama, tested its own solution on a small sample - albeit only one data point.


Code Llama testing its own solution.


To be fair, having tested its solution is impressive for an A.I. chatbot. Alas, had it tried another number, one slightly larger, I would have never seen its reply and it would have never seen its own answer. Below is the timed execution on Apple silicone with ten cores first on (fib 10), then (fib 1000).

It got stuck. What happened? The sample input of 10 was completed in less than 1 millisecond, yielding 55 as the answer. The sample input of 1000 simply never returned until a very discouraged user force-quit the application. Why? Because the answer, while recursive, isn't tail recursive. The call stack must perform the addition + on the left branch (fib (- n 1)) and the right branch (fib (- n 2)) after the last leaf nodes have been executed. For this, with every level of recursion, the call stack doubles as there are two calls to (fib) with every level of recursion. In short, it grows exponentially....2,4,8,16,32,64,128,256,1024,2048. At the tenth level, there are 2048 states to track. Therefore even relatively small inputs, like 1000, will simply result in a frozen program.

A tail recursive solution would alleviate this problem. Fewer programmers will leverage this technique, so the model will be forgiven for not having produced this answer, as once again the solution would steer away from consensus implementations, even though the "consensus" in the wild will also be non-functional for all but trivial inputs. The model will have learned incorrectness by virtue of its design. The more data it is trained on out there, the more incorrect it will be!

A tail recursive implementation carries the context which is otherwise accumulated by the stack as an additional parameter to the function and ensures that the function call is simultaneously the return value of the expression. In the Code Llama solution, our last expression was the addition of the two "fibs," which caused the blow-out. In our solution shown below, there is only (fib) call on the last line and it is simultaneously the last term in the overall expression. Our solution has the added benefit of being able to commence computation anywhere within the sequence rather than at the start but this isn't central.

Tail Recursive Implementation of Fibonacci

How will this fare? Here we are starting at 1 with an accumulator of 0.

We observe that runtime for (fib 10) is on par for both the Code Llama Solution and the solution we offered. Now we attempt (fib 1000).

We note that the human-curated solution completes in under 1 millisecond, whereas the Code Llama solution results simply in a frozen program. Here we also recall the emphasis on the order of complexity (big O notation) in the Collatz sequence problem. Tail recursion and correct order of complexity are not a matter of performance obsession but represent the difference between a working and a non-working solution on non-trivial inputs due to exponential resource use. This ends our digression which we started by seeking solutions in the esoteric programming languages space.

Let's continue that adventure. We recall that large language models and deep learning in general depend on Big Data inputs. What if the domain does not have a lot of data? Or it's not published as is often the case in industries like finance which jealously guards working solutions?

Let's give Code Llama a more business-oriented problem with high-level business requirements rather than classic and well-known algorithms.

Code Llama Answer:

...

This concludes Code Llama's attempt at the question. The full answer was too long to be replicated here. Hence the ellipses. Crucially, we observe: no code. Code Llama relapsed to its regular chat mode as plain Llama-2. As a side note, Code Llama's reply could be interpreted as a reasonable high-level business case analysis. I hope I am qualified to comment here in that I am the author of an issued United States patent for this very business case.

The model answer in CSP is as shown below:

Copyright 2016 Christoh Kohlhepp, US Patent 10152760

I was a little disappointed to find no code in Code Llama's answer as the above is both the subject of a United States Patent and copyright ownership and conversations would have had to be had with Meta if the code had been produced here.

Indeed, patent law has the potential to be far more ruinous to Large Language Models than the controversies surrounding copyright law. To show that a violation of copyright law may have taken place, a copyright holder would need to demonstrate that the model was trained on a corpus that included their copyrighted material and that the model reasonably reproduces said material. Patent law, by contrast, grants a monopoly right to use a particular method or embodiment of an idea. If the model employs this method without the model owner having negotiated a patent license and paid a fee, then the model owner patently (pun intended) violated patent law. Typically, in the realm of software patents and because abstract ideas are not patentable, patent authors tie the method or idea inextricably to a computer by citing embodiments like CPU and computer memory. In that case, where a model has used a computer, a CPU and computer memory and the protected method or idea, then the model is violating the law, regardles of the corpus it has been trained on. Of course the mere citing of an algorithm, such as occurs when the model produces the code analogous to a patent protected method does not amount to executing that method or idea. This does happen, however, if test cases are run against the model output, either to select from a large range of outputs or because the model itself runs a test case as part of the code generation process. Then the method has been exercised in the manner the patent prescribed and a patent law violation would have taken place. A patent holder need merely visit a model website, ask it to produce the code and test it. Then ring their lawyer...

The Balance:

5 test cases. One strikeout, one wrong answer, three suboptimal solutions. Suboptimal means disfunctional on all but trivial inputs. 55 - Ok. 1000 - program freezes.

Summary: As noted at the beginning, the fact that Code Llama writes reasonable-looking code is a monumental achievement in terms of machine learning and natural language processing. It also demonstrates that complex statistical relationships mask non-understanding or incomplete understanding. What is intelligence? As for C++, the overarching feel is that the C++ solutions offered lean on Pythonic procedural styles. No predominantly functional or predominantly templated styles were observed even where C++11 was requested. One wonders if the relatively large representation of Python in the Code Llama foundation model produces a bias which bleeds into other languages or alternatively if the tendency towards statistical convergence inherent to large language models simply produces average-quality code without varying the temperature of the model and inducing “hallucinations” in the answers. It would appear logical that an emphasis on correctness by way of conservative temperature settings of the model inherently induces a tendency towards answers with the greatest statistical consensushence average rather than better. This would be diametrically opposed to the use case of a language like C++ which is only used where extreme optimizations are warranted. Moreover, the tension between "Cheap and Easy" and the desire to excel in a competitive marketplace, i.e. to be better than average, inherently has implications for other domains.

I have the feeling that the Large Language Models of today will afford low-cost jump starts across a number of domains but will a) fail to excel and b) come at the cost of foregoing ownership in the framework of today's copyright legal frameworks.

One point worth contemplating is that "Cheap and Easy" is also the dominant tenor on the input data on which Large Language Models are trained. We may not even be in a position to hope for a statistical convergence on a domain's average quality in the output of Large Language Models where these models are trained on publically available data. According to an article on "Jupyter in the Emacs universe", a study collected a corpus of approximately 1.16 million Jupyter notebooks from GitHub found that only 3.17% could be executed providing the same results as in their GitHub publications. This makes the relatively poor test results of this article rather unsurprising. Some 96% of the GitHub notebook content on which Code LLama will have been trained was indeed incorrect. Code LLama successfully learned to predict this incorrectness. Yet the multi-billion parameter nature of Large Language Models invariably creates the insatiable hunger for the volume of data which can only be found in public and uncurated corpa. This is a systemic issue for machine learning and Large Language Models.

Another interesting thought to ponder is this: As tools like Github Copilot become more ubiquitous, will we see an emerging phenomenon of diminishing diversity as the authorship of millions is subsumed by a relatively small number of competing code generators who are retraining and fine-tuning their own mutual outputs all the while converging on statistical consensus?

To be or not ....

Do you understand Shakespeare's deeper motivations, or do you just know how to complete the above headline?

One interesting question to ponder is: "What is understanding?" If we are to say that large Language Models do not truly understand, and the blogosphere is awash with this notion, then we should first "understand understanding."


"Understand understanding" is of course circular and a higher-order question at least --- or meta-physical at worst. Wikipedia has this to say: "Understanding and knowledge are both words without unified definitions." This is not encouraging, but even without unified definitions, and as scientists, we should attempt to "triangulate" the concept. According to Wikipedia "Gregory Chaitin propounds a view that understanding is a kind of data compression. In his essay The Limits of Reason, he argues that understanding something means being able to figure out a simple set of rules that explains it." Simplified explanations then! Central to that concept are reduction and abstraction. How might we get from next word prediction in Large Language Models to understanding, meaning how do we start at sequence to sequence modelling and arrive at compression and abstraction?

Ilya Suskevy, chief scientist at OpenAI, said in a recent interview. “It may look on the surface that we are just learning statistical correlations in text, but it turns out that to just learn the statistical correlations in text, the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world. The neural network learns more and more aspects of the world, of people, of the human conditions, their hopes, dreams, and motivations, their interactions in the situations that we are in. And the neural network learns a compressed, abstract, usable representation of that. This is what's being learned from accurately predicting the next word.”

In short, Large Language Models do learn both compression and abstraction. In systems theory, we further learn that sufficiently complex systems begin to exhibit emergent behaviours.

Emergent Behaviour in Complex Systems

Hence when critics argue that statistical prediction in Large Language Models based on next-word sequence modelling does not indeed constitute understanding, they are right but the emergent behaviour of Large Language Models does qualify according to some of our definitions of "understanding." Yet we perceive there is obviously still something missing - as this article has illustrated. Industry leaders concur:

Why do we perceive that the understanding exhibited by LLMs is somehow illusory?

My personal view is this: Integral to understanding is the notion of awareness. The abstraction represented by LLMs corresponds to the function performed by the brain's Hippocampus: Space and Time: The Hippocampus as a Sequence Generator. Yet as humans, we can do so much more. The Hippocampus is not the whole brain. Our frontal lobe is responsible for awareness, self-monitoring and introspection. With this comes the ability to independently (without being exposed to new experiences/training) ponder what we have learned, and if contrary to our goals and imperatives, to unlearn it. As the primary and only sensory organ of a Large Language Model is the Transformer, with understanding but an emergent property, it is inevitable that its purported understanding will be judged deficient. Nonetheless, as humans can conceptualise only 3 dimensions, and Large Language Models can conceptualise billions of dimensions, Large Language Models are predestined to excel were humans cannot.

The future is exciting.

?











要查看或添加评论,请登录

Christoph Kohlhepp MSc MBCS CITP的更多文章

社区洞察

其他会员也浏览了