登录查看更多内容

The Case of the Suspicious Code

Eliora Horst

Digital Forensics Enthusiast | MS in Computer Science

发布日期: 2024年9月19日

In the bleak backend of November 2022 I was holed away in my Chicago apartment, hiding away from the mix of Midwest sleet, snow, and sun. I blocked the rattling thunder of the nearby Red Line with noise cancelling headphones, and put my nose to the grindstone of grading student submissions for an Introduction to Object-Oriented Programming course, for which I served as a teaching assistant.

While the assignment was simple, the course included five class sections, and I was wearily trudging through the submissions of nearly 200 students when a strange pattern began to emerge.

Consider a freshman or sophomore in college, new to the world of computer programming, grappling with this new terrain with a new way to solve problems. What comes naturally to the new learner? What takes practice?

In my countless hours of grading, the predominant characteristic displayed among these new programmers was a lack of efficiency: long, overly complicated variable names; multiple statements to complete a single task when one line may have sufficed; extraneous statements floating in the ether of the white space.

Then there came a solution to the assignment that was so neat, so tidy, it could not have been any more efficient if I had coded it myself - variable names were simple, statements had been condensed down to their most efficient, and not a single line of extraneous white space was to be seen.

The first assignment I received with these characteristics I passed - after all, there were some students who weren't entirely new to the realm of object oriented programming, and they might well be equipped with the skills of clean code already. But then I ran into an assignment with the exact same code as the first, down to the variable names and number of lines. Now this raised my suspicion even further. But I reserved judgement - this assignment was simple enough in nature that two students of a similar temperament may have fallen together in their pattern of completing the requirements. By the third identical assignment, I pulled back and reevaluated what I had before me. So many students having identical, or near identical, code was cause for alarm.

My mind, weary with the repetition of lines of Java, was now alert. I went back and began paging through the assignments, pulling aside those that waved forth flags of doubt.

These flags included:

Use of identical/near identical variable names - near identical included things like shortening a name (numbers -> num, num -> n), or replacing the variable name with a single character (usually, a, b, c, etc)
Coded statements in the same order - while there is a logical progression to how statements should be organized in a program, having the precise same order for things like initial variable declaration is a marker for plagiarism.
Unusual operators/functions - being an introductory course, and only a handful of assignments in, it would be highly unusual to find complicated functions taken from libraries, or use of operators not yet introduced. For example, shortening the operator 'x = x/y' to 'x /= y'.
Identical white space - whether between lines or between brackets, the use of identical white space is another marker.
Code fails to compile - This is not, in and of itself, an usual characteristic to come across in beginner code; however, the cause for a program not compiling may be reason enough to pause. In the course for which I TA'd, the students were learning object oriented programming through Java; and as such, they used an IDE (Integrated Development Environment) for Java. This particular one had certain requirements in order to compile and run a Java program, and it was required by the professors that a program run successfully in the IDE to receive any sort of grade.

Now to my own sensibilities, I knew the code I found in front of me was not genuine. However, I needed evidence before I could present my suspicions to the professors I worked for.

I used several different search queries to find the font of programming wisdom these students had copied from. I first searched answers to this specific textbook problem, as the program name was also the name of the assignment. I found several GitHub repositories containing solutions to every assignment in our textbook, and by going into their files and finding the specific assignment, I found the code that so many of my students had used.

There were still some assignments that I had flagged for potential plagiarism that did not match this code, so I continued searching. Search engines like Google sometimes have a hard time using snippets of code in the search term, so I had to be selective with what parts of the code I looked up. The key ended up being putting parts of statements or lines into quotation marks, which brought up exact matches in places like Chegg and StackOverflow.

Below, the code on the right shows a solution from a GitHub repository, containing all the solutions to the assignments in this particular textbook; the code of the left shows an example of potential plagiarism in a fictional student's code (inspired by real student submissions).

领英推荐

The Evolution and Impact of C++: Bjarne Stroustrup’s…

Mesut Oezdil 9 个月前

C In 500 Words

Adam Paulin 2 年前

Don’t Believe the Hype: C Remains the Gold Standard…

Lance Harvie Bsc (Hons) 2 年前

Solution to programming problem on right, suspicious code on left

Plagiarism flags found in the suspicious code:

Missing the keyword static in the program declaration, which would lead to a compilation error for a student in this class
Line 1 - Shortened variable name: number -> num
Lines 2, 3, 4 - Identical code; and use of num instead of number, which would cause an error
Line 5 - use of advanced operator
Line 8 - shortened var name again: digit -> dig
Lines 10-14 - Identical if/if else statement order when the reverse order would also be valid
Lines 16, 19 - Identical code
Identical whitespace

All of these markers collectively, combined with looking at the student's previous work, led me to believe the submitted assignment was plagiarized.

After analyzing all the data, I felt I did not have the proper authority to issue zeros to these plagiarized assignments, nor deliver the serious academic verdict of plagiarism. So I compiled a detailed report of my findings, including images of copied code and where I found it online, and escalated the issue to the professors in charge of the course (three professors spanning five sections).

After I had concluded the Case of the Suspicious code, I began to ponder on the development of a program that could be used to flag suspicious code, and relay that information to a grader. It would be straightforward enough to code something workable in Java, though with potential expansion and need to access and process hundreds of files, Java may not be the most efficient language through which to do this process. However, since my personal computer language knowledge is strongest in Java, I began the pseudo-code process with Java in mind,

If I were to design a program that could be fed an assignment to check for plagiarism, this is how I could go about it:

Create a library of solutions - The object oriented programming course I was working with used a specific textbook with specific problem sets. That being the case, there are multiple online sources for solutions, including GitHub, Chegg, and StackOverflow. A library would need to be created for this textbook, containing all possible solutions found online. Each file would include not only the code, but the URL and attribution for use if/when plagiarism is found, so a source can be provided.
Create structures/objects that define what variables are in that particular language - for instance, a variable in Java could be defined by a set of characters (RegEx) where the left-hand side of the statement contains a variable type, then a space, followed by a name, followed by an equals sign.
Having a definition of a variable, compare variable names - there would be multiple checks in this phase, including: identical match of variable name and type, and partial match (this would check for letters in sequence, not overall characters).
Going line by line, checking for exact matches, excluding variable names.
Using the characteristics listed above, create a score as it relates to how closely the input code matches.
Threshold score - Users could set the program to only display a potential case of plagiarism when this percentage score is over a given value.
User's would also input what exercise from the coding book they are checking against - this would pull the relevant files from the library and have them ready for use for comparison.

Any final determination of plagiarism, especially at a college level, should be reviewed by a professor, so this program should not give a verdict on the likelihood of copied code. Instead, there could be an overall percent match score. The program should give the grader using it the assignment flagged for plagiarism, it's plagiarism score, and the library file that it identified as being copied.

Areas for expansion:

Ability to choose different courses - with the goal of this software being made for a specific university (e.g. my alma mater Loyola University Chicago), so professors/TAs would be able to select which course they were grading for, and the program would load the appropriate language settings and source library.
Ability to load all the assignments from one course - being able to input a single folder/library containing every assignment from that exercise, the program would then pull in one assignment at a time, and deliver a report to the user once it completed.
A GUI (Graphic User Interface) - The easier interaction is between the user and the computer, the more efficient the experience is for the user. This could potentially include things like highlighting of the code where the program has identified issues, issuing a more human-friendly report with graphics, and ability to generate a report to email to students directly from the program.

Emmanuel Ihekweazu

5 个月

This was such a good read, thank you for sharing!

1 次回应

要查看或添加评论，请登录

Eliora Horst的更多文章

Cellebrite 2024 CTF Write-Up

2024年11月14日

Cellebrite 2024 CTF Write-Up

Team: The Lurgid Bee Members: Eliora Horst Points: 147 Place: 306/629 The 2024 Cellebrite CTF was a digital forensics…
Essential Certifications for Digital Forensics

2024年8月27日

Essential Certifications for Digital Forensics

While a Digital Forensics position may be listed as 'entry level', that doesn't preclude it from requiring extensive…

10 条评论

The Case of the Suspicious Code

Eliora Horst

Digital Forensics Enthusiast | MS in Computer Science

领英推荐

Eliora Horst的更多文章

社区洞察

其他会员也浏览了

Thinking beyond Roslyn source generators and aspect-oriented programming

Deep Dive into Closures in Go: Theoretical Foundations and Practical Applications

Exploring Pointers and Memory: Key Insights from CS50 Week 4

CompletableFuture and Parallel Stream

The Legacy of C Programming Language!

Speed Testing Programming Languages

Storage Classes in C

Introduction to C Programming: A Beginner’s Guide

Why Do We Start Counting from Zero in Programming?

Ugly code which makes you proud

领英推荐

Eliora Horst的更多文章

Cellebrite 2024 CTF Write-Up

Essential Certifications for Digital Forensics

社区洞察

其他会员也浏览了

Thinking beyond Roslyn source generators and aspect-oriented programming

Deep Dive into Closures in Go: Theoretical Foundations and Practical Applications

Exploring Pointers and Memory: Key Insights from CS50 Week 4

CompletableFuture and Parallel Stream

The Legacy of C Programming Language!

Speed Testing Programming Languages

Storage Classes in C

Introduction to C Programming: A Beginner’s Guide

Why Do We Start Counting from Zero in Programming?

Ugly code which makes you proud