The Crux of the Problem
Carson Whitsett
Software Architect, Hardware / Embedded Firmware Developer / Biotech Engineer, Unity 3D Developer, Lead iOS Developer.
Time is synonymous to money. Take either one: You can spend it, waste it, lose it, have too much or too little of it, donate it, borrow it and invest it. But the most important thing you can do with it is SAVE it!
In engineering, problems crop up and when they do, it's prudent to solve them quickly. The longer it takes to fix a problem, the more it's going to cost somebody. Through the years, I have refined my skills at sleuthing out that particular point of difficulty, or crux of the problem, causing a failure or malfunction in a system under development. What I learned bridges two barely related fields (musical instrument repair and software design) but reveals a common approach to solving any technical problem in any field quickly. So how do we quickly get to the crux of the problem? Let's start with a little story...
It was 1988 and I had just graduated from ITT Technical Institute and moved to San Diego, California. I tagged along with a friend as he went to a local music store to get something repaired. I was just starting my job search and asked if they needed anyone to repair music gear. Their response was, "When can you start"? Soon I found myself sitting at a work bench with a hot soldering iron, a dead piece of music gear and no prior experience of professionally fixing anything! I managed to get through four pieces of music gear on the first day and my confidence was building. The second day started out great. I breezed through a few guitar amps, a Yamaha DX7 keyboard and then I got an Ampeg V4 bass guitar amp that had a crackly sound in the speaker. I spent hours on that thing, probing around with a Simpson voltmeter and an oscilloscope. I could not find the source of that crackle. It seemed like it was everywhere!
I even stayed an extra hour that day but eventually gave up and went home defeated. My confidence wained and I realized this repair stuff wasn't going to be a walk in the park. Maybe I was just lucky with those first few repairs. Am I really cut out for this? The next morning I returned to the battlefield with nary a plan. Perhaps I'll just start desoldering individual components and testing them out of circuit. When I returned to my bench, the daunting Ampeg was still there. I flipped my soldering iron's power switch to ON and then noticed a sticky note on the amp. It said, "Hi Carson, I replaced a bad capacitor after you left. It's good to go. Have a great day! Dave". This left me with mixed feelings. While I was certainly relieved that the problem that had eluded me for hours the day prior was fixed, I was troubled that it took someone else just one hour or less to fix it. Clearly I could do better.
Five years went by. We logged each of our repairs on a log sheet and I numbered mine in the top-left corner. I had accumulated well over 200 of these log sheets amassing over 5,000 successful electronic repairs.
I had become a master of fixing anything that was broken. I could fix TVs, VCRs, tape machines, synthesizers, rack mount effects units, guitar amps, mixing consoles, power amps, computers, electric pianos (Fender Rhodes and Wurlitzer), fog machines, microwave ovens, washing machines... You name it, I've probably fixed it. Within those years of gaining all of that experience, I had developed a pattern to diagnose and repair ailing gadgets. It was a two step process. The first step was to observe...
OBSERVE
The first thing I would do when a new patient arrived was to observe intently. I would scrutinize the device inside and out. Was there any mechanical damage? Did the unit suffer a fall?
Inside the unit, I would scan the circuit board looking for anything out of the ordinary. Where there any blown components? Components that have the potential for high current to flow through them such as voltage regulators and power transistors can sometimes detonate! Here is a picture of a voltage regulator that blew off part of its plastic casing.
I would also look for discoloration. This could be caused by heat for instance. A component that repeatedly got too hot would usually cause some discoloration on the fiberglass printed circuit board beneath it. The component itself might look browned or charred. My keen eye would zero in on these components and they would be placed at the top of my suspect list. Another source of discoloration is the presence of a foreign substance, usually a liquid.
Sometimes certain types of capacitors, called electrolytic capacitors, will leak their inner fluid called electrolyte. This image shows a slightly soiled circuit board (highlighted by the red arrow) due to C123's electrolyte leakage. Another clue is the tiny green dot of copper corrosion that's barely visible on the far right side of the picture below the two silver capacitors. A repair of this board, especially without the aid of a schematic might take hours or even days to accomplish, but picking up on the subtle clues often times leads you right to the culprit.
Other things I would look for are black arc marks engraved into the circuit board. An indication that electricity has carved an illegal low impedance path between copper traces.
A careful study of the underside of the printed circuit board can reveal clues to the problem too. I'll look for stress fractures in the board, possibly severing copper traces, that were due to a fall or some other form of high impact. Sometimes I would find a blown trace. A copper trace is very thin and can not handle a large amount of electrical current. I've seen cases where a nut would come loose or maybe a penny would find its way through a ventilation slot and short something out. All it takes is a brief short circuit in the right place to blow a trace like a fuse.
Fresh shiny solder connections would also catch my attention. This may have been an area of recent repair so I would check that the soldering was correct and that there were no solder bridges. I'd also make sure the correct components were installed (in the right direction).
Intermittent solder connections were common. Solder contains lead which is very malleable. Two things can cause a bad solder connection to develop. Repeated cycles of heating and cooling which causes thermal expansion and contraction and repeated cycles of mechanical forces being applied.
Here is a 1/4" headphone jack out of a music keyboard. I spotted those telltale rings around the pins of the jack caused by repeated plugging and unplugging of the headphones.
Once I have finished my thorough observation, if the problem hasn't become apparent, I would move on to the second step of my two-step process. Test...
TEST
Up until this point, I have not powered the unit on. I have utilized my senses to detect anything out of the ordinary, but sometimes nothing looks (or smells) funny. So I will power the unit on and begin testing things.
The first thing I would test is the power supply. If proper power isn't being supplied to the electronics, all bets are off. The unit could exhibit problems ranging from not turning on at all, to strange intermittent glitches or resets. I would make sure all DC voltages are within an acceptable range with minimal ripple. With a digital multimeter, I would measure the voltage drop across diodes (should be 0.7V for silicon diodes), I'd make sure transistors and MOSFETs weren't shorted. With the power off, I would measure resistors to make sure their values were within tolerance, sometimes desoldering a leg out of circuit to get a more accurate reading.
During the observation phase, I might have noticed that a problem only exhibits itself after the unit has had time to warm up. Or vice versa where the unit only exhibits a problem when it's cold and then resolves once it has had time to warm up. Thermal issues can be hunted down rather quickly when you're armed with the right tools: Freeze spray and a heat gun. By freezing then heating suspect components, you can quickly zero in on the offending component, as my friend Dave did when he fixed the Ampeg V4 amplifier that had dumbfounded me so early in my career.
The oscilloscope is an invaluable tool for getting a peek at what's going on inside a circuit. My experience has taught me that sometimes a very faint, almost invisible glitch in the trace can be a clue to the underlying problem. A careful assessment of the oscilloscope trace can sometimes reveal clues that point you directly to the problem. For instance, digital signals typically have two states: low and high. The green trace in the image below ventures slightly above the low state which could cause something to falsely trigger. It would be easy to overlook without that giant red arrow.
TRANSITION
I had been doing repairs for five and a half years and reached a point where it was no longer a challenge. I was quick to diagnose and fix devices that had gone aloof and seldom did something come along that invoked a good head scratching. It was time for a change.
I had taken an interest in video games and in my spare time was teaching myself the C programming language. I set out to program my own ball and paddle brick bashing game inspired by Atari's Breakout and Taito's Arkanoid. Since I was using an Apple Machintosh IIci, I called it MacBrickout. During the day, I would fix broken things and on nights and weekends, I would cultivate my little game, sometimes adding features but mostly fixing bugs. A fellow tech at the music store had left and got a job at another company. I met with him at his new company one day and showed him my game. He was impressed and introduced me to a friend of his who was starting a new game company here in San Diego. I showed him MacBrickout and before I knew it, I had a new job porting a motorcycle racing game called Road Rash to the Sega CD game console. I can still remember the overwhelming feelings of questioning and doubt that went along with throwing myself into an entirely new and unfamiliar work environment. Just as when I first started working at the music store with not a lick of technician experience, I sat at the trailhead of a mass of 68000 assembly code for an already commercially successful video game asking myself, "Am I really cut out for this?". I persevered, the company grew and we released Road Rash for the Sega CD.
We added more people and released a new title, this time for the Sega Saturn and Sony Playstation called Courier Crisis. It was a 3D bicycle messenger game similar to the old arcade game, Paperboy.
As the years and projects went by, I was accumulating a decent amount of software development knowledge. I had learned to work on a multidisciplinary team, grasped the concepts of 3D math, polygon rendering, sorting and optimization techniques (this was before OpenGL was readily available and there were no depth buffers so we had to carefully manage and sort the polygons such that they would render in the correct order, furthest to nearest). There was always a healthy supply of bugs to fix too.
Five years had passed and the video game job had run its course. I found myself on a new career track as a hardware / software design engineer. I had the opportunity to learn about microcontrollers and started gaining experience designing small microcontroller-based hardware devices and writing the firmware for them. The Observe-and-Test troubleshooting pattern that had worked so well for me in the electronic repair days continued to bear fruit on these hardware projects and as my software skills continued to improve, I was noticing I was starting to spend a lot less time fixing bugs.
SOFTWARE
Troubleshooting software can be a lot like debugging hardware. You essentially have a black box with a lot going on inside and in order to reveal the cause of the errant behavior, you need to poke and prod in the right places to reveal valuable clues. For software, the same Observe-and-Test pattern can be applied to reveal those clues and reduce debugging time.
OBSERVE
The first thing you'll want to do is watch how the software behaves. When does the malfunction happen? It will typically fall into one of 5 categories:
- It fails right away. Lucky for you, these are generally very easy to find and fix.
- It fails after a certain amount of time. Take note of how much time elapses. Is it always the same amount of time? If so, what other process within the software also takes that amount of time? Perhaps they are connected. Maybe the failure happens after a random amount of time. Collect some data. Understand what the shortest and longest failure times are. You might uncover a clue.
- Periodic failure. In this case, everything will work fine for a while, then fail for some amount of time, then start working fine again. The cycle repeats over and over. This is usually caused by two seemingly unconnected events running concurrently and periodically interfering with each other. A good example of this would be in a video game loop. The display refreshes 60 times a second. You set up a timer to fire at 60Hz and whenever it fires, you submit your polygons to the display engine to be rendered. Most of the time everything will work perfectly but eventually the two timers will drift such that the display refresh happens exactly when your 60Hz timer fires. There's not enough time to render the scene and a frame gets skipped causing the rendering to appear chunky. This continues for a little bit until the two timers drift far enough apart so that there is again enough time to submit and render your polygons.
- Failure after a specific chain of events. It may not be immediately obvious that it takes a particular chain of events to trigger this failure. It's your job to observe with suspicion. Maybe the problem doesn't happen until after the device goes to sleep one time and then wakes up. Maybe it takes a certain type of packet to be received over the network before the problem exhibits itself. Careful observation will uncover the event chain that leads to this failure. Be patient!
- Sporadic failure. This is the hardest software bug to find because it happens very rarely and at seemingly random times. Most likely, though, this is actually a manifestation of #4 above and the specific chain of events hasn't yet been identified. The challenge here is to find ways to make it happen more frequently. Time spent waiting for a bug to happen is 100% wasted. I was once trying to track down an elusive bug that only happened on one screen in an App. To get to that screen, I needed to launch the app, log in, navigate 4 screens deep then press a start button. To find the bug faster, I temporarily modified the app to bypass the login and jump directly to that 4th screen, shaving off tons of repetitive manual setup time.
TEST
Admittedly, I seldom encounter tough bugs anymore. Any complex procedures or algorithms I write are usually preceded by a flow chart. Errors in logic can often times be discovered and fixed in the flow chart before any code is written (See my article on estimating accurately and completing on time. About halfway through the article there's a section that outlines the benefits of flow charting). When that head-pounder does rear up though, it becomes necessary to probe inside the black box and see what's going on. The best tool I've found for doing that is the debug log.
Almost all platforms these days have the ability to send simple "printf" statements to a debug console. I use this method almost exclusively to understand what's going on under the hood. If you want to understand how often functions are getting called or how they're getting called in relation to other functions, simply emit a single unique character using a printf statement at the start of each function. Then let the program run for a while and examine the console log. It will give a clear picture as to what's going on. Do you see two of the same character repeated twice? Is it normal for that particular function to be called twice in a row?
You can also output time stamps with some data. This is extremely helpful when you're trying to run down a critical timing bug. You can plot this data in Excel and see exactly what is happening and when. I was working on a project recently that sent audio packets over a WiFi connection in real time. I was getting breakups in the audio so I outputted pertinent information to the debug console as comma separated values then plotted it in Excel. It showed that packets were accumulating in the WiFi router and then being sent out in a burst some time later. This resulted in a lengthy period of time (> 130mS) when no audio packets were being sent over the air. Once I understood what was happening, I knew my audio buffer sizes needed to be increased appropriately.
You can also emit special characters such as "X" or "!" that signify an error condition of some sort such as buffer full or buffer empty. If you're tracking down a bug and you notice it happens about as often as you see an "X" in the log, then you're on to something.
THE CRUX
Ideally your engineering project progresses without a hitch but when that awful potentially time consuming bug does crop up, it's essential to find it and squash it as fast as possible. Time is money but taking some time to first observe the behavior and understand the nature of the events leading to the failure will often reveal enough clues allowing you to zero in on the problem and save you lots of debugging time in the long run. Additionally the right tests will get you the resolution you need. For hardware engineering, ensure your power supply rails are within spec and clean (free of any glitches or ripple). The oscilloscope is an excellent tool to see what's going on within the device. Just be sure to pay close attention to the trace(s). Scrutinize every pixel and be sure you understand the cause of every deviation you see. Sometimes very faint glitches or very slight deviations in timing are the clues you need to pinpoint the problem.
For software, the debug log is my go-to tool for understanding what's going on and when. You can format the data so that it can be plotted and graphed visually which often times reveals the underlying nature of the problem. If you find yourself spending a lot of time repeating the setup to get a bug to manifest, consider what you can do to eliminate that setup time (such as how I temporarily modified my app to jump directly to the area of the code where the problem was happening).
The faster you can get to the crux of the problem, the faster you'll crush the problem! Observe patiently, test smartly, save time and money!
Lead Mobile Software Engineer
6 年Fun read