Crashing the Student Computer Lab
In my last year of graduate school at Notre Dame, I used over 1,000,000 computer hours or just over 114 years of compute time. Only once did I inadvertently crash the engineering computing lab.
I was using a distributed compute grid called Condor. It was installed on most computers in the computer labs across the engineer building. This later expanded to the entire university. It would only use spare compute cycles, and it would stop if someone logged in. One would have a script to send jobs and commands to these machines, and it would dump the results in a nice little folder.
Below is from a paper about how one would make the all vs all comparison for face recognition experiments. These were the types of experiments I was running.
I needed to use Condor because I was using Iterative Closest Point (ICP) to do face matching. At the time, it was one of the best techniques, but it was computational expensive like O(m*n*log n * iterations) where m is the number of points in the model and n is the number of points in the comparison model. The iterations made a difference, and you could do some optimization, but it was still slow. However, for failure analysis, ICP was visually appealing and understandable relative to other pattern recognition techniques.
Usually, running large experiments was always bottlenecked by resources. So weekends and nights were when most of the jobs got done. I also had to compete for resources with two other grad students that were using Condor almost as much as I was. After I left, Condor got really busy. Below is a utilization chart pulled from whatever I could find as an example. It happens to have me (rmckeon) as the top user, and coincidentally, the time period it covers includes Spring Break, which is when the computer lab lost!
Spring break of 2009 came (see chart above), and suddenly I was dominating all of the resources. I was thrilled until I got the email. Some poor student over the break came to the lab and tried to login to a computer. The login just spun and spun. He tried multiple machines, but I was on all of them through Condor, twice over (each machine was duel core, so two of my jobs per machine).
My jobs didn’t give up its priority as they should have, and as a result, my jobs had render the labs unless to anyone else unless they hard rebooted all the machines. Even on reboot, my jobs would get pushed to those machines and take over if the user didn't login quick enough!
I had to kill off all of my jobs, and the professor in charge of Condor had to fix the bug. I didn’t want to dominate an entire computer lab, but it was pretty funny.
Further readings of mine:
Senior Data Scientist at Google
6 年Really fascinating! Can you give more details about the bug in Condor that didn't lower priority for your jobs?
ICT Counsel | Autodidact @ SYNC01? Global Outreach Mechanism?
6 年Cool - lol - ?? : )