Crashing the Student Computer Lab

Crashing the Student Computer Lab

In my last year of graduate school at Notre Dame, I used over 1,000,000 computer hours or just over 114 years of compute time. Only once did I inadvertently crash the engineering computing lab.

I was using a distributed compute grid called Condor. It was installed on most computers in the computer labs across the engineer building. This later expanded to the entire university. It would only use spare compute cycles, and it would stop if someone logged in. One would have a script to send jobs and commands to these machines, and it would dump the results in a nice little folder.

Below is from a paper about how one would make the all vs all comparison for face recognition experiments. These were the types of experiments I was running.

I needed to use Condor because I was using Iterative Closest Point (ICP) to do face matching. At the time, it was one of the best techniques, but it was computational expensive like O(m*n*log n * iterations) where m is the number of points in the model and n is the number of points in the comparison model. The iterations made a difference, and you could do some optimization, but it was still slow. However, for failure analysis, ICP was visually appealing and understandable relative to other pattern recognition techniques.

Usually, running large experiments was always bottlenecked by resources. So weekends and nights were when most of the jobs got done. I also had to compete for resources with two other grad students that were using Condor almost as much as I was. After I left, Condor got really busy. Below is a utilization chart pulled from whatever I could find as an example. It happens to have me (rmckeon) as the top user, and coincidentally, the time period it covers includes Spring Break, which is when the computer lab lost!

Spring break of 2009 came (see chart above), and suddenly I was dominating all of the resources. I was thrilled until I got the email. Some poor student over the break came to the lab and tried to login to a computer. The login just spun and spun. He tried multiple machines, but I was on all of them through Condor, twice over (each machine was duel core, so two of my jobs per machine).

My jobs didn’t give up its priority as they should have, and as a result, my jobs had render the labs unless to anyone else unless they hard rebooted all the machines. Even on reboot, my jobs would get pushed to those machines and take over if the user didn't login quick enough!

I had to kill off all of my jobs, and the professor in charge of Condor had to fix the bug. I didn’t want to dominate an entire computer lab, but it was pretty funny. 

Me on Twitter

Me on Medium

Further readings of mine:

My coffee Setup

A Day in the Life of a Data Scientist

Writings Sorted by Topic

Abandon Ship: How a Startup went Under

Prasiddhi M.

Senior Data Scientist at Google

6 年

Really fascinating! Can you give more details about the bug in Condor that didn't lower priority for your jobs?

His Excellency Raymond Toh

ICT Counsel | Autodidact @ SYNC01? Global Outreach Mechanism?

6 年

Cool - lol - ?? : )

要查看或添加评论,请登录

Dr. Robert McKeon Aloe的更多文章

  • Ph.D. Interviews

    Ph.D. Interviews

    I have interviewed mostly Ph.D.

  • How to break into Data Science the easy way

    How to break into Data Science the easy way

    Scratch that; there’s not an easy way. Data science has become a hot topic the past few years along side machine…

    5 条评论
  • ML: Examining the Test Set

    ML: Examining the Test Set

    I recently saw a post where someone said “Never touch your test set.” The theory was that you (as the algorithm…

    8 条评论
  • Privacy in Machine Learning: PII

    Privacy in Machine Learning: PII

    Privacy is not a value explicitly written into the US Constitution, but the essentials are there. As a democratic…

    1 条评论
  • Mastering LinkedIn

    Mastering LinkedIn

    Account Creation I never had a LinkedIn account until I was searching for a job, and then I only paid attention to it…

    1 条评论
  • Withdrawing a Conference Paper

    Withdrawing a Conference Paper

    In graduate school, I tried all sorts of optimizations aimed at making my face matcher work better and faster. I found…

    1 条评论
  • Thoughts on Leaving

    Thoughts on Leaving

    Relax, I’m not leaving my current job right now. I’ve been writing about many different aspects of my work experience…

  • Presentation Essentials

    Presentation Essentials

    I have fallen asleep in my fair share of presentations, and I’ve worked hard at making sure my presentations are not…

  • Design of Experiment: Data Collection

    Design of Experiment: Data Collection

    Anyone can collect data; some people can collect good data. The key theme to any good data collection is data…

  • Preserving LinkedIn for Professionalism

    Preserving LinkedIn for Professionalism

    I recently saw a discussion on LinkedIn about LinkedIn possibly becoming more like Facebook and how that was…

社区洞察

其他会员也浏览了