Tracing My Data Science Path
Copyright (c) Dmitry Vostokov

Tracing My Data Science Path

Monsieur Jourdain: These forty years now I’ve been speaking in prose without knowing it! (Molière, “The Bourgeois Gentleman”)

People are quite bad at debugging. Debugging is a complicated skill, kind of like medical diagnosis. And it’s not as well codified for debugging computer systems as it is for medical diagnosis, where one knows there’s this percent probability of this or that, but it’s the same kind of idea, [testing theories one by one] and so on. But the thing people don’t do very much, and they really really should, is data science for debugging. (Stephen Wolfram, in “How to Think Like Stephen Wolfram” [1])

 

Throughout my long engineering career, I periodically discovered already existing names for the various activities that I was doing. For example, in 2012, I realized that I was doing software diagnostics (however, I knew that I was doing forensics activities from the beginning). Since then, I implicitly assumed that it was something like traditional data analysis, as evidenced in DA+TA abbreviation (Dump Analysis + Trace Analysis) that I coined. Although I used regression for prediction long before I became familiar with artificial neural networks 15 years ago (mathematics, C++ implementation), and even tried to use other traditional AI approaches (expert systems, PROLOG for memory dump analysis inference), only recently I caught up with latest frameworks and approaches in ML, and over the last years was building a substantial library for exploration and learning.

My path to data science had the following segments with some overlapping:

  1. Relational database design and SQL programming
  2. Exploratory data analysis and mining insights
  3. Data visualization
  4. Best practices
  5. Pattern mining
  6. Cross-disciplinary pattern transfer
  7. Developing interdisciplinary scientific and philosophical foundations for software diagnostics
  8. Software data science
  9. Anomaly detection
  10. Data engineering
  11. Data science

Let’s do a little bit of reminiscence for the points above. The full multithreaded path can be traced in Theoretical Software Diagnostics[2]; additional information can be found in Memory Dump Analysis Anthology volumes[3], including the latest developments in volume 12, and various seminar transcripts[4].

  1. It is often said that SQL skills are essential for data science. I learned relational database design and SQL programming in the second half of the 1990s, became twice certified in Microsoft T-SQL and SQL Server, and even had exposure to Oracle. At that time, I also learned and used various client-side libraries for accessing databases from C++ and Java.
  2. Still, that was software engineering activities. What dramatically changed my career was memory dump analysis. Memory dump data used for exploratory analysis and mining insights is usually a mix of structured and unstructured data, where the latter is data resulted from sparse overlapping execution histories.
  3. Memory visualization (2D and 3D approaches) was another my activity from the beginning[5].
  4. To get the best from memory dumps, you need some organizing best practices, and I introduced typical data analysis workflow elements such as data cleaning, checklists and scripts, and common mistakes to avoid. Common analysis approaches and techniques applied in specific contexts to common recurrent problems went into analysis patterns and problem patterns.
  5. As more and more analysis and problem patterns were discovered, the process became resembling pattern mining activity.
  6. When traces and logs were added, pattern mining became a cross-disciplinary activity, utilizing insights from humanities, mathematics, medicine, and natural sciences. Inside the software data domain, the same patterns were transferred to malware narratives and network traces.
  7. Pattern-oriented and systemic approaches became the foundation for software diagnostics and forensics, and software narratology became the foundation for log analysis and software-related stories in general.
  8. I also realized that most analysis patterns could be applied in the broader context of software data[6], including source code and other software artefacts. I also proposed software artefact annotations for software data analysis[7].
  9. Recent experience with digital pathology made me realize that what I was doing in the past is also called anomaly detection and analysis[8].
  10. Since software is its own model[9], data engineering became my focus to simulation, training, and validation of analysis patterns. This is also supported by my long-time software defect modelling experience.
  11. Finally, I came to the realization that I was doing data science all the way back to 2003.

References: 

[1] https://www.lifehacker.com.au/2019/04/how-to-think-like-stephen-wolfram/

[2] Theoretical Software Diagnostics: Collected Articles, Second Edition (https://www.dumpanalysis.org/theoretical-software-diagnostics-book)

[3] Advanced Software Diagnostics and Debugging Reference (https://www.dumpanalysis.org/advanced-software-debugging-reference)

[4] Principles of Memory Dump Analysis: The Collected Seminars (https://www.dumpanalysis.org/principles-memory-dump-analysis-book) and Software Diagnostics: The Collected Seminars (https://www.dumpanalysis.org/software-diagnostics-seminars-book)

[5] Memory Dump and Live Memory Visualization and Picture Extraction (https://www.dumpanalysis.org/memory-dump-live-memory-visualization)

[6] Principles of Pattern-Oriented Software Data Analysis (https://www.dumpanalysis.org/pattern-oriented-data-analysis)

[7] Coding and Articoding (https://www.dumpanalysis.org/articoding)

[8] Trace and Log Analysis: A Pattern Reference for Diagnostics and Anomaly Detection (https://www.dumpanalysis.org/trace-log-analysis-pattern-reference)

[9] The Scope of Software Diagnostics (https://www.dumpanalysis.org/scope-software-diagnostics)



要查看或添加评论,请登录

Dmitry Vostokov ????的更多文章

  • Reflections on 2024

    Reflections on 2024

    The significant 2024 achievements include these: Finally actively learned the Rust language (this process will continue…

    1 条评论
  • A Software Engineer Reborn

    A Software Engineer Reborn

    Seven years ago, I put my feet on a software engineering path again after 14 years of technical support when I only…

  • Software Surgery

    Software Surgery

    A colleague recently named me a brain surgeon after solving a series of software problems no one could solve in the…

    4 条评论
  • ChatGPT Review of "Memory Dump Analysis Anthology" by Dmitry Vostokov

    ChatGPT Review of "Memory Dump Analysis Anthology" by Dmitry Vostokov

    Overview "Memory Dump Analysis Anthology" by Dmitry Vostokov is an extensive and highly specialized series dedicated to…

    2 条评论
  • Reflections on 2023

    Reflections on 2023

    I usually write this post on New Year’s Eve, but because this was the year when my System Idle Process had 0% CPU…

    2 条评论
  • Introducing Lov Language

    Introducing Lov Language

    In the past, I paid little attention to traditional performance and system behavior visualizations, for example, for…

    1 条评论
  • Reflections on 2022

    Reflections on 2022

    My previous yearly review was in 2012. Ten years later, I see 2022 as the most productive in the last decade for…

    4 条评论
  • Systematic Software Diagnostics

    Systematic Software Diagnostics

    Systematic Software Diagnostics attempts to unify various disorganized and fragmentary individual software diagnostic…

  • Exercises in Tracing Style

    Exercises in Tracing Style

    Years ago, I bought the book in Russia whose title is “Literature of Formal Constraints: Form and Games from Antiquity…

  • Introducing Methodology and System of Cloud Analysis Patterns (CAPS)

    Introducing Methodology and System of Cloud Analysis Patterns (CAPS)

    We wrote a short post about added complexities of virtualization almost 15 years ago[1] and then about orbifold memory…

社区洞察

其他会员也浏览了