My data story
Jayanti Prasad Ph.D.
Associate Principal - Data Sciences @ LTIMindtree | Ph.D. in Astronomy
I strongly believe that every data scientist must have a data story to share that may include the motivation and circumstances leading to a career in data science. Here I am presenting my own story.
My story starts sometime back in the early 2000s when I started my PhD in physics/Astrophysics at the Harish-Chandra Research Institute in Allahabad (India). At that time giga was not common and kilo was still relevant and floppy disks were very much in use. The first pen-drive I used had massive storage capacity of 256 Mb ! Most computers were running on Mega Hz clocks, Cyber Cafes were the common places to check emails (outside the work places) and laptops were not affordable. So what was the problem I was working on, let us discuss that.
After completing a set of advanced courses in Physics (Mathematical Physics, Classical and Quantum Mechanics, Particle Physics and Quantum Field Theory, Statistical Physics, General Relativity, Computational Physics and Astrophysics) I decided to work in a sub-field of Astronomy & Astrophysics called Cosmology. It was an interesting area to work in and had the objective of building theories (models) about the origin and evolution of the Universe - within the framework of Physics. The problems I worked on were quite specific and were about how galaxies (which are considered the building blocks of the Universe) form, evolve and cluster with time. The problem became narrower with time and I decided to work on clustering of galaxies in an expanding Universe for my thesis work.
Galaxy formation is a messy subject and needs inputs from many branches of Physics such as General Theory of Relativity, Hydrodynamics and Electrodynamics which have reputation of hosting some very beautiful equations with messy or no solutions ! The systems for which equations are exactly solvable in these fields are far from the real systems we have in the Universe and so numerical solutions are the only way out. There is no surprise that the use of Computers in physics and astronomy research had been there from the very beginning. Just to give one example - before the modern GPU were even born a chip by the name 'GRAPE1' was built in 1989 in Japan that had 310 MFlops computing power and was running on a clock of 8 MHz for computing the gravitational force between stars !
What makes the task of understanding galaxy clustering a bit easier is the approximation that galaxies can be considered point particle interacting with each other only due to the gravitational force. Other features related to gas and radiation physics can be added later on once the gravitational dynamics is understood.
It turns out that we can model the clustering of galaxies in expanding universe by just considering two sets of equations - one related to the expansion of the Universe and other related to the dynamics of mass particles in the gravitational field. The expansion of the Universe is well described with the Hubble's law (relativity velocities of two galaxies in the Universe grows in proportion to the distance between those). I am not writing down these equations here in order to avoid making the article technical and boring but these equations are pretty simple. Here I will focus on Newton's law of gravitation which is used to compute the gravitational force acting between a set of objects with mass.
As it has been mentioned above we can remove all the properties of galaxies except their mass and try to understand how they cluster in an expanding background using Newton's laws of motion. There are very strong (observational) reasons to believe that sometime back (billion years back !) the galaxies in the Universe were not as much clustered as they are today and were distributed almost uniformly.
The title of my Ph.D thesis was “Gravitational Clustering in an expanding Universe” which was about understanding some aspects of the clustering of galaxies - a highly non-linear process. In particular, I focused on the problem of effects of clustering at large scales due to presence of structures (sub-structures) at small scales. This was an important problem to understand because in the Universe the clustering happens at small scales first and then take place at larger and larger scales. Clustering of galaxies lead to highly non-linear objects such as groups, clusters, super-clusters, filaments and voids as we observe in the Universe.
There is no way to solve the equations of motion given by Newton's law for a million particles (representing galaxies) analytically - with just only with pen and paper. In fact, these equations cannot be solved even for three particles - you can read about the three body problem somewhere else. The main approach to solve this problem is numerical and called Gravitational N-body simulations - in which the equations of motions are solved numerically on a digital computer.
The equations which are used in N-body simulations are not very complex but need a lot of computing power and storage. The computational (cost) complexity grows as O(N^2) if we do not make any assumption. This means that if we have one million objects (particles) then we need to compute almost one thousand billion pairs of force at every step of the simulation. Computation is needed to update the position and velocities of particles also but that can be ignored since it grows linearly with the number of particles used.
There have been proposed algorithms such as Particle Mesh (PM) and Tree (Barnes–Hut ) which bring down the computational complexity from O(N^2) to O (N long N) by making some simplifications with compromising some accuracy within acceptable limits. In case we want to track the evolution of particles in the simulation we must save the positions and velocities of all the particles at all epochs - or at least at some epochs and that require a lot of storage. I do not recall anyone calling the data generated in N-body simulations ‘Big Data’ but it is not just ‘Big Data’ but it can actually be very Big Data.
The first 'big data' I worked with was a few GB (up to 5) and was generated in a N-body simulation. The biggest simulation I ran was for a system with 512^3=134,217,728 particles and required around one week of computing time on a Linux cluster with around 64 cores (of Intel Xeon processor @ 2 Gh ). Just to add that the knowledge of Message Passing Interface or MPI and Job Schedulers were very crucial for launching the jobs on a Linux cluster. The programming language which I started with was Fortran 77 and later on all the codes were converted to Fortran 90 at different points of time.
The N-body projects I worked on had three main components - the equations, computation and statistics. Equations & computation part I have covered and now I will just describe the statistics part. The main statistic we used is called the N-point correlation functions which measure the probability of finding N objects in N cells of volume dV1, dV2,...,dVN around positions r1, r2,..,rN. If there are no correlations between the positions of particles then the probability P(1) of finding one particle in volume dV1 and P(2) of finding another particle in volume dV2, separated by a distance r should not depend on the value of r and P(1,2) = P(1) * P(2) dV1 dV2. If there are correlations then we have extra terms.
If we have a set of N objects then we can write down the joint probability P(1,2,...,N) in terms of the N-point correlation functions. There are strong reasons to believe that galaxies in the Universe formed due to density fluctuations in the early Universe which were Gaussian random. This means that we do not need to go higher order than the two point correlation functions or something called Power spectrum (in Fourier space). All the data I created and analysed in my projects was interpreted in terms of the N-point correlation functions. I do not want to go in technical details here and for that you can check my relevant publications.
Now I will discuss another section of my data story which is about collecting huge volumes of data about the sky using the Giant Meter Radio Telescope or GMRT in Pune. If you do not know what GMRT is then let me tell you it is a set of 30 Giant (50 meter diameter each) parabolic and fully steerable dish antennas, located around 100 km from Pune. The facility is one of the most sensitive telescopes in the world in its frequency range (from few hundred Mhz to 2 Ghz). Receivers on these antennas can collect signals coming from the sky and can be combined in many different ways. With some adjustments all the antennas can be pointed in some particular direction and a very high angular resolution can be achieved but this formation limits the area of the sky which can be covered for an allocated observation time. So in a common formation different antennas will be looking into different direction and the whole sky (as Geographical and physical constraints allow) can be covered in a short period of time. In any case a huge volume of data can be generated in a just few hours.
Commonly not many objects in the sky change (after removing the changes due to the motion of the Earth) their brightness with time but there is a class of objects called “Pulsars” which emit signals of very high time regularity - we can even set our clocks using these signals. In fact when they were discovered first time in the beginning it looked like the signals were sent by some extra terrestrials ! The signals coming from Pulsars not only can be used to find the location and other properties of the Pulsars but they can also be used to find the properties of the medium through which they pass - such as the ionization state of the matter. Pulsar Astronomy is a very active area of research and has led Nobel prizes also.
Apart from the static sky and Pulsars there are many other objects in the sky which show variations in their brightness over different time scales. The research group I was working with at NCRA was interested in a class of objects called radio transient. These transients could change their brightness over a very short period (nano second) of time also. This means that we needed to collect the data very frequently so that we could catch these extremely fast events. I still remember the observations when we were receiving data with such a huge rate that we were running out of the disk space quickly. Finally the data we collected needed to be shipped to Australia for analysis on the Supercomputers at the Swinburne University. Back at that time (2008-2009) the Network Bandwidth was not good and we received huge boxes filled with Hard disks with TB capacities from Australia for the project. We used to remove the filled disk and insert new disk during the observations. Details of this project are published in one of my papers so I will not discuss anything more here except mentioning that it was very time consuming to look at the whole data for quality and remove the bad data points. For this purpose, I wrote a software package named “Flagcal” with my supervisor that could do this task in an automatic fashion. The package was written in 'C' and had provision of processing the data in parallel using OpenMP. You can check out the original version of the code on my GitHub page and corresponding paper on arXiv. Back then we collected TB of data and analyzed that but none of us called ourselves data scientists ! Again the data we collected was big only in terms of Volume so we can call it ‘Big Data’. There is a lot to write about the relationship between data, software & radio astronomy and I keep that for later.
Out of various types of waves which come from the far corners of the Universe these are mainly Radio Waves & Gravitational Waves which reach to Earth without being affected by the matter in between. I will write about my association with Gravitational Waves at the end of the article but first I want to discuss another kind of waves or radiation called Cosmic Microwave Background Radiation or CMB - which is considered one of the earliest snapshots of the universe we have.
CMB is a rich source of information about how the Universe was just after a few hundred thousand years of its creation. CMB has been studied in detail now for more than 50 years after its discovery in 1965 by Wilson & Penzias - for which they were awarded the Nobel Prize in physics. One of the main challenges for CMB is that there are many other forms of radiation which contaminate it badly and so we must put our receivers somewhere where we can collect it without contamination. There is one such place 1.5 million kilometer away from Earth, called ‘L2’, where the detectors can be put. NASA did that with its COBE & WMAP missions in 1990 and 2000s respectively. The European Union (ESA) repeated the same in 2010s and put a detector called Planck close to the same point. In both of these missions very sophisticated instruments with enclosures were used which could maintain very low temperature to minimize the noise. WMAP and Planck missions provided a lot of very good quality data which was made publicly available by NASA and ESA. WMAP data was available in the middle of 2000s but I started to work on it only after joining IUCAA in 2010.
As is common in Physics & many other sciences that the data is used to rule out theoretical models which are inconsistent (or validate some other models) with the data and this is what was done with the WMAP and Planck data also. Apart from model validation WMAP was also used to get very good numbers (in terms of uncertainty) about the Universe such as its age, composition, geometrical structure etc.
What I did with the WMAP and later on with the Planck data was Bayesian parameter estimation. The main exercise was to compute the likelihood functions for various different theoretical models and try to constrain their parameters. In one of the publications we put constraints on the parameters of a set of theoretical models. This exercise is pretty standard and any one who has done some work in Bayesian analysis can easily understand that. Just to mention that we use Markov Chain Monte Carlo (MCMC) sampling employing Metropolis-Hasting Algorithm. Here one thing is important to mention and that is the data volume from WMAP & Planck mission was not huge - it was less than 10 GB but the processing needed was a lot and that was because the likelihood functions in a high dimensional space (up to 25) were computationally expensive. Just to speed up the processing in multiple Monte Carlo chains are launched on different processors and the data was shared with Message passing Interface. Here again my experience of working with MPI and job scheduler in earlier projects was quite useful.
I was not only able to run my own MCMC jobs - I also helped a good number of other researchers also to run their jobs. In some cases my involvement became more than helping to run the jobs and my contribution was acknowledged with giving the authorship in the papers published. This helped me to diversify my work. Thanks to IUCAA for providing good high performance computing power (In Teraflops ) and special thank to my supervisor for giving me the full charge of a Cray CX1 Supercomputer (with 72 Cores of Intel Xeon processors, 1 Nvidia Tesla K40 card and 4 TB of storage). I used that Cray CX1 supercomputer like my personal desktop for almost 6 years !
Two of the projects that I worked on CMB in IUCAA are interesting to mention. As it was known at that time MCMC sampling in the Bayesian analysis was not only very expensive in terms of computing power, it was very user-unfriendly also ! For example the commonly available code (Cosmomc) needed a covariance matrix in advance before trying a new model. WMAP & Planck data set came with covariance matrices for around 100 theoretical models which people could try and had very poor provision for any new models. As one can see that there is no point in re-running the job for a model which has already been tried out. So I will say all those 100 covariances matrices were pretty useless. Although I found a way to cheat and create fake covariances matrices and improve them iteratively but I was seriously looking for some other way which was quick and robust and gave results as good as MCMC. My supervisor suggested me to try Particle Swarm Optimization. For almost one year I worked on this scheme and came up with an alternative solution for cosmological parameter estimation based on Particle Swarm Optimization. Our scheme was not only less computation hungry, it was robust and accurate also. Our work was accepted for publication in Physical Review D that was very encouraging. I still see many other group are getting inspiration from our work and implemented PSO for other problems also. Just to mention that our work is still well received outside India which is clear from the citations the paper gets.
When I was working on the problems of cosmological parameters I was seriously bothered with the approach which some authors were taking by adding extra parameters to lower the chi-square. In one of our works we argued that one should not look only at the chi-square when comparing the goodness of a fit. We proposed using Akaike Information Criteria (AKI) and Bayesian Information Criteria (BCI) together with chi-square. With one of my colleagues who was working on his Ph.D I worked on a regularization scheme to address the issue of over-fitting in one reconstructing or inverse problem. We ended up with a very good scheme based on Maximum Entropy regularization and our work was well received and published in Physical Review D. The work has high visibility and is still cited by important research groups in the field.
The last part of my data story is quite interesting in terms of the exposure and experience I had. This part is longer mainly due to the reason that a lot of what I did during this period was different than plain simple research. It involved many different technologies which were unique to the projects I worked on. Since it is very unlikely that I will use most of these in future therefore I am writing here detail which can help me to track of things also.
I was doing quite interesting stuff in CMBR back in 2012 and was approached by my mentors to accept a new responsibility. That was in helping to set up a new data center for the LIGO Scientific Collaboration (LSC) at IUCAA. The LSC was a group of around one thousand scientists spread over the entire globe with a single mission - to detect gravitational waves from binary objects (black holes and neutron stars). This involved to build, maintain and use a set of enormous size (4 km X 4km ) detectors at two locations in US - Hanford in Washington State and Livingston in Louisiana State.
LIGO already had far bigger data centers in many places - the biggest ones were in Germany, Caltech and at the University of Wisconsin in Milwaukee then what we were planning. It was made clear to me that there are no financial resources and I will need to support myself from some other means. I was promised to be provided every possible help in that matter. I sent a research proposal to the Department of Science & Technology of the Government of India and my proposal was granted around 2,500,000 INR to support myself for three years. This amount included to buy a very high end server with a GPU (Nvidia Tesla k40) and costing 500,000 INR.
Being a member of the LIGO Scientific Collaboration was quite exciting mainly because this was one of the most ambitious scientific projects ever been conceived. LIGO detector had the aim of pushing the sensitivity of the instrument to the limit of one in 10 to power 22 parts ! My first assignment was a field tour of the LIGO observatory at Hanford and computing infrastructure in many different locations in the USA - such as Caltech, Hanford and Milwaukee. The scale of computing infrastructure I saw in Caltech was overwhelming - there were storage facilities in the form of enclosures having racks full of taps with read/write managed with robotic arm.
When I had meetings with technical teams at Caltech I was bombarded with many unfamiliar abbreviations. Let me give some funniest ones. AY - Actual Year, BCWP - Budgeted Cost of Work Performed, CD- Conceptual Design, DASWG - Data Analysis Software Working Group. Apart from these home cooked abbreviations there were other standard technical terms also the meanings of which I understood only later on when I actually worked on projects related to those. Such as GLOBUS, CONDOR, PEGASUS, SHIBBOLETH, KERBEROS, SINGLE SIGN ON, X509, GANGLIA, CVFMS etc.
During my association with the LIGO group at IUCAA I Worked on multiple frontiers simultaneously - (1) finding out what needs to be set up by constantly interacting with the LIGO teams at Caltech and other places (2) setting up and bringing up systems and services (3) I was the first user of every service I set up (4) documenting whatever services were being made available and writing FAQs (4) setting up authentication and authorization schemes for various resources (LDAP, SHIBBOLETH, GRIDMAPFILE) (5) setting up multiple web servers for various purpose with common LIGO authentication system (5) Transferring terabytes of data (using GridFTP and GLOBUS online) to IUCAA (5) troubleshooting (6) helping the members to launch scientific work-flows and testing those myself and (7) managing network security. For the first three years I did my job voluntarily - in the way I was not paid for this but I was given a full membership with authorship in all the papers of the collaboration. For the next two years (2016-2018) I received appropriate compensation for my work.
Although I worked on multiple projects during my association with the LIGO project but here I will highlight just three. The first is related to the identity management, the second is with data management and the third related to work flow management.
Within the LIGO Grid the password based authentication was not considered secure so X.509 certificate based authentication mechanism was adopted. This required users to obtain digital certificate from their respective grid authorities (in India it is the Indian Grid Certification Authority of IGCA) and send the public key of that to LIGO. This public key was used for all the authentication purposes. My role was to guide the users in the process and manage the certificates in an automated way. Since all the computational resources available within the grid were presented to the users as a single pool so making sure that the authentication does not break down was extremely important.
Before writing about the data management plan it is important to know the organization of the LIGO data. At the topmost level the data was divided on the basis of the observations runs, then the data for a single observation was split in terms of the detectors, and then for every detector we had data of different levels L1, L2, L3. The raw data which is called L1 had around hundred thousand channels in the form of time order data with Khz sampling. The number of channels (representing the output of thousands different sensors) becomes small when we go from L1 to L3 and so the data volume also reduces. Grid computing need data in the form of large number of small files in place of small number of large files so that the data can move easily within the grid. What is interesting to note that out of hundred thousand channels only a few have the actual (science) data which had signals coming from the sky. Most of the channels had information about the health of the detector and its environment. I will not go into more detail here but if anyone is interested then can easily find this information on Internet.
The data management scheme was straightforward but got complex due to giving restricted access to different groups of users. It had three steps (1) Transfer the data over GridFTP (2) host that on a data server - called LIGO Data Replicator and (3) set up clients to discover the data in a user friendly way. The software which was used for this purpose had outdated documentation so I needed to do a lot of reverse engineering and experimentation to make things work. I was able to transfer around 100 TB of data and host and make that available to the users. Before discussing about the work-flow management I would like to say something about grid computing that is needed to understand how work flows runs over the grid.
Grid computing is a way to utilize computing and storage resources spread over many different geographical locations connected with each other by common Internet network. For a common user it may look like a cluster of Linux clusters which need not to have the same configuration or architecture. The ideal work-flow for a grid is a work-flow consists of millions of small jobs which can be executed independently. GLOBUS toolkit is a tool is commonly used for data transfer across different geographical locations.
With around one dozen data centers LIGO data grid makes the backbone of the LIGO computing infrastructure. At some point of time one senior member of the Open Science Grid of OSG - a large grid of more than 100 computing facilities (data centers) mostly in USA - was visiting our group. He suggested our group to join the OSG and he invited me for a meeting at the University of San Diego where he was working. I visited there in 2017 and had many meetings with the technical teams there and learned a lot of about the contribution of CERN - the largest physics lab in the world - in this field.
After coming back with some training in OSG systems I worked on connecting our data center with the OSG and that finally happened. It was made clear to me in the very beginning by the OSG team that we will need to host the CVFMS (CernVM File System) for this purpose. This is a file system system which can be used to make the software available on fly ! In common grid computing it is required that host machines have all the components of software need to execute the grid jobs and the data can come on fly. But with the CVMFS even host need not to have software. This approach is more or less what is being used in many other projects also like in the SETI@home ., Einstein@Home etc.
LIGO has its full software stack consisting many components related to physics, data management, authentication, work-flow creation and presenting the result. At the heart of the analysis is a Library completely written by scientists over a period of many decades to solve the mathematical equations related to the dynamics of binary objects, such as Black Hole and Neutron stars. This library is called the LIGO Algebraic Library or LAL and is mainly written in 'C'. Around LAL many other python packages are written for fetching the data, creating the work flows etc. Just to mention that the work flows are in the form of graphs (directed acyclic graphs or DAGs). Once the graphs have been created they can be mapped to any system of resource. This approach is very similar what is being adopted in tensorflow also. The difference may be that the LIGO work-flows are huge - with millions of nodes and can be launched on a system like grid. Let me write in very short how the signals corresponding to the merging of binary objects are discovered in the LIGO data.
The main technique which is used by LIGO is called matched filtering in which million of expected type of templates are created for many different physical parameters (such as mass, spin, location, orientation etc.). These templates are matched with the observational data. When the match is found better than a pre-defined threshold the detection is claimed and is cross-checked with other detectors also which might have the same signal in their data if were live in that duration. The whole process is much more complex than what I have discussed here.
LIGO team was very busy during 2014-2015 in the upgrade of the detectors. It was speculated that once the detectors come online with better sensitivity the chances are very high for finding something very interesting - most probably a binary black hole or neutron star merger. No one knew that the pleasant surprise will come so soon. Within 24 hours of the detectors coming online on September 14, 2015 at 09:50:45 UTC something remarkable was spotted in the data. This made the full collaboration busy for the next six months to confirm and re-confirm that the event of a binary black merger had been recorded.
The secret was closely guarded and the discovery was announced with a set of high profile press conference across the world and simultaneously the discovery papers and other media were made available to public on February 11, 2015. The event was widely covered by almost all the major media publications on the next day and all the members of the LIGO became celebrities overnight. Stories about many members of the collaboration (including me) were published in regional newspapers of their respective regions. By that time the data center I was managing had everything ready and so I got the data and rediscovered the historical signal. I shared my excitement with the other members of the group. For next few weeks most of the members of our group were busy in giving public talks, interviews etc. I gave at least half a dozen public talks about the discovery in many places including COEP and MIT Pune. Within one week of the discovery Government of India approved a proposal of building a gravitational wave detector (LIGO-India) in India and that changed everything. What happened after that is just history. At some point of time I started to think about switching the gear and do something different and that ended with accepting a job of data scientist.
When someone asks me how many years of experience I have in data science I get completely confused. Does one become a data scientist only when has the job title 'data scientist'. Are my fellow scientists who inhale and exhale data and work on some of the most powerful super-computers with decades long training in mathematics and statistics are data scientists ? At present it looks that the entire computer technology was developed by Google and Facebook but if we dig dip in the history then we will find that the contribution of Bell labs and CERN may dwarf google ! I will write about the data science and Physics/Astrophysics in another post. If you have read till here I must appreciate your patience and seriously thank you !
AI ML Engineer
3 年Awesome curation. Really enjoyed reading it.
Lead Data Scientist at Marks and Spencer
5 年Loved this article Jayanti, I still remember those intern days in HRI when you used to run your simulation in the clusters. I sincerely believe that the term data scientist is irrelevant here, you are a true scientist and we can only strive to follow your example.