Data science vs. Geophysics: Two intertwined fields
The biggest artificial hole on earth is just approximately 12 km deep (Kola Superdeep Borehole). Therefore geophysicists struggle to understand the earth's inner structure which has a radius of 6,371 km. Given the extreme pressures and temperatures in the Earth’s interior, it is not possible to directly investigate the Earth’s structure further than this depth. Similarly, in data science, there is no direct information about the relationship between features, especially in a multivariate environment, and the target variable.
To this end, geophysical techniques are used by researchers to overcome this problem and model the deeper Earth. One of the aforementioned geophysical techniques is seismic tomography which analyses the information of seismic waves travelling through the Earth in order to produce tomographic images of its interior and extract useful information. Seismic tomography is about inferring the structure of the Earth’s interior from recordings at the Earth’s surface. Data from travelling waves in the Earth’s interior are gathered using seismometers. These data are then analysed using machine learning techniques to extract useful information about the Earth’s interior. Thus, the observables, in this case, are the recordings of seismograms which can be in the form of unstructured (seismic waveforms) or structured (arrival-times i.e. time taken for a seismic wave to arrive at the Earth’s surface) data. By taking the definition of data science to be the “field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data”, we can see that geophysical problems match this definition. In other words, we want to extract knowledge and insights about the Earth’s subsurface using structured and/or unstructured data (as explained earlier) by exploiting only indirect information obtained from the surface of the Earth. Thus, both data science and geophysics utilise scientific methods to exploit the information obtained from the data and transform them into useful insights.
In practice, both data science and geophysical problems often require a cost function to be minimised in order to reconcile the observed with the modelled data by adjusting the model parameters (coefficients). This cost function can be derived from a simple ordinary least squares to a more complicated one such as the use of deep neural networks. In contrast to data science, the geophysical inverse problem often includes a physical solution to model the data such as the solution of the wave equation. On the other hand, in data science, the data observations are features to be modelled and the model parameters are the coefficients to be adjusted. The physical solution to solve the model prediction problem is a significant difference between the two fields in the sense that in data science there might not be a deterministic mathematical relationship describing the modelled with the observed data.
The inversion step involves the adjustment of model parameters in order to match model predictions with data observations i.e. making inferences from data. The minimisation of the cost function often requires an optimisation technique to be used such as gradient descent and conjugate gradient methods. The process for adjusting the model parameters involves numerous challenges including solution non-uniqueness and the non-linear nature of the inverse problem. The introduction of regularisation parameters, in geophysical problems, such as damping and smoothing are used to tackle the solution non-uniqueness issue, by reducing a potentially infinite pool of data fitting models down to one, which retains the most desirable properties. The selection of these ad-hoc parameters can often be achieved using the point of maximum curvature of an L-curve. If we were to allow the data fit to match the observations freely without explicitly choosing a damping parameter then due to the underdetermined nature of the problem it would probably converge to a local minimum which satisfies for example only a subset of the data observations. Similarly, in data science, a damping factor is introduced to reduce the variance of the model which causes overfitting to a particular subset of the observations (often split in train and test sets). However, the solution to the high variance problem in data science is tackled differently using usually K-fold cross-validation and/or hyperparameter tuning.
Last but not least, domain knowledge is extremely important both in data science and geophysical problems. It is this knowledge which adds value to the findings while ensuring that the data processing was performed correctly and hence, is vital for extracting meaningful information. For example, the need for understanding the Earth’s structure is one of the most significant aspects that scientists have to study in further detail. Various phenomena like earthquake location could be forecasted and thus, precautions can be taken to avoid the dangerous consequences. Similarly, in data science to devise meaningful features (feature engineering) you need to understand the domain very well, in order for them to have as much predictive power as possible. However, many of the applications based on the results of geophysical studies can have an indirect impact in the real world. In any case, both data science and geophysics can have a real-world impact or can contribute to further understanding of a particular field.
Concluding, given the formulation of the inverse problem described above, we can observe consistent similarities between geophysical problems and data science which make use of various machine learning techniques. The application of machine learning techniques in order to learn insights from the data without explicitly programmed provides a common ground between those two fields. The scientific methods used in an indirect approach to infer information from structured or unstructured data and the significance of understanding the domain makes the data science and geophysical problems intertwined.