Typical data science challenges in advanced engineering

Typical data science challenges in advanced engineering

Data science as part of an engineering product development cycle is not the same as it is for financial or actuarial problems. While financial analysts have to fight with massive data sets, engineers often encounter:

  • limited data where they have to make statistical assumptions
  • dealing with models the fact that physical systems can contain unexpected non-linearities, bifurcations, or discontinuities
  • a limited number of design variables changes they can look at before the number of possible combinations becomes too large ( the curse of dimensionality )
  • understanding the risk associated with the lack of data ( avoid rare events or Black Swans like the mishaps Boeing has been encountering recently )
LIMITED DATA AND MAKING ASSSUMPTIONS 

Let's start with the easiest one: obtaining measurement data generally costs a lot of money. In the aviation industry, it is common to repeat measurements five times. While this is enough to estimate mean and variance, it is generally not enough to calculate statistical failure probabilities or perform a rare event analysis.

No alt text provided for this image

Limited measurement data generally mean that the engineer in charge needs to make assumptions about the more general behavior of the system.

The majority of methods in the field of machine learning assume that a sufficiently high number of samples is available to correctly infer the probability distributions underlying the variable parameters. However, for industrial cases, there is often not enough data available due to limited time and cost issues to determine those properly. The construction of a reliable probability distribution consequently involves a lot of assumptions and subjectivity. 

Errors caused by making the wrong assumptions for input probability distribution are often underestimated. The figure below shows the impact that the choice of distribution has on the results of a study performed for the 2D CFD simulation of a heat exchanger. The CFD simulation was performed with QuickerSim, a really simple Matlab CFD solver. This figure shows the effect of different probability distributions on the temperature variance field of a heat exchanger CFD simulation. Both the Gaussian and the Uniform distribution are acceptable assumptions if you have little data. The variance in the temperature distribution is however quite different. 

No alt text provided for this image

Another difficulty is that trends can be misleading because we want to find patterns. When we see a measurement fluctuate between 12-15 Volt we are easy to say, it's 12 or it is 15. Let’s take for example the random process in the figure below and assume that only the data until the division line has been observed. The process seems to be periodic. When looking at the actual realisation after the knowledge line, however, one can see that this was a false assumption. Engineers in particular are better trained at recognising patterns than they are in dealing with stochastic processes, which is why I always felt these things needed to be simplified for practical engineering applications.

No alt text provided for this image

While doing research with Rolls Royce or Airbus, I have often heard that the maximum acceptable time frame for a simulation or optimisation analysis in the industry is about two weeks. Ideally, the results of a simulation should be available within a few working days.

Receiving reliable results on a standard workstation this fast is only possible with a very efficient and intelligently automated method and a reasonably fast model. However, available computing resources, in particular, access to high-performance computing, is becoming more and more common throughout many industries. Computational design exploration methods will greatly benefit from this development in the future because parallelising a Monte Carlo analysis is is trivial. Individual simulations are always independent and can be submitted to different machines. Nevertheless, efficiency still becomes important if you look at the computational cost. Most models are computationally too expensive to allow thousands of simulation runs. 

NON-LINEARITIES, BIFURCATIONS AND DISCONTINUITIES

Even more frustrating than the fact that they cannot use normal distributions for everything, is the fact when sampling any design space one can miss critical features of the same. When building an aircraft, this can be dangerous by leading to flutter.

No alt text provided for this image


In the above example, the problem is not too big, but some computational models contain instabilities, bifurcations or sharp gradients across a certain region of their computational domain. An important example from supersonic aircraft engineering or turbomachinery is compressible flow simulation, where discontinuities can be encountered.

The left graph in the figure below shows a system of shocks in transonic flow within a gas turbine. The right graph shows a connected model response surface where a shock divides the domain. It

No alt text provided for this image
RISK, RELIABILITY AND RARE EVENTS

Most popular in engineering is risk analysis using Gaussian distributions. However, they can give a false sense of security. The assumption of normality removes the possibility to account for rare events from the model. Events that are further away than five or more standard deviations from the mean are extremely unlikely for the normal distributions - more unlikely than they often are in reality. If such rare events have catastrophic consequences, they are called a Black Swan, as suggested by Taleb. Black Swans have so far been studied mostly in mathematical finance to explain catastrophic market crashes. The reason for that can be seen in the figure below. When considered with the mere eye, fat-tailed probability distributions look almost identical to Gaussian distributions. However, when considered closely, as in the right graph, it can be seen that while unlikely rare outliers are possible. 

No alt text provided for this image

While physical laws and even aleatory input conditions can be discovered and incorporated into the simulation process, some unknown factors inevitably remain outside the scope of simulation entirely.  

THE CURCE OF DIMENSIONALITY 

One of the major problems in design exploration in engineering is the so-called curse of dimensionality. It is a problem that engineering shares with many other mathematical disciplines that need to analyse high-dimensional spaces, like numerical analysis, optimisation or machine learning. Although the applications may vary, the core problem is always the same: the amount of data needed to obtain reliable results grows exponentially with the number of dimensions With three samples per dimension, the number of samples required for twenty dimensions is already higher than 3.4 Billion samples. For up to three dimensions this growth can still be visualised. The figure below shows the number of design collocation points that are would be needed to explore a one, two or three-dimensional design space. Tip from me: sparse design of experiment like Smolyak’s rule the most widely can alleviate the curse considerably.

No alt text provided for this image


要查看或添加评论,请登录

社区洞察

其他会员也浏览了