REVERSE ENGINEERING USING: LINEAR REGRESSION ANALYSIS & Microsoft Excel 2000
REVERSE ENGINEERING USING LINEAR REGRESSION ANALYSIS

REVERSE ENGINEERING USING: LINEAR REGRESSION ANALYSIS & Microsoft Excel 2000

LINEAR REGRESSION

Analysis

by FRED FINKELSTEIN, P.E. MSCS, BSME BSCS EIT

Iron Designer LLC

Engineering Services Co.

Feb 11th 2021


ABSTRACT:


Linear Regression is a mathematical method that is used by engineers frequently to reverse engineer data from which comes from an unknown source. Such data collection from an engineering system, machine, function, prototype signal or vibrational mechanism for example requires that through engineering the data be analyzed for system level predictive analysis to within a tolerance band. An example is shown in Figure A of such a set of data that an engineer would collect and use linear regression to determine a predictive model within a tolerance band to assist the engineer in understanding and designing the system response elements, i.e. a damper, spring, sensing unit, communications module, etc. Though many other forms of regression exist, the introductory example is the linear example. This paper will provide the derivation of the equations and an example of how a spread sheet, such as Microsoft Excel can assist the engineer in developing the model equation (red line) to provide the engineer with the useful equation of the predictive model and compute the tolerance band (upper and lower) parallel band lines to determine the envelope of design for the engineer’s final result.??


Linear Regression Analysis Example Diagram:

No alt text provided for this image


Figure A: an example of data incorporating the model of linear regression as a predictor of trend, slope and central limit of “best” fit.

Citation: https://en.wikipedia.org/wiki/Linear_regression


In linear regression, the relationships are modeled using?linear predictor functions ?whose unknown model?parameters ?are?estimated ?from the?data . Such models are called?linear models .[3] ?Most commonly, the?conditional mean ?of the response given the values of the explanatory variables (or predictors) is assumed to be an?affine function ?of those values; less commonly, the conditional?median ?or some other?quantile ?is used. Like all forms of?regression analysis , linear regression focuses on the?conditional probability distribution ?of the response given the values of the predictors, rather than on the?joint probability distribution ?of all of these variables.


Citation: https://en.wikipedia.org/wiki/Linear_regression


Derivation of Linear Regression Equations:


The basic equation is the algebraic equation of a straight line:

No alt text provided for this image


The model of the regression line must lay with a slope and an intercept such that the residuals; the vertical (positive & negative) distances between the data points and the model line at any given point x that corresponds to a data point (x,y) is minimized and the sum of all residuals is = zero.


The residual is given by the equation:

No alt text provided for this image


Next the sum of the residuals must be computed by the following:

No alt text provided for this image

We must find the constants of regression of our model:???ao & a1

We must take the derivative of each summation, such that each derivative has a slope of “zero”, meaning that there is no “gain” or “loss” in the residual and the balance of the residual summation is equal.

No alt text provided for this image


Now, we have 2 equations and 2 unknowns;

No alt text provided for this image

So, this is solvable first using the “chain rule” in calculus and then using algebra to solve for the variables (ao & a1), we will close by using a matrix and Cramer’s rule to solve the final equations.

No alt text provided for this image

Cancel the –2 in both equations and move factors across the “=” to balance equations:

No alt text provided for this image


Next step is to correct equation (1). Since the summation is to “n”, then n * a0??is the summation of this portion, since a0?is a constant. And we can factor out?a1??since this too is a constant.

No alt text provided for this image


Using Matrix Algebra:


No alt text provided for this image



Using?Cramer’s Rule to solve the matrix:


No alt text provided for this image



No alt text provided for this image
No alt text provided for this image

are the average values, as denoted by the division of n in the equations, therefore a0 is the difference of the mean values, which derives the final equation of the trendline as follows:

y = a0 + a1x

the equation of the trendline.

-end derivation of the trendline from the derivatives of the residual sum of squares summations, i.e. the least squares function -

Simple Example:??“ORIGINAL COLLECTED DATA”

No alt text provided for this image


Figure 1: simple data table with 3 pairs of points


No alt text provided for this image


Figure 2: graph of the data set in Figure 1


Question is, how was the equation (y = 0.25x + 1.5) developed in the Microsoft Excel Trend Line (red line) and how did they develop the residual R2 in the equation. The residual gives the confidence of the Trend Line (red line) to the data, as an approximation of Least Squares to the Trend Line. This paper will explain the procedure to develop the Trend Line Equation for this set of data and any set of data points (pairs).



No alt text provided for this image

Create a table as shown, the headers will remain the same for any type of linear regression model.

The value in this example in the upper left corner 3.000, above the n is just the value for the number of pairs in this model, i.e. 3, but it also is called “k” in some text books, and in this example it is called “n”.


NOTE: the columns x3 and x4, i.e. log(x) & log(y) are not need for linear regression, but would be helpful for exponential regression, we will leave them here since we already have them, but note, they do not play a role in linear regression for tis type of modeling.


Also, NOTE: the column “copy-x was just used to more easily graph the data and is also not mathematically required, it is just there to simplify graphing.


Step 2:

NOTE: Several variable names are just randomly chosen and have no effect on the algorithm, the student can choose any variable names they choose.

NOTE: The student should understand that substitutions of names such as SUMX = sum(x) and SUMX2 = sum(x2) = sum(x^2), all mean the same thing, so be careful to NOTICE and make note of these substitutions **, as they will interchange often in this paper.


** i.e. k = n = K = N = 3 = 3.000, all mean the same thing in this example.

No alt text provided for this image




FORMULAS AND EQUATIONS FOR THE REMAINDER OF THE

LINEAR REGRESSION ANALYSIS

?

  1. ST3 = (K * SUMX2) – (SUMX)^2
  2. ST4 =?[ (SUMX2 * SUMY) – (SUMX * SUMXY) ]?/ ST3
  3. ST6 = ST4
  4. ST5 = (( SUMXY * K ) – ( SUMX * SUMY))??/?ST3
  5. ST1 = (SUMY * ST4) + (ST5 * SUMXY)
  6. ST2 = (SUMY)^2 / K
  7. ST9 = ST1 – ST2
  8. ST0 = SUMY2 – ST2
  9. STT = ST9 / ST0 = RR = R2 = R^2 = R2


?

Step 3:?“TEST EXAMPLE”

?

  1. ST3 = (3 * 14) – (6)^2 = 6
  2. ST4 = (14 * 6) – (6 * 12.5)??/???(6)??=??1.5
  3. ST6 = ST4 = 1.5
  4. ST5 = (12.5) * (3) – (6) * (6)???/???(6)???= 0.25
  5. ST1 = (6)(1.5) + (.25)(12.5) = (9) + 3.125)?=?12.125
  6. ST2 = (6)^2??/?3?=?36/3??=?12
  7. ST9?=?12.125??-?12??=?0.125
  8. STT = 0.125 / .5??= 0.25??=?R2??=??RR?= R2

?

?

Step 4:?“VERIFICATION of TRENDLINE”


NOTES:??

·??????In the more familiar: Y = mx + b, make note of the following to prevent confusion.

·??????b = A = ST6

·??????m = B = ST5

?

y = 1.5 +?(0.25)(x)

Equation: 3a – “equation of the TREND LINE”


Student’s note: the student should be able to distinguish between X & x and also Y and y, not that it makes any difference or distinction, other than symbols are usually “overloaded” and that x & y can mean different things in different contexts, the student must be careful to know which “x” is which and which “y” is which.


Step 5:??“TEST VALUES in a TABLE”, note: these values test the TREND LINE to verify that it “maps” and corresponds to the computed values in Step 3: and Step 4: ( a good student will ALWAYS “back check” all of their work, and leave nothing to chance or believe anything they are told, until they check it, 100%)

No alt text provided for this image


Step 6: Graphical Verification of the TREND LINE

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image


Step 8:??CONCLUSIONS


Well, everything looks ok, right? After all the vertical deltas [(y2-y1) = 0], so if its zero, it must be right. Right? Not exactly. What’s wrong then ?


My Analysis:

  1. Simply because we plotted a “line” that had a “balanced” vertical-delta’s doesn’t mean that we could not have found a multitude of lines to satisfy that mathematically “weak” argument that the “sum of the vertical delta’s = 0, therefore proves we have a valid TRENDLINE. No, rigorous mathematicians would laugh at that and would most likely prove the contrary a million different ways.
  2. Our model is ridiculously simple. We only used 3 data pairs. How do we know that this algorithm would hold up under, 500 data pairs, or maybe 5 billion data pairs, we don’t have enough confidence, especially as given by the R2 value of?0.25, we have?computed by the algorithm itself to have much confidence in the TRENDLINE we have computed, nor do we fully understand the inner workings of the computation of R2 to make any kind of determination on the rigorousness of the result, so we have the R2?value in question as well.
  3. we would need to test larger data pair sets to actually determine if increasing the number of data pair sets actually increases the value of?R2 , knowing full well as good students that?R2 =1, would mean a 100% confidence level in our?TRENDLINE equation, but in this example, we only have a 25% confidence level in our TRENDLINE.
  4. What if making “vertical” measurements to satisfy the LEAST SQUARES Theorem is NOT the correct way to look at the problem, and actually many text books, suggest that lines PERPENDICULAR to the LINEAR REGRESSION line, would actually make more sense in terms of LEAST SQUARES, after all a PERPENDICULAR line to a POINT, is the SHORTEST actual path and fits more closely into the model of LEAST SQUARES, does it not ?
  5. Lastly, we simply wanted to find a TRENDLINE, or as Economist would say a “MODEL”, so we derived this equation (y = 1.5 + .25x) and we call this the TRENDLINE, but is it not TRUE that this equation is a TRENDLINE to an INFINITE family of data points, is it not TRUE that we could devise any number of 3 pairs, that would result in this TRENDLINE, so therefore this TRENDLINE is in fact NOT unique to our data set example, nor is it unique to any unique data set example, and all we would have to do is just fine one single other 3 pair data set that results in the same TRENDLINE equation and therefore, we would mathematically prove, we do NOT have a unique TRENDLINE to represent our data, and that if only had this TRENDLINE to start with, we would run into an ambiguity situation, where we could NOT reverse engineer the original data 3 pair set, and thus would NOT know if the TRENDLINE fits our particular MODEL.


-- end of paper --?

@copyright 2021

all rights reserved, Microsoft Excel 2000 was used to complete this analysis.

Iron Designer Engineering Services LLC

this report may not be copied (owner: Fred Finkelstein, P.E.)

*** if your company needs help with mathematical simulations, or business analysis, I have over 35 years of experience in Analytical design. Feel FREE to reach out to me to discuss a contract with IRONDesigner LLC @2021.

No alt text provided for this image

If your company is looking for an executive level board member to provide guidance, direction, business tactic, strategies, mergers, acquisitions, or to grow or dissolve, feel free to contact me for inclusion in your executive team

?

-------------

DISCLAIMER

Fair Use Notice: This blog may contain some copyrighted material whose use has not been authorized by the copyright owners. We believe that this not-for-profit, educational, and/or criticism or commentary use on the Web constitutes a fair use of the copyrighted material (as provided for in section 107 of the US Copyright Law. If you wish to use this copyrighted material for purposes that go beyond fair use, you must obtain permission from the copyright owner. Fair Use notwithstanding we will immediately comply with any copyright owner who wants their material removed or modified, wants us to link to their web site, or wants us to add their photo.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了