Can you get a better model by interpolating fake data points by averaging?

No, you cannot. These “fake” data points don’t actually add anything to the model.

Let us demonstrate in Julia. First, we run a simple linear model (lm) on 100 random \((x,y)\) data points.

using DataFrames, GLM
x = rand(1:100, 10);
y = rand(1:100, 10);
df = DataFrame(X = x, Y = y)

## 10×2 DataFrame
##  Row │ X      Y
##      │ Int64  Int64
## ─────┼──────────────
##    1 │    67     32
##    2 │    13     87
##    3 │    66     11
##    4 │    64     15
##    5 │    43     54
##    6 │    49     74
##    7 │    70     83
##    8 │    33     17
##    9 │    67     64
##   10 │    26     50

m1 = lm(@formula(Y ~ X), df)

## StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
## 
## Y ~ 1 + X
## 
## Coefficients:
## ──────────────────────────────────────────────────────────────────────────
##                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%   Upper 95%
## ──────────────────────────────────────────────────────────────────────────
## (Intercept)  67.2677     25.6817     2.62    0.0307    8.04564  126.49
## X            -0.372846    0.480944  -0.78    0.4605   -1.4819     0.736213
## ──────────────────────────────────────────────────────────────────────────

We now have a line of best fit for \(y\) as a function of \(x\). Now we try to find a “better” model by averaging every point \(\left( x_i , y_i \right)\) against every other point \(\left( x_j , y_j \right)\). We could do this with the combinations function of the Combinatorics package, but it is cooler to see it as a matrix. Symbolically, the matrix of pairwise combinations \(C\) from vector \(\vec{x}\) is constructed with the midpoint function \(m\) as:

\[ C = \begin{pmatrix} m(x_1, x_1) & m(x_1, x_2) & m(x_1, x_3) & \dots & m(x_1, x_n) \\ m(x_2, x_1) & m(x_2, x_2) & m(x_2, x_3) & \dots & m(x_2, x_n) \\ m(x_3, x_1) & m(x_3, x_2) & m(x_3, x_3) & \dots & m(x_3, x_n) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ m(x_n, x_1) & m(x_n, x_2) & m(x_n, x_3) & \dots& m(x_n, x_n) \end{pmatrix} \]

m(a, b) = (a + b) / 2;
c(v, f) = [f(v[i], v[j]) for i in 1:length(v), j in 1:length(v)];
x2 = c(x, m)

## 10×10 Matrix{Float64}:
##  67.0  40.0  66.5  65.5  55.0  58.0  68.5  50.0  67.0  46.5
##  40.0  13.0  39.5  38.5  28.0  31.0  41.5  23.0  40.0  19.5
##  66.5  39.5  66.0  65.0  54.5  57.5  68.0  49.5  66.5  46.0
##  65.5  38.5  65.0  64.0  53.5  56.5  67.0  48.5  65.5  45.0
##  55.0  28.0  54.5  53.5  43.0  46.0  56.5  38.0  55.0  34.5
##  58.0  31.0  57.5  56.5  46.0  49.0  59.5  41.0  58.0  37.5
##  68.5  41.5  68.0  67.0  56.5  59.5  70.0  51.5  68.5  48.0
##  50.0  23.0  49.5  48.5  38.0  41.0  51.5  33.0  50.0  29.5
##  67.0  40.0  66.5  65.5  55.0  58.0  68.5  50.0  67.0  46.5
##  46.5  19.5  46.0  45.0  34.5  37.5  48.0  29.5  46.5  26.0

y2 = c(y, m)

## 10×10 Matrix{Float64}:
##  32.0  59.5  21.5  23.5  43.0  53.0  57.5  24.5  48.0  41.0
##  59.5  87.0  49.0  51.0  70.5  80.5  85.0  52.0  75.5  68.5
##  21.5  49.0  11.0  13.0  32.5  42.5  47.0  14.0  37.5  30.5
##  23.5  51.0  13.0  15.0  34.5  44.5  49.0  16.0  39.5  32.5
##  43.0  70.5  32.5  34.5  54.0  64.0  68.5  35.5  59.0  52.0
##  53.0  80.5  42.5  44.5  64.0  74.0  78.5  45.5  69.0  62.0
##  57.5  85.0  47.0  49.0  68.5  78.5  83.0  50.0  73.5  66.5
##  24.5  52.0  14.0  16.0  35.5  45.5  50.0  17.0  40.5  33.5
##  48.0  75.5  37.5  39.5  59.0  69.0  73.5  40.5  64.0  57.0
##  41.0  68.5  30.5  32.5  52.0  62.0  66.5  33.5  57.0  50.0

df2 = DataFrame(X = vec(x2), Y = vec(y2))

## 100×2 DataFrame
##  Row │ X        Y
##      │ Float64  Float64
## ─────┼──────────────────
##    1 │    67.0     32.0
##    2 │    40.0     59.5
##    3 │    66.5     21.5
##    4 │    65.5     23.5
##    5 │    55.0     43.0
##    6 │    58.0     53.0
##    7 │    68.5     57.5
##    8 │    50.0     24.5
##   ⋮  │    ⋮        ⋮
##   94 │    45.0     32.5
##   95 │    34.5     52.0
##   96 │    37.5     62.0
##   97 │    48.0     66.5
##   98 │    29.5     33.5
##   99 │    46.5     57.0
##  100 │    26.0     50.0
##          85 rows omitted

Now let’s run a linear model and see what happens.

m2 = lm(@formula(Y ~ X), df2)

## StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
## 
## Y ~ 1 + X
## 
## Coefficients:
## ─────────────────────────────────────────────────────────────────────────
##                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
## ─────────────────────────────────────────────────────────────────────────
## (Intercept)  67.2677      7.0947     9.48    <1e-14  53.1885    81.3469
## X            -0.372846    0.137413  -2.71    0.0079  -0.645537  -0.100155
## ─────────────────────────────────────────────────────────────────────────

We get the same linear model! We even get the same \(R^2\) value:

r2(m1)

## 0.06987483032862085

r2(m2)

## 0.06987483032862063

We can even get the same model by taking only the upper triangle (with or without the diagonal) of the \(X_2\) and \(Y_2\) matrices.

using LinearAlgebra
x3 = filter(>(0), x2 - tril(x2));
y3 = filter(>(0), y2 - tril(y2));
df3 = DataFrame(X = vec(x3), Y = vec(y3))

## 45×2 DataFrame
##  Row │ X        Y
##      │ Float64  Float64
## ─────┼──────────────────
##    1 │    40.0     59.5
##    2 │    66.5     21.5
##    3 │    39.5     49.0
##    4 │    65.5     23.5
##    5 │    38.5     51.0
##    6 │    65.0     13.0
##    7 │    55.0     43.0
##    8 │    28.0     70.5
##   ⋮  │    ⋮        ⋮
##   39 │    46.0     30.5
##   40 │    45.0     32.5
##   41 │    34.5     52.0
##   42 │    37.5     62.0
##   43 │    48.0     66.5
##   44 │    29.5     33.5
##   45 │    46.5     57.0
##          30 rows omitted

m3 = lm(@formula(Y ~ X), df3)

## StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
## 
## Y ~ 1 + X
## 
## Coefficients:
## ──────────────────────────────────────────────────────────────────────────
##                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%   Upper 95%
## ──────────────────────────────────────────────────────────────────────────
## (Intercept)  67.2677     10.669      6.30    <1e-06  45.7515    88.7839
## X            -0.372846    0.207446  -1.80    0.0793  -0.791201   0.0455092
## ──────────────────────────────────────────────────────────────────────────

We should observe that the \(t\)-statistic and other values related to confidence are different in all three models. This is because the number of degrees of freedom changes with the number of observations. The machine incorrectly computes a greater level of confidence with the artificially inflated data set.

Can you get a better model by interpolating fake data points by averaging?

William John Holden

10 May 2021