No, you cannot. These “fake” data points don’t actually add anything to the model.
Let us demonstrate in Julia. First, we run a simple linear model (lm
) on 100 random \((x,y)\) data points.
using DataFrames, GLM
x = rand(1:100, 10);
y = rand(1:100, 10);
df = DataFrame(X = x, Y = y)
## 10×2 DataFrame
## Row │ X Y
## │ Int64 Int64
## ─────┼──────────────
## 1 │ 67 32
## 2 │ 13 87
## 3 │ 66 11
## 4 │ 64 15
## 5 │ 43 54
## 6 │ 49 74
## 7 │ 70 83
## 8 │ 33 17
## 9 │ 67 64
## 10 │ 26 50
m1 = lm(@formula(Y ~ X), df)
## StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
##
## Y ~ 1 + X
##
## Coefficients:
## ──────────────────────────────────────────────────────────────────────────
## Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
## ──────────────────────────────────────────────────────────────────────────
## (Intercept) 67.2677 25.6817 2.62 0.0307 8.04564 126.49
## X -0.372846 0.480944 -0.78 0.4605 -1.4819 0.736213
## ──────────────────────────────────────────────────────────────────────────
We now have a line of best fit for \(y\) as a function of \(x\). Now we try to find a “better” model by averaging every point \(\left( x_i , y_i \right)\) against every other point \(\left( x_j , y_j \right)\). We could do this with the combinations
function of the Combinatorics
package, but it is cooler to see it as a matrix. Symbolically, the matrix of pairwise combinations \(C\) from vector \(\vec{x}\) is constructed with the midpoint function \(m\) as:
\[ C = \begin{pmatrix} m(x_1, x_1) & m(x_1, x_2) & m(x_1, x_3) & \dots & m(x_1, x_n) \\ m(x_2, x_1) & m(x_2, x_2) & m(x_2, x_3) & \dots & m(x_2, x_n) \\ m(x_3, x_1) & m(x_3, x_2) & m(x_3, x_3) & \dots & m(x_3, x_n) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ m(x_n, x_1) & m(x_n, x_2) & m(x_n, x_3) & \dots& m(x_n, x_n) \end{pmatrix} \]
m(a, b) = (a + b) / 2;
c(v, f) = [f(v[i], v[j]) for i in 1:length(v), j in 1:length(v)];
x2 = c(x, m)
## 10×10 Matrix{Float64}:
## 67.0 40.0 66.5 65.5 55.0 58.0 68.5 50.0 67.0 46.5
## 40.0 13.0 39.5 38.5 28.0 31.0 41.5 23.0 40.0 19.5
## 66.5 39.5 66.0 65.0 54.5 57.5 68.0 49.5 66.5 46.0
## 65.5 38.5 65.0 64.0 53.5 56.5 67.0 48.5 65.5 45.0
## 55.0 28.0 54.5 53.5 43.0 46.0 56.5 38.0 55.0 34.5
## 58.0 31.0 57.5 56.5 46.0 49.0 59.5 41.0 58.0 37.5
## 68.5 41.5 68.0 67.0 56.5 59.5 70.0 51.5 68.5 48.0
## 50.0 23.0 49.5 48.5 38.0 41.0 51.5 33.0 50.0 29.5
## 67.0 40.0 66.5 65.5 55.0 58.0 68.5 50.0 67.0 46.5
## 46.5 19.5 46.0 45.0 34.5 37.5 48.0 29.5 46.5 26.0
y2 = c(y, m)
## 10×10 Matrix{Float64}:
## 32.0 59.5 21.5 23.5 43.0 53.0 57.5 24.5 48.0 41.0
## 59.5 87.0 49.0 51.0 70.5 80.5 85.0 52.0 75.5 68.5
## 21.5 49.0 11.0 13.0 32.5 42.5 47.0 14.0 37.5 30.5
## 23.5 51.0 13.0 15.0 34.5 44.5 49.0 16.0 39.5 32.5
## 43.0 70.5 32.5 34.5 54.0 64.0 68.5 35.5 59.0 52.0
## 53.0 80.5 42.5 44.5 64.0 74.0 78.5 45.5 69.0 62.0
## 57.5 85.0 47.0 49.0 68.5 78.5 83.0 50.0 73.5 66.5
## 24.5 52.0 14.0 16.0 35.5 45.5 50.0 17.0 40.5 33.5
## 48.0 75.5 37.5 39.5 59.0 69.0 73.5 40.5 64.0 57.0
## 41.0 68.5 30.5 32.5 52.0 62.0 66.5 33.5 57.0 50.0
df2 = DataFrame(X = vec(x2), Y = vec(y2))
## 100×2 DataFrame
## Row │ X Y
## │ Float64 Float64
## ─────┼──────────────────
## 1 │ 67.0 32.0
## 2 │ 40.0 59.5
## 3 │ 66.5 21.5
## 4 │ 65.5 23.5
## 5 │ 55.0 43.0
## 6 │ 58.0 53.0
## 7 │ 68.5 57.5
## 8 │ 50.0 24.5
## ⋮ │ ⋮ ⋮
## 94 │ 45.0 32.5
## 95 │ 34.5 52.0
## 96 │ 37.5 62.0
## 97 │ 48.0 66.5
## 98 │ 29.5 33.5
## 99 │ 46.5 57.0
## 100 │ 26.0 50.0
## 85 rows omitted
Now let’s run a linear model and see what happens.
m2 = lm(@formula(Y ~ X), df2)
## StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
##
## Y ~ 1 + X
##
## Coefficients:
## ─────────────────────────────────────────────────────────────────────────
## Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
## ─────────────────────────────────────────────────────────────────────────
## (Intercept) 67.2677 7.0947 9.48 <1e-14 53.1885 81.3469
## X -0.372846 0.137413 -2.71 0.0079 -0.645537 -0.100155
## ─────────────────────────────────────────────────────────────────────────
We get the same linear model! We even get the same \(R^2\) value:
r2(m1)
## 0.06987483032862085
r2(m2)
## 0.06987483032862063
We can even get the same model by taking only the upper triangle (with or without the diagonal) of the \(X_2\) and \(Y_2\) matrices.
using LinearAlgebra
x3 = filter(>(0), x2 - tril(x2));
y3 = filter(>(0), y2 - tril(y2));
df3 = DataFrame(X = vec(x3), Y = vec(y3))
## 45×2 DataFrame
## Row │ X Y
## │ Float64 Float64
## ─────┼──────────────────
## 1 │ 40.0 59.5
## 2 │ 66.5 21.5
## 3 │ 39.5 49.0
## 4 │ 65.5 23.5
## 5 │ 38.5 51.0
## 6 │ 65.0 13.0
## 7 │ 55.0 43.0
## 8 │ 28.0 70.5
## ⋮ │ ⋮ ⋮
## 39 │ 46.0 30.5
## 40 │ 45.0 32.5
## 41 │ 34.5 52.0
## 42 │ 37.5 62.0
## 43 │ 48.0 66.5
## 44 │ 29.5 33.5
## 45 │ 46.5 57.0
## 30 rows omitted
m3 = lm(@formula(Y ~ X), df3)
## StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
##
## Y ~ 1 + X
##
## Coefficients:
## ──────────────────────────────────────────────────────────────────────────
## Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
## ──────────────────────────────────────────────────────────────────────────
## (Intercept) 67.2677 10.669 6.30 <1e-06 45.7515 88.7839
## X -0.372846 0.207446 -1.80 0.0793 -0.791201 0.0455092
## ──────────────────────────────────────────────────────────────────────────
We should observe that the \(t\)-statistic and other values related to confidence are different in all three models. This is because the number of degrees of freedom changes with the number of observations. The machine incorrectly computes a greater level of confidence with the artificially inflated data set.