Understanding R-Squared and the Problem with Lopping Through a DataFrame
R-squared, often abbreviated as R² or r², is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable. In simpler terms, it measures how well a linear regression model fits the data.
Given this context, the question at hand revolves around calculating the R-squared value for increasingly larger numbers of rows in a dataframe using Python and the scikit-learn library.
Calculating R-Squared Without Lopping Through a DataFrame
When working with regression models, particularly linear regression, we often use the fit method to train the model on our data. The fit method takes our feature (or independent variable) matrix X and our target (or dependent variable) vector y, then adjusts its parameters to minimize the difference between predictions and actual values.
The formula used by linear regression models to predict a value is:
[ \hat{y} = b_0 + b_1x ]
where (b_0) and (b_1) are coefficients learned during the training process. The R-squared value is then calculated using:
[ r^2 = 1 - \frac{\sum_{i=1}^{n}(y_i-\hat{y}i)^2}{\sum{i=1}^{n}(y_i-\bar{y})^2} ]
where ( y_i ) are the actual values of our target variable, ( \hat{y}_i ) are the predicted values by our model, and ( \bar{y} ) is the mean value of all our target variables.
Creating a DataFrame with Sample Data
First, we need to create a sample dataframe df with two columns ‘a’ and ‘b’, where both are numerical. In this example, column ‘a’ increases linearly from 1 to 10, while column ‘b’ also increases but at a different rate.
import pandas as pd
# Create a DataFrame
df = pd.DataFrame()
a = [1, 1.5, 2, 4, 5, 6, 7, 8, 9, 10]
b = [2, 4, 6, 7, 8, 9, 10, 11, 12, 13]
df['a'] = a
df['b'] = b
print(df)
Understanding and Correcting the Original Code
The error in the original code is caused by how np.column_stack handles arrays. In each iteration of the for-loop, the value [df['a'].loc[0+n]].values() results in a single number because we are using indexing to select elements from df['a']. This means we can’t call .values() on it.
To correctly calculate R-squared for our model, we need to create an array of all ones and a column containing the current row value of ‘a’ (or any other feature that’s relevant to our regression). We also need to assign y the corresponding value from ‘b’.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
df = pd.DataFrame()
a = [1, 1.5, 2, 4, 5, 6, 7, 8, 9, 10]
b = [2, 4, 6, 7, 8, 9, 10, 11, 12, 13]
df['a'] = a
df['b'] = b
plt.scatter(x=df['a'], y=df['b'])
lr = LinearRegression()
n = 0 # Initialize n before the for-loop
for i in range(len(df)):
X = np.column_stack([
np.ones(len(df), dtype=np.float32),
np.repeat(df['a'].loc[n], len(df)) # Add all values of 'a' as the second column
])
y = df['b'].loc[i] # Assign each row value of 'b'
model = lr.fit(X, y)
print(f'R Squared: {model.score(X,y)}')
R-Squared Calculation Without Lopping Through DataFrames
In order to calculate the R-squared for our entire dataset, we can use linear_regression to fit a line through all rows of data once. However, since our goal is not fitting any actual linear regression model but rather understanding how it’s calculated, we should directly compute it from known values.
import numpy as np
# Sample data
a = [1, 1.5, 2, 4, 5, 6, 7, 8, 9, 10]
b = [2, 4, 6, 7, 8, 9, 10, 11, 12, 13]
# Calculate sums
sum_a = np.sum(a)
sum_b = np.sum(b)
# Calculate means
mean_a = sum_a / len(a)
mean_b = sum_b / len(a)
# Initialize R-squared
r_squared = 0
for i in range(len(df)):
y_i = b[i]
x_i = a[i]
# Predict using linear regression formula (b1*x + b0) where b0 is mean of all x values and b1 is the slope
predicted_y = mean_b + ((x_i - mean_a) / len(a)) * (sum_b - mean_b)
# Calculate difference squared
diff_squared = (y_i - predicted_y) ** 2
r_squared += diff_squared
r_squared /= len(df)
print(r_squared)
Creating a Linear Regression Model with Increasing Number of Rows
However, the question asks us to loop through increasingly larger numbers of rows in our dataframe. This requires modifying our approach slightly.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
# Create a DataFrame with an increasing number of rows
a = [1, 1.5, 2, 4, 5, 6, 7, 8, 9, 10]
b = [2, 4, 6, 7, 8, 9, 10, 11, 12, 13]
df = pd.DataFrame()
df['a'] = a
df['b'] = b
plt.scatter(x=df['a'], y=df['b'])
lr = LinearRegression()
n_rows = 0
r_squared_values = []
while n_rows <= len(df):
# Select the first 'n' rows for each iteration of the loop
df_subset = df.iloc[:n_rows]
X = np.column_stack([
np.ones(len(df_subset), dtype=np.float32),
np.repeat(df_subset['a'], len(df_subset)) # Add all values of 'a'
])
y = df_subset['b']
model = lr.fit(X, y)
print(f'R Squared: {model.score(X,y)}')
r_squared_values.append(model.score(X, y))
n_rows += 1
# Print R-squared for the entire dataset
print("R-Squared Value:", np.mean(r_squared_values))
Last modified on 2023-11-23