Linear regression for housing data using randomized search, cross-validation, search grid, or combines:
Linear regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. In the article we are using housing data, linear regression can be used to predict the median house value based on various features such as the number of rooms, the location, and the age of the house. By analyzing these features and their impact on the target variable, we can build a linear regression model that accurately predicts the median house value.
In this particular case, we have a dataset with 20640 rows and 10 columns, with the target variable being the median_house_value. Our goal is to use linear regression to build a model that accurately predicts the median house value based on the other features in the dataset. By analyzing the data, selecting relevant features, and training the model, we can create a powerful tool for predicting housing prices and making informed decisions in the real estate industry.
Columns:
Here's an outline of the Article:
I. Introduction
II. Train a linear regression model on the training set
III. Train a linear regression model using cross-validation
IV. Train a linear regression model using grid search and cross-validation
V. Train a linear regression model using randomized search and cross-validation
VI. Analyze results
VII. Combine all code in one cell
VIII. Visualize all plots in one cell
IX. Conclusion
import?pandas?as?pd
#?Load?housing?data?from?CSV?file
housing_data?=?pd.read_csv('housing.csv')
#?Display?some?basic?information?about?the?data
print("Number?of?rows:",?len(housing_data))
print("Data?types:\n",?housing_data.dtypes)
print("Column?names:\n",?housing_data.columns)
#?Display?summary?statistics?about?the?data
print("Summary?statistics:\n",?housing_data.describe())
#?Quick?look?at?the?data?structure
print("First?5?rows?of?the?data:\n",?housing_data.head())
#?Count?the?occurrences?of?each?category?in?the?ocean_proximity?column
ocean_proximity_counts?=?housing_data['ocean_proximity'].value_counts()
Here's a step-by-step explanation of the code:
1. `import pandas as pd`: This line imports the pandas library and assigns it the alias 'pd', which is a common convention. Pandas is a powerful data manipulation and analysis library that provides data structures and functions needed to work with structured data.
2. `housing_data = pd.read_csv('housing.csv')`: This line reads the contents of a CSV file named 'housing.csv' using the `pd.read_csv()` function and stores the data in a DataFrame called 'housing_data'. A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types, similar to a spreadsheet or SQL table.
3. `print("Number of rows:", len(housing_data))`: This line prints the number of rows in the 'housing_data' DataFrame. The `len()` function is used to find the length of the DataFrame, which represents the number of rows.
4. `print("Data types:\n", housing_data.dtypes)`: This line prints the data types of each column in the 'housing_data' DataFrame. The `dtypes` attribute is used to retrieve this information.
5. `print("Column names:\n", housing_data.columns)`: This line prints the names of the columns in the 'housing_data' DataFrame. The `columns` attribute is used to retrieve this information.
import?matplotlib.pyplot?as?plt
#?Create?a?bar?chart?of?the?counts
plt.bar(ocean_proximity_counts.index,?ocean_proximity_counts.values)
#?Add?labels?and?title
plt.title('Counts?of?Ocean?Proximity?Categories')
plt.xlabel('Ocean?Proximity')
plt.ylabel('Count')
#?Display?the?chart
plt.show()
#?Loop?over?each?column?in?the?DataFrame
for?column?in?housing_data.columns:
????#?Check?if?the?column?contains?numerical?data
????if?pd.api.types.is_numeric_dtype(housing_data[column]):
????????#?Create?a?histogram?of?the?column
????????plt.hist(housing_data[column],?bins=50)
????????#?Add?labels?and?title
????????plt.title(f'Histogram?of?{column}')
????????plt.xlabel(column)
????????plt.ylabel('Count')
????????#?Display?the?chart
????????plt.show()
import?seaborn?as?sns
#?Select?the?numerical?variables?in?the?housing?data
numerical_variables?=?housing_data.select_dtypes(include=['int64',?'float64'])
#?Create?a?pair?plot?of?the?numerical?variables
sns.pairplot(numerical_variables)
plt.show()
#?Split?the?data?into?training?and?testing?sets
from?sklearn.model_selection?import?train_test_split
train_data,?test_data?=?train_test_split(housing_data,?test_size=0.25,?random_state=42)
#?Display?the?number?of?rows?in?the?training?and?testing?sets
print("Number?of?rows?in?training?set:",?len(train_data))
print("Number?of?rows?in?testing?set:",?len(test_data))
#?Plot?the?distribution?of?median_house_value?in?the?training?set
plt.hist(train_data['median_house_value'],?bins=50,?alpha=0.5,?label='Training?set')
#?Plot?the?distribution?of?median_house_value?in?the?testing?set
plt.hist(test_data['median_house_value'],?bins=50,?alpha=0.5,?label='Testing?set')
#?Add?labels?and?title
plt.title('Distribution?of?Median?House?Value?in?Training?and?Testing?Sets')
plt.xlabel('Median?House?Value?($)')
plt.ylabel('Count')
plt.legend()
#?Display?the?chart
plt.show()
This above code the following tasks:
- It imports the `train_test_split` function from the `sklearn.model_selection` module to split the dataset into training and testing sets.
- It splits the `housing_data` into training and testing sets with a test size of 0.25 (i.e., 25% of the data is used for testing) and sets a random seed of 42 using `random_state`.
- It prints the number of rows in the training and testing sets.
- It creates two histograms of the `median_house_value` column in the training and testing sets using `plt.hist()` from the `matplotlib.pyplot` module.?
- It adds labels, a title, and a legend to the histograms using `plt.title()`, `plt.xlabel()`, `plt.ylabel()`, and `plt.legend()`.
- It displays the histograms using `plt.show()`.?
Train?a?linear?regression?model?on?the?training?set
To analyze and make predictions about housing data, the first step is to collect and organize the data. This can be done using a CSV file that contains information on various features of different houses. Once the data is collected, it is important to ensure that it is complete and accurate. In order to do this, any missing values in the data can be filled with the most frequent value in each column. This will help to ensure that there are no gaps in the data, which could lead to inaccurate predictions later on.
The next step in preparing the housing data for analysis is to convert any categorical data into numerical data using one-hot encoding. This technique assigns a numerical value to each categorical variable, making it easier to analyze and compare the data. Once the data has been converted, it is important to scale the numerical data using StandardScaler. This will ensure that the data is on a common scale, which is necessary for accurate analysis and prediction.
After the data has been collected, organized, and prepared, it is time to split the data into training and testing sets. This allows the machine learning model to learn from the data and then be tested on a separate set of data to evaluate its accuracy. The features and target variable must also be defined for the model, so that it knows which variables to focus on when making predictions.
Once the data is prepared and the model is set up, it is time to train the model using linear regression. This is a popular machine learning technique that uses a linear approach to predict the target variable based on the features. After training the model on the training set, it is important to evaluate its accuracy on the same set of data. This will help to ensure that the model is accurate and can be used to make accurate predictions about housing data.
import?pandas?as?pd
from?sklearn.impute?import?SimpleImputer
from?sklearn.preprocessing?import?StandardScaler
#?Load?housing?data?from?CSV?file
housing_data?=?pd.read_csv('housing.csv')
#?Fill?missing?values?with?the?most?frequent?value?in?each?column
imputer?=?SimpleImputer(strategy='most_frequent')
housing_data?=?pd.DataFrame(imputer.fit_transform(housing_data),?columns=housing_data.columns)
#?Convert?categorical?data?to?numerical?data?using?one-hot?encoding
housing_data?=?pd.get_dummies(housing_data,?columns=['ocean_proximity'])
#?Scale?the?numerical?data?using?StandardScaler
scaler?=?StandardScaler()
housing_data[['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',
??????????????'population',?'households',?'median_income',?'median_house_value']]?=?scaler.fit_transform(housing_data[['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',?'population',?'households',?'median_income',?'median_house_value']])
#?Split?the?data?into?training?and?testing?sets
train_data,?test_data?=?train_test_split(housing_data,?test_size=0.25,?random_state=42)
#?Define?the?features?and?target?variable?for?the?model
features?=?['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',
????????????'population',?'households',?'median_income',?'ocean_proximity_<1H?OCEAN',
????????????'ocean_proximity_INLAND',?'ocean_proximity_ISLAND',?'ocean_proximity_NEAR?BAY',
????????????'ocean_proximity_NEAR?OCEAN']
target?=?'median_house_value'
#?Train?a?linear?regression?model?on?the?training?set
model?=?LinearRegression()
model.fit(train_data[features],?train_data[target])
#?Evaluate?the?model?on?the?training?set
train_predictions?=?model.predict(train_data[features])
train_rmse?=?mean_squared_error(train_data[target],?train_predictions,?squared=False)
print('Training?set?RMSE:',?train_rmse)
result:
Training set RMSE: 0.5944623639468501
Train?a?linear?regression?model?using?cross-validation
To improve the accuracy of the linear regression model, it is recommended to use cross-validation. This technique involves splitting the data into multiple subsets and training the model on each subset while using the other subsets for validation. This helps to ensure that the model is accurate and can be used to make accurate predictions on new data.
To train a linear regression model using cross-validation, the first step is to define the model. This can be done using the LinearRegression() function in Python. Once the model is defined, the next step is to use cross_val_score() function to train the model on the training data while using the negative mean squared error as the scoring metric.
The cross_val_score() function takes several parameters, including the model to be trained, the features and target variable to be used, the number of folds for the cross-validation (in this case, cv=5), and the scoring metric to be used (in this case, 'neg_mean_squared_error').
After training the model on each subset of data, the next step is to calculate the root mean squared error (RMSE) of the model using np.sqrt(-scores). This will help to evaluate the accuracy of the model on the training data.
By using cross-validation to train the linear regression model, it is possible to improve the accuracy of the model and make more accurate predictions about housing data.
import?pandas?as?pd
import?numpy?as?np
from?sklearn.impute?import?SimpleImputer
from?sklearn.preprocessing?import?StandardScaler
from?sklearn.model_selection?import?train_test_split,?cross_val_score
from?sklearn.linear_model?import?LinearRegression
from?sklearn.metrics?import?mean_squared_error
#?Load?housing?data?from?CSV?file
housing_data?=?pd.read_csv('housing.csv')
#?Fill?missing?values?with?the?most?frequent?value?in?each?column
imputer?=?SimpleImputer(strategy='most_frequent')
housing_data?=?pd.DataFrame(imputer.fit_transform(housing_data),?columns=housing_data.columns)
#?Convert?categorical?data?to?numerical?data?using?one-hot?encoding
housing_data?=?pd.get_dummies(housing_data,?columns=['ocean_proximity'])
#?Scale?the?numerical?data?using?StandardScaler
scaler?=?StandardScaler()
housing_data[['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',
??????????????'population',?'households',?'median_income',?'median_house_value']]?=?scaler.fit_transform(housing_data[['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',?'population',?'households',?'median_income',?'median_house_value']])
#?Define?the?features?and?target?variable?for?the?model
features?=?['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',
????????????'population',?'households',?'median_income',?'ocean_proximity_<1H?OCEAN',
????????????'ocean_proximity_INLAND',?'ocean_proximity_ISLAND',?'ocean_proximity_NEAR?BAY',
????????????'ocean_proximity_NEAR?OCEAN']
target?=?'median_house_value'
#?Split?the?data?into?training?and?testing?sets
train_data,?test_data?=?train_test_split(housing_data,?test_size=0.25,?random_state=42)
#?Train?a?linear?regression?model?using?cross-validation
model?=?LinearRegression()
scores?=?cross_val_score(model,?train_data[features],?train_data[target],?cv=5,?scoring='neg_mean_squared_error')
rmse_scores?=?np.sqrt(-scores)
#?Print?the?cross-validation?scores
print('Cross-validation?RMSE?scores:',?rmse_scores)
print('Mean?RMSE?score:',?rmse_scores.mean())
#?Train?a?linear?regression?model?on?the?training?set
model.fit(train_data[features],?train_data[target])
#?Evaluate?the?model?on?the?training?set
train_predictions?=?model.predict(train_data[features])
train_rmse?=?mean_squared_error(train_data[target],?train_predictions,?squared=False)
print('Training?set?RMSE:',?train_rmse)
#?Evaluate?the?model?on?the?test?set
test_predictions?=?model.predict(test_data[features])
test_rmse?=?mean_squared_error(test_data[target],?test_predictions,?squared=False)
print('Test?set?RMSE:',?test_rmse)
result:
Cross-validation RMSE scores: [0.60320196 0.60033186 0.59034827 0.58958832 0.59734325]
Mean RMSE score: 0.5961627314011231
Training set RMSE: 0.5944623639468501
Test set RMSE: 0.5997029527266164
Train?a?linear?regression?model?using?grid?search?and?cross-validation
Another way to improve the accuracy of the linear regression model is to use grid search in combination with cross-validation. This technique involves training the model on multiple combinations of hyperparameters to find the best combination that results in the highest accuracy.
To train a linear regression model using grid search and cross-validation, the first step is to define the model and the hyperparameters to be tested. This can be done using the LinearRegression() function in Python and defining the hyperparameters using a param_grid dictionary.
The next step is to use GridSearchCV() function to perform the grid search. This function takes several parameters, including the model to be trained, the hyperparameters to be tested, the number of folds for the cross-validation (in this case, cv=5), and the scoring metric to be used (in this case, 'neg_mean_squared_error').
After performing the grid search, the best parameters and the corresponding score can be printed using the best_params_ and best_score_ attributes of the grid_search object.
To evaluate the accuracy of the model on the training and test sets, the next step is to use the best_estimator_ attribute of the grid_search object to predict the target variable using the test and training data. The root mean squared error (RMSE) of the predictions can then be calculated using the mean_squared_error() function in Python.
By using grid search in combination with cross-validation, it is possible to further improve the accuracy of the linear regression model and make even more accurate predictions about housing data.
import?pandas?as?pd
from?sklearn.impute?import?SimpleImputer
from?sklearn.preprocessing?import?StandardScaler
from?sklearn.model_selection?import?train_test_split,?GridSearchCV,?cross_val_score
from?sklearn.linear_model?import?LinearRegression
from?sklearn.metrics?import?mean_squared_error
#?Load?housing?data?from?CSV?file
housing_data?=?pd.read_csv('housing.csv')
#?Fill?missing?values?with?the?most?frequent?value?in?each?column
imputer?=?SimpleImputer(strategy='most_frequent')
housing_data?=?pd.DataFrame(imputer.fit_transform(housing_data),?columns=housing_data.columns)
#?Convert?categorical?data?to?numerical?data?using?one-hot?encoding
housing_data?=?pd.get_dummies(housing_data,?columns=['ocean_proximity'])
#?Scale?the?numerical?data?using?StandardScaler
scaler?=?StandardScaler()
housing_data[['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',
??????????????'population',?'households',?'median_income',?'median_house_value']]?=?scaler.fit_transform(housing_data[['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',?'population',?'households',?'median_income',?'median_house_value']])
#?Define?the?features?and?target?variable?for?the?model
features?=?['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',
????????????'population',?'households',?'median_income',?'ocean_proximity_<1H?OCEAN',
????????????'ocean_proximity_INLAND',?'ocean_proximity_ISLAND',?'ocean_proximity_NEAR?BAY',
????????????'ocean_proximity_NEAR?OCEAN']
target?=?'median_house_value'
#?Split?the?data?into?training?and?testing?sets
train_data,?test_data?=?train_test_split(housing_data,?test_size=0.25,?random_state=42)
#?Train?a?linear?regression?model?using?grid?search?and?cross-validation
model?=?LinearRegression()
param_grid?=?{'fit_intercept':?[True,?False],?'positive':?[True,?False]}
grid_search?=?GridSearchCV(model,?param_grid,?cv=5,?scoring='neg_mean_squared_error')
grid_search.fit(train_data[features],?train_data[target])
#?Print?the?best?parameters?and?the?corresponding?score
print('Best?parameters:',?grid_search.best_params_)
print('Best?score:',?np.sqrt(-grid_search.best_score_))
#?Evaluate?the?model?on?the?training?set
train_predictions?=?grid_search.best_estimator_.predict(train_data[features])
train_rmse?=?mean_squared_error(train_data[target],?train_predictions,?squared=False)
print('Training?set?RMSE:',?train_rmse)
#?Evaluate?the?model?on?the?test?set
test_predictions?=?grid_search.best_estimator_.predict(test_data[features])
test_rmse?=?mean_squared_error(test_data[target],?test_predictions,?squared=False)
print('Test?set?RMSE:',?test_rmse)
result:
Best parameters: {'fit_intercept': True, 'positive': False}
Best score: 0.5961871129758642
Training set RMSE: 0.5944623639468501
Test set RMSE: 0.5997029527266164
Train?a?linear?regression?model?using?randomized?search?and?cross-validation
Another approach to tuning hyperparameters for a linear regression model is by using randomized search in conjunction with cross-validation. This technique involves randomly selecting a subset of hyperparameters to train the model and find the combination that produces the best results.
To perform this method, the first step is to define the linear regression model to be used and the hyperparameters that will be tested. The hyperparameters can be defined using a param_distributions dictionary, and the LinearRegression() function in Python can be used to define the model.
The next step is to utilize the RandomizedSearchCV() function, which will perform the randomized search. This function requires several parameters, including the model to be trained, the hyperparameters to be tested, the number of cross-validation folds to be used (in this case, cv=5), the number of random combinations to test (in this case, n_iter=20), and the scoring metric to be used (in this case, 'neg_mean_squared_error').
After the randomized search is completed, the best parameters and their corresponding scores can be obtained using the best_params_ and best_score_ attributes of the random_search object, respectively.
By using randomized search together with cross-validation, the accuracy of the linear regression model can be further enhanced, leading to more accurate predictions of housing data.
import?pandas?as?pd
from?sklearn.impute?import?SimpleImputer
from?sklearn.preprocessing?import?StandardScaler
from?sklearn.model_selection?import?train_test_split,?RandomizedSearchCV,?cross_val_score
from?sklearn.linear_model?import?LinearRegression
from?sklearn.metrics?import?mean_squared_error
from?scipy.stats?import?uniform
#?Load?housing?data?from?CSV?file
housing_data?=?pd.read_csv('housing.csv')
#?Fill?missing?values?with?the?most?frequent?value?in?each?column
imputer?=?SimpleImputer(strategy='most_frequent')
housing_data?=?pd.DataFrame(imputer.fit_transform(housing_data),?columns=housing_data.columns)
#?Convert?categorical?data?to?numerical?data?using?one-hot?encoding
housing_data?=?pd.get_dummies(housing_data,?columns=['ocean_proximity'])
#?Scale?the?numerical?data?using?StandardScaler
scaler?=?StandardScaler()
housing_data[['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',
??????????????'population',?'households',?'median_income',?'median_house_value']]?=?scaler.fit_transform(housing_data[['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',?'population',?'households',?'median_income',?'median_house_value']])
#?Define?the?features?and?target?variable?for?the?model
features?=?['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',
????????????'population',?'households',?'median_income',?'ocean_proximity_<1H?OCEAN',
????????????'ocean_proximity_INLAND',?'ocean_proximity_ISLAND',?'ocean_proximity_NEAR?BAY',
????????????'ocean_proximity_NEAR?OCEAN']
target?=?'median_house_value'
#?Split?the?data?into?training?and?testing?sets
train_data,?test_data?=?train_test_split(housing_data,?test_size=0.25,?random_state=42)
#?Train?a?linear?regression?model?using?randomized?search?and?cross-validation
model?=?LinearRegression(copy_X=False)
param_distributions?=?{'fit_intercept':?[True,?False],?'positive':?[True,?False],
???????????????????????'copy_X':?[True,?False],?'n_jobs':?[1,?2,?3,?4,?-1]}
random_search?=?RandomizedSearchCV(model,?param_distributions,?cv=5,?scoring='neg_mean_squared_error',?n_iter=20)
random_search.fit(train_data[features],?train_data[target])
#?Print?the?best?parameters?and?the?corresponding?score
print('Best?parameters:',?random_search.best_params_)
print('Best?score:',?np.sqrt(-random_search.best_score_))
#?Evaluate?the?model?on?the?training?set
train_predictions?=?random_search.best_estimator_.predict(train_data[features])
train_rmse?=?mean_squared_error(train_data[target],?train_predictions,?squared=False)
print('Training?set?RMSE:',?train_rmse)
#?Evaluate?the?model?on?the?test?set
test_predictions?=?random_search.best_estimator_.predict(test_data[features])
test_rmse?=?mean_squared_error(test_data[target],?test_predictions,?squared=False)
print('Test?set?RMSE:',?test_rmse)
Result:
Best parameters: {'positive': False, 'n_jobs': 4, 'fit_intercept': True, 'copy_X': False}
Best score: 0.5961871129758642
Training set RMSE: 0.5944623639468501
Test set RMSE: 0.5997029527266164
Join all above methods together:
import?pandas?as?pd
import?numpy?as?np
from?sklearn.impute?import?SimpleImputer
from?sklearn.preprocessing?import?StandardScaler
from?sklearn.model_selection?import?train_test_split,?cross_val_score,?GridSearchCV,?RandomizedSearchCV
from?sklearn.linear_model?import?LinearRegression
from?sklearn.metrics?import?mean_squared_error
def?linear_regression_cv(train_data,?features,?target):
????model?=?LinearRegression()
????scores?=?cross_val_score(model,?train_data[features],?train_data[target],?cv=5,?scoring='neg_mean_squared_error')
????rmse_scores?=?np.sqrt(-scores)
????return?rmse_scores
def?linear_regression_grid_search(train_data,?features,?target):
????model?=?LinearRegression()
????param_grid?=?{'fit_intercept':?[True,?False],?'positive':?[True,?False]}
????grid_search?=?GridSearchCV(model,?param_grid,?cv=5,?scoring='neg_mean_squared_error')
????grid_search.fit(train_data[features],?train_data[target])
????return?grid_search
def?linear_regression_randomized_search(train_data,?features,?target):
????model?=?LinearRegression(copy_X=False)
????param_distributions?=?{'fit_intercept':?[True,?False],?'positive':?[True,?False],
???????????????????????????'copy_X':?[True,?False],?'n_jobs':?[1,?2,?3,?4,?-1]}
????random_search?=?RandomizedSearchCV(model,?param_distributions,?cv=5,?scoring='neg_mean_squared_error',?n_iter=20)
????random_search.fit(train_data[features],?train_data[target])
????return?random_search
#?Load?housing?data?from?CSV?file
housing_data?=?pd.read_csv('housing.csv')
#?Fill?missing?values?with?the?most?frequent?value?in?each?column
imputer?=?SimpleImputer(strategy='most_frequent')
housing_data?=?pd.DataFrame(imputer.fit_transform(housing_data),?columns=housing_data.columns)
#?Convert?categorical?data?to?numerical?data?using?one-hot?encoding
housing_data?=?pd.get_dummies(housing_data,?columns=['ocean_proximity'])
#?Scale?the?numerical?data?using?StandardScaler
scaler?=?StandardScaler()
housing_data[['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',
??????????????'population',?'households',?'median_income',?'median_house_value']]?=?scaler.fit_transform(housing_data[['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',?'population',?'households',?'median_income',?'median_house_value']])
#?Define?the?features?and?target?variable?for?the?model
features?=?['longitude',?'latitude',?'housing_median_age',?'total_rooms',?'total_bedrooms',
????????????'population',?'households',?'median_income',?'ocean_proximity_<1H?OCEAN',
????????????'ocean_proximity_INLAND',?'ocean_proximity_ISLAND',?'ocean_proximity_NEAR?BAY',
????????????'ocean_proximity_NEAR?OCEAN']
target?=?'median_house_value'
#?Split?the?data?into?training?and?testing?sets
train_data,?test_data?=?train_test_split(housing_data,?test_size=0.25,?random_state=42)
#?Train?a?linear?regression?model?using?cross-validation
rmse_scores?=?linear_regression_cv(train_data,?features,?target)
print('Cross-validation?RMSE?scores:',?rmse_scores)
print('Mean?RMSE?score:',?rmse_scores.mean())
#?Train?a?linear?regression?model?using?grid?search?and?cross-validation
grid_search?=?linear_regression_grid_search(train_data,?features,?target)
print('Best?parameters?(grid?search):',?grid_search.best_params_)
print('Best?score?(grid?search):',?np.sqrt(-grid_search.best_score_))
#?Train?a?linear?regression?model?using?randomized?search?and?cross-validation
random_search?=?linear_regression_randomized_search(train_data,?features,?target)
print('Best?parameters?(randomized?search):',?random_search.best_params_)
print('Best?score?(randomized?search):',?np.sqrt(-random_search.best_score_))
#?Evaluate?the?models?on?the?training?set
train_predictions_cv?=?grid_search.best_estimator_.predict(train_data[features])
train_rmse_cv?=?mean_squared_error(train_data[target],?train_predictions_cv,?squared=False)
print('Training?set?RMSE?(cross-validation):',?train_rmse_cv)
train_predictions_gs?=?grid_search.best_estimator_.predict(train_data[features])
train_rmse_gs?=?mean_squared_error(train_data[target],?train_predictions_gs,?squared=False)
print('Training?set?RMSE?(grid?search):',?train_rmse_gs)
train_predictions_rs?=?random_search.best_estimator_.predict(train_data[features])
train_rmse_rs?=?mean_squared_error(train_data[target],?train_predictions_rs,?squared=False)
print('Training?set?RMSE?(randomized?search):',?train_rmse_rs)
#?Evaluate?the?models?on?the?test?set
test_predictions_cv?=?grid_search.best_estimator_.predict(test_data[features])
test_rmse_cv?=?mean_squared_error(test_data[target],?test_predictions_cv,?squared=False)
print('Test?set?RMSE?(cross-validation):',?test_rmse_cv)
test_predictions_gs?=?grid_search.best_estimator_.predict(test_data[features])
test_rmse_gs?=?mean_squared_error(test_data[target],?test_predictions_gs,?squared=False)
print('Test?set?RMSE?(grid?search):',?test_rmse_gs)
test_predictions_rs?=?random_search.best_estimator_.predict(test_data[features])
test_rmse_rs?=?mean_squared_error(test_data[target],?test_predictions_rs,?squared=False)
print('Test?set?RMSE?(randomized?search):',?test_rmse_rs)
Results:
Cross-validation RMSE scores: [0.60320196 0.60033186 0.59034827 0.58958832 0.59734325]
Mean RMSE score: 0.5961627314011231
Best parameters (grid search): {'fit_intercept': True, 'positive': False}
Best score (grid search): 0.5961871129758642
Best parameters (randomized search): {'positive': False, 'n_jobs': 2, 'fit_intercept': True, 'copy_X': False}
Best score (randomized search): 0.5961871129758642
Training set RMSE (cross-validation): 0.5944623639468501
Training set RMSE (grid search): 0.5944623639468501
Training set RMSE (randomized search): 0.5944623639468501
Test set RMSE (cross-validation): 0.5997029527266164
Test set RMSE (grid search): 0.5997029527266164
Test set RMSE (randomized search): 0.5997029527266164
##?Analyzing?the?Best?Models?and?Their?Errors
-> ?Model?A:
-?Method:?Train?a?linear?regression?model?using?grid?search?and?cross-validation.
-?Best?parameters:?{'fit_intercept':?True,?'positive':?False}.
-?Best?score?(CV):?0.5962.
-?Training?set?RMSE:?0.5945.
-?Test?set?RMSE:?0.5997.
-> Model?B:
-?Method:?Train?a?linear?regression?model?using?randomized?search?and?cross-validation.
-?Best?parameters:?{'positive':?False,?'n_jobs':?4,?'fit_intercept':?True,?'copy_X':?True}.
-?Best?score?(CV):?0.5962.
-?Training?set?RMSE:?0.5945.
-?Test?set?RMSE:?0.5997.
-> Model?C:
-?Method:?Train?a?linear?regression?model?using?cross-validation?only.
-?Cross-validation?RMSE?scores:?[0.6032,?0.6003,?0.5903,?0.5896,?0.5973].
-?Mean?RMSE?score?(CV):?0.5962.
-?Training?set?RMSE:?0.5945.
-?Test?set?RMSE:?0.5997.
All?three?models?have?similar?performance.?The?best?score?(mean?RMSE)?from?cross-validation?is?almost?the?same?for?all?three?models?(around?0.596).?Additionally,?the?training?and?test?set?RMSEs?are?very?close?for?all?three?models,?which?indicates?that?there?is?no?significant?overfitting?or?underfitting?in?any?of?them.
In?conclusion,?all?three?models?have?comparable?performance.?The?choice?between?them?should?depend?on?other?factors?such?as?simplicity,?computational?resources,?and?preference?for?search?strategy?(grid?search,?randomized?search,?or?no?search).?If?you?prefer?a?simpler?model?with?fewer?hyperparameters,?Model?A?or?C?might?be?a?better?choice.?On?the?other?hand,?if?you?want?to?explore?a?wider?range?of?hyperparameters?and?have?the?computational?resources?to?do?so,?Model?B?could?be?a?better?option.
import?matplotlib.pyplot?as?plt
import?numpy?as?np
#?Define?data?for?the?bar?plots
labels?=?['Best?Score?(CV)',?'Training?RMSE',?'Test?RMSE']
model_a?=?[0.5962,?0.5945,?0.5997]
model_b?=?[0.5962,?0.5945,?0.5997]
model_c?=?[0.5962,?0.5945,?0.5997]
x?=?np.arange(len(labels))??#?the?label?locations
width?=?0.25??#?the?width?of?the?bars
fig,?ax?=?plt.subplots()
rects1?=?ax.bar(x?-?width,?model_a,?width,?label='Model?A',?color='blue')
rects2?=?ax.bar(x,?model_b,?width,?label='Model?B',?color='green')
rects3?=?ax.bar(x?+?width,?model_c,?width,?label='Model?C',?color='orange')
#?Add?some?text?for?labels,?title,?and?custom?x-axis?tick?labels
ax.set_ylabel('RMSE?Scores')
ax.set_title('RMSE?Scores?by?Model')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
#?Display?the?bar?chart
plt.show()
t
QA:
Q1 : What are some techniques used to prepare housing data for analysis, and why are they important?
A1: Some techniques used to prepare housing data for analysis include filling missing values with the most frequent value in each column, converting categorical data to numerical data using one-hot encoding, and scaling the numerical data using StandardScaler. These techniques are important to ensure that the data is complete, accurate, and on a common scale, which is necessary for accurate analysis and prediction.
Q2: How can hyperparameter tuning be used to improve the accuracy of a linear regression model, and what are some methods for implementing it, such as grid search and randomized search?
A2: Hyperparameter tuning can be used to improve the accuracy of a linear regression model by adjusting the hyperparameters to find the combination that produces the best results. Grid search and randomized search are two methods for performing hyperparameter tuning. Grid search involves training the model on multiple combinations of hyperparameters, while randomized search involves training the model on a random subset of hyperparameters. Both methods utilize cross-validation to ensure that the model is accurate and can be used to make accurate predictions on new data.
#housingdata #dataanalysis #missingvalues #categoricaldata #numericaldata #onehotencoding #scaling #StandardScaler #trainingset #testingset #features #targetvariable #linearregression #crossvalidation #gridsearch #randomizedsearch #hyperparametertuning #bestparameters #RMSE #mean_squared_error #accuracy #predictions #machinelearning #Python #datascience
End of article.