Data Analysis and Linear Regression
In the first class, after discussing the course structure we started our first project which is applying linear regression on the CDC diabetes data. The data consists of features like FIPS, COUNTY, OBESITY, INACTIVITY, DIABETES. We first must analyze the data and find the relationship between all the features. As this data is real data, we examined the data such as finding the correlation between all the features like obesity, inactivity, diabetes. And I have learnt about different measures like skewness, Kurtosis, Heteroscedasticity.
Skewness: It is a statistical measure of how asymmetrical distribution is, means it shows us the shape of the distribution of data points.
Kurtosis: Kurtosis is a measure of how flat or peaked the distributions are. It shows us the the distributions of the tails instead if center.
Heteroscedasticity: It is a situation where the error term in a regression model is not constant across all values of the independent variable.
I have also learnt about linear regression.
In simple way, Linear regression is a statistical model which shows us the relationship between the variables and make the predictions.
The Mathematical formula for linear regression is Y = b0 + b1X + c
X is an independent variable
Y is a dependent variable
B0 is an intercept; b1 is a slope
C is error term
Here, in this dataset we are going to predict the %diabetes with the help of %obesity and %inactivity.
So, our dependent variable or Target variable Y is %diabetes and the independent variable is %inactivity.
And in Multiple linear regression we can used several independent variables. So we also have multiple variables in our dataset. We can also find out the diabetes by using two independent variables which are inactivity and obesity.
