Skip to content

Contains questions on statistics required in machine learning based on as asked in interviews

Notifications You must be signed in to change notification settings

Adi1729/Statistics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 

Repository files navigation

Statistics

Contains questions on statistics required in machine learning based on as asked in interviews

Which hypothesis test to perform?

What is the bias-variance trade-off?

Bias refers to an error from an estimator that is too general and does not learn relationships from a data set that would allow it to make better predictions.

Variance refers to error from an estimator being too specific and learning relationships that are specific to the training set but will not generalize to new observations well.

👉 In short, the bias-variance trade-off is a the trade-off between underfitting and overfitting. As you decrease variance, you tend to increase bias. As you decrease bias, you tend to increase variance.

👉 Generally speaking, your goal is to create models that minimize the overall error by careful model selection and tuning to ensure sure there is a balance between bias and variance: general enough to make good predictions on new data but specific enough to pick up as much signal as possible.

Variance

Variance gives us an idea about the distribution of data around the mean, and thus from this distribution, we can work out where we can expect an unknown data point.

Why variance is divided by (n-1)?

Khan Academy - Variance

Covariance and Corrlelation :

Covariance :
Formaula = first equation

Assume X = Height(cms) , Y = Weight(kgs) , Z = Age(Yrs)\

cov(X,Y) = 100 cms-Kgs
cov(Y,Z) = 150 kgs-Yrs
However, one can not compare two cavariance results as both has different units.\

Correlation:
Formaula = second equation -1<=cor<=1
cor(X,Y) = 0.5\
cor(X,Z) = 1
Here, it can ne said X,Z covary much more than X,Z

F-tests

Compares variance of two different population.
Assumption: Population from which sample is drawn must be normal.

  1. F-test for testing equality of variance is used to test the hypothesis of the equality of two population variances. The height example above requires the use of this test.

  2. F-test for testing equality of several means. The test for equality of several means is carried out by the technique called ANOVA.
    For example, suppose that an experimenter wishes to test the efficacy of a drug at three levels: 100 mg, 250 mg and 500 mg. A test is conducted among fifteen human subjects taken at random, with five subjects being administered each level of the drug.
    To test if there are significant differences among the three levels of the drug in terms of efficacy, the ANOVA technique has to be applied. The test used for this purpose is the F-test.

  3. F-test for testing significance of regression is used to test the significance of the regression model. The appropriateness of the multiple regression model as a whole can be tested by this test. A significant F value indicates a linear relationship between Y and at least one of the Xs.

Z-test and T-test

Both the tests is use to find how mean of a two sample is different from each other.

Z- test is used when standard deviation of a population is known and t - test is used when standard deviation is now known for population. Instead it is estimated by finding standard deviation of a sample.

Assumption of a t-test :

  1. Random sample i.e each sample must be independant of each other which is assumed if samples size < 30 or 10% of population.
  2. Sample follows Normal distribution

Z -test is used when sample size > 30.

Multicollinearity and VIF

VIF Explained with python notebook in link

About

Contains questions on statistics required in machine learning based on as asked in interviews

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published