In real world scenario data analysis tasks involve complex data analysis i.e. analysis for multi-dimensional data. We analyse the data and try to find out various patterns in it.
Here dimensions represents your data point x, As the dimensions of data increases, the difficulty to visualize it and perform computations on it also increases. So, how to reduce the dimensions of a data
- Remove the redundant dimension
- Only keep the most important dimension
To reduce dimensions of the data we use principle component analysis. Before we deep dive in working of PCA, lets understand some key terminology, which will use further.
It is a measure of the variability or it simply measures how spread the data set is. Mathematically, it is the average squared deviation from the mean score. We use the following formula to compute variance var(x).
Covariance: It is a measure of the extent to which corresponding elements from two sets of ordered data move in the same direction. Formula is shown above denoted by cov(x,y) as the covariance of x and y.
Here, xi is the value of x in ith dimension. x bar and y bar denote the corresponding mean values.
One way to observe the covariance is how interrelated two data sets are.
Positive, negative and zero covariance:
Positive covariance means X and Y are positively related i.e. as X increases Y also increases. Negative covariance depicts the exact opposite relation. However zero covariance means X and Y are not related.
Eigenvectors and Eigenvalues:
To better understand these concepts, let’s consider the following situation. We are provided with 2-dimensional vectors v1, v2, …, vn. Then, if we apply a linear transformation T (a 2×2 matrix) to our vectors, we will obtain new vectors, called b1, b2,…,bn.
Some of them (more specifically, as many as the number of features), though, have a very interesting property: indeed, once applied the transformation T, they change length but not direction. Those vectors are called eigenvectors, and the scalar which represents the multiple of the eigenvector is called eigenvalue
Thus, each eigenvector has a correspondent eigenvalue.
When should I use PCA:
- If you want to reduce the number of variables, but aren’t able to identify variables to completely remove from consideration?
- If you want to ensure your variables are independent from each other.
- To avoid overfitting your model.
- If you are comfortable making your independent variable less interpretable.
- PCA is an unsupervised statistical technique used to examine the interrelations among a set of variables in order to identify the underlying structure of those variables.
- It is also known sometimes as a general factor analysis.
- Where regression determines a line of best fit to a data set, factor analysis determines several orthogonal lines of best fit to the data set.
- Orthogonal means “at right angles”.
- Actually the lines are perpendicular to each other in n-dimensional space.
- N-dimensional Space is the variable sample space.
- There are as many dimensions as there are variables, so in a data set with 4 variables the sample space is 4-dimensional.
- Here we have some data plotted along two features, x and y.
- We can add an orthogonal line. Now we can begin to understand the components!
- Components are a linear transformation that chooses a variable system for the data set such that the greatest variance of the data set comes to lie on the first axis.
- The second greatest variance on the second axis, and so on.
- This process allows us to reduce the number of variables used in an analysis.
- We can continue this analysis into higher dimensions.
- If we use this technique on a data set with a large number of variables, we can compress the amount of explained variation to just a few components.
- The most challenging part of PCA is interpreting the components.
For our work with Python, we’ll walk through an example of how to perform PCA with scikit learn. We usually want to standardize our data by some scale for PCA, so we’ll cover how to do this as well.
- Calculate the covariance matrix X of data points.
- Calculate eigenvectors and corresponding eigenvalues.
- Sort the eigen vectors according to their eigenvalues in decreasing order.
- Choose first k eigenvectors and that will be the new k dimensions.
- Transform the original n dimensional data points into k dimensions.