In my understanding, I thought PCA can be performed only for continuous features. But while trying to understand the difference between onehot encoding and label encoding ca
Basically, PCA finds and eliminate less informative (duplicate) information on feature set and reduce the dimension of feature space. In other words, imagine a N-dimensional hyperspace, PCA finds such M (M < N) features that the data variates most. In this way data may be represented as M-dimensional feature vectors. Mathematically, it is some-kind of a eigen-values & eigen vectors calculation of a feature space.
So, it is not important whether the features are continuous or not.
PCA is used widely on many application. Mostly for eliminating noisy, less informative data that comes from some sensor or hardware before classification/recognition.
Edit:
Statistically speaking, categorical features can be seen as discrete random variables in interval [0,1]. Computation for expectation E{X} and variance E{(X-E{X})^2) are still valid and meaningful for discrete rvs. I still stand for the applicability of PCA in case of categorical features.
Consider a case where you would like to predict whether "It is going to rain for a given day or not". You have categorical feature X which is "Do I have to go to work for the given day", 1 for yes and 0 for no. Clearly weather conditions do not depend on our work schedule, so P(R|X)=P(R). Assuming 5 days of work for every week, we have more 1s than 0s for X in our randomly collected dataset. PCA would probably lead to dropping this low-variance dimension in your feature representation.
At the end of the day, PCA is for dimension reduction with minimal loss of information. Intuitively, we rely on variance of the data on a given axis to measure its usefulness for the task. I don't think there is any theoretical limitation for applying it to categorical features. Practical value depends on application and data which is also the case for continuous variables.