Is it a good idea to exclude noisy data from the dataset to train the model?

问题

Will it be a good idea to exclude the noisy data( which may reduce model accuracy or cause unexpected output for testing dataset) from a dataset to generate the training and validation dataset ?

Assumption: Noisy data is pre-known to us

Any suggestion is deeply appreciated!

回答1:

It depends on your application. If the noisy data is valid, then definitely include it to find the best model.

However, if the noisy data is invalid, then it should be cleaned out before fitting your model.

Noise is a broad term, you better consider them as inliers or outliers instead.

Most of the outliers detection algorithms specify a threshold and sort the outliers candidates according to some given score. In this case, you can choose to eradicate the most extreme values. Say for example 3xSTD far from the mean (of course that is in case you have a Gaussian-like distributed data set).

So my suggestion is to build your judgement based on two things:

Your business concept and logic about validity vs invalidity. For example: A house size, area or price cannot be a negative number.
Your mathematical / algorithmic logic. For example: Detect extreme values based on some threshold to decide (along with / without point no. 1) whether it is a valid observation or not.

Noisy data doesn't cause a huge problem themselves. The extreme noisy data (i.e. extreme values / outliers) are those you should really concern about! Such points would adjust the hypothesis of your model while fitting the data. Hence, results might be drastically shifted / incorrect.

Finally, you can look at Pyod open-source Pythonic toolbox which contains a lot of different algorithms implemented off-the-shelf. (You can choose more than one algorithm and create a voting pool to decide the extremeness of the observations).

回答2:

You can use Multivariate Gaussian Distribution for outlier Detection in python. It is the best method.

来源：https://stackoverflow.com/questions/60972247/is-it-a-good-idea-to-exclude-noisy-data-from-the-dataset-to-train-the-model

标签

machine-learning

dataset

data-science