[time-nuts] Discarding outliers in two dimensions

Kyle Hofmann kyle.hofmann at gmail.com
Fri Dec 11 03:42:33 UTC 2009


On Wed, Dec 9, 2009 at 5:53 AM, Hal Murray <hmurray at megapathdsl.net> wrote:
>
> Suppose I want to average a bunch of samples.  Sometimes it helps to discard
> the outliers.  I think that helps when there are two noise mechanisms, say
> the typical Gaussian plus sometimes some other noise added on.  If the other
> noise is rare but large, those occasional samples can have a big influence on
> the average.  So discarding those outliers gives better results, for some
> value of "better".
>
> I know how to do it in one dimension.  How do I do it in two dimensions?

I think the relevant property of the median is that it minimizes the
expected value of the norm of the deviations. That is, suppose that we
have n data points in one dimension.  Call them X_1, ..., X_n, and
pick out one of them which we denote by M. The deviations from M are
the values X_1 - M, ..., X_n - M, and the absolute values of the
deviations are |X_1 - M|, ..., |X_n - M|. Take the expected value of
these. The median M makes this expected value as low as possible.

In higher dimensions, you could do the same thing: Take all your data
points X_1, ..., X_n; pick out one called M; compute X_1 - M, ..., X_n
- M; take the norm, i.e., the usual Euclidean distance; and take the
expected value of these. Do that for all M and find which one's the
smallest; that'll be your median. If you have more than one, take the
mean.

Of course, this is a really slow algorithm, but I'd guess that the
output would be optimal.

-- 
Kyle Hofmann <kyle.hofmann at gmail.com>




More information about the Time-nuts_lists.febo.com mailing list