Winsorizing and Transforming Data
When working with data, there is a very common problem that we need to deal with: non-normal distributions. For many statistical analyses, a normal distribution is a necessary assumption. Many researchers ignore the normality of the distribution and just hope that the analysis they are using is "robust" enough (i.e., not significantly influenced by non-normal data) that it doesn't matter. In many cases, this is fine. However, it is still good to look at your data to see if it is non-normal to an extent that needs to be addressed. Here, I want to detail how to do so.
Before we jump into how to handle non-normal data, let's look at an example frequency plot of a non-normal variable. The normal curve has been added to help illustrate just how non-normal the data really is:
Here, we see that the data is positively-skewed (most of it is to the left, but some high values to the right are stretching the curve towards the positive). In this case, the value to the right is pretty extreme, so we would want to question whether or not it's even a valid measurement. This sometimes happens with lab measurements, when instruments can mis-measure something (which is the case here). However, we can also look at the data of any participants who are outliers (i.e., far outside the rest of the data values) to see if their other measurements are also extreme.
In this case, that extreme measurement was deemed to be a measurement error (the participant's other measurements were much more normal), so it was left out of the dataset, while some other higher values (which you can see by the right edge of the normal curve) were kept.
Once you decide which data points to cut out, if any (and you need good justification to do so!), the next step is to "correct" any additional outliers. "Correcting" the other outliers consists of identifying values that are outside what can be considered a reasonably expected range, but they are only outside that range due to measurement error having slightly elevated their scores. For example, if a participant's measurements are consistently near the outer edge of the normal curve, but one measurement goes beyond the edge, then it is reasonable to assume that the actual value is quite high and the measured value was just inflated a bit.
To correct for this inflation from measurement error, we can Winsorize the outliers. Winsorizing is a process that involves bringing the outliers down to a specified value, so they are closer to within the normal distribution curve. The value to bring the outliers down to depends on the situation; some use a 90th percentile cutoff, some a 95th percentile cutoff, you can use a 3 standard deviation cutoff, and so forth. Just keep in mind that you want good justification for the value you choose!
For this data, I chose to go with a 3 standard deviation cutoff because it involves "correcting" the data to a smaller extent. To get that value, I found the mean and standard deviation of the variable and calculated the value as Mean + 3 * SD. After doing that, I sorted the data descending and looked for how many values exceeded 3 standard deviations.
Keep in mind that you shouldn't need to correct many values! In a dataset with a couple hundred measurements, I found only three that needed to be corrected (plus the one that was excluded), and even then the corrections were fairly minor.
You can either bring all of the outliers down to the value you calculated or, as some prefer, bring the lowest outlier to that value, then bring the next down to it (but add a little bit, like +0.01), then for the next bring it down but add a little more (e.g., +0.02), and so forth. That way, the values are basically corrected, but they are still in the same order from highest to lowest. What value you may want to consider adding depends on the scale of the variable you're correcting, and in most cases the small additions likely result in zero changes to the statistical output.
Alright! So you've Winsorized your data. Here is an example of how that can make your data more normal:
That's looking much better! But, as you can see, we still have a positive skew. Depending on the analysis, this skew may not affect the results, but it may still be good to help make the distribution more normal.
If further normalization is needed, then the next step is to transform the data. Data transformation is a little controversial among researchers; whereas Winsorizing only involves changing outliers, transforming involves changing all of the values. Subjectively, that worries a lot of people (as changing data values should!). However, statistically it is fine if done appropriately.
To transform data, we do the same computation with all of the values for a variable. Two common computations are square root, and logarithm. For each, you simply take the square root (or log) of each value, and you've effectively transformed the data. Think of it as similar to keeping equations balanced; as long as you do the same computation to everything equally, it all works out mathematically.
Which computation you'll use depends on the type of data you're working with and the nature of the non-normal problems you have with the distribution. You can try different methods, or look to see what has become popular for the type of data you are working with.
After you successfully transform the data, your distribution should look more like this:
Sure, there are still values at the high end of the distribution (partly due to the Winsorizing), but that distribution is much more normal!
Now that you know how to Winsorize and transform data, I want to make a few important statements:
- Only Winsorize and/or transform the data if you have good justification to do so! Never do it just to fish for significant findings.
- Look into the literature to see if Winsorizing and/or transforming is common for the data you're working with. For some things (like cortisol measurements), it's common and a fairly standard procedure is used. Follow previous examples to make your methods consistent.
- Whenever you're editing data, like Winsorizing or transforming, always have the new values in a separate variable! Just like I recommend working with variables in a copy of the original database, you want to make sure you're never replacing the original data.
And that's it! Hopefully this post has been helpful.
Have questions about Winsorizing and/or transforming data? Unsure of whether or not you're justified to Winsorize or transform? Let me know your thoughts/concerns in the comments!