Q & A : Does skewness of a variable impact predictive data modelling? If so, how? How can I deal

IntelData Pty Ltd | Asset Management and Business
Jan 15, 2016
4 min read

The answer to this question is a big yes and this is an important consideration that a data analyst or a reliability engineer needs to take into account when preparing the dataset for the modelling phase. This is important because of the many assumptions that many algorithms make regards the shape (distribution) of data namely normal distribution.

For example, linear regression, k-nearest neighbour and K-means are sensitive to the skewness of the data. The models created without taking this issue into consideration, are not truly representative of the patterns in the data.

Suppose that you have built the following linear regression model that computes the square of the errors between the data points and the trend line (Figure 1).

The larger the values, the larger the error could be. For example, as it can be seen in the picture the error for 500 is around 150 units which is 100 times more than the value on the left side. And the problem is even worse than it appears. The regression model computes the square of the error, so 150 units is [if gte msEquation 12]><m:oMath><m:sSup><m:sSupPr><span style='font-size:10.0pt;mso-ansi-font-size:10.0pt;mso-bidi-font-size:10.0pt; font-family:"Cambria Math",serif;mso-ascii-font-family:"Cambria Math"; mso-hansi-font-family:"Cambria Math";font-style:italic;mso-bidi-font-style: normal'><m:ctrlPr></m:ctrlPr></m:sSupPr><m:e><span style='font-size:11.0pt;font-family:"Cambria Math",serif; mso-fareast-font-family:"Times New Roman";mso-fareast-theme-font:minor-fareast; mso-bidi-font-family:Arial;mso-bidi-theme-font:minor-bidi;mso-ansi-language: EN-AU;mso-fareast-language:EN-US;mso-bidi-language:AR-SA'><m:r>150</m:r></m:e><m:sup><span style='font-size:11.0pt;font-family: "Cambria Math",serif;mso-fareast-font-family:"Times New Roman";mso-fareast-theme-font: minor-fareast;mso-bidi-font-family:Arial;mso-bidi-theme-font:minor-bidi; mso-ansi-language:EN-AU;mso-fareast-language:EN-US;mso-bidi-language:AR-SA'><m:r>2</m:r></m:sup></m:sSup><span style='font-size:11.0pt;font-family: "Cambria Math",serif;mso-fareast-font-family:"Times New Roman";mso-fareast-theme-font: minor-fareast;mso-bidi-font-family:Arial;mso-bidi-theme-font:minor-bidi; mso-ansi-language:EN-AU;mso-fareast-language:EN-US;mso-bidi-language:AR-SA'><m:r>=22500</m:r></m:oMath><![endif][if !msEquation][if gte vml 1]><v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"> <v:stroke joinstyle="miter"></v:stroke> <v:formulas> <v:f eqn="if lineDrawn pixelLineWidth 0"></v:f> <v:f eqn="sum @0 1 0"></v:f> <v:f eqn="sum 0 0 @1"></v:f> <v:f eqn="prod @2 1 2"></v:f> <v:f eqn="prod @3 21600 pixelWidth"></v:f> <v:f eqn="prod @3 21600 pixelHeight"></v:f> <v:f eqn="sum @0 0 1"></v:f> <v:f eqn="prod @6 1 2"></v:f> <v:f eqn="prod @7 21600 pixelWidth"></v:f> <v:f eqn="sum @8 21600 0"></v:f> <v:f eqn="prod @7 21600 pixelHeight"></v:f> <v:f eqn="sum @10 21600 0"></v:f> </v:formulas> <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"></v:path> <o:lock v:ext="edit" aspectratio="t"></o:lock> </v:shapetype><v:shape id="_x0000_i1025" type="#_x0000_t75" style='width:67.5pt; height:19.5pt'> <v:imagedata src="file:///C:/Users/INTELD~1/AppData/Local/Temp/msohtmlclip1/01/clip_image001.png" o:title="" chromakey="white"></v:imagedata> </v:shape><![endif][if !vml][endif][endif] units, thousands of times greater than the errors at the left end of the plot. Therefore, to minimise the square of the errors, the regression model must try to keep the line close to the data point at the right extreme of the plot, giving this data point a disproportionate od the slope of the line. Of the 28 data points I the plot, the square of the error for the data point with an x-axis value equal to 500 is 21 percent of the total error of all the data points. The four data points with x-axis values 250 or above contribute more than 60 percent of the error (for 14 percent of the data). Clustering methods such as k-Means and Kohonen use Euclidean Distance, which computes the square of the distance between data points, and therefore tails of the distribution have the same disproportionate effect. Other algorithms such as decision trees, are unaffected by skew.

Fixing Skewness

Positive Skew

To do this, usually the skewed variables are transformed by a function that has a disproportionate effect on the tails of the distribution. Ideally, for most modelling algorithms, the desired outcome of skew correction is a new version of the variable that is normally distributed.

However, even if the outcome is a variable that has a normal distribution, as long as the distribution is balanced – skew equals a value close to 0 – the algorithms behave in a less biased way.

For positive skew, the most common corrections are:

The log transform
The multiplicative inverse
The square root transform

The way these work is by reducing larger values more thank reducing (or even expanding) the smaller value less, or, in the case of the inverse, actually increasing the smaller values. Table 1 shows the common transformations to reduce positive skew.1

Of these, the log transform is perhaps the most often used transformation to correct for positive skew. A log transform of any positive base will pull in the tail of a positively skewed variable, but the natural log or the log base 10 are usually included in software packages. The log base 0 pulls the tail in more than natural log and is preferred sometimes for this reason. Another reason to prefer the log base 10 is that translating from the original units to log unit is simpler. The log base 10 increments by 1 for every order of magnitude increase, whereas the natural log has increments of 2.3.

Negative Skew

Negative skew is less common than positive skew but has the same problems with bias that positive skew has. One correction used often to transform negatively skewed variables is a power transform: square, cube, or raise the variable to a higher power.

If the variable has a large magnitude, raising that value to a higher power will create very large transformed values and even cause numeric overflow. It is therefore advisable to scale the variable first by its magnitude before raising it to a higher power.

An alternative approach is to take advantage of the same log transform already described. However, because the log transform is undefined for negative values, you must first ensure the values are positive before applying the log. Moreover, the log transform pulls in the tail of a positively skewed distribution. Therefore, you must first flip the distribution before applying the log transform and then restore the distribution to its original negative (now transformed) values. The equation for transforming either a positively or negatively skewed variable is as follows:

Inside the log transform, an absolute value is applied first to make the values positive. One (1) is then added to shift the distribution away from zero. After applying the log transform, the function applies the original sign of the negative sign. Note that if the original values are all positive, the absolute value and have no effect, so this formula can be used for either positively or negatively skewed data.

Also, this transformation can be used in yet another distribution, one with large tails in both the positive and negative directions but peaks at 0. This occurs for monetary fields such as profit/loss variables.

Conclusion and Summary

Transformation used by the predictive modeller or data miner will vary depending on personal preferences and the resulting distribution found after applying the transformation.

Table 2 shows the summary of the methods described earlier.