What is meant by skewness

The data is said to symmetrical distribution when the Mean = median = mode are equal

Skewness tells us the direction of outliers. In a positive skew, the tail of a distribution curve is longer on the right side. This means the outliers of the distribution curve are further out towards the right and closer to the mean on the left.

Untitled

In skewed dataset if there is classification problem where 1% of chances of a patient to get a disease yet 99% chances that it not and your model accuracy is also around 99%, now you as you cant tell what if misclassified the 1% by 99% which is dumb algorithm with a worse error, its more like always print 0 which will never error tell a person has a disease.

When working on Skewed Dataset we use different type of metrics

Precision / Recall : confusion matrix

Untitled

| Precision | where we predict y=1 what fraction has the rare disease? i.e which have actually 1 True Positive / predicted positive TP/(TP + FP) | | --- | --- | | Recall | the actual rare disease what fraction did we correctly detect as having it ? TP / actual positive TP /(TP + FN) |

Trade of between Precision and Recall

High Precision	if a diagnosis patients have that rare disease, probably the patient does have it and its an accurate diagnosis.
High Recall	if there a patient with that rare disease probably the algorithm will correctly identify

In logistics normally we classify the values in 0.5 threshold but what if the disease treatment is expensive, now we want to send the patient to prediction when there is a 7o percent chance by increasing the threshold from 0.5 to 0.7 it will give a high precision but less recall as True value will occur very rare.

if the threshold is around 0.3 then we will face the low precision but high recall

Untitled

by choosing this threshold we can trade off between precision and recall

→ IF you want to automatically select the precision and recall you can use F1 score

F1 score

Untitled

in this the algorithm 1 has a normal trade off but algorithm 2 has higher precision and Algorithm 3 has higher recall …

In order to pick which algorithm to use it would be better if we combine both of them in a single score , so we can look at which have higher score and pick that one, There are multiple ways to do it:

By taking average of precision and recall
- (P+R ) / 2

as we can see the high average is given of the one which have high recall but less precision we cant go with that one (so average algorithm is not recommend)

By computing F1 Score
- F1_score = 2((PR)/(P + R) ) ⇒ Harmonic Mean