The data is said to symmetrical distribution when the Mean = median = mode are equal
Skewness tells us the direction of outliers. In a positive skew, the tail of a distribution curve is longer on the right side. This means the outliers of the distribution curve are further out towards the right and closer to the mean on the left.
In skewed dataset if there is classification problem where 1% of chances of a patient to get a disease yet 99% chances that it not and your model accuracy is also around 99%, now you as you cant tell what if misclassified the 1% by 99% which is dumb algorithm with a worse error, its more like always print 0 which will never error tell a person has a disease.
When working on Skewed Dataset we use different type of metrics
| Precision | where we predict y=1 what fraction has the rare disease? i.e which have actually 1 True Positive / predicted positive TP/(TP + FP) | | --- | --- | | Recall | the actual rare disease what fraction did we correctly detect as having it ? TP / actual positive TP /(TP + FN) |
High Precision | if a diagnosis patients have that rare disease, probably the patient does have it and its an accurate diagnosis. |
---|---|
High Recall | if there a patient with that rare disease probably the algorithm will correctly identify |
In logistics normally we classify the values in 0.5 threshold but what if the disease treatment is expensive, now we want to send the patient to prediction when there is a 7o percent chance by increasing the threshold from 0.5 to 0.7 it will give a high precision but less recall as True value will occur very rare.
if the threshold is around 0.3 then we will face the low precision but high recall
by choosing this threshold we can trade off between precision and recall
→ IF you want to automatically select the precision and recall you can use F1 score
in this the algorithm 1 has a normal trade off but algorithm 2 has higher precision and Algorithm 3 has higher recall …
In order to pick which algorithm to use it would be better if we combine both of them in a single score , so we can look at which have higher score and pick that one, There are multiple ways to do it:
(P+R ) / 2
as we can see the high average is given of the one which have high recall but less precision we cant go with that one (so average algorithm is not recommend)