Statistics for Data Science – Normal Distribution Example

Understanding statistical concepts is crucial for data science, as they form the foundation for data analysis and decision-making. One of the most important and widely used concepts in statistics is the normal distribution. In this post, we will explore the normal distribution and work through an example to illustrate how it can be applied in data science.

What is the Normal Distribution?

The normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean. In simple terms, it describes how data points are spread around a central value, with most values clustering around the mean and fewer values appearing as you move away from the mean in either direction.

The shape of a normal distribution is often referred to as a “bell curve” because of its characteristic bell-shaped graph. The key parameters of a normal distribution are:

Mean (μ): The central value of the distribution.
Standard Deviation (σ): A measure of how spread out the values are around the mean.

In a perfectly normal distribution, approximately 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations. This is often referred to as the 68-95-99.7 rule.

Example of Normal Distribution

Let’s consider an example where we have a dataset representing the heights of adult women in a population. Suppose the heights are normally distributed with the following characteristics:

Mean height (μ): 160 cm
Standard deviation (σ): 10 cm

We want to answer the following questions using this information:

What is the probability that a randomly selected woman has a height between 150 cm and 170 cm?
What percentage of women are taller than 180 cm?

Step 1: Standardizing the Data

To calculate probabilities and percentiles in a normal distribution, we first need to standardize the data. This involves converting the raw values into Z-scores, which represent how many standard deviations a value is away from the mean.

The formula for calculating a Z-score is:

Z = (X - μ) / σ

Where:

X: The raw value we want to standardize.
μ: The mean of the distribution.
σ: The standard deviation of the distribution.

Step 2: Probability Between 150 cm and 170 cm

We will first standardize the values 150 cm and 170 cm to find their respective Z-scores:

For X = 150 cm: Z = (150 – 160) / 10 = -1
For X = 170 cm: Z = (170 – 160) / 10 = 1

Now, we look up the Z-scores in the Z-table (standard normal distribution table) or use statistical software to find the cumulative probability for these Z-scores:

For Z = -1, the cumulative probability is 0.1587.
For Z = 1, the cumulative probability is 0.8413.

To find the probability that a woman’s height is between 150 cm and 170 cm, we subtract the cumulative probability for Z = -1 from the cumulative probability for Z = 1:

P(150 ≤ X ≤ 170) = P(Z = 1) - P(Z = -1) = 0.8413 - 0.1587 = 0.6826

So, the probability that a randomly selected woman has a height between 150 cm and 170 cm is 0.6826, or 68.26%.

Step 3: Percentage of Women Taller than 180 cm

Next, we calculate the Z-score for 180 cm:

Z = (180 - 160) / 10 = 2

Looking up the Z-score of 2 in the Z-table gives a cumulative probability of 0.9772. This represents the percentage of women who are shorter than 180 cm. To find the percentage of women taller than 180 cm, we subtract this value from 1:

P(X > 180) = 1 - P(Z = 2) = 1 - 0.9772 = 0.0228

So, approximately 2.28% of women in this population are taller than 180 cm.

Conclusion

The normal distribution is an essential concept in statistics and is widely used in data science for analyzing datasets that exhibit a symmetric distribution. By understanding the properties of the normal distribution, such as the 68-95-99.7 rule and how to calculate probabilities using Z-scores, data scientists can make informed decisions based on data analysis.

In this post, we worked through an example of calculating probabilities using the normal distribution and applied it to a real-world scenario. Understanding these concepts is crucial for analyzing data and making data-driven decisions in various fields, including machine learning, finance, and healthcare.