Foundations of Data Science – NumPy, Pandas, and Data Visualization

Understanding NumPy, Pandas, and key plotting techniques is essential for anyone diving into data science. This guide covers core concepts, common operations, and visualization tools with practical use cases to help you build a solid foundation.

NumPy Arrays and Functions

NumPy (Numerical Python) offers efficient arrays and mathematical functions for numerical computing.

import numpy as np

arr = np.array([1, 2, 3])
matrix = np.array([[1, 2], [3, 4]])

Key functions:

np.mean(arr)      # 2.0
np.std(arr)       # Standard deviation
np.arange(5)      # [0 1 2 3 4]
np.linspace(0, 1, 5)  # [0.   0.25 0.5  0.75 1.]

Accessing and Modifying NumPy Arrays

arr[1]        # Access element at index 1
arr[0:2]      # Slicing
arr[1] = 10   # Modify element

Useful for selecting, updating, or filtering array data efficiently.

Saving and Loading NumPy Arrays

np.save('my_array.npy', arr)
loaded = np.load('my_array.npy')

Use .npz to save multiple arrays:

np.savez('arrays.npz', a=arr, b=matrix)

Pandas Series

Series are one-dimensional labeled arrays.

import pandas as pd

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s['b']     # 20
s['a'] = 15

Useful when you need labeled data and fast access by index.

Pandas DataFrames

DataFrames are 2D labeled data structures, similar to SQL tables.

data = {'name': ['Alice', 'Bob'], 'age': [25, 30]}
df = pd.DataFrame(data)
df['age']         # Access column
df.loc[0]         # Access row by index
df['age'] += 1    # Modify column

Combining DataFrames:

df2 = pd.DataFrame({'name': ['Carol'], 'age': [22]})
combined = pd.concat([df, df2], ignore_index=True)

Pandas Functions

df.describe()       # Summary statistics
df.info()           # DataFrame structure
df.value_counts('name')  # Count unique values
df.groupby('name').mean()  # Grouped aggregation

These help analyze and summarize datasets quickly.

Saving and Loading Datasets Using Pandas

df.to_csv('data.csv', index=False)
df = pd.read_csv('data.csv')

Also supports Excel (.xlsx), JSON, Parquet, and more.

Data Loading and Overview

df.head()      # First 5 rows
df.tail()      # Last 5 rows
df.columns     # Column names
df.dtypes      # Data types

This is your first step when exploring a new dataset.

Data Visualization with Seaborn and Matplotlib

Histogram

import seaborn as sns
sns.histplot(df['age'])

Use Case: Understand the distribution of a numeric variable (e.g., age distribution).

Box Plot

sns.boxplot(x='gender', y='income', data=df)

Use Case: Spot outliers and compare medians across categories.

Bar Plot

sns.barplot(x='department', y='salary', data=df)

Use Case: Compare categorical variables (e.g., average salary by department).

Line Plot

sns.lineplot(x='date', y='price', data=df)

Use Case: Track changes over time, such as stock prices.

Scatter Plot

sns.scatterplot(x='height', y='weight', data=df)

Use Case: Reveal correlation or clustering between two numeric variables.

Joint Plot

sns.jointplot(x='height', y='weight', data=df)

Use Case: Show both scatter and distribution side-by-side.

Violin Plot

sns.violinplot(x='gender', y='score', data=df)

Use Case: Reveal data distribution with density estimate by category.

Strip Plot

sns.stripplot(x='class', y='score', data=df, jitter=True)

Use Case: Show all individual data points along a category.

Swarm Plot

sns.swarmplot(x='group', y='value', data=df)

Use Case: Like strip plot, but avoids overlapping points.

Cat Plot

sns.catplot(x='team', y='points', kind='bar', data=df)

Use Case: Simplified categorical plotting across subgroups.

Pair Plot

sns.pairplot(df[['height', 'weight', 'age']])

Use Case: Visualize pairwise relationships across multiple variables.

Heatmaps

corr = df.corr()
sns.heatmap(corr, annot=True)

Use Case: Visualize a correlation matrix or any 2D data as colors.

Customizing Plots

import matplotlib.pyplot as plt

plt.title("Sample Title")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
plt.grid(True)

Plotly for Interactive Plots

import plotly.express as px

fig = px.scatter(df, x='height', y='weight', color='gender')
fig.show()

Use Case: When you want interactive zooming, tooltips, and interactivity in dashboards or reports.

Leave a Reply

Your email address will not be published. Required fields are marked *