Understanding NumPy, Pandas, and key plotting techniques is essential for anyone diving into data science. This guide covers core concepts, common operations, and visualization tools with practical use cases to help you build a solid foundation.
NumPy Arrays and Functions
NumPy (Numerical Python) offers efficient arrays and mathematical functions for numerical computing.
import numpy as np
arr = np.array([1, 2, 3])
matrix = np.array([[1, 2], [3, 4]])
Key functions:
np.mean(arr) # 2.0
np.std(arr) # Standard deviation
np.arange(5) # [0 1 2 3 4]
np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1.]
Accessing and Modifying NumPy Arrays
arr[1] # Access element at index 1
arr[0:2] # Slicing
arr[1] = 10 # Modify element
Useful for selecting, updating, or filtering array data efficiently.
Saving and Loading NumPy Arrays
np.save('my_array.npy', arr)
loaded = np.load('my_array.npy')
Use .npz
to save multiple arrays:
np.savez('arrays.npz', a=arr, b=matrix)
Pandas Series
Series are one-dimensional labeled arrays.
import pandas as pd
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s['b'] # 20
s['a'] = 15
Useful when you need labeled data and fast access by index.
Pandas DataFrames
DataFrames are 2D labeled data structures, similar to SQL tables.
data = {'name': ['Alice', 'Bob'], 'age': [25, 30]}
df = pd.DataFrame(data)
df['age'] # Access column
df.loc[0] # Access row by index
df['age'] += 1 # Modify column
Combining DataFrames:
df2 = pd.DataFrame({'name': ['Carol'], 'age': [22]})
combined = pd.concat([df, df2], ignore_index=True)
Pandas Functions
df.describe() # Summary statistics
df.info() # DataFrame structure
df.value_counts('name') # Count unique values
df.groupby('name').mean() # Grouped aggregation
These help analyze and summarize datasets quickly.
Saving and Loading Datasets Using Pandas
df.to_csv('data.csv', index=False)
df = pd.read_csv('data.csv')
Also supports Excel (.xlsx
), JSON, Parquet, and more.
Data Loading and Overview
df.head() # First 5 rows
df.tail() # Last 5 rows
df.columns # Column names
df.dtypes # Data types
This is your first step when exploring a new dataset.
Data Visualization with Seaborn and Matplotlib
Histogram
import seaborn as sns
sns.histplot(df['age'])
Use Case: Understand the distribution of a numeric variable (e.g., age distribution).
Box Plot
sns.boxplot(x='gender', y='income', data=df)
Use Case: Spot outliers and compare medians across categories.
Bar Plot
sns.barplot(x='department', y='salary', data=df)
Use Case: Compare categorical variables (e.g., average salary by department).
Line Plot
sns.lineplot(x='date', y='price', data=df)
Use Case: Track changes over time, such as stock prices.
Scatter Plot
sns.scatterplot(x='height', y='weight', data=df)
Use Case: Reveal correlation or clustering between two numeric variables.
Joint Plot
sns.jointplot(x='height', y='weight', data=df)
Use Case: Show both scatter and distribution side-by-side.
Violin Plot
sns.violinplot(x='gender', y='score', data=df)
Use Case: Reveal data distribution with density estimate by category.
Strip Plot
sns.stripplot(x='class', y='score', data=df, jitter=True)
Use Case: Show all individual data points along a category.
Swarm Plot
sns.swarmplot(x='group', y='value', data=df)
Use Case: Like strip plot, but avoids overlapping points.
Cat Plot
sns.catplot(x='team', y='points', kind='bar', data=df)
Use Case: Simplified categorical plotting across subgroups.
Pair Plot
sns.pairplot(df[['height', 'weight', 'age']])
Use Case: Visualize pairwise relationships across multiple variables.
Heatmaps
corr = df.corr()
sns.heatmap(corr, annot=True)
Use Case: Visualize a correlation matrix or any 2D data as colors.
Customizing Plots
import matplotlib.pyplot as plt
plt.title("Sample Title")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
plt.grid(True)
Plotly for Interactive Plots
import plotly.express as px
fig = px.scatter(df, x='height', y='weight', color='gender')
fig.show()
Use Case: When you want interactive zooming, tooltips, and interactivity in dashboards or reports.
Leave a Reply