Pandas : Machine Learning

Pandas

Pandas is a Python library for working with tabular data: think spreadsheets, but in code.

This note explains Pandas simply, like you’re a student learning by analogy and small examples. Each concept has a short explanation, a one-line analogy, and minimal runnable code.

Quick setup

Install Pandas if you don’t have it:

pip install pandas

Or in a Colab/Jupyter cell:

%pip install pandas --quiet

1. Core ideas

Series: a single column of data. Analogy: a single column in a spreadsheet or a named list.
DataFrame: a table with rows and columns. Analogy: a spreadsheet sheet or SQL table.
Index: labels for rows (like row numbers or named keys). Analogy: the row headers in a sheet.

These three are the building blocks you will use everywhere.

import pandas as pd

# Series
s = pd.Series([10, 20, 30], name='scores')

# DataFrame
df = pd.DataFrame({
	'name': ['Alice', 'Bob', 'Charlie'],
	'age': [25, 30, 22],
	'score': [85, 92, 78]
})

Analogy recap: Series = column, DataFrame = table, Index = row labels.

2. Creating DataFrames

You can create a DataFrame from lists, dicts, NumPy arrays, or read from files.

# from dict (common)
df = pd.DataFrame({'A': [1,2], 'B': [3,4]})

# from list of dicts (like rows)
rows = [{'x':1,'y':2}, {'x':3,'y':4}]
df2 = pd.DataFrame(rows)

# from CSV (real-world)
# df = pd.read_csv('data.csv')

Small detail: when constructing from dict, keys become column names, values must align by length.

3. Reading and writing (I/O)

Most common formats: CSV, Excel, JSON, and SQL. Use read_csv and to_csv frequently.

# read CSV
df = pd.read_csv('my_data.csv')

# write CSV
df.to_csv('out.csv', index=False)

Tip: index=False avoids saving the DataFrame index as an extra column.

4. Inspecting data

df.head()        # first 5 rows
df.tail()        # last 5 rows
df.info()        # types and non-null counts
df.describe()    # summary stats for numeric columns
df.shape         # (rows, columns)
df.columns       # list of column names

Small detail: info() helps find missing values quickly (non-null counts).

5. Selection and indexing

df['col'] or df.col gives a Series.
df[['a','b']] selects multiple columns (returns DataFrame).
df.loc[row_label, col_label] label-based selection.
df.iloc[row_idx, col_idx] position-based selection (like arrays).

names = df['name']            # Series
two_cols = df[['name','score']]  # DataFrame
row0 = df.loc[0]             # first row by index label
cell = df.loc[0, 'score']    # single value by label
by_pos = df.iloc[0, 2]       # single value by position

Filtering (boolean indexing):

young = df[df['age'] < 30]   # rows where age < 30

Small detail: boolean masks keep the original index — use reset_index(drop=True) to reindex.

6. Missing data

Pandas uses NaN for missing values. Common ops:

df.dropna()               # drop rows with any missing value
df.dropna(axis=1)         # drop columns with any missing value
df.fillna(0)              # replace NaN with 0
df['col'].fillna(df['col'].mean())  # fill with mean (column-wise)

Small detail: inplace=True mutates the DataFrame (use sparingly). Preferred: reassign df = df.dropna().

7. GroupBy and aggregation

GroupBy splits data by keys, applies functions, and combines results. Analogy: grouping grades by class and computing average per class.

data = pd.DataFrame({
	'team': ['A','A','B','B'],
	'points': [10, 15, 7, 12]
})

data.groupby('team')['points'].mean()

You can aggregate multiple functions:

data.groupby('team').agg({'points': ['mean','sum','count']})

Small detail: groupby returns a GroupBy object — apply aggregation to get a DataFrame/Series.

8. Merge / Join

Combine tables by keys (like SQL joins). Analogy: joining student info with exam scores by student id.

left = pd.DataFrame({'id':[1,2], 'name':['A','B']})
right = pd.DataFrame({'id':[1,2], 'score':[90,85]})
merged = left.merge(right, on='id', how='inner')

how can be inner, left, right, or outer.

Small detail: check for duplicated key names — use suffixes=('_l','_r') to disambiguate.

9. Apply, vectorized operations, and performance

Pandas is fastest when using vectorized operations (operate on whole columns) rather than pure Python loops.

Vectorized example:

df['double_score'] = df['score'] * 2

Using apply (row-wise or column-wise) for custom logic:

def pass_fail(x):
	return 'pass' if x >= 80 else 'fail'

df['result'] = df['score'].apply(pass_fail)

Small detail: df.apply(func, axis=1) passes rows as Series — it’s slower than vectorized ops.

10. A minimal example workflow

Load data: df = pd.read_csv('data.csv')
Inspect: df.info(), df.head()
Clean: handle missing values, fix types (df['col'] = df['col'].astype(int)).
Analyze: groupby, agg, sort_values, create new columns.
Save results: df.to_csv('clean.csv', index=False)

Example:

# load
df = pd.DataFrame({
	'name': ['A','B','C','D'],
	'age': [20, None, 22, 23],
	'score': [88, 92, None, 75]
})

# clean
df['age'] = df['age'].fillna(df['age'].median())
df['score'] = df['score'].fillna(df['score'].mean())

# analyze
avg = df['score'].mean()
top = df.sort_values('score', ascending=False).head(2)

# save
df.to_csv('example_clean.csv', index=False)

11. Useful tiny tips (student-friendly)

Prefer df.head() and df.info() to quickly understand data.
Avoid inplace=True — reassign for clarity.
Use vectorized ops (column arithmetic) for speed.
When things go wrong, print df.dtypes and df.isna().sum() to debug types and missing data.

12. Summary

Pandas turns messy tabular data into Python objects you can inspect, clean, analyze, and save. Learn Series, DataFrame, Index, selection, groupby, and merge first — these cover 90% of everyday tasks.

If you want, I can add a short, runnable Jupyter notebook version of these examples or expand any section with more student-level analogies and exercises.

Pandas

Comments