Netflix Movie Recommendation System

A content-based filtering system built on 8,807 Netflix titles using TF-IDF vectorization and cosine similarity. Developed using the Kaggle Netflix Movies & TV Shows dataset.

Total Titles

8,807

Netflix dataset

Movies

6,131

69.6% of catalog

TV Shows

2,676

30.4% of catalog

Year Range

1925–2021

96 years of content

Project Pipeline

Data Collection

Import 8,807 Netflix titles with 12 features each from CSV

Data Cleaning

Handle nulls, remove duplicates, standardize genres

EDA

Bar charts, genre distribution, rating vs year trends

TF-IDF

Vectorize descriptions + genres into numerical matrix

Cosine Similarity

Build n×n similarity matrix across all movies

Recommend

Return top-N most similar movies for any title

Content Type Split

Movies vs TV Shows in the dataset

Titles Added Per Year

Content added 2010–2021

Top 8 Ratings

Content rating distribution

Top 10 Genres

Most common genre tags

Exploratory Data Analysis

Deep dive into the Netflix dataset — understanding genre distribution, content trends, and rating patterns before building the recommendation engine.

Genre Distribution

All Genre Tags (Top 15)

Each movie can have multiple genre tags

Key finding: "International Movies" is the single largest genre tag (2,752 titles), overtaking even "Dramas" (2,427). This tells us Netflix's catalog is heavily globalized. Comedies (1,674) rank third, making Comedy + Drama the dominant content combo. For our recommender, genre weighting matters — a movie tagged as "International" alone carries less discriminative power than "Thrillers" or "Sci-Fi & Fantasy".

Content Trend (2010–2021)

Netflix Content Growth

Number of titles released each year (2010–2021)

Key finding: Netflix experienced explosive content growth from 2015–2018, peaking at 1,147 titles in 2018. Content dropped slightly in 2019–2020 (likely due to production delays from COVID-19), then dropped further in 2021 as the dataset captures only part of that year. For the recommender, newer titles (post-2015) dominate the corpus, so similarity scores will naturally skew toward recent content.

Rating Analysis

Rating Distribution

TV-MA and TV-14 dominate

Movies vs TV — Rating Comparison

How ratings differ by content type

Key finding: TV-MA (adult/mature) is the most common rating at 3,207 titles (36.4%), followed by TV-14 at 2,160. This reveals Netflix caters primarily to adult audiences. For a personalized recommender, rating should be a filter option — family users need PG/TV-PG content, while mature audiences can explore the full catalog. Our system includes this as a live filter.

Data Quality Assessment

Missing Values Per Column

Null counts across the 12 dataset features

Key finding: Director has the most missing values (~2,634 rows), followed by cast (~825 rows) and country (~831 rows). For the TF-IDF vectorizer, we rely on description and listed_in (genres) — both have near-zero nulls, making them the ideal features for our content-based engine. Missing director/cast are replaced with empty strings before vectorization.

Netflix Movie Recommendation System

Project Pipeline

Content Type Split

Titles Added Per Year

Top 8 Ratings

Top 10 Genres

Exploratory Data Analysis

Genre Distribution

All Genre Tags (Top 15)

Content Trend (2010–2021)

Netflix Content Growth

Rating Analysis

Rating Distribution

Movies vs TV — Rating Comparison

Data Quality Assessment

Missing Values Per Column

Live Recommendation Engine