Netflix Movie Recommendation System

A content-based filtering system built on 8,807 Netflix titles using TF-IDF vectorization and cosine similarity. Developed using the Kaggle Netflix Movies & TV Shows dataset.

Total Titles
8,807
Netflix dataset
Movies
6,131
69.6% of catalog
TV Shows
2,676
30.4% of catalog
Year Range
1925–2021
96 years of content

Project Pipeline

1
Data Collection
Import 8,807 Netflix titles with 12 features each from CSV
2
Data Cleaning
Handle nulls, remove duplicates, standardize genres
3
EDA
Bar charts, genre distribution, rating vs year trends
4
TF-IDF
Vectorize descriptions + genres into numerical matrix
5
Cosine Similarity
Build n×n similarity matrix across all movies
6
Recommend
Return top-N most similar movies for any title

Content Type Split

Movies vs TV Shows in the dataset

Titles Added Per Year

Content added 2010–2021

Top 8 Ratings

Content rating distribution

Top 10 Genres

Most common genre tags

Exploratory Data Analysis

Deep dive into the Netflix dataset — understanding genre distribution, content trends, and rating patterns before building the recommendation engine.

Genre Distribution

All Genre Tags (Top 15)

Each movie can have multiple genre tags
Key finding: "International Movies" is the single largest genre tag (2,752 titles), overtaking even "Dramas" (2,427). This tells us Netflix's catalog is heavily globalized. Comedies (1,674) rank third, making Comedy + Drama the dominant content combo. For our recommender, genre weighting matters — a movie tagged as "International" alone carries less discriminative power than "Thrillers" or "Sci-Fi & Fantasy".

Content Trend (2010–2021)

Netflix Content Growth

Number of titles released each year (2010–2021)
Key finding: Netflix experienced explosive content growth from 2015–2018, peaking at 1,147 titles in 2018. Content dropped slightly in 2019–2020 (likely due to production delays from COVID-19), then dropped further in 2021 as the dataset captures only part of that year. For the recommender, newer titles (post-2015) dominate the corpus, so similarity scores will naturally skew toward recent content.

Rating Analysis

Rating Distribution

TV-MA and TV-14 dominate

Movies vs TV — Rating Comparison

How ratings differ by content type
Key finding: TV-MA (adult/mature) is the most common rating at 3,207 titles (36.4%), followed by TV-14 at 2,160. This reveals Netflix caters primarily to adult audiences. For a personalized recommender, rating should be a filter option — family users need PG/TV-PG content, while mature audiences can explore the full catalog. Our system includes this as a live filter.

Data Quality Assessment

Missing Values Per Column

Null counts across the 12 dataset features
Key finding: Director has the most missing values (~2,634 rows), followed by cast (~825 rows) and country (~831 rows). For the TF-IDF vectorizer, we rely on description and listed_in (genres) — both have near-zero nulls, making them the ideal features for our content-based engine. Missing director/cast are replaced with empty strings before vectorization.

Live Recommendation Engine

Select any movie from the dataset to get personalized recommendations powered by TF-IDF + cosine similarity. The engine analyzes plot descriptions and genre tags to find the most similar titles.

🎬
Pick a movie to get started
The engine will find the most similar titles using TF-IDF cosine similarity