← Back to Projects

Movies & Shows Data Analysis

Comprehensive data analysis using pandas to explore a dataset of movies and TV shows, including cast information, genres, release years, and IMDb ratings

Python Pandas Jupyter Notebook

Project Overview

Business Context: Understanding streaming content landscape for content acquisition or production decisions.

This project demonstrates fundamental data analysis skills including data cleaning and standardization, exploratory data analysis (EDA), data filtering and manipulation with pandas, custom function development for reusable analysis, and IMDb rating categorization and insights.

What This Demonstrates

Learning Challenge

  • Pandas library fundamentals (filtering, sorting, grouping)
  • Working with mixed content types and missing data
  • Understanding streaming industry dynamics

Problem-Solving Process

  1. Tool Mastery: Systematically learned pandas operations through this project
  2. Data Quality: Addressed missing data and inconsistencies professionally
  3. Content Analysis: Examined genres, release patterns, and content distribution
  4. Insight Generation: Identified trends that could guide content strategy

Professional Outcome

  • Created analysis that a content team could use for programming decisions
  • Demonstrated systematic approach to learning a new technical library
  • Built foundation skills transferable to any data manipulation task

Tools Utilized

  • VS Code with GitHub Copilot for development
  • Jupyter Notebook for interactive analysis
  • Git/GitHub for version control

Dataset

The analysis uses the movies_and_shows.csv dataset, which contains comprehensive information about actors, characters, roles, titles, content types, release years, genres, and IMDb ratings and votes.

Key Features

Data Cleaning

Standardized inconsistent column names, converted mixed-case headers to lowercase with underscores, and replaced special characters for consistency.

Custom Functions

Developed reusable functions like get_actors_for_title() to retrieve cast lists and categorize_imdb_score() to classify content quality.

Rating Categorization

Implemented a tiered rating system: Excellent (≥9.0), Good (7.0-8.9), Average (5.0-6.9), and Low (<5.0) for meaningful insights.

Technical Highlights

get_actors_for_title() Function

Returns a comma-separated list of all actors for a given movie or show:

get_actors_for_title("Taxi Driver")
# Returns: "Robert De Niro, Jodie Foster, Harvey Keitel, ..."

categorize_imdb_score() Function

Categorizes movies/shows into quality tiers based on IMDb scores:

  • Excellent: IMDb score ≥ 9.0
  • Good: IMDb score 7.0 - 8.9
  • Average: IMDb score 5.0 - 6.9
  • Low: IMDb score < 5.0

Skills Demonstrated

  • Pandas DataFrame manipulation and filtering
  • Data cleaning and standardization techniques
  • Custom function development for code reusability
  • Exploratory data analysis (EDA)
  • String manipulation and text processing
  • Conditional logic and data categorization
  • Jupyter Notebook documentation

Technologies Used

Python 3.x Pandas Jupyter Notebook