🗄 Netflix Data API & Scraping

Unofficial 👉 Official Website: https://www.netflix.com | 👥 Contributors: steve

Netflix Catalog Data

You can scrape Netflix data with an official API from Guidebox:

Unofficial API

These are unofficial Netflix API endpoints useful for scraping public data such as movie & TV show titles available on their catalog. These endpoints are made available for unofficial use and experimentation, such as for security research, use at your own risk!

Authentication

You’ll need to find your HTTP cookie from a logged in Netflix session on your iOS device. You can launch an interceptable proxy, open up Netflix on your iOS device, and then connect to your proxy to reveal the HTTP cookie.

Official API

Netflix used to have an API in 2014 that it has since shut down.

Data Visualization

Let’s walk through how to visualize the Netflix catalog in Python!

import pandas as pd
import numpy as np

movies_df = pd.read_csv('~/Desktop/all_netflix_movies.csv')

Now let’s see how the release year is distributed:

movies_df['results.release_year'] \
    .value_counts() \
    .sort_index(ascending=False) \
    .plot(kind='bar', figsize=(20,5))

We can then visualize the distribution of release year:

Netflix Videos by Release Year

It looks like most of the movies are relatively recent, which is nice. Let’s change this to a pie chart and focus primarily on the movies since 2005.

We’ll first want to declare a display_year in the data frame where we want to display the year if the movie is from 2005 or more recent, or a generic pre_2005 category to group all the very old movies together in.

movies_df['display_year'] = np.where(
    movies_df['results.release_year'] < 2005,
    'pre_2005',
    movies_df['results.release_year']
)

Now we have a reasonable amount of categories we can plot:

movies_df['display_year'] \
    .value_counts() \
    .sort_index(ascending=False) \
    .plot(kind='pie', figsize=(5,5))

And the plot looks a little like this:

Netflix Plots

Now let’s repeat this process for Netflix shows!

shows_df = pd.read_csv('~/Desktop/all_netflix_shows.csv')

shows_df['release_datetime'] = pd.to_datetime(
    shows_df['results.first_aired'],
    errors='coerce',
)

shows_with_release_dates = shows_df.copy()[shows_df['release_datetime'].notnull()]
shows_with_release_dates['release_year'] = shows_with_release_dates['release_datetime'] \
    .dt \
    .year

This will result in a new dataframe with only the shows having a known release year. We can then plot them:

shows_with_release_dates['release_year'] \
    .value_counts() \
    .sort_index(ascending=False) \
    .plot(kind='bar', figsize=(20,5))

The distribution looks similar to the movies:

Netflix Show Distribution by Release Year

And we can generate the pie chart as well:

shows_with_release_dates['display_year'] = np.where(
    shows_with_release_dates['release_year'] < 2005,
    'pre_2005',
    shows_with_release_dates['release_year']
)

shows_with_release_dates['display_year'] \
    .value_counts() \
    .sort_index(ascending=False) \
    .plot(kind='pie', figsize=(5,5))

Netflix Shows by Release Year

Now let’s add some flare to our results. Let’s first combine the bar charts.

shows_with_release_dates['source'] = 'show'
movies_df['source'] = 'movie'

movies_df['release_year'] = movies_df['results.release_year']

combined = pd.concat([shows_with_release_dates, movies_df], sort=False)

# remove bad years
combined = combined[combined['release_year'] > 1000]

And now we can plot shows and movies by release date in a single chart!

combined \
    .groupby('source')['release_year'] \
    .value_counts() \
    .unstack() \
    .transpose() \
    .sort_index(ascending=False) \
    .plot(kind='bar', stacked=True, figsize=(20,5))

And we get:

Netflix titles by Release Year

Style our charts!

fig = plt.figure(tight_layout=True, figsize=(1, 1), dpi=200)
gs = fig.add_gridspec(2, 2)

ax1 = fig.add_subplot(gs[0, :])
ax2 = fig.add_subplot(gs[1, 0])
ax3 = fig.add_subplot(gs[1, 1])

year_data_source = combined \
    .groupby('source')['release_year'] \
    .value_counts() \
    .unstack() \
    .transpose() \
    .sort_index(ascending=False)

year_chart = year_data_source \
    .plot(
        ax=ax1,
        kind='bar',
        stacked=True,
        figsize=(20,5),
        color=['#eeeeee', '#e50914'],
        title='Netflix Shows & Movies by Release Year',
    )

year_chart.set_xlabel('Release Year')
year_chart.set_ylabel('Count')
year_chart.legend(['Movie', 'Show'])

should_show = 0
for i, label in enumerate(year_chart.xaxis.get_ticklabels()[::1]):
    if should_show != 0:
        label.set_visible(False)
    should_show += 1
    if should_show > 4:
        should_show = 0

movie_pie = movies_df['display_year'] \
    .value_counts() \
    .sort_index(ascending=False) \
    .plot(ax=ax2, kind='pie', figsize=(5,5), radius=1)
movie_pie.set_ylabel('')
movie_pie.set_xlabel('Movies')

show_pie = shows_with_release_dates['display_year'] \
    .value_counts() \
    .sort_index(ascending=False) \
    .plot(ax=ax3, kind='pie', figsize=(5,5), radius=1)
show_pie.set_ylabel('')
show_pie.set_xlabel('Shows')

fig.savefig(os.path.expanduser('~/Desktop/netflix_catalog.png'))
Disclaimer: These URLs are not part of an official API endorsed by netflix.com and are documented here only for informational purposes. They were obtained through use of software or services made publicly available by netflix.com. Use of these URLs may breach the terms of service governing netflix.com.