Webscraping for Filmography Recommendations

This blog post was created for my PIC16B class at UCLA (Previously titled: Blog Post 2 - Webscraping for Filmography Recommendations).

In the age of Netflix, Amazon Prime, and Disney+, it seems as though the question on the tip of everyone’s tongue is “What should I binge next?” Wouldn’t it be lovely to find new movie or TV show recommendations based on your preferences? Never fear, in this post I will outline how we might be able to achieve the same using a webscraper to extract filmography recommendations.

Specifically, in this post I will be describing how my webscraper extracts the names of movies and TV shows that share actors with my favorite film, “Pride & Prejudice (2005)” (the IMDB page for this film is displayed below). I will be recommending new movies and TV shows based on the idea that if “Pride & Prejudice (2005)” shares actors with another movie or TV show (let this filmography be X), then it is likely that I would also enjoy filmography X.

pride-prejudice-imdb-banner.png

Part 1: The Webscraper

In simple terms, webscraping refers to the act of extracting data from a selected website. The information or data is then exported into a format that is useful (such as .csv or .txt files) for the user. While it is true that data can be scrapped from the web through manual means (e.g., one rudimentary method of scraping data is the simple copy and paste method), this can often be a slow and tedious process. Automating the webscrapping process leads to the aggretation of information at much faster rate and also presents information in an easily interpretable manner.

The first step in the webscraping process is to ensure that we have successfully loaded scrapy, the package that will allow us to scrape data from the web. In my case, since I use anaconda, this would require me to activate the conda environment with the loaded scrapy package. The next step would be to start a new project by typing the following lines of code into the command line. Note that I have enclosed the environment name and project name within square brackets as these are arbitrary (Beware: you should remove the square brackets when you actually type in the environment and project names of your choice).

conda activate [environment name]
scrapy startproject [project name]

In my case, my conda environment name is PIC16B and I have decided to name this project IMDB_scraper. So the code I type in will be as follows:

conda activate PIC16B
scrapy startproject IMDB_scraper

Now that we have created our project, we must remember to work in our newly created project directory! We can enter this directory by entering the following in the command line. Recall that my project name is “IMDB_scraper” (Note that you will need to replace “IMDB_scraper” with your chosen project name).

cd IMDB_scraper

You might have noticed that your new project directory contains a number of files and/or folders. However, the ones we need to worry about are the settings.py file and spiders folder. Before writing our scraper we should add the following line of code to our settings.py file.

CLOSESPIDER_PAGECOUNT = 20

The reason for this is simple: while we are still writing, testing, and debugging out webscraper, we do not want to risk downloading too much data! This way, we limit the amount of data we download while ensuring our webscraper is working correctly. When we are sure that the webscraper is working, we can head back to our settings.py file and comment out the above line.

Now that we have dealt with the little details, we can finally navigate to our spiders folder and create our main .py file that will enable us to scrape data from the required IMDB pages. Within this .py file we will be writing our Spider class which inherits from scrapy.Spider and defines how a given website will be scraped. I’ve decided to call my main .py file imdb_spider.py and my Spider class IMDB_spider which has the following boilerplate code.

# Importing the scrapy package
import scrapy

class IMDB_spider(scrapy.Spider):
    
    name = "imdb_spider"
    
    # The following URL is for the Pride & Prejudice (2005)
    # IMDB page. Note that it may be replaced with any other
    # IMDB url
    start_urls = [
        "https://www.imdb.com/title/tt0414387/"
    ]

Within this IMDB_spider class, I’ve implemented three different parsing methods: parse which parses and then navigates from the main IMDB filmography page to the “Cast & crew” page, parse_full_credits which parses the “Full Cast & Crew” page and navigates to each respective actor’s page, and parse_actor_page which parses each individual cast member’s page and yields their respective films and TV shows in a dictionary. The three methods are described in greater detail below.

The parse(self, response) method parses the main IMDB page for “Pride & Prejudice (2005)” (link: https://www.imdb.com/title/tt0414387/) and navigates to the “Cast & crew” page. Note that the url for the “Cast & crew” page has the format <original_url>fullcredits/, hence we have the following:

def parse(self, response):
    """
    Parses the main "Pride & Prejudice (2005)" IMDB page and 
    navigates to the the Cast & crew page
    """
    cast_crew_url = response.url + "fullcredits"
    # Navigating to the cast & crew page using scrapy.Request
    # Once navigated to the above url, the parse_full_credits method
    # will be implemented
    yield scrapy.Request(cast_crew_url, callback=self.parse_full_credits)

The parse_full_credits(self, response) method parses the Series Cast section (url: https://www.imdb.com/title/tt0414387/fullcredits) and navigates to the individual actor pages. By using the developer tools embedded into chrome, we note that the cast member information is located within the following html code: <table class="cast_list">, as displayed in the following image. Hence, we can extract the hyperlink information using our handy .attrib function. However, we note that there are three links associated with each cast member - we only want the link corresponding to the image of the cast member.

pride-prejudice-cast-crew.png

The cast_links variable in the following parse method stores the url segments following https://www.imdb.com and each cast member url has the format https://www.imdb.com/name/*, so we need only search for url segments that begin with /name. We may then loop through the list of url segments, use the response.urljoin function to join the two url segments, and navigate to the respective actor pages by calling the parse_actor_page method.

def parse_full_credits(self, response):
    """
    Parses the Series Cast Page for "Pride & Prejudice (2005)" 
    and navigates to each actor's page
    """
    # Extracting links to the cast page
    cast_links = [actor.attrib["href"] for actor in response.css("table.cast_list a")]
    cast_imgs_links = [link for link in set(cast_links) if link.startswith("/name")]
    
    for cast_link in cast_imgs_links:
        url = response.urljoin(cast_link)
        yield scrapy.Request(url, callback=self.parse_actor_page)

The parse_actor_page(self, response) method parses through the actor page and yields a dictionary with key-value pairs in the following format:

{
    "actor_name" : actor_name,
    "movie_or_TV_name" : movie_or_TV_name
}

pride-prejudice-actor-name.png

After navigating to the actor page (e.g., Keira Knightley’s IMDB page, shown in the above image, has the following url: https://www.imdb.com/name/nm0461136/) we note that the actor name is stored withn span.itemprop, making it possible for us to extract the actor name information. Also note that the actor’s filmography section is located within <div class="filmo-category-section b">. Therefore, we may define the parse_actor_page method as follows:

def parse_actor_page(self, response):
    """
    Parses the actor page and stores the shared actor
    information within a dictionary
    """
    # extracting the actor name
    actor_name = response.css("span.itemprop").get().split("<")[-2].split(">")[-1]
    # all filmography information
    all_films = response.css("div.filmo-category-section b").getall()
    
    # looping through all the filmography information and extracting the 
    # filmography names
    for film in all_films:
        movie_or_TV_name = film.split("<")[-3].split(">")[-1]
        
        # Yielding information in a key-value pair dictionary
        yield {
            "actor_name" : actor_name,
            "movie_or_TV_name" : movie_or_TV_name
        }

Now that we have successfully defined the Spider class, we are now ready to test out our webscraper! To do so, we run the following command and create a .csv file with the actor names and their respective filmography.

scrapy crawl imdb_spider -o movies.csv

Once we have checked the .csv file and made sure that our webscraper works correctly, recall that we must navigate to our settings.py file and comment out our earlier placed restriction, CLOSESPIDER_PAGECOUNT = 20. Now, we will re-run the same command, but instead we store all the shared actor information within a new .csv file (Note that I’ve called my new .csv file results.csv).

scrapy crawl imdb_spider -o results.csv

The full documentation for my webscraper function can be found on the following github page: https://github.com/mansakrishna23/filmography-scraper

Part 2: Deducing Filmography Recommendations

Our work doesn’t end even though we have sucessfully scraped our data! Remember that we still need to parse through and analyze the data we have aggregated and make filmography recommendations! To do so, we may read in out .csv file using Pandas (Don’t forget to import useful Python packages for your analysis)!

import pandas as pd
import numpy as np
films = pd.read_csv('IMDB_scraper/results.csv')

Remember that is it always good practice to check the data set you have imported!

# Checking the data set
films.head()
actor_name movie_or_TV_name
0 Sylvester Morand Agora
1 Sylvester Morand The Seeker: The Dark Is Rising
2 Sylvester Morand Ancient Rome: The Rise and Fall of an Empire
3 Sylvester Morand The Brief
4 Sylvester Morand Pride &amp; Prejudice

Since we would like to determine the number of shared actors in each of the movies and TV shows, we would need to count the number of times each movie or TV show name is repeated within the films dataset. Why is it possible for us to determine the number of shared actors this way? - Consider a film X. Suppose unique actors A and B worked in film X, it follows that we would have two shared actors corresponding to film X. Also note that film X would have to be repeated twice within our films data set. Therefore, we may safely assume that the number of shared actors for all filmography corresponds to the frequency of each movie/TV show in the above films dataset!

# Storing the unique filmography in an array as strings
unique_films = films['movie_or_TV_name'].unique()
unique_films = np.array(unique_films).astype(str)
unique_films # checking the array
array(['Agora', 'The Seeker: The Dark Is Rising',
       'Ancient Rome: The Rise and Fall of an Empire', ...,
       'VM Show Vol. 2', '41 premis Sant Jordi de cinematografia',
       'Mike Leigh on Secrets and Lies'], dtype='<U86')
# initializing the array to zeros
freq = np.zeros(len(unique_films)) 
# all filmography
all_films = np.array(films['movie_or_TV_name']).astype(str)
# looping through the unique filmography
for n, film_name in enumerate(unique_films): 
    for film in all_films: 
        if film == film_name:
            # incrementing the number of shared
            # actors 
            freq[n] += 1
# we will now check the frequency of each movie/TV show
# which will give us the number of shared actors
freq
array([1., 1., 2., ..., 1., 1., 1.])

Now that we have successfullt determined the number of shared actors per filmography, let us create a new dataframe using Pandas!

df = pd.DataFrame()
df['movie/TV show'] = unique_films
df['number of shared actors'] = freq
df
movie/TV show number of shared actors
0 Agora 1.0
1 The Seeker: The Dark Is Rising 1.0
2 Ancient Rome: The Rise and Fall of an Empire 2.0
3 The Brief 2.0
4 Pride &amp; Prejudice 37.0
... ... ...
2634 54th Golden Globe Awards 1.0
2635 The Smith &amp; Jones Sketchbook 1.0
2636 VM Show Vol. 2 1.0
2637 41 premis Sant Jordi de cinematografia 1.0
2638 Mike Leigh on Secrets and Lies 1.0

2639 rows × 2 columns

Notice that our fourth film is “Pride & Prejudice (2005)”. We can safely remove this from the dataset since we are recommending other movies and TV shows to “Pride & Prejudice (2005)” enthusiasts.

movie/TV show number of shared actors
0 Agora 1.0
1 The Seeker: The Dark Is Rising 1.0
2 Ancient Rome: The Rise and Fall of an Empire 2.0
3 The Brief 2.0
5 Fingersmith 1.0
... ... ...
2634 54th Golden Globe Awards 1.0
2635 The Smith &amp; Jones Sketchbook 1.0
2636 VM Show Vol. 2 1.0
2637 41 premis Sant Jordi de cinematografia 1.0
2638 Mike Leigh on Secrets and Lies 1.0

2638 rows × 2 columns

Suppose we only want the top ten movie/TV show recommendations! We would need to choose only those movies/TV shows with the highest number of shared actors. So, let’s sort the movies and TV shows accordingly!

# Let's sort the movies and TV shows according the number of shared actors (greatest no. to lowest no.)
shared_actors = sorted(zip(freq, unique_films), reverse=True)

# Extracting the film names
filmography = np.array([])
counts = np.array([])
for x, y in shared_actors:
    filmography = np.append(filmography, y)
    counts = np.append(counts, x)
df_sorted = pd.DataFrame()
df_sorted['movie/TV show'] = filmography
df_sorted['number of shared actors'] = counts
df_sorted
movie/TV show number of shared actors
1 The Bill 13.0
2 Pride and Prejudice Revisited 13.0
3 Pride &amp; Prejudice: The Life and Times of J... 13.0
4 Breakfast 13.0
5 Made in Hollywood 11.0
... ... ...
2634 007 Legends 1.0
2635 'NYPD Blue': A Final Tribute 1.0
2636 'Donnie Darko': Production Diary 1.0
2637 'Calendar Girls': The Naked Truth 1.0
2638 'Calendar Girls': Creating the Calendar 1.0

2638 rows × 2 columns

Now, let’s recommend the first 10 movies in the above data frame!!

recommendations = df_sorted[0:10]
recommendations
movie/TV show number of shared actors
1 The Bill 13.0
2 Pride and Prejudice Revisited 13.0
3 Pride &amp; Prejudice: The Life and Times of J... 13.0
4 Breakfast 13.0
5 Made in Hollywood 11.0
6 Entertainment Tonight 11.0
7 Doctor Who 11.0
8 The Graham Norton Show 10.0
9 Pride &amp; Prejudice: The Bennets 10.0
10 Pride &amp; Prejudice: On Set Diaries 10.0

Congratulations! We’ve successfully extracting our top ten recommendations! But wait, notice that the scraper was unable to correctly scraper the ampersand symbol! As always, we may need to modify the strings a little to improve the readability of our suggestions!

recommendations = recommendations.replace("Pride &amp; Prejudice: The Bennets", "Pride & Prejudice: The Bennets")
recommendations = recommendations.replace("Pride &amp; Prejudice: The Life and Times of Jane Austen", 
                                          "Pride & Prejudice: The Life and Times of Jane Austen")
recommendations = recommendations.replace("Pride &amp; Prejudice: On Set Diaries", "Pride & Prejudice: On Set Diaries")
recommendations
movie/TV show number of shared actors
1 The Bill 13.0
2 Pride and Prejudice Revisited 13.0
3 Pride & Prejudice: The Life and Times of Jane ... 13.0
4 Breakfast 13.0
5 Made in Hollywood 11.0
6 Entertainment Tonight 11.0
7 Doctor Who 11.0
8 The Graham Norton Show 10.0
9 Pride & Prejudice: The Bennets 10.0
10 Pride & Prejudice: On Set Diaries 10.0

In the end, we have successfully extracted recommendations for “Pride & Prejudice (2005)” enthusiasts via webscraping!

As far as this particular project was concerned, undoubtedly the most tedious part was writing the webscraper and identifying the correct HTML/CSS code that would allow us to extract the appropriate data! For this, found it particularly helpful to navigate through the websites in advance (manually) and explore the developer code beforehand. A lesson learned was to scrape only the required data; in order words I had to identify the information I needed and plan how I was going to store it in advance - it would have been really inefficient if I were to simply scrape the entire site!

Written on October 24, 2021