Can Computers Have a Favorite Actor?
Using Linear Regression to predict Movie Domestic Gross
Have you ever watched a trailer for a movie and saw an actor or actress that made your desire for the film increase? What about seeing your favorite director on the roster for a film? For my second project at Metis, I tried to quantify that feeling to predict the success of a movie.
Data for this project was scraped from BoxOfficeMojo. For this project, I focused on domestic film rankings. I gathered links to each movie on each yearly list from 2000 to 2019 on BoxOfficeMojo’s yearly table. On an individual movie’s landing page, I used BeautifulSoup to scrape the basic details. To scrape the cast and crew information, I needed to use Selenium to click through the page to open those informations. To filter down the data, I only scraped the page if the opening theater count was over 1500 theaters.
I wanted to capture a person’s past experience using the data I had scraped. In particular I wanted to focus on members of the production team. Each column with the production team had to be exploded to separate the names. The column was then grouped by the name and date of release. Then I calculated the cumulative maximum and cumulative average of either the budget or domestic gross. This varied for different categories of the production team. For example, when looking at the producers of the movie, I am more interested in their past experience with certain budgets. On the other hand, a distribution company does not actually use any of the reported budget for advertising or marketing. In that case, I would be more interested in the distribution company’s past domestic gross.
A quick visualisation can talk about some interesting interpretations of what numbers have to say about big ticket actors. One of the features I used for the model was the maximum domestic gross of prior movies that actors had worked on in the past.
Does the fact that an actor had worked on a high grossing film in the past influence the gross of the movie they currently work on? If the answer were yes, we would see more clusters along the diagonal of the graph above. However, we see that success is not as directly correlated. Most of the data seems to be clustered around actors working on newer films that gross about half of what their historical films had earned.
A similar effect is mirrored when looking at the budget that producers are experienced working with. What budget are producers comfortable using? Do they need a large budget to create a large grossing film?
A few interpretations can be made from the graph above:
- Low budget producers tend to continue to work on low budget films
- Producers with experience with larger budgets films work on films that then produce higher domestic gross
- Large distribution companies tend to back films with larger budgets.
These interpretations allow for me to see that the features I engineered are reflective of intuition. The producer’s past experience does reflect how I would expect a producer’s future work to be.
To train my model, I used a linear regression with a time series split. Since a majority of my features utilized past data, I wanted to train the model to predict future quantities. I used the time series split to create a holdout group to use my final model on.
Using statsmodels, I created an OLS model to create my baseline. Using the P-score of coefficients, I decided to filter out any P-score higher than 0.8. Interestingly for my model, most of my genre feature dummies were in that category. There may have been some collinearity between directors or producers with the style of movie they create.
To try to filter down more coefficients, I tried to use a second order polynomial regression passed through a LassoCV. This increased both my R² and my Root Mean Squared Error (RMSE) for my train set. I opted for a lower RMSE with the trade off of a slightly lower R².
I also passed in the original train set into RidgeCV to see if the balance of important features would change anything. It yielded a similar result to the LassoCV.
Checking out the residuals, I determined that the majority of my large residuals were the movies that broke the records for their respective categories. These included films like The Avengers, The Dark Knight, The Sixth Sense, and the Lord of the Rings: Return of the King.
RidgeCV and LassoCV had similar residual distributions but generally under-predicted more movies.
Overall I was content with the results that the linear model presented. While not perfect, the model was able to generalize the relationships between production team history and their respective gross. It was still a good experience to learn the intricacies of how linear regression works and how Lasso and Ridge provide regularization.
The model itself is better at interpreting and describing features that make successful movies but not great at predicting future success. As an example, the model knew that blockbusters would make more money than others but was not able to accurately predict their true domestic earnings.
Future steps for this project would be to scrape more accurate earnings data to find a better metric for describing experience of people in the film industry.
Linear regression would not be my first step at attempting to capture the intricacies of what goes behind movie success. A lot of feature engineering had to go into trying to capture the affect of individual members on a certain production team. Other methods I may come back to apply include using some form of a tree algorithm (Random Forest or XGBoost) to classify and predict successful movies.
Any code I used can be accessed through my github repo linked here.