The Movie Database Analysis

Deniz Can Yıldırım
5 min readDec 19, 2020

--

In the evening, you plan to watch a movie or TV show with your friends, family, wife/husband or just yourself. How do you decide? And if you are like me then you are about to look for a nice one by surfing on the internet. IMDB and TMDB are the websites that give us lots of insight about movies. They provide us useful information such as the average user score, number of user votes, popularity and so on. Thanks to these, we have a chance to spend our time much more fun :)

In my first blog, I would like to make an analysis about the movies data-set by exploring basic questions and then try to predict the popularity of movies. Here is the content:

  1. Knowing Data / TMDB 5000 Movie Data-set
  2. How is the distribution of movies according to genres?
  3. What are the leading production companies?
  4. How are the average votes distributed among movies?
  5. Top 10 List of Movies
  6. Predicting Popularity of Movies
  7. Conclusion

1. Knowing Data / TMDB 5000 Movie Data-set

First of all, I obtain this data-set from Kaggle consisting of two files (‘tmdb_5000_credits.csv’ and ‘tmdb_5000_movies.csv’). I use the ‘tmdb_5000_movies.csv’, which is a public data-set which consists of 4803 rows and 20 columns. Before answering questions, let us look into the data-set and get basic information about the movies.

Budget, Genres, Homepage, Id, Keywords, Original Language, Original Title, Overview, Popularity, Production Companies, Production Countries, Release Date, Revenue, Runtime, Spoken Languages, Status, Tagline, Title, Vote Average, Vote Count

Fig. 1; TMDB Data-set summary
Figure 01; TMDB Data-set summary

2. How is the distribution of movies according to genres?

In Figure 02, we see the distribution of genres of movies. In this data-set, drama based movies have the highest ratio of ~20% among others. They are followed by comedy, thriller, action movies. On the other hand, documentaries, music, history and animation movies have the lowest ratio less than 2.5%.

Figure 02; Distribution of movies according to genres

3. What are the leading production companies?

In Figure 03, there is another distribution for companies who produce movies. The leading companies are “Warner Bros”, “Universal Pictures”, “Columbia Pictures” and “Paramount Pictures”. They are dominant in the sector with an appearance of 76.39%.

Figure 03; Leading production companies

4. How are the average votes distributed among movies?

What about the average vote of users in this data-set? When we look at histogram values, we see that most values fall into the range [4,7]. In contrast, very high and very low values are pretty less. This makes sense because it resembles the normal distribution.

Figure 04; Average vote distribution

5. Top 10 List of Movies

I wonder which movies are most popular in this data-set. TMDB explains popularity metric based on:

  • Number of votes for the day
  • Number of views for the day
  • Number of users who marked it as a “favorite” for the day
  • Number of users who added it to their “watch-list” for the day
  • Release date
  • Number of total votes
  • Previous days score

I sorted movies in terms of popularity and formed a table which shows the top 10. Let us look into Figure 05.

Figure 05; Top 10 Movies

6. Predicting Popularity of Movies

Having seen the top 10 popularity list of movies, I try to predict popularity by using Linear Regression (LR) and Support Vector Regression (SVR) models and compare the results. Firstly, I select features using heat-map of the data-set (Figure 06).

Figure 06; Correlation matrix of data-set

Since heat-map shows the strength of relationship between variables, correlation coefficient values of variables (related to ‘popularity’) greater than 0.5 are selected as features. These variables are:

  • Budget — 0.51
  • Revenue — 0.64
  • Vote count — 0.7

After deciding features, I build the LR and SVR models and compare the results in terms of r2 score. Both models are better in the training phase, however performances on test data are pretty low. Moreover, LR outperforms SVR in both cases. The Figure 07 summarizes the results.

Figure 07; Prediction scores

The data-set is shuffled before models fitting the data because of that each run produces different train and test scores.

What can be done in order to enhance these results? The answer could be using more features, more data, different optimization techniques or more powerful models. In the following works, I plan to learn and apply these techniques and go deeper.

7. Conclusion

In this blog, I study the TMDB 5000 Movie Data-set in various aspects. While doing the project, I try to learn the basics of the data science process by making some analysis on the data and developing machine learning algorithms. If you are interested in this subject, you could review the work here.

--

--

No responses yet