Data Analysts
Make sure you’ve read and understand these pages:
·
Basic Project Requirements (Read This)
·
Week 2 – Keeping Data Separate from Analysis
Project Instructions:
This project is focused on wrangling and analyzing data using pivot tables and complex formulas.
This project uses data that was scraped from IMDB websites. The original data set was downloaded early 2017 from:
https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset (Links to an external site.)Links to an external site.
.
Download the dataset for this project here:
movie_metadata.csv
and load the file into Excel to get started. Remember, the submission must be in .xlsx format (Excel).
Part A Wrangle:
1. Scroll to “Movie_IMDB_Link” Column in the dataset. That column contains a URL string that we want to extract a specific variable from. Specifically we want to extract a specific range of characters (letters and numbers) that looks like “tt0499549” from every URL string. This is going to be a unique identifier for the table (primary key).
Each row can then be identified by this field.
Create a formula learned in class to extract the primary key from the “Movie_IMDB_Link” column into a new column (call that new column Primary Key and make it the first column dataset). Leave the “Movie_IMDB_Link” column untouched. Create a named range for the new column called “PrimaryKey”. Make sure you leave the formula that you created intact so that I can see your process.
10pts
2. Format the table as a table. Create named ranges for these columns: Director, Country, Gross (Gross Revenue), Budget and Title Year.
5pts
3. Copy and paste the new primary key column and the genres column into a new worksheet (call it the genre worksheet). Use an Excel feature to give each genre it’s own column: label the columns Genre1, Genre2, etc. Create a function that counts the number of movies that are described with at least 3 genres.
5pts
So if you have a single cell that has “Action, Comedy, Romance” it should now be:
Action | Comedy | Romance
Where | represents a new column.
4. Modify the movie_title column and remove the unique character that exists in the cell values.
5pts
5. Format the budget column in U.S. Dollars. Create a conditional formatting on the budget column. Use the best conditional formatting that shows the differences between the amounts in each cell.
5pts.
5a. Scroll through the dataset, what do you notice? Is formatting the column as dollars an appropriate choice, why or why not? Explain.
5pts
Part B Analysis:
Q. Which countries produced the most number of movies?
6. Create another worksheet called “Countries”. Copy and paste a distinct listing of countries from the raw data. Use a formula learned in class to count the total number of movies made by the particular country, make sure to use the appropriate named range in your equation. Create another column called “Ranking by Count” and use a function learned in class to rank the countries by their respective count. Which Countries were in the top 5 based on quantity of movies produced?
10pts
Note: Do not use pivot tables.
Q. Which countries had the largest gross revenue and biggest movie budgets?
7. In the same Countries worksheet, use another formula learned in class to bring in the Total Gross Revenues and Total Budgets of all movies by country. Again make sure to use the appropriate named range in your equation. Create two additional columns called “Ranking by Gross Revenue” and “Ranking by Total Budget” and use a function learned in class to rank the countries by each measure respectively. Does the countries in your top 5 change from the total revenue versus the budget total?
10pts
7a. Describe the issue that is present in number 7 and describe how you would go about solving it.
5pts
Note: Do not use pivot tables.
Q. How many movies was each actor in?
8. In a new worksheet called “Actors”, create an unduplicated listing of all actors (from columns actor_1_name, actor_2_name, actor_3_name) in 1 column. Sort the column from A-Z. Use a function learned in class to count the total number of movies each actor appears in (regardless of whether they are in column they are in).
10pts
8a.Create a new column called “Flag” and write an equation that “Flags” the actor if they appeared in more than 25 films. Filter the table by this “flag”. Create another function learned in class to count the number of actors who appeared in 30 movies or more.
10pts
Note: Do not use pivot tables.
Q. How many movies did each director make by year within the US?
9. Use the pivot table feature in Excel to help you answer this problem.
In a new worksheet called “Directors”, create your pivot and filter it by: Year >= 2010 and Country = USA. Include the director_name as your first column and additional column names for each of the years from 2010 – 2015. Next bring in the Primary Key field (that you created in Step 1) as your value to be counted for each director in each year respectively. Create a Total column and Total row to sum the data accordingly.
10pts
Note: Do not use equations.
10. Create your own analysis with the data. Clearly describe the problem/question that you’re trying to address, show your work, and explain the answer that you derived at. And make sure the analysis is complex. It should not be something as simple as “the total number of movies in the dataset” or even the “total number of movies by year”. Make sure your analysis is more interesting and complicated than that. And make sure it’s not similar to one of the other problems in this project.
10pts
Make sure you’ve read and understand these pages:
·
Basic Project Requirements (Read This)
·
Week 2 – Keeping Data Separate from Analysis
Project Instructions:
This project is focused on wrangling and analyzing data using pivot tables and complex formulas.
This project uses data that was scraped from IMDB websites. The original data set was downloaded early 2017
from:
https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset (Links to an external site.)Links to an external site.
.
Download the dataset for this project here:
movie_metadata.csv
and load the file into Excel to get started. Remember, the submission must be in .xlsx format (Excel).
Part A Wrangle:
1. Scroll to “Movie_IMDB_Link” Column in the dataset. That column contains a URL string that we want to extract a specific variable from. Specifically we want to extract a specific range of characters (letters and numbers) that looks like “tt0499549” from every URL string. This is going to be a unique identifier for the table (primary key).
Each row can then be identified by this field.
Create a formula learned in class to extract the primary key from the “Movie_IMDB_Link” column into a new column (call that new column Primary Key and make it the first column dataset). Leave the “Movie_IMDB_Link” column untouched. Create a named range for the new column called “PrimaryKey”. Make sure you leave the formula that you created intact so that I can see your process.
10pts
2. Format the table as a table. Create named ranges for these columns: Director, Country, Gross (Gross Revenue), Budget and Title Year.
5pts
3. Copy and paste the new primary key column and the genres column into a new worksheet (call it the genre worksheet). Use an Excel feature to give each genre it’s own column: label the columns Genre1, Genre2, etc. Create a function that counts the number of movies that are described with at least 3 genres.
5pts
So if you have a single cell that has “Action, Comedy, Romance” it should now be:
Action | Comedy | Romance
Where | represents a new column.
4. Modify the movie_title column and remove the unique character that exists in the cell values.
5pts
5. Format the budget column in U.S. Dollars. Create a conditional formatting on the budget column. Use the best conditional formatting that shows the differences between the amounts in each cell.
5pts.
5a. Scroll through the dataset, what do you notice? Is formatting the column as dollars an appropriate choice, why or why not? Explain.
5pts
Part B Analysis:
Q. Which countries produced the most number of movies?
6. Create another worksheet called “Countries”. Copy and paste a distinct listing of countries from the raw data. Use a formula learned in class to count the total number of movies made by the particular country, make sure to use the appropriate named range in your equation. Create another column called “Ranking by Count” and use a function learned in class to rank the countries by their respective count. Which Countries were in the top 5 based on quantity of movies produced?
10pts
Note: Do not use pivot tables.
Q. Which countries had the largest gross revenue and biggest movie budgets?
7. In the same Countries worksheet, use another formula learned in class to bring in the Total Gross Revenues and Total Budgets of all movies by country. Again make sure to use the appropriate named range in your equation. Create two additional columns called “Ranking by Gross Revenue” and “Ranking by Total Budget” and use a function learned in class to rank the countries by each measure respectively. Does the countries in your top 5 change from the total revenue versus the budget total?
10pts
7a. Describe the issue that is present in number 7 and describe how you would go about solving it.
5pts
Note: Do not use pivot tables.
Q. How many movies was each actor in?
8. In a new worksheet called “Actors”, create an unduplicated listing of all actors (from columns actor_1_name, actor_2_name, actor_3_name) in 1 column. Sort the column from A-Z. Use a function learned in class to count the total number of movies each actor appears in (regardless of whether they are in column they are in).
10pts
8a.Create a new column called “Flag” and write an equation that “Flags” the actor if they appeared in more than 25 films. Filter the table by this “flag”. Create another function learned in class to count the number of actors who appeared in 30 movies or more.
10pts
Note: Do not use pivot tables.
Q. How many movies did each director make by year within the US?
9. Use the pivot table feature in Excel to help you answer this problem.
In a new worksheet called “Directors”, create your pivot and filter it by: Year >= 2010 and Country = USA. Include the director_name as your first column and additional column names for each of the years from 2010 – 2015. Next bring in the Primary Key field (that you created in Step 1) as your value to be counted for each director in each year respectively. Create a Total column and Total row to sum the data accordingly.
10pts
Note: Do not use equations.
10. Create your own analysis with the data. Clearly describe the problem/question that you’re trying to address, show your work, and explain the answer that you derived at. And make sure the analysis is complex. It should not be something as simple as “the total number of movies in the dataset” or even the “total number of movies by year”. Make sure your analysis is more interesting and complicated than that. And make sure it’s not similar to one of the other problems in this project.
10pts