Check out the files here: https://github.com/remydrain/social-buzz
A social media company would like an analysis of their content categories, highlighting the top 5 categories with the largest aggregate popularity. Additional insights into the data would be welcome.
The data was provided in the form of three separate CSV files representing:
We were also provided a data model, which I’ll include in the spreadsheet workbook. Note that for my portion of the analysis, some data tables referenced in the data dictionary were excluded as they’re out of scope of what I’m tasked with.
Since the amount of data is relatively small, I was able to handle it in Google Sheets. I loaded each of the tables into their own sheet and took a first pass over the data to see what I was working with and what columns I would be using to join the data, then did some cleaning up on the second pass.
Here’s the Content table before any cleaning:
Content
I started by removing the URL and User ID columns, since they’re not necessary for my analysis, and then I added filters to make sure that the sheet isn’t missing any data, as well as to get a sense of what the ‘Type’ and ‘Category’ columns contain.
When I looked through the content categories, I found that there were some wrapped in quotation marks, and a couple categories had the first letter capitalized where the majority didn’t. I used Find and Replace to remove all quotation marks and to standardize case. I didn’t find any other issues with the data, so I moved onto the Reactions sheet.
Reactions
As with the previous sheet, I removed the User ID column, then filtered to get a sense of the data. There were just under 3,000 rows out of about 25,500 that had missing values in the Type column. After verifying the instructions provided in the case of missing data, I deleted those rows.
Since I was interested in exploring how engagement varied throughout the week and times of day, I went ahead and split the Datetime into two separate columns, then added another column showing the day of the week.
The ReactionTypes sheet did not require any cleaning.
ReactionTypes
After taking one more pass over the sheets to make sure they were sufficiently cleaned, I duplicated the Reactions sheet to act as the base for my Aggregate, since it contained the most data of the three. I used VLOOKUP to pull in the Type and Category columns from the Content sheet (matched on Content ID), and the Sentiment and Score columns from the ReactionTypes sheet (matched on Reaction Type). I scrolled through and randomly spot-checked several rows to make sure the data matched up correctly between the aggregate and the separate tables. Here’s a glimpse of the aggregated sheet:
Cleaned, aggregated, and ready for analysis
The primary objective was to get the top 5 most popular content categories, as measured by the total reaction score. In order to do this, I first needed to identify all the categories, which I did using the UNIQUE function pointing to the range of Content Categories. Then I used SUMIF to get the total score of each category, and sorted descending by score to get all the categories with their total scores, in order from most popular to least.
Categories, ranked from highest to lowest score
We can see from this that the top 5 content categories, in order, are:
To make that look a little more interesting, I put it into a bar chart with just the top 5.
Top 5 Content Categories
Now that I had the main objective done, I could explore the data further with my own questions, starting with what type of content had the most engagement. I measured engagement by the number of reactions — of any sentiment — each particular type of content received, divided by the number of such type posted (aka, the average number of reactions for each content type). I again pulled in the different content types into a blank sheet, then used COUNTIF to tally the number of engagements for each type, and COUNTUNIQUEIFS to tally the number of each content type.
Engagement Level for Each Content Type
To see when people were most active on the site, I created a pivot table showing the total number of interactions by day of the week and hour of the day, divided by the number of days in the year that weekday occurred. I then plotted them using a radar chart, as it seemed to capture the activity by time of day well.
Average Engagement
Finally, I wanted to see which types of reactions people used the most. I again used UNIQUE and COUNTIF to get the number of times each reaction was used, and found that they’re all used a relatively similar amount.
Reaction Types Used
To recap my findings:
You can see the charts all together in a Dashboard here: Social Buzz Analysis
I also redid my analysis of the original question using R and SQL in RStudio, which you can read about here: Social Buzz Analysis - RStudio