Workshop Analyzing Social Media Data - Youtube Data

Author

Tiago Ventura

Published

October 21, 2020

Introduction

This notebook walks through a Python code developed by Megan Brown, Senior Engineer at the Center for Social Media and Politics at NYU. The tutorial uses the youtube-data-api library.

If you want to learn some different strategies to analyze Youtube data, particularly, a very clever way to estimate the political leaning of Youtube videos, I strongly suggest you to check out the article by Lai et al, “Estimating Ideology of Youtube videos”

What kind of data can you get from the Youtube API?

Youtube has a very extensive api. There are a lot of data you can get access to. See a compreensive list here

What is included in the package:

  • video metadata
  • channel metadata
  • playlist metadata
  • subscription metadata
  • featured channel metadata
  • comment metadata
  • search results

How to Install

The software is on PyPI, so you can download it via pip

# run in the command line 
#pip install youtube-data-api

How to get an API key

A quick guide: https://developers.google.com/youtube/v3/getting-started

  1. You need a Google Account to access the Google API Console, request an API key, and register your application. You can use your GMail account for this if you have one.

  2. Create a project in the Google Developers Console and obtain authorization credentials so your application can submit API requests.

  3. After creating your project, make sure the YouTube Data API is one of the services that your application is registered to use.

    1. Go to the API Console and select the project that you just registered.

    2. Visit the Enabled APIs page. In the list of APIs, make sure the status is ON for the YouTube Data API v3. You do not need to enable OAuth 2.0 since there are no methods in the package that require it.

An overview of Youtube API

Calling the libraries

# call some libraries
import os
import datetime
import pandas as pd
# pass your keys
from youtube_api import YouTubeDataAPI
from youtube_api.youtube_api_utils import *
from dotenv import load_dotenv

# load keys from  environmental var
load_dotenv() # .env file in cwd
api_key = os.environ.get("YT_KEY")

# create a client
yt = YouTubeDataAPI(api_key)

Starting with a channel name and getting some basic metadata

Let’s start with the LastWeekTonight channel

https://www.youtube.com/user/LastWeekTonight

First we need to get the channel id

channel_id = yt.get_channel_id_from_user('LastWeekTonight')
print(channel_id)
UC3XTzVzaHQEd30rQbuvCtTQ

Channel metadata

# collect metadata
yt.get_channel_metadata(channel_id)
{'channel_id': 'UC3XTzVzaHQEd30rQbuvCtTQ',
 'title': 'LastWeekTonight',
 'account_creation_date': 1395178899.0,
 'keywords': None,
 'description': 'Breaking news on a weekly basis. Sundays at 11PM - only on HBO.\nSubscribe to the Last Week Tonight channel for the latest videos from John Oliver and the LWT team.',
 'view_count': '3472706969',
 'video_count': '400',
 'subscription_count': '9070000',
 'playlist_id_likes': '',
 'playlist_id_uploads': 'UU3XTzVzaHQEd30rQbuvCtTQ',
 'topic_ids': 'https://en.wikipedia.org/wiki/Politics|https://en.wikipedia.org/wiki/Society|https://en.wikipedia.org/wiki/Entertainment|https://en.wikipedia.org/wiki/Television_program',
 'country': None,
 'collection_date': datetime.datetime(2022, 10, 18, 23, 17, 20, 78616)}

Subscriptions of the channel.

pd.DataFrame(yt.get_subscriptions(channel_id))
subscription_title subscription_channel_id subscription_kind subscription_publish_date collection_date
0 trueblood UCPnlBOg4_NU9wdhRN-vzECQ youtube#channel 1.395357e+09 2022-10-18 23:17:20.206669
1 GameofThrones UCQzdMyuz0Lf4zo4uGcEujFw youtube#channel 1.395357e+09 2022-10-18 23:17:20.206716
2 HBO UCVTQuK2CaWaTgSsoNkn5AiQ youtube#channel 1.395357e+09 2022-10-18 23:17:20.206752
3 HBOBoxing UCWPQB43yGKEum3eW0P9N_nQ youtube#channel 1.395357e+09 2022-10-18 23:17:20.206792
4 Cinemax UCYbinjMxWwjRpp4WqgDqEDA youtube#channel 1.424812e+09 2022-10-18 23:17:20.206835
5 HBODocs UCbKo3HsaBOPhdRpgzqtRnqA youtube#channel 1.395357e+09 2022-10-18 23:17:20.206870
6 HBOLatino UCeKum6mhlVAjUFIW15mVBPg youtube#channel 1.395357e+09 2022-10-18 23:17:20.206904
7 OfficialAmySedaris UCicerXLHzJaKYHm1IwvTn8A youtube#channel 1.461561e+09 2022-10-18 23:17:20.206937
8 Real Time with Bill Maher UCy6kyFxaMqGtpE3pQTflK8A youtube#channel 1.418342e+09 2022-10-18 23:17:20.206971

List of videos of the channel

You first need to convert the channel_id into a playlist id to get all the videos ever posted by a channel using a function from the youtube_api_utils in the package. Then you can get the video ids, and collect metadata, comments, among many others.

from youtube_api.youtube_api_utils import *
playlist_id = get_upload_playlist_id(channel_id)
print(playlist_id)

## Get video ids
videos = yt.get_videos_from_playlist_id(playlist_id)
df = pd.DataFrame(videos)
UU3XTzVzaHQEd30rQbuvCtTQ

Collect video metadata

# id for videos as a list
df.video_id.tolist()

#grab metadata
video_meta = yt.get_video_metadata(df.video_id.tolist()[:5])

#visualize
pd.DataFrame(video_meta[:2])
video_id channel_title channel_id video_publish_date video_title video_description video_category video_view_count video_comment_count video_like_count video_dislike_count video_thumbnail video_tags collection_date
0 Ns8NvPPHX5Y LastWeekTonight UC3XTzVzaHQEd30rQbuvCtTQ 1.666003e+09 Transgender Rights II: Last Week Tonight with ... John Oliver discusses the latest round of atta... 24 2227137 23236 95244 None https://i.ytimg.com/vi/Ns8NvPPHX5Y/hqdefault.jpg 2022-10-18 23:17:21.274582
1 kCOnGjvYKI0 LastWeekTonight UC3XTzVzaHQEd30rQbuvCtTQ 1.665398e+09 Crime Reporting: Last Week Tonight with John O... John Oliver discusses the outlets that cover c... 24 3232207 5943 89845 None https://i.ytimg.com/vi/kCOnGjvYKI0/hqdefault.jpg 2022-10-18 23:17:21.274612

Collect Comments

ids = df.video_id.tolist()[:5]

# loop
list_comments = []
for video_id in ids:
  comments = yt.get_video_comments(video_id, max_results=10)
  list_comments.append(pd.DataFrame(comments))

# concat
df = pd.concat(list_comments)
df.head()
video_id commenter_channel_url commenter_channel_id commenter_channel_display_name comment_id comment_like_count comment_publish_date text commenter_rating comment_parent_id collection_date reply_count
0 Ns8NvPPHX5Y http://www.youtube.com/channel/UCQcuYcWoTYbtlx... UCQcuYcWoTYbtlxRMI0bruXQ Brigid Pfenninger UgwbjPhusWygy8um6YF4AaABAg 0 1.666164e+09 I wish people would mind their own business. W... none None 2022-10-18 23:17:21.500352 0
1 Ns8NvPPHX5Y http://www.youtube.com/channel/UClGdRsKW0vxmmo... UClGdRsKW0vxmmoXPU8wpNgg E Bellyfish Ugwuodu_xhHB2_SacIJ4AaABAg 0 1.666164e+09 Trans rights are human rights. none None 2022-10-18 23:17:21.500415 0
2 Ns8NvPPHX5Y http://www.youtube.com/channel/UCl2MFhAwpOwCgj... UCl2MFhAwpOwCgjnPIN9zScQ fantasia243645 Ugy-aIoF-2KEWrxLHh54AaABAg 3 1.666163e+09 Why are people allowing children to consent to... none None 2022-10-18 23:17:21.500464 2
3 Ns8NvPPHX5Y http://www.youtube.com/channel/UCI3JogrrM3q8sZ... UCI3JogrrM3q8sZzcxE23H0w Democracy Lives UgxQ--0X5omlypnQBNl4AaABAg 1 1.666163e+09 The comparison to CRT panic is on point! Unfor... none None 2022-10-18 23:17:21.500511 0
4 Ns8NvPPHX5Y http://www.youtube.com/channel/UCwL5M4gdC00-a5... UCwL5M4gdC00-a5ZVFNZ0kQA Lizard King UgwjukKuviwU4pZRRHZ4AaABAg 0 1.666163e+09 TBH it sounds pretty hot to fall into some kin... none None 2022-10-18 23:17:21.500557 0

Want more?

If you want to learn more about youtube, you shoudl definitely check these two paper from my CSMaP colleagues about Youtube.