Workshop Analyzing Social Media Data | Telegram Data

Author

Tiago Ventura

Published

October 21, 2020

Introduction

This notebook walks through some code in Python and R to download and clean data from Telegram.

Telegram has become a very important social media messaging App, particularly in the Global South. What makes Telegram unique, particularly compared to WhatsApp, is the reluctance of the company to comply with content moderation policies determined by local governments.

For this reason, Telegram has become an platform marked by the circulation of extremely polarizing content, misinformation rumors, and the organization of harmful groups.

To capture Telegram data, we will use the Python library telethon. This library provides an access to telegram API, from which you can grab information from channels using your account.

The code I present below is inspired by this medium post.

Get your Telegram API credentials

To connect to Telegram, we need an api_id and an api_hash.

To get those, you need to login to your Telegram core and go to the API development tools area.

Here’s short tutorial about how to get your API credentials.

Installing Telethon

First step is to install the python library

#pip3 install telethon

APIs Keys

Now, we will load our keys

# call some libraries
import os
import datetime
import pandas as pd
from dotenv import load_dotenv
import json

# get the keys
# load keys from  environmental var
load_dotenv() # .env file in cwd
telegram_id= os.environ.get("telegram_id")
telegram_hash= os.environ.get("telegram_hash")

# also need your cellphone and username from telegram
phone=os.environ.get("phone_number")
username= os.environ.get("username")
username
'venturatds'

Log in to Telegram

Now everything is set up, we need to create a client and log in to our telegram account

# call packages
from telethon import TelegramClient
from telethon.errors import SessionPasswordNeededError
from telethon import sync


# Create the client and connect
def telegram_start(username, api_id, api_hash):
  client = TelegramClient(username, api_id, api_hash)
  client.start()
  print("Client Created")
  # Ensure you're authorized
  if not client.is_user_authorized():
      client.send_code_request(phone)
      try:
          client.sign_in(phone, input('Enter the code: '))
      except SessionPasswordNeededError:
          client.sign_in(password=input('Password: '))
  return client        

# Tun the function
client = telegram_start(username, telegram_id, telegram_hash)

Getting Channel Members

from telethon.tl.functions.channels import GetParticipantsRequest
from telethon.tl.types import ChannelParticipantsSearch
from telethon.tl.types import (PeerChannel)

# Let's get members of the Lula Channel on Telegram
input_channel = "https://t.me/UrnasEletronicaseEleicoesBrasil"

## Getting information from channel 
my_channel = client.get_entity(input_channel)

## get channel members
offset = 0
limit = 500
all_participants = []

while True:
    participants = client(GetParticipantsRequest(
        my_channel, ChannelParticipantsSearch(''), offset, limit,
        hash=0
    ))
    if not participants.users:
        break
    all_participants.extend(participants.users)
    offset += len(participants.users)



# Open Json
all_user_details = []
for participant in all_participants:
    all_user_details.append(
        {"id": participant.id, "first_name": participant.first_name, "last_name": participant.last_name,
         "user": participant.username, "phone": participant.phone, "is_bot": participant.bot})

# Check it our
df = pd.DataFrame(all_user_details)

Getting Messages

This only gets you 100 messages. You need to wrap it in a loop to get all the messages in the chat.

from telethon.tl.functions.messages import (GetHistoryRequest)
from telethon.tl.types import (PeerChannel)
import json

offset_id = 0
limit = 1000
all_messages = []
total_messages = 0
total_count_limit = 0

# capture data
history = client(GetHistoryRequest(
        peer=my_channel,
        offset_id=offset_id,
        offset_date=None,
        add_offset=0,
        limit=limit,
        max_id=0,
        min_id=0,
        hash=0
    ))
    
# get messages objects
messages = history.messages

# convert to a dictionary
for message in messages:
      all_messages.append(message.to_dict())

# save json
with open('data_telegram/message_data.json', 'w') as outfile:
    json.dump(all_messages, outfile, indent=4, sort_keys=True, default=str)

Quick data cleaning

# convert to pandas
# Opening JSON file
f = open('data_telegram/message_data.json')
  
# returns JSON object as 
# a dictionary
data = json.load(f)
 
df = pd.DataFrame(data)
df.keys()
Index(['_', 'date', 'edit_date', 'edit_hide', 'entities', 'forwards',
       'from_id', 'from_scheduled', 'fwd_from', 'grouped_id', 'id', 'legacy',
       'media', 'media_unread', 'mentioned', 'message', 'noforwards', 'out',
       'peer_id', 'pinned', 'post', 'post_author', 'reactions', 'replies',
       'reply_markup', 'reply_to', 'restriction_reason', 'silent',
       'ttl_period', 'via_bot_id', 'views', 'action'],
      dtype='object')
# open nested lists
df = pd.concat([df, df["from_id"].apply(pd.Series)], axis=1)

# See
df.head()
_ date edit_date edit_hide entities forwards from_id from_scheduled fwd_from grouped_id ... reply_markup reply_to restriction_reason silent ttl_period via_bot_id views action _ user_id
0 Message 2022-10-19 01:36:09+00:00 None False [] NaN {'_': 'PeerUser', 'user_id': 400651691} False None NaN ... None None [] False 86400.0 NaN NaN NaN PeerUser 400651691
1 Message 2022-10-19 01:26:10+00:00 None False [{'_': 'MessageEntityPhone', 'length': 9, 'off... NaN {'_': 'PeerUser', 'user_id': 1370474841} False {'_': 'MessageFwdHeader', 'channel_post': None... NaN ... None None [] False 86400.0 NaN NaN NaN PeerUser 1370474841
2 Message 2022-10-19 01:02:19+00:00 None False [{'_': 'MessageEntityUrl', 'length': 28, 'offs... NaN {'_': 'PeerUser', 'user_id': 1502201089} False None NaN ... None None [] False 86400.0 NaN NaN NaN PeerUser 1502201089
3 Message 2022-10-19 00:56:46+00:00 2022-10-19 00:56:52+00:00 False [] NaN {'_': 'PeerUser', 'user_id': 1370474841} False None NaN ... None None [] False 86400.0 NaN NaN NaN PeerUser 1370474841
4 Message 2022-10-19 00:55:24+00:00 None False [{'_': 'MessageEntityMention', 'length': 20, '... 73.0 {'_': 'PeerUser', 'user_id': 1370474841} False {'_': 'MessageFwdHeader', 'channel_post': 3032... NaN ... None None [] False 86400.0 NaN 15007.0 NaN PeerUser 1370474841

5 rows × 34 columns

Conclusion

This is a very quick introduction. If you want to do this at scale, you need to first curate a list of channels you are interested in. Second, you need to host this code in a server so that you can make multiple calls over the days. Third, you probably need to use the async package to make this code more efficient.