#pip3 install telethon
Workshop Analyzing Social Media Data | Telegram Data
Introduction
This notebook walks through some code in Python and R to download and clean data from Telegram.
Telegram has become a very important social media messaging App, particularly in the Global South. What makes Telegram unique, particularly compared to WhatsApp, is the reluctance of the company to comply with content moderation policies determined by local governments.
For this reason, Telegram has become an platform marked by the circulation of extremely polarizing content, misinformation rumors, and the organization of harmful groups.
To capture Telegram data, we will use the Python library telethon. This library provides an access to telegram API, from which you can grab information from channels using your account.
The code I present below is inspired by this medium post.
Get your Telegram API credentials
To connect to Telegram, we need an api_id
and an api_hash
.
To get those, you need to login to your Telegram core and go to the API development tools area.
Here’s short tutorial about how to get your API credentials.
Installing Telethon
First step is to install the python library
APIs Keys
Now, we will load our keys
# call some libraries
import os
import datetime
import pandas as pd
from dotenv import load_dotenv
import json
# get the keys
# load keys from environmental var
# .env file in cwd
load_dotenv() = os.environ.get("telegram_id")
telegram_id= os.environ.get("telegram_hash")
telegram_hash
# also need your cellphone and username from telegram
=os.environ.get("phone_number")
phone= os.environ.get("username")
username username
'venturatds'
Log in to Telegram
Now everything is set up, we need to create a client and log in to our telegram account
# call packages
from telethon import TelegramClient
from telethon.errors import SessionPasswordNeededError
from telethon import sync
# Create the client and connect
def telegram_start(username, api_id, api_hash):
= TelegramClient(username, api_id, api_hash)
client
client.start()print("Client Created")
# Ensure you're authorized
if not client.is_user_authorized():
client.send_code_request(phone)try:
input('Enter the code: '))
client.sign_in(phone, except SessionPasswordNeededError:
=input('Password: '))
client.sign_in(passwordreturn client
# Tun the function
= telegram_start(username, telegram_id, telegram_hash) client
Getting Channel Members
from telethon.tl.functions.channels import GetParticipantsRequest
from telethon.tl.types import ChannelParticipantsSearch
from telethon.tl.types import (PeerChannel)
# Let's get members of the Lula Channel on Telegram
= "https://t.me/UrnasEletronicaseEleicoesBrasil"
input_channel
## Getting information from channel
= client.get_entity(input_channel)
my_channel
## get channel members
= 0
offset = 500
limit = []
all_participants
while True:
= client(GetParticipantsRequest(
participants ''), offset, limit,
my_channel, ChannelParticipantsSearch(hash=0
))if not participants.users:
break
all_participants.extend(participants.users)+= len(participants.users)
offset
# Open Json
= []
all_user_details for participant in all_participants:
all_user_details.append("id": participant.id, "first_name": participant.first_name, "last_name": participant.last_name,
{"user": participant.username, "phone": participant.phone, "is_bot": participant.bot})
# Check it our
= pd.DataFrame(all_user_details) df
Getting Messages
This only gets you 100 messages. You need to wrap it in a loop to get all the messages in the chat.
from telethon.tl.functions.messages import (GetHistoryRequest)
from telethon.tl.types import (PeerChannel)
import json
= 0
offset_id = 1000
limit = []
all_messages = 0
total_messages = 0
total_count_limit
# capture data
= client(GetHistoryRequest(
history =my_channel,
peer=offset_id,
offset_id=None,
offset_date=0,
add_offset=limit,
limit=0,
max_id=0,
min_idhash=0
))
# get messages objects
= history.messages
messages
# convert to a dictionary
for message in messages:
all_messages.append(message.to_dict())
# save json
with open('data_telegram/message_data.json', 'w') as outfile:
=4, sort_keys=True, default=str) json.dump(all_messages, outfile, indent
Quick data cleaning
# convert to pandas
# Opening JSON file
= open('data_telegram/message_data.json')
f
# returns JSON object as
# a dictionary
= json.load(f)
data
= pd.DataFrame(data)
df df.keys()
Index(['_', 'date', 'edit_date', 'edit_hide', 'entities', 'forwards',
'from_id', 'from_scheduled', 'fwd_from', 'grouped_id', 'id', 'legacy',
'media', 'media_unread', 'mentioned', 'message', 'noforwards', 'out',
'peer_id', 'pinned', 'post', 'post_author', 'reactions', 'replies',
'reply_markup', 'reply_to', 'restriction_reason', 'silent',
'ttl_period', 'via_bot_id', 'views', 'action'],
dtype='object')
# open nested lists
= pd.concat([df, df["from_id"].apply(pd.Series)], axis=1)
df
# See
df.head()
_ | date | edit_date | edit_hide | entities | forwards | from_id | from_scheduled | fwd_from | grouped_id | ... | reply_markup | reply_to | restriction_reason | silent | ttl_period | via_bot_id | views | action | _ | user_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Message | 2022-10-19 01:36:09+00:00 | None | False | [] | NaN | {'_': 'PeerUser', 'user_id': 400651691} | False | None | NaN | ... | None | None | [] | False | 86400.0 | NaN | NaN | NaN | PeerUser | 400651691 |
1 | Message | 2022-10-19 01:26:10+00:00 | None | False | [{'_': 'MessageEntityPhone', 'length': 9, 'off... | NaN | {'_': 'PeerUser', 'user_id': 1370474841} | False | {'_': 'MessageFwdHeader', 'channel_post': None... | NaN | ... | None | None | [] | False | 86400.0 | NaN | NaN | NaN | PeerUser | 1370474841 |
2 | Message | 2022-10-19 01:02:19+00:00 | None | False | [{'_': 'MessageEntityUrl', 'length': 28, 'offs... | NaN | {'_': 'PeerUser', 'user_id': 1502201089} | False | None | NaN | ... | None | None | [] | False | 86400.0 | NaN | NaN | NaN | PeerUser | 1502201089 |
3 | Message | 2022-10-19 00:56:46+00:00 | 2022-10-19 00:56:52+00:00 | False | [] | NaN | {'_': 'PeerUser', 'user_id': 1370474841} | False | None | NaN | ... | None | None | [] | False | 86400.0 | NaN | NaN | NaN | PeerUser | 1370474841 |
4 | Message | 2022-10-19 00:55:24+00:00 | None | False | [{'_': 'MessageEntityMention', 'length': 20, '... | 73.0 | {'_': 'PeerUser', 'user_id': 1370474841} | False | {'_': 'MessageFwdHeader', 'channel_post': 3032... | NaN | ... | None | None | [] | False | 86400.0 | NaN | 15007.0 | NaN | PeerUser | 1370474841 |
5 rows × 34 columns
Conclusion
This is a very quick introduction. If you want to do this at scale, you need to first curate a list of channels you are interested in. Second, you need to host this code in a server so that you can make multiple calls over the days. Third, you probably need to use the async package to make this code more efficient.