select
country,
ccode,
year,
riots from conflict
Week 13: SQL
2023-12-03
Brief Discussion about your Presentations
Introduction to SQL
Motivation
Ways to use SQL
Overview of main functions
Discuss your projects
You should prepare a 10-12 minutes presentation. It should cover:
Motivation or problem statement (THE LEAST IMPORTANT)
Data collection
Methods
Results (THE MOST IMPORTANT)
Lessons learned and next steps
Assume you are starting a company collecting daily a high volume of data, you can:
Use a file system to store data as daily csvs, jsons, etc..
OR build a structured, secured and efficient system to store and access the data
While the first works at small scale (your computer and your DS final projects), if you have a large scale data collection, OR continous insertion of data, thinking og a secured system is often the way to go.
Structured Data Storage: Decide which variables to store.
Enable Relationships: Build ids for your variables
Define types: pre-specify which types each columns is
Validation Rules: define a set of rules to guarantee integrity of your data
Database A database is a structured collection of data designed to store, manage, and retrieve information efficiently.
Database Management System serves as an interface between the database and its users or the application programs (MySQL, SQLite, Hive, BigQuery (Google), Presto (Meta), Redshift (Amazon) are all different DBMS.)
As Data Scientists, we will often use Databases, instead of building them. For our purposes, two main advantages:
Memory efficient (relational component)
Relies on Hard Drive memory for storage, not RAM!!!
Databases are optimized to reduce the dependency on RAM when querying your data.
DBMS enforces rules to maintain data integrity and consistency (stop you of adding corrupt data)
SQL (pronounced S-Q-L or SEQUEL) is a language designed to query relational databases
While relatively uncommon in academia, many companies / governments expect data scientists to be able to write SQL queries
The result of an SQL query is always a table
It’s a nonprocedural language: define inputs and outputs; how the statement is executed is left to the optimizer
SQL is a language that works with many types of relational database systems. Nearly identical syntax but some small differences on the margins
Use an R or Python package that helps you connect with a specific type of database (sqlite3 library in Python)
Establish a connection between your local computer and the database
Write a SQL query
Execute the query
Pull the result and work with the result in that language
SELECT columns
FROM a table in a database
WHERE rows meet a condition
GROUP BY values of a column
ORDER BY values of a column when displaying results
LIMIT to only X number of rows in resulting table
Always required: SELECT and FROM. Rest are optional.
SELECT can be combined with functions such as SUM, COUNT, AVG… including when using GROUP BY
Relational databases are all about joins. SQL offers the same type of joins as we saw in pandas
inner join (default)
left join
outer join
Use join + on to define which column to join.
Data science I: Foundations