PPOL 5203 Data Science I: Foundations

Working with Databases

Tiago Ventura


Learning Goals

This notebook will cover the basics of working with databases.

To do so, we will be working with a database called country.sqlite which you can download here. This database was developed by Dr. Eric Dunford for a previous iteration of this class. It contains a set of country-level information over years, and spread across different tables.

Using SQL through Pandas.

As we saw in the slides, we can use SQL in Python and build a connection with database systems through pandas. Here are te steps:

  • Establish a connection between your local python environment and the database stored locally

  • Write a SQL query

  • Execute the query

  • Pull the result and work with the result in that language

Let's see this in practice

In [1]:
## import pandas and sqlite3
import pandas as pd
import sqlite3
In [2]:
# Establish a connection with the database
conn = sqlite3.connect("country.sqlite")
In [3]:
# write a query
query = "select * from econ"
In [5]:
# just text, wuery
query
Out[5]:
'select * from econ'
In [13]:
# Use the connection with the read_sql() method
pd.read_sql(query,conn)
Out[13]:
country ccode year ln_gdppc ln_pop
0 Algeria 615.0 1997.0 7.385237 17.212929
1 Algeria 615.0 1998.0 7.370534 17.227837
2 Algeria 615.0 1999.0 7.365816 17.241908
3 Algeria 615.0 2000.0 7.471373 17.255405
4 Algeria 615.0 2001.0 7.457527 17.268419
... ... ... ... ... ...
910 Zimbabwe 552.0 2012.0 6.875401 16.504094
911 Zimbabwe 552.0 2013.0 6.933801 16.527188
912 Zimbabwe 552.0 2014.0 6.938386 16.550636
913 Zimbabwe 552.0 2015.0 6.940625 16.574092
914 Zimbabwe 552.0 2016.0 6.936417 16.597453

915 rows × 5 columns

As noticed before, the output is always a table. From now on, I will show you some different SQL commands and its applications

listing all tables of your connection

In [6]:
pd.read_sql("select name from sqlite_master where type='table'",conn)
Out[6]:
name
0 regime
1 life
2 econ
3 conflict
4 continents

Head top of the data

In [7]:
# build query
query = """ 
         select  * 
         from life  
         limit 5 
         """

#query database
pd.read_sql(query,conn)
Out[7]:
country ccode year infant_mort life_exp
0 Algeria 615.0 1997.0 35.5 69.323
1 Algeria 615.0 1998.0 34.9 69.745
2 Algeria 615.0 1999.0 34.4 70.183
3 Algeria 615.0 2000.0 33.9 70.640
4 Algeria 615.0 2001.0 33.3 71.116

Select specific variables

In [8]:
# build query
query = """select  country, year, polity 
         from regime 
           limit 5"""

#query database
pd.read_sql(query,conn)
Out[8]:
country year polity
0 Algeria 1997.0 -3.0
1 Algeria 1998.0 -3.0
2 Algeria 1999.0 -3.0
3 Algeria 2000.0 -3.0
4 Algeria 2001.0 -3.0

Row-wise manipulations

arranging

In [11]:
# build query
query = """select country, ccode,  year,  riots 
         from conflict 
         order by riots"""

#query database
pd.read_sql(query,conn).head()
Out[11]:
country ccode year riots
0 Algeria 615 1997.0 0
1 Algeria 615 1997.0 0
2 Algeria 615 1997.0 0
3 Algeria 615 1997.0 0
4 Algeria 615 1997.0 0

Arranging (descending order)

In [12]:
# build query

query = """select  country,  ccode,  year,  riots 
            from conflict 
            order by riots desc"""

#query database
pd.read_sql(query,conn).head()
Out[12]:
country ccode year riots
0 Egypt 651 2013.0 124
1 Burundi 516 2015.0 120
2 Egypt 651 2014.0 120
3 Tunisia 616 2018.0 105
4 Kenya 501 2007.0 102

Filtering

In [13]:
query = """select  country,  ccode,  year,  riots 
        from conflict 
        where country = 'Nigeria' """

#query database
pd.read_sql(query,conn).head()
Out[13]:
country ccode year riots
0 Nigeria 475 1997.0 3
1 Nigeria 475 1997.0 3
2 Nigeria 475 1997.0 4
3 Nigeria 475 1997.0 1
4 Nigeria 475 1997.0 4

Filtering multiple conditions (and)

In [16]:
query=  """
select  country,  ccode, year,  riots 
            from conflict 
            where country = 'Nigeria' and year >= 2000 
"""
#query database
pd.read_sql(query,conn).head()
Out[16]:
country ccode year riots
0 Nigeria 475 2000.0 2
1 Nigeria 475 2000.0 1
2 Nigeria 475 2000.0 6
3 Nigeria 475 2000.0 2
4 Nigeria 475 2000.0 4

Filtering with OR

In [17]:
query = "select  country,  ccode,  year,  riots, protests \
            from conflict \
            where riots > 20 or protests > 20 "

#query database
pd.read_sql(query,conn).head()
Out[17]:
country ccode year riots protests
0 Algeria 615 2001.0 23 1
1 Algeria 615 2001.0 32 9
2 Algeria 615 2001.0 26 2
3 Algeria 615 2011.0 82 19
4 Algeria 615 2011.0 12 23

Column-Wise Operations

Renaming variables

In [18]:
query = "select \
        country as COUNTRY_NAME, \
        year as YEAR, \
        polity as POLITY \
        from regime \
        limit 20"

#query database
pd.read_sql(query,conn)
Out[18]:
COUNTRY_NAME YEAR POLITY
0 Algeria 1997.0 -3.0
1 Algeria 1998.0 -3.0
2 Algeria 1999.0 -3.0
3 Algeria 2000.0 -3.0
4 Algeria 2001.0 -3.0
5 Algeria 2002.0 -3.0
6 Algeria 2003.0 -3.0
7 Algeria 2004.0 2.0
8 Algeria 2005.0 2.0
9 Algeria 2006.0 2.0
10 Algeria 2007.0 2.0
11 Algeria 2008.0 2.0
12 Algeria 2009.0 2.0
13 Algeria 2010.0 2.0
14 Algeria 2011.0 2.0
15 Algeria 2012.0 2.0
16 Algeria 2013.0 2.0
17 Algeria 2014.0 2.0
18 Algeria 2015.0 2.0
19 Algeria 2016.0 2.0

Mutating

In [19]:
query = "select country, ccode, year, riots, protests, \
         riots + protests as displays \
         from conflict \
         order by displays desc"

#query database
pd.read_sql(query,conn).head()
Out[19]:
country ccode year riots protests displays
0 Egypt 651 2011.0 55 270 325
1 Algeria 615 2019.0 0 302 302
2 Tunisia 616 2019.0 19 257 276
3 Egypt 651 2013.0 100 175 275
4 Ethiopia 530 2015.0 6 253 259

Mutating with ifelse logic

In [20]:
query = "select  country, ccode, year, protests, \
        (case when protests > 0 then 'Protest' \
            else 'No Protest' end) as protest_onset \
        from conflict "

#query database
pd.read_sql(query,conn)
Out[20]:
country ccode year protests protest_onset
0 Algeria 615 1997.0 0 No Protest
1 Algeria 615 1997.0 0 No Protest
2 Algeria 615 1997.0 0 No Protest
3 Algeria 615 1997.0 0 No Protest
4 Algeria 615 1997.0 0 No Protest
... ... ... ... ... ...
8699 Zimbabwe 552 2019.0 1 Protest
8700 Zimbabwe 552 2019.0 5 Protest
8701 Zimbabwe 552 2019.0 10 Protest
8702 Zimbabwe 552 2019.0 6 Protest
8703 Zimbabwe 552 2019.0 4 Protest

8704 rows × 5 columns

Mutating the variable type

In [21]:
query = "select  country,  ccode, \
    cast(year as integer) as year, \
    cast(riots as float) as riots, \
    protests \
    from conflict \
    order by riots desc \
    limit 5"

#query database
pd.read_sql(query,conn)
Out[21]:
country ccode year riots protests
0 Egypt 651 2013 124.0 123
1 Burundi 516 2015 120.0 83
2 Egypt 651 2014 120.0 75
3 Tunisia 616 2018 105.0 53
4 Kenya 501 2007 102.0 17

Summarization and Group by

Distinct Values

In [23]:
query = "select  \
         distinct country \
         from econ "

#query database
pd.read_sql(query,conn).head()
Out[23]:
country
0 Algeria
1 Angola
2 Benin
3 Burkina Faso
4 Botswana

Counting N observations

In [24]:
query = "select \
        count(*) as n_obs \
        from conflict "

#query database
pd.read_sql(query,conn)
Out[24]:
n_obs
0 8704

Counting N distinct

In [25]:
query = "select \
count(distinct country) as n_countries \
from conflict "

#query database
pd.read_sql(query,conn)
Out[25]:
n_countries
0 49

Numerical Summaries

In [26]:
query = "select \
         round(avg(riots),2) as ave_riots, \
         sum(riots) as total_riots, \
         min(riots) as min_riots, \
         max(riots) as max_riots \
        from conflict "

#query database
pd.read_sql(query,conn)
Out[26]:
ave_riots total_riots min_riots max_riots
0 2.46 21372 0 124

GROUP BY + Summarization

In [28]:
query = "select \
    country,  \
    year, \
    sum(riots) as riots, \
    sum(protests) as protests \
from conflict \
group by country, year "

#query database
pd.read_sql(query,conn).head()
Out[28]:
country year riots protests
0 Algeria 1997.0 0 0
1 Algeria 1998.0 0 1
2 Algeria 1999.0 0 0
3 Algeria 2000.0 0 2
4 Algeria 2001.0 95 17

GROUP BY + Counting

In [30]:
query = "select country, count(*) as n \
        from conflict \
        group by country "

#query database
pd.read_sql(query,conn).head()
Out[30]:
country n
0 Algeria 272
1 Angola 193
2 Benin 88
3 Botswana 48
4 Burkina Faso 146

Subqueries

Subqueries allows us to combine several different operations across multiple tables

In [32]:
query = " WITH algeria_a as ( \
            SELECT \
                country,\
                year,\
                'bombings' as type,\
                explosions_remote_violence as n_events\
            FROM conflict\
            WHERE country =  'Algeria'),\
                algeria_b as (\
            SELECT \
                country,\
                year,\
                'vac' as type,\
                violence_against_civilians as n_events\
            FROM conflict\
            WHERE country =  'Algeria')\
        select * from algeria_a\
        UNION ALL\
        select * from algeria_b\
"
#query database
pd.read_sql(query,conn).head()
Out[32]:
country year type n_events
0 Algeria 1997.0 bombings 11
1 Algeria 1997.0 bombings 0
2 Algeria 1997.0 bombings 2
3 Algeria 1997.0 bombings 0
4 Algeria 1997.0 bombings 1

Joins

INNER Joining Two Table

In [34]:
query = """select 
    life.*,
    econ.ln_gdppc
from life
inner join econ on (life.ccode = econ.ccode and life.year=econ.year)
"""
#query database
pd.read_sql(query,conn)
Out[34]:
country ccode year infant_mort life_exp ln_gdppc
0 Algeria 615.0 1997.0 35.5 69.323 7.385237
1 Algeria 615.0 1998.0 34.9 69.745 7.370534
2 Algeria 615.0 1999.0 34.4 70.183 7.365816
3 Algeria 615.0 2000.0 33.9 70.640 7.471373
4 Algeria 615.0 2001.0 33.3 71.116 7.457527
... ... ... ... ... ... ...
910 Zimbabwe 552.0 2012.0 45.7 55.032 6.875401
911 Zimbabwe 552.0 2013.0 42.8 56.897 6.933801
912 Zimbabwe 552.0 2014.0 40.4 58.410 6.938386
913 Zimbabwe 552.0 2015.0 38.5 59.534 6.940625
914 Zimbabwe 552.0 2016.0 36.3 60.294 6.936417

915 rows × 6 columns

using aliases when joining

In [93]:
query = """select 
    a.*,
    b.ln_gdppc
from life a
inner join econ b on (a.ccode = b.ccode and a.year=b.year)
"""
#query database
pd.read_sql(query,conn)
Out[93]:
country ccode year infant_mort life_exp ln_gdppc
0 Algeria 615.0 1997.0 35.5 69.323 7.385237
1 Algeria 615.0 1998.0 34.9 69.745 7.370534
2 Algeria 615.0 1999.0 34.4 70.183 7.365816
3 Algeria 615.0 2000.0 33.9 70.640 7.471373
4 Algeria 615.0 2001.0 33.3 71.116 7.457527
... ... ... ... ... ... ...
910 Zimbabwe 552.0 2012.0 45.7 55.032 6.875401
911 Zimbabwe 552.0 2013.0 42.8 56.897 6.933801
912 Zimbabwe 552.0 2014.0 40.4 58.410 6.938386
913 Zimbabwe 552.0 2015.0 38.5 59.534 6.940625
914 Zimbabwe 552.0 2016.0 36.3 60.294 6.936417

915 rows × 6 columns

LEFT Joining Two Table

In [94]:
"""select 
    a.*,
    b.ln_gdppc
from life a
left join econ b on (a.ccode = b.ccode and a.year=b.year)
"""
#query database
pd.read_sql(query,conn)
Out[94]:
country ccode year infant_mort life_exp ln_gdppc
0 Algeria 615.0 1997.0 35.5 69.323 7.385237
1 Algeria 615.0 1998.0 34.9 69.745 7.370534
2 Algeria 615.0 1999.0 34.4 70.183 7.365816
3 Algeria 615.0 2000.0 33.9 70.640 7.471373
4 Algeria 615.0 2001.0 33.3 71.116 7.457527
... ... ... ... ... ... ...
910 Zimbabwe 552.0 2012.0 45.7 55.032 6.875401
911 Zimbabwe 552.0 2013.0 42.8 56.897 6.933801
912 Zimbabwe 552.0 2014.0 40.4 58.410 6.938386
913 Zimbabwe 552.0 2015.0 38.5 59.534 6.940625
914 Zimbabwe 552.0 2016.0 36.3 60.294 6.936417

915 rows × 6 columns

Joining on more than one table

In [96]:
query = """select 
    a.*,
    b.ln_gdppc,
    c.regime_type
from life a
left join econ b on (a.ccode = b.ccode and a.year=b.year)
left join regime c on (a.ccode = c.ccode and a.year=c.year)
"""
#query database
pd.read_sql(query,conn)
Out[96]:
country ccode year infant_mort life_exp ln_gdppc regime_type
0 Algeria 615.0 1997.0 35.5 69.323 7.385237 competitive authoritarian
1 Algeria 615.0 1998.0 34.9 69.745 7.370534 competitive authoritarian
2 Algeria 615.0 1999.0 34.4 70.183 7.365816 competitive authoritarian
3 Algeria 615.0 2000.0 33.9 70.640 7.471373 competitive authoritarian
4 Algeria 615.0 2001.0 33.3 71.116 7.457527 competitive authoritarian
... ... ... ... ... ... ... ...
910 Zimbabwe 552.0 2012.0 45.7 55.032 6.875401 competitive authoritarian
911 Zimbabwe 552.0 2013.0 42.8 56.897 6.933801 competitive authoritarian
912 Zimbabwe 552.0 2014.0 40.4 58.410 6.938386 competitive authoritarian
913 Zimbabwe 552.0 2015.0 38.5 59.534 6.940625 competitive authoritarian
914 Zimbabwe 552.0 2016.0 36.3 60.294 6.936417 competitive authoritarian

915 rows × 7 columns

In [1]:
!jupyter nbconvert _week_13_sql.ipynb --to html --template classic
[NbConvertApp] Converting notebook _week_13_sql.ipynb to html
[NbConvertApp] Writing 377641 bytes to _week_13_sql.html