5  Week 3: Find Actionable Insights, Quickly (Pre-Class)

This pre-class preparation should take about 45-60 minutes to complete.

5.1 Overview

Now that you can create visualizations and automated reports, it’s time to learn how to transform your data to find meaningful insights. This week focuses on data transformation - the process of taking raw data and reshaping it to answer specific questions. We’ll use the tidyverse’s powerful dplyr package, which makes complex data operations surprisingly intuitive.

5.1.1 Video Lecture

Watch this video lecture before our interactive session:

5.2 Learning Objectives

By completing this pre-class work, you will:

  1. Understand the core data transformation verbs in dplyr
  2. Learn to chain operations together using the pipe operator |>
  3. Begin thinking about data transformation patterns
  4. Practice with real Chinese development finance data
  5. Use AI tools to assist with data transformation tasks

5.3 Setup

Let’s get our workspace ready. First, create a new Quarto document for your notes:

# Create a new Quarto document
# File → New File → Quarto Document
# Save as "week_3_transformation_preclass.qmd" in your week_3/R folder

Load the packages we’ll need:

library(tidyverse)    # For data transformation tools
library(chinadevfin3) # For Chinese development finance data

5.4 A Mini Dataset for Learning

Before diving into data transformation, let’s create a small dataset we’ll use for learning. This contains the two largest loans for five countries:

mini_gcdf <- get_gcdf3_dataset() |> 
    filter(
        recommended_for_aggregates == "Yes",
        flow_type == "Loan",
        recipient %in% c(
            "Angola",
            "Zambia",
            "Venezuela",
            "Indonesia",
            "Pakistan"
        )
    ) |> 
    group_by(recipient) |> 
    slice_max(
        order_by = amount_constant_usd_2021,
        n = 2
    ) |> 
    select(
        recipient,
        recipient_region,
        sector_name,
        commitment_year,
        amount_constant_usd_2021 
    ) |> 
    ungroup()

# Also create a dataset for year-over-year analysis
angola_annual_flows <- get_gcdf3_dataset() |>
    filter(
        recommended_for_aggregates == "Yes",
        flow_type == "Loan",
        recipient == "Angola"
    ) |>
    group_by(commitment_year) |>
    summarize(
        total_amount = sum(amount_constant_usd_2021, na.rm = TRUE),
        .groups = "drop"
    )

# Look at our mini dataset
mini_gcdf
# A tibble: 10 × 5
   recipient recipient_region sector_name commitment_year amount_constant_usd_…¹
   <chr>     <chr>            <chr>                 <int>                  <dbl>
 1 Angola    Africa           ENERGY                 2016            8147551108.
 2 Angola    Africa           OTHER SOCI…            2010            3481817339.
 3 Indonesia Asia             INDUSTRY, …            2009            2853819092.
 4 Indonesia Asia             TRANSPORT …            2017            2743140983.
 5 Pakistan  Asia             BANKING AN…            2020            4855813054.
 6 Pakistan  Asia             BANKING AN…            2021            4651162791.
 7 Venezuela America          OTHER MULT…            2010           14402361186.
 8 Venezuela America          OTHER MULT…            2010           13927269358.
 9 Zambia    Africa           ENERGY                 2017             881870586.
10 Zambia    Africa           ENERGY                 2017             881870586.
# ℹ abbreviated name: ¹​amount_constant_usd_2021

Don’t worry if the code that created this dataset looks complex - by the end of this pre-class material, you’ll understand every line! For now, just notice that:

  1. We have 10 rows (2 loans each from 5 countries)
  2. Countries are from 3 different regions (Africa, Asia, America)
  3. Each loan has a sector, year, and amount
  4. The amounts are in constant 2021 USD

This small dataset will help us learn the fundamentals before working with the full GCDF database.

5.5 The Five Core verbs of Data Transformation

Think of data transformation as having five fundamental operations, just like basic arithmetic has addition, subtraction, multiplication, and division. In dplyr, these operations are:

  1. filter(): Pick rows based on their values
  2. arrange(): Change the order of rows
  3. select(): Pick columns by their names
  4. mutate(): Create new columns from existing ones
  5. summarize(): Collapse multiple rows into a single summary

Let’s explore each one using examples from Chinese development finance data.

5.5.1 Verb 1. filter(): Subsetting Your Data

filter() helps you focus on specific parts of your data. Think of it like a sieve that keeps only the rows you want:

# Using mini dataset: African loans over $1 billion
mini_gcdf |>
  filter(
    recipient_region == "Africa",
    amount_constant_usd_2021 >= 1 * 1e9
  )
# A tibble: 2 × 5
  recipient recipient_region sector_name  commitment_year amount_constant_usd_…¹
  <chr>     <chr>            <chr>                  <int>                  <dbl>
1 Angola    Africa           ENERGY                  2016            8147551108.
2 Angola    Africa           OTHER SOCIA…            2010            3481817339.
# ℹ abbreviated name: ¹​amount_constant_usd_2021
# Real world example: Large ODA-like projects
get_gcdf3_dataset() |>
  filter(
    flow_class == "ODA-like",
    amount_constant_usd_2021 >= 100 * 1e6,
    recommended_for_aggregates == "Yes"
  )
# A tibble: 292 × 129
   country_name     iso3c country_or_regional aid_data_record_id
   <chr>            <chr> <chr>                            <int>
 1 Sri Lanka        LKA   country                          89483
 2 Tajikistan       TJK   country                          92674
 3 Uzbekistan       UZB   country                          92623
 4 Cambodia         KHM   country                          94984
 5 North Korea      PRK   country                          93442
 6 North Korea      PRK   country                          93443
 7 North Korea      PRK   country                          95607
 8 Congo - Kinshasa COD   country                          92575
 9 Ghana            GHA   country                          92600
10 Kazakhstan       KAZ   country                          92613
# ℹ 282 more rows
# ℹ 125 more variables: recommended_for_aggregates <chr>,
#   aid_data_parent_id <chr>, umbrella <chr>, financier_country <chr>,
#   recipient <chr>, recipient_iso_3 <chr>, recipient_region <chr>,
#   commitment_year <int>, implementation_start_year <int>,
#   completion_year <int>, title <chr>, description <chr>,
#   staff_comments <chr>, status <chr>, intent <chr>, flow_type <chr>, …

Common filtering operations you’ll use:

# Projects from recent years
get_gcdf3_dataset() |>
  filter(commitment_year >= 2018)
# A tibble: 6,528 × 129
   country_name iso3c country_or_regional aid_data_record_id
   <chr>        <chr> <chr>                            <int>
 1 Afghanistan  AFG   country                          94556
 2 Afghanistan  AFG   country                          94564
 3 Afghanistan  AFG   country                          94565
 4 Afghanistan  AFG   country                          94567
 5 Afghanistan  AFG   country                          94568
 6 Afghanistan  AFG   country                          94613
 7 Afghanistan  AFG   country                          94619
 8 Afghanistan  AFG   country                          95312
 9 Afghanistan  AFG   country                          95322
10 Afghanistan  AFG   country                          95323
# ℹ 6,518 more rows
# ℹ 125 more variables: recommended_for_aggregates <chr>,
#   aid_data_parent_id <chr>, umbrella <chr>, financier_country <chr>,
#   recipient <chr>, recipient_iso_3 <chr>, recipient_region <chr>,
#   commitment_year <int>, implementation_start_year <int>,
#   completion_year <int>, title <chr>, description <chr>,
#   staff_comments <chr>, status <chr>, intent <chr>, flow_type <chr>, …
# Projects in specific countries
get_gcdf3_dataset() |>
  filter(recipient %in% c("Angola", "Ethiopia", "Kenya"))
# A tibble: 963 × 129
   country_name iso3c country_or_regional aid_data_record_id
   <chr>        <chr> <chr>                            <int>
 1 Angola       AGO   country                          92519
 2 Angola       AGO   country                          93104
 3 Angola       AGO   country                          93281
 4 Angola       AGO   country                          93286
 5 Angola       AGO   country                          95331
 6 Ethiopia     ETH   country                          91947
 7 Ethiopia     ETH   country                          92357
 8 Ethiopia     ETH   country                          95864
 9 Ethiopia     ETH   country                          95866
10 Ethiopia     ETH   country                          95869
# ℹ 953 more rows
# ℹ 125 more variables: recommended_for_aggregates <chr>,
#   aid_data_parent_id <chr>, umbrella <chr>, financier_country <chr>,
#   recipient <chr>, recipient_iso_3 <chr>, recipient_region <chr>,
#   commitment_year <int>, implementation_start_year <int>,
#   completion_year <int>, title <chr>, description <chr>,
#   staff_comments <chr>, status <chr>, intent <chr>, flow_type <chr>, …
# Projects where we don't have an unknown (NA) commitment value
get_gcdf3_dataset() |>
  filter(
    !is.na(amount_constant_usd_2021),
    recommended_for_aggregates == "Yes"
  )
# A tibble: 10,671 × 129
   country_name     iso3c country_or_regional aid_data_record_id
   <chr>            <chr> <chr>                            <int>
 1 Afghanistan      AFG   country                          94556
 2 Afghanistan      AFG   country                          94564
 3 Afghanistan      AFG   country                          94565
 4 Afghanistan      AFG   country                          94567
 5 Afghanistan      AFG   country                          94568
 6 Afghanistan      AFG   country                          94619
 7 Afghanistan      AFG   country                          98007
 8 Afghanistan      AFG   country                          98008
 9 Africa, regional <NA>  regional                         92947
10 Algeria          DZA   country                          94808
# ℹ 10,661 more rows
# ℹ 125 more variables: recommended_for_aggregates <chr>,
#   aid_data_parent_id <chr>, umbrella <chr>, financier_country <chr>,
#   recipient <chr>, recipient_iso_3 <chr>, recipient_region <chr>,
#   commitment_year <int>, implementation_start_year <int>,
#   completion_year <int>, title <chr>, description <chr>,
#   staff_comments <chr>, status <chr>, intent <chr>, flow_type <chr>, …
Logical Operators in filter()
  • ==: Exactly equals
  • !=: Does not equal
  • >, >=: Greater than, Greater than or equal to
  • <, <=: Less than, Less than or equal to
  • %in%: Is in a set of values
  • !is.na(): Is not missing
  • &: And (multiple conditions)
  • |: Or (either condition)

5.5.2 Verb 2. arrange(): Ordering Your Data

arrange() lets you sort your data. By default, it sorts in ascending order (smallest to largest):

# Using mini dataset: Sort by size (largest first)
mini_gcdf |>
  arrange(desc(amount_constant_usd_2021))
# A tibble: 10 × 5
   recipient recipient_region sector_name commitment_year amount_constant_usd_…¹
   <chr>     <chr>            <chr>                 <int>                  <dbl>
 1 Venezuela America          OTHER MULT…            2010           14402361186.
 2 Venezuela America          OTHER MULT…            2010           13927269358.
 3 Angola    Africa           ENERGY                 2016            8147551108.
 4 Pakistan  Asia             BANKING AN…            2020            4855813054.
 5 Pakistan  Asia             BANKING AN…            2021            4651162791.
 6 Angola    Africa           OTHER SOCI…            2010            3481817339.
 7 Indonesia Asia             INDUSTRY, …            2009            2853819092.
 8 Indonesia Asia             TRANSPORT …            2017            2743140983.
 9 Zambia    Africa           ENERGY                 2017             881870586.
10 Zambia    Africa           ENERGY                 2017             881870586.
# ℹ abbreviated name: ¹​amount_constant_usd_2021
# Real world example: Sort projects by multiple columns
get_gcdf3_dataset() |>
  filter(recommended_for_aggregates == "Yes") |>
  arrange(
    recipient,  # First by country A-Z
    desc(commitment_year)  # Then by most recent year
  )
# A tibble: 17,957 × 129
   country_name iso3c country_or_regional aid_data_record_id
   <chr>        <chr> <chr>                            <int>
 1 Afghanistan  AFG   country                          94556
 2 Afghanistan  AFG   country                          94564
 3 Afghanistan  AFG   country                          94565
 4 Afghanistan  AFG   country                          94567
 5 Afghanistan  AFG   country                          94568
 6 Afghanistan  AFG   country                          94613
 7 Afghanistan  AFG   country                          94619
 8 Afghanistan  AFG   country                          95322
 9 Afghanistan  AFG   country                          95323
10 Afghanistan  AFG   country                          95324
# ℹ 17,947 more rows
# ℹ 125 more variables: recommended_for_aggregates <chr>,
#   aid_data_parent_id <chr>, umbrella <chr>, financier_country <chr>,
#   recipient <chr>, recipient_iso_3 <chr>, recipient_region <chr>,
#   commitment_year <int>, implementation_start_year <int>,
#   completion_year <int>, title <chr>, description <chr>,
#   staff_comments <chr>, status <chr>, intent <chr>, flow_type <chr>, …
Note

Use desc() to sort in descending order. When sorting by multiple columns, each one is used as a tie-breaker for the previous ones.

5.5.3 Verb 3. select(): Choosing Columns

select() helps you focus on specific variables. It’s particularly useful when you have datasets with many columns:

# Using mini dataset: Select key columns
mini_gcdf |>
  select(
    recipient,
    commitment_year,
    amount_constant_usd_2021
  )
# A tibble: 10 × 3
   recipient commitment_year amount_constant_usd_2021
   <chr>               <int>                    <dbl>
 1 Angola               2016              8147551108.
 2 Angola               2010              3481817339.
 3 Indonesia            2009              2853819092.
 4 Indonesia            2017              2743140983.
 5 Pakistan             2020              4855813054.
 6 Pakistan             2021              4651162791.
 7 Venezuela            2010             14402361186.
 8 Venezuela            2010             13927269358.
 9 Zambia               2017               881870586.
10 Zambia               2017               881870586.
# Real world example: Select columns by pattern
get_gcdf3_dataset() |>
  select(
    starts_with("amount"),
    contains("year")
  )
# A tibble: 20,985 × 7
   amount_original_currency amount_estimated amount_constant_usd_2021
                      <dbl> <chr>                               <dbl>
 1                550000000 <NA>                             7111456.
 2                 12600000 Yes                             12600000 
 3                 14400000 Yes                             14400000 
 4                 13000000 <NA>                            13000000 
 5                  7500000 <NA>                             7500000 
 6                       NA <NA>                                  NA 
 7                  3600000 Yes                              3600000 
 8                 30000000 <NA>                             4651163.
 9                       NA <NA>                                  NA 
10                       NA <NA>                                  NA 
# ℹ 20,975 more rows
# ℹ 4 more variables: amount_nominal_usd <dbl>, commitment_year <int>,
#   implementation_start_year <int>, completion_year <int>
Helpful select() Helpers
  • starts_with(): Columns starting with a prefix
  • ends_with(): Columns ending with a suffix
  • contains(): Columns containing a string
  • matches(): Columns matching a regular expression
  • everything(): All remaining columns

5.5.4 Verb 4. mutate(): Creating New Variables

mutate() lets you create new columns based on existing ones. Let’s look at some examples:

# Using mini dataset: Calculate billions and shares
mini_gcdf |>
  group_by(recipient_region) |>
  mutate(
    amount_bn = amount_constant_usd_2021 / 1e9,
    share_of_region = amount_constant_usd_2021 / sum(amount_constant_usd_2021) * 100
  ) |>
  ungroup()
# A tibble: 10 × 7
   recipient recipient_region sector_name commitment_year amount_constant_usd_…¹
   <chr>     <chr>            <chr>                 <int>                  <dbl>
 1 Angola    Africa           ENERGY                 2016            8147551108.
 2 Angola    Africa           OTHER SOCI…            2010            3481817339.
 3 Indonesia Asia             INDUSTRY, …            2009            2853819092.
 4 Indonesia Asia             TRANSPORT …            2017            2743140983.
 5 Pakistan  Asia             BANKING AN…            2020            4855813054.
 6 Pakistan  Asia             BANKING AN…            2021            4651162791.
 7 Venezuela America          OTHER MULT…            2010           14402361186.
 8 Venezuela America          OTHER MULT…            2010           13927269358.
 9 Zambia    Africa           ENERGY                 2017             881870586.
10 Zambia    Africa           ENERGY                 2017             881870586.
# ℹ abbreviated name: ¹​amount_constant_usd_2021
# ℹ 2 more variables: amount_bn <dbl>, share_of_region <dbl>
# Real world example: Year-over-year growth
angola_annual_flows |>
  mutate(
    prev_year_amount = lag(total_amount),
    yoy_growth = (total_amount - prev_year_amount) / prev_year_amount * 100
  )
# A tibble: 21 × 4
   commitment_year total_amount prev_year_amount yoy_growth
             <int>        <dbl>            <dbl>      <dbl>
 1            2001    15087227.              NA        NA  
 2            2002   489893161.        15087227.     3147. 
 3            2003    26796701.       489893161.      -94.5
 4            2004   989972579.        26796701.     3594. 
 5            2005  2857495673.       989972579.      189. 
 6            2006  2453696747.      2857495673.      -14.1
 7            2007  3895910151.      2453696747.       58.8
 8            2008   198727773.      3895910151.      -94.9
 9            2009  3388534669.       198727773.     1605. 
10            2010  4300740778.      3388534669.       26.9
# ℹ 11 more rows
What’s Happening Here?

In the regional shares example:

  1. Group by region so calculations happen within each region
  2. Convert amounts to billions (divide by 1e9)
  3. Calculate each project’s share of its regional total
  4. Remove grouping when done

In the growth example:

  1. lag(total_amount) gets previous year’s value
  2. Calculate percent change from previous year

5.5.5 Verb 5. summarize(): Creating Summaries

summarize() collapses groups into single rows. This is especially powerful when combined with group_by():

# Using mini dataset: Regional summaries
mini_gcdf |>
  group_by(recipient_region) |>
  summarize(
    total_amount_bn = sum(amount_constant_usd_2021) / 1e9,
    project_count = n(),
    avg_amount_bn = mean(amount_constant_usd_2021) / 1e9,
    .groups = "drop"
  )
# A tibble: 3 × 4
  recipient_region total_amount_bn project_count avg_amount_bn
  <chr>                      <dbl>         <int>         <dbl>
1 Africa                      13.4             4          3.35
2 America                     28.3             2         14.2 
3 Asia                        15.1             4          3.78
# Real world example: Annual lending by flow class
get_gcdf3_dataset() |>
  filter(recommended_for_aggregates == "Yes") |>
  group_by(commitment_year, flow_class) |>
  summarize(
    total_amount_bn = sum(amount_constant_usd_2021, na.rm = TRUE) / 1e9,
    project_count = n(),
    .groups = "drop"
  )
# A tibble: 66 × 4
   commitment_year flow_class               total_amount_bn project_count
             <int> <chr>                              <dbl>         <int>
 1            2000 ODA-like                          1.44             138
 2            2000 OOF-like                          3.99              34
 3            2000 Vague (Official Finance)          0.0128             2
 4            2001 ODA-like                          3.50             175
 5            2001 OOF-like                          4.16              33
 6            2001 Vague (Official Finance)          0.204              6
 7            2002 ODA-like                          2.27             199
 8            2002 OOF-like                          4.89              38
 9            2002 Vague (Official Finance)          0.429              8
10            2003 ODA-like                          3.52             236
# ℹ 56 more rows

Common summary functions:

  • sum(): Total values
  • mean(): Average
  • median(): Middle value
  • sd(): Standard deviation
  • n(): Count rows
  • n_distinct(): Count unique values
Important

Always use na.rm = TRUE when working with financial data! Missing values are common and can break your calculations if not handled properly.

5.6 Understanding Groups: A Powerful Way to Organize Analysis

If you’ve used Excel, you’re probably familiar with pivot tables - they let you organize data by categories and calculate summaries for each group. The group_by() function in R serves a similar purpose but is even more powerful. Just like in Excel when you:

  • Create a pivot table to see total lending by region
  • Calculate what percent each project is of its country’s total
  • Find the largest project in each sector

In R, group_by() lets you do all this and more. Let’s explore how it works using our mini dataset.

5.6.1 Three Key Grouping Patterns

There are three main ways you’ll use grouping in your analysis:

  1. Summarize by Group: Calculate totals, averages, or counts for each group
  2. Calculate Within Groups: Create new columns based on group calculations
  3. Find Extremes Within Groups: Identify top/bottom values in each group

Let’s look at each pattern:

5.6.2 Pattern 1: Summarize by Group

First, let’s see what happens without grouping:

# Without grouping - one summary for everything
mini_gcdf |>
  summarize(
    total_amount_bn = sum(amount_constant_usd_2021) / 1e9,
    avg_amount_bn = mean(amount_constant_usd_2021) / 1e9
  )
# A tibble: 1 × 2
  total_amount_bn avg_amount_bn
            <dbl>         <dbl>
1            56.8          5.68
# With grouping - summaries for each region
mini_gcdf |>
  group_by(recipient_region) |>
  summarize(
    total_amount_bn = sum(amount_constant_usd_2021) / 1e9,
    avg_amount_bn = mean(amount_constant_usd_2021) / 1e9,
    .groups = "drop"
  )
# A tibble: 3 × 3
  recipient_region total_amount_bn avg_amount_bn
  <chr>                      <dbl>         <dbl>
1 Africa                      13.4          3.35
2 America                     28.3         14.2 
3 Asia                        15.1          3.78
What’s Happening Here?

When you group by recipient_region, R essentially:

  1. Splits the data into three pieces (Africa, America, Asia)
  2. Runs the calculations separately on each piece
  3. Combines the results back into one table

This is just like choosing “Region” as the row variable in a pivot table!

5.6.3 Pattern 2: Calculate Within Groups

Sometimes you want to compare values within their group, like calculating each loan’s share of its regional total:

# Calculate share of regional total
mini_gcdf |>
  group_by(recipient_region) |>
  mutate(
    region_total = sum(amount_constant_usd_2021),
    share_of_region = amount_constant_usd_2021 / region_total * 100
  ) |>
  select(recipient, recipient_region, amount_constant_usd_2021, share_of_region) |>  # Just show relevant columns
  ungroup()
# A tibble: 10 × 4
   recipient recipient_region amount_constant_usd_2021 share_of_region
   <chr>     <chr>                               <dbl>           <dbl>
 1 Angola    Africa                        8147551108.           60.8 
 2 Angola    Africa                        3481817339.           26.0 
 3 Indonesia Asia                          2853819092.           18.9 
 4 Indonesia Asia                          2743140983.           18.2 
 5 Pakistan  Asia                          4855813054.           32.1 
 6 Pakistan  Asia                          4651162791.           30.8 
 7 Venezuela America                      14402361186.           50.8 
 8 Venezuela America                      13927269358.           49.2 
 9 Zambia    Africa                         881870586.            6.58
10 Zambia    Africa                         881870586.            6.58
What’s Happening Here?

For each region:

  1. sum(amount_constant_usd_2021) adds up all loans in that region
  2. Each loan’s amount is divided by its region’s total
  3. The share will always be between 0 and 100% within each region

This is similar to Excel’s “Show Values As” → “% of Parent Row Total” in pivot tables!

5.6.4 Pattern 3: Find Extremes Within Groups

Often you want to find the largest or smallest values within each group:

# Largest loan in each region
mini_gcdf |>
  group_by(recipient_region) |>
  slice_max(order_by = amount_constant_usd_2021, n = 1) |>
  ungroup()
# A tibble: 3 × 5
  recipient recipient_region sector_name  commitment_year amount_constant_usd_…¹
  <chr>     <chr>            <chr>                  <int>                  <dbl>
1 Angola    Africa           ENERGY                  2016            8147551108.
2 Venezuela America          OTHER MULTI…            2010           14402361186.
3 Pakistan  Asia             BANKING AND…            2020            4855813054.
# ℹ abbreviated name: ¹​amount_constant_usd_2021
# Real world example: Top 3 loans by region
get_gcdf3_dataset() |>
  filter(
    recommended_for_aggregates == "Yes",
    !is.na(amount_constant_usd_2021)
  ) |>
  group_by(recipient_region) |>
  slice_max(order_by = amount_constant_usd_2021, n = 3) |>
  select(recipient_region, recipient, amount_constant_usd_2021, commitment_year) |>
  ungroup()
# A tibble: 22 × 4
   recipient_region recipient             amount_constant_usd_…¹ commitment_year
   <chr>            <chr>                                  <dbl>           <int>
 1 Africa           Angola                           8147551108.            2016
 2 Africa           Angola                           3481817339.            2010
 3 Africa           Democratic Republic …            3094756963.            2008
 4 America          Argentina                       21421774753.            2018
 5 America          Argentina                       21270602003.            2019
 6 America          Argentina                       21041856567.            2020
 7 Asia             Turkmenistan                     6008040193.            2009
 8 Asia             Kazakhstan                       5717197207.            2008
 9 Asia             Kazakhstan                       5717197207.            2008
10 Europe           Russia                          37225206222.            2013
# ℹ 12 more rows
# ℹ abbreviated name: ¹​amount_constant_usd_2021
What’s Happening Here?

For each region: 1. Sort loans by amount (largest to smallest) 2. Keep the top one (n = 1) or top three (n = 3) 3. Move on to the next region

This is like filtering a pivot table to show only the maximum value in each group!

5.6.5 The Importance of ungroup()

Notice how we often end with ungroup()? This is important! When you group data:

  1. The grouping stays active until you explicitly remove it
  2. This can affect later calculations in unexpected ways
  3. ungroup() removes the grouping when you’re done with it

Let’s see what can go wrong:

# THIS IS WRONG! (still grouped when calculating overall_share)
mini_gcdf |>
  group_by(recipient_region) |>
  mutate(
    # This gives regional share (correct)
    region_share = amount_constant_usd_2021 / sum(amount_constant_usd_2021),
    # This gives same result because we're still grouped! (wrong)
    overall_share = amount_constant_usd_2021 / sum(amount_constant_usd_2021)
  )
# A tibble: 10 × 7
# Groups:   recipient_region [3]
   recipient recipient_region sector_name commitment_year amount_constant_usd_…¹
   <chr>     <chr>            <chr>                 <int>                  <dbl>
 1 Angola    Africa           ENERGY                 2016            8147551108.
 2 Angola    Africa           OTHER SOCI…            2010            3481817339.
 3 Indonesia Asia             INDUSTRY, …            2009            2853819092.
 4 Indonesia Asia             TRANSPORT …            2017            2743140983.
 5 Pakistan  Asia             BANKING AN…            2020            4855813054.
 6 Pakistan  Asia             BANKING AN…            2021            4651162791.
 7 Venezuela America          OTHER MULT…            2010           14402361186.
 8 Venezuela America          OTHER MULT…            2010           13927269358.
 9 Zambia    Africa           ENERGY                 2017             881870586.
10 Zambia    Africa           ENERGY                 2017             881870586.
# ℹ abbreviated name: ¹​amount_constant_usd_2021
# ℹ 2 more variables: region_share <dbl>, overall_share <dbl>
# THIS IS RIGHT! (ungroup before overall calculation)
mini_gcdf |>
  group_by(recipient_region) |>
  mutate(
    region_share = amount_constant_usd_2021 / sum(amount_constant_usd_2021)
  ) |>
  ungroup() |>
  mutate(
    overall_share = amount_constant_usd_2021 / sum(amount_constant_usd_2021)
  )
# A tibble: 10 × 7
   recipient recipient_region sector_name commitment_year amount_constant_usd_…¹
   <chr>     <chr>            <chr>                 <int>                  <dbl>
 1 Angola    Africa           ENERGY                 2016            8147551108.
 2 Angola    Africa           OTHER SOCI…            2010            3481817339.
 3 Indonesia Asia             INDUSTRY, …            2009            2853819092.
 4 Indonesia Asia             TRANSPORT …            2017            2743140983.
 5 Pakistan  Asia             BANKING AN…            2020            4855813054.
 6 Pakistan  Asia             BANKING AN…            2021            4651162791.
 7 Venezuela America          OTHER MULT…            2010           14402361186.
 8 Venezuela America          OTHER MULT…            2010           13927269358.
 9 Zambia    Africa           ENERGY                 2017             881870586.
10 Zambia    Africa           ENERGY                 2017             881870586.
# ℹ abbreviated name: ¹​amount_constant_usd_2021
# ℹ 2 more variables: region_share <dbl>, overall_share <dbl>
When to ungroup()
  • After summarize(): Usually automatic (but watch for warnings)
  • After mutate(): If you’re done with group calculations
  • After slice_*(): Almost always
  • When in doubt: ungroup()! It never hurts.

5.6.6 Real World Example: Time Series Analysis

Let’s apply these patterns to analyze year-over-year changes in Angola’s loan commitments:

# Calculate year-over-year changes
angola_annual_flows |>
  mutate(
    prev_year_amount = lag(total_amount),
    yoy_change = (total_amount - prev_year_amount) / prev_year_amount * 100
  ) |>
  filter(!is.na(yoy_change))  # Remove first year (no previous year to compare)
# A tibble: 20 × 4
   commitment_year total_amount prev_year_amount yoy_change
             <int>        <dbl>            <dbl>      <dbl>
 1            2002   489893161.        15087227.    3147.  
 2            2003    26796701.       489893161.     -94.5 
 3            2004   989972579.        26796701.    3594.  
 4            2005  2857495673.       989972579.     189.  
 5            2006  2453696747.      2857495673.     -14.1 
 6            2007  3895910151.      2453696747.      58.8 
 7            2008   198727773.      3895910151.     -94.9 
 8            2009  3388534669.       198727773.    1605.  
 9            2010  4300740778.      3388534669.      26.9 
10            2011  2764770648.      4300740778.     -35.7 
11            2012  1567120325.      2764770648.     -43.3 
12            2013  5155360714.      1567120325.     229.  
13            2014  4980964328.      5155360714.      -3.38
14            2015  6206188709.      4980964328.      24.6 
15            2016 17920526346.      6206188709.     189.  
16            2017  2999075287.     17920526346.     -83.3 
17            2018  4340735198.      2999075287.      44.7 
18            2019   119578481.      4340735198.     -97.2 
19            2020    64776546.       119578481.     -45.8 
20            2021    79700000         64776546.      23.0 
What’s Happening Here?
  1. lag(total_amount) gets the previous year’s value
  2. Calculate percent change from previous year
  3. Remove the first year (which has no previous year)

This kind of analysis is common when looking at lending trends over time!

5.7 Common Transformation Patterns in Development Finance

Now that we understand both the basic operations and grouping, let’s look at some common patterns you’ll use when analyzing Chinese development finance data:

5.7.1 Pattern 1: Annual Flows By Region

This pattern helps understand how lending varies across regions and time:

get_gcdf3_dataset() |>
  filter(recommended_for_aggregates == "Yes") |>
  group_by(commitment_year, recipient_region) |>
  summarize(
    total_amount_bn = sum(amount_constant_usd_2021, na.rm = TRUE) / 1e9,
    project_count = n(),
    avg_project_size_bn = mean(amount_constant_usd_2021, na.rm = TRUE) / 1e9,
    .groups = "drop"
  ) |>
  arrange(recipient_region, commitment_year)
# A tibble: 147 × 5
   commitment_year recipient_region total_amount_bn project_count
             <int> <chr>                      <dbl>         <int>
 1            2000 Africa                      1.35            80
 2            2001 Africa                      3.09           110
 3            2002 Africa                      2.68           126
 4            2003 Africa                      4.24           144
 5            2004 Africa                      2.64           150
 6            2005 Africa                      7.10           215
 7            2006 Africa                      7.42           274
 8            2007 Africa                     15.9            351
 9            2008 Africa                     12.4            295
10            2009 Africa                     16.5            342
# ℹ 137 more rows
# ℹ 1 more variable: avg_project_size_bn <dbl>

5.7.2 Pattern 2: Portfolio Composition

Understanding the sectoral focus of lending:

get_gcdf3_dataset() |>
  filter(
    recommended_for_aggregates == "Yes",
    commitment_year >= 2018  # Focus on recent years
  ) |>
  group_by(sector_name) |>
  summarize(
    total_amount_bn = sum(amount_constant_usd_2021, na.rm = TRUE) / 1e9,
    project_count = n(),
    avg_amount_bn = mean(amount_constant_usd_2021, na.rm = TRUE) / 1e9,
    .groups = "drop"
  ) |>
  arrange(desc(total_amount_bn)) |>
  slice_head(n = 10)  # Top 10 sectors
# A tibble: 10 × 4
   sector_name                    total_amount_bn project_count avg_amount_bn
   <chr>                                    <dbl>         <int>         <dbl>
 1 BANKING AND FINANCIAL SERVICES          143.             163       0.897  
 2 INDUSTRY, MINING, CONSTRUCTION           56.5            205       0.304  
 3 TRANSPORT AND STORAGE                    54.7            315       0.215  
 4 ENERGY                                   43.4            210       0.241  
 5 GENERAL BUDGET SUPPORT                   24.0             43       0.572  
 6 COMMUNICATIONS                            7.78           124       0.101  
 7 UNALLOCATED/UNSPECIFIED                   7.06            70       0.116  
 8 BUSINESS AND OTHER SERVICES               6.80            78       0.101  
 9 OTHER MULTISECTOR                         5.59           101       0.119  
10 HEALTH                                    4.23          2314       0.00311

5.7.3 Pattern 3: Country Risk Analysis

Analyzing lending patterns for specific countries:

get_gcdf3_dataset() |>
  filter(
    recommended_for_aggregates == "Yes",
    recipient %in% c("Angola", "Kenya", "Ethiopia")
  ) |>
  group_by(recipient, flow_class) |>
  summarize(
    total_amount_bn = sum(amount_constant_usd_2021, na.rm = TRUE) / 1e9,
    project_count = n(),
    .groups = "drop"
  ) |>
  arrange(recipient, desc(total_amount_bn))
# A tibble: 9 × 4
  recipient flow_class               total_amount_bn project_count
  <chr>     <chr>                              <dbl>         <int>
1 Angola    OOF-like                          54.8             202
2 Angola    Vague (Official Finance)          10.0              65
3 Angola    ODA-like                           0.310           101
4 Ethiopia  OOF-like                          18.0              64
5 Ethiopia  ODA-like                           3.09            187
6 Ethiopia  Vague (Official Finance)           0.276            10
7 Kenya     OOF-like                           9.56             42
8 Kenya     ODA-like                           2.80            152
9 Kenya     Vague (Official Finance)           0.884            14

5.8 Practice Exercises

Try these exercises to get comfortable with data transformation. Remember to use AI tools if you get stuck!

5.8.1 Exercise 1: Basic Filtering

Find all projects that are:

  • ODA-like or OOF-like
  • Committed between 2018-2021
  • Worth at least $100 million

5.8.2 Exercise 2: Regional Analysis

For each region, calculate:

  • Total lending volume
  • Number of projects
  • Average project size
  • Number of recipient countries

5.9 Resources for Learning More

5.9.1 Essential References

  1. R for Data Science - Data Transformation
    • Comprehensive guide to dplyr
    • Many practical examples
    • Free online!
  2. dplyr cheatsheet
    • Quick reference for common operations
    • Great to keep handy while working

5.9.2 Video Tutorials

  1. Animated versions of common dplyr functions
    • Clear, beginner-friendly overview
    • Shows live coding examples
    • Perfect for visual learners

5.10 Next Steps

In our class session, we’ll:

  1. Review any questions about these concepts
  2. Practice more complex transformations
  3. Work with real analysis questions
  4. Learn some advanced dplyr features

Remember: The goal isn’t to memorize every function, but to understand the basic patterns of data transformation. With these five core verbs and the pipe operator, you can handle most analysis tasks!