library(tidyverse) # For data transformation tools
library(chinadevfin3) # For Chinese development finance data
5 Week 3: Find Actionable Insights, Quickly (Pre-Class)
This pre-class preparation should take about 45-60 minutes to complete.
5.1 Overview
Now that you can create visualizations and automated reports, it’s time to learn how to transform your data to find meaningful insights. This week focuses on data transformation - the process of taking raw data and reshaping it to answer specific questions. We’ll use the tidyverse’s powerful dplyr
package, which makes complex data operations surprisingly intuitive.
5.1.1 Video Lecture
Watch this video lecture before our interactive session:
5.2 Learning Objectives
By completing this pre-class work, you will:
- Understand the core data transformation verbs in
dplyr
- Learn to chain operations together using the pipe operator
|>
- Begin thinking about data transformation patterns
- Practice with real Chinese development finance data
- Use AI tools to assist with data transformation tasks
5.3 Setup
Let’s get our workspace ready. First, create a new Quarto document for your notes:
# Create a new Quarto document
# File → New File → Quarto Document
# Save as "week_3_transformation_preclass.qmd" in your week_3/R folder
Load the packages we’ll need:
5.4 A Mini Dataset for Learning
Before diving into data transformation, let’s create a small dataset we’ll use for learning. This contains the two largest loans for five countries:
<- get_gcdf3_dataset() |>
mini_gcdf filter(
== "Yes",
recommended_for_aggregates == "Loan",
flow_type %in% c(
recipient "Angola",
"Zambia",
"Venezuela",
"Indonesia",
"Pakistan"
)|>
) group_by(recipient) |>
slice_max(
order_by = amount_constant_usd_2021,
n = 2
|>
) select(
recipient,
recipient_region,
sector_name,
commitment_year,
amount_constant_usd_2021 |>
) ungroup()
# Also create a dataset for year-over-year analysis
<- get_gcdf3_dataset() |>
angola_annual_flows filter(
== "Yes",
recommended_for_aggregates == "Loan",
flow_type == "Angola"
recipient |>
) group_by(commitment_year) |>
summarize(
total_amount = sum(amount_constant_usd_2021, na.rm = TRUE),
.groups = "drop"
)
# Look at our mini dataset
mini_gcdf
# A tibble: 10 × 5
recipient recipient_region sector_name commitment_year amount_constant_usd_…¹
<chr> <chr> <chr> <int> <dbl>
1 Angola Africa ENERGY 2016 8147551108.
2 Angola Africa OTHER SOCI… 2010 3481817339.
3 Indonesia Asia INDUSTRY, … 2009 2853819092.
4 Indonesia Asia TRANSPORT … 2017 2743140983.
5 Pakistan Asia BANKING AN… 2020 4855813054.
6 Pakistan Asia BANKING AN… 2021 4651162791.
7 Venezuela America OTHER MULT… 2010 14402361186.
8 Venezuela America OTHER MULT… 2010 13927269358.
9 Zambia Africa ENERGY 2017 881870586.
10 Zambia Africa ENERGY 2017 881870586.
# ℹ abbreviated name: ¹amount_constant_usd_2021
Don’t worry if the code that created this dataset looks complex - by the end of this pre-class material, you’ll understand every line! For now, just notice that:
- We have 10 rows (2 loans each from 5 countries)
- Countries are from 3 different regions (Africa, Asia, America)
- Each loan has a sector, year, and amount
- The amounts are in constant 2021 USD
This small dataset will help us learn the fundamentals before working with the full GCDF database.
5.5 The Five Core verbs of Data Transformation
Think of data transformation as having five fundamental operations, just like basic arithmetic has addition, subtraction, multiplication, and division. In dplyr, these operations are:
- filter(): Pick rows based on their values
- arrange(): Change the order of rows
- select(): Pick columns by their names
- mutate(): Create new columns from existing ones
- summarize(): Collapse multiple rows into a single summary
Let’s explore each one using examples from Chinese development finance data.
5.5.1 Verb 1. filter()
: Subsetting Your Data
filter()
helps you focus on specific parts of your data. Think of it like a sieve that keeps only the rows you want:
# Using mini dataset: African loans over $1 billion
|>
mini_gcdf filter(
== "Africa",
recipient_region >= 1 * 1e9
amount_constant_usd_2021 )
# A tibble: 2 × 5
recipient recipient_region sector_name commitment_year amount_constant_usd_…¹
<chr> <chr> <chr> <int> <dbl>
1 Angola Africa ENERGY 2016 8147551108.
2 Angola Africa OTHER SOCIA… 2010 3481817339.
# ℹ abbreviated name: ¹amount_constant_usd_2021
# Real world example: Large ODA-like projects
get_gcdf3_dataset() |>
filter(
== "ODA-like",
flow_class >= 100 * 1e6,
amount_constant_usd_2021 == "Yes"
recommended_for_aggregates )
# A tibble: 292 × 129
country_name iso3c country_or_regional aid_data_record_id
<chr> <chr> <chr> <int>
1 Sri Lanka LKA country 89483
2 Tajikistan TJK country 92674
3 Uzbekistan UZB country 92623
4 Cambodia KHM country 94984
5 North Korea PRK country 93442
6 North Korea PRK country 93443
7 North Korea PRK country 95607
8 Congo - Kinshasa COD country 92575
9 Ghana GHA country 92600
10 Kazakhstan KAZ country 92613
# ℹ 282 more rows
# ℹ 125 more variables: recommended_for_aggregates <chr>,
# aid_data_parent_id <chr>, umbrella <chr>, financier_country <chr>,
# recipient <chr>, recipient_iso_3 <chr>, recipient_region <chr>,
# commitment_year <int>, implementation_start_year <int>,
# completion_year <int>, title <chr>, description <chr>,
# staff_comments <chr>, status <chr>, intent <chr>, flow_type <chr>, …
Common filtering operations you’ll use:
# Projects from recent years
get_gcdf3_dataset() |>
filter(commitment_year >= 2018)
# A tibble: 6,528 × 129
country_name iso3c country_or_regional aid_data_record_id
<chr> <chr> <chr> <int>
1 Afghanistan AFG country 94556
2 Afghanistan AFG country 94564
3 Afghanistan AFG country 94565
4 Afghanistan AFG country 94567
5 Afghanistan AFG country 94568
6 Afghanistan AFG country 94613
7 Afghanistan AFG country 94619
8 Afghanistan AFG country 95312
9 Afghanistan AFG country 95322
10 Afghanistan AFG country 95323
# ℹ 6,518 more rows
# ℹ 125 more variables: recommended_for_aggregates <chr>,
# aid_data_parent_id <chr>, umbrella <chr>, financier_country <chr>,
# recipient <chr>, recipient_iso_3 <chr>, recipient_region <chr>,
# commitment_year <int>, implementation_start_year <int>,
# completion_year <int>, title <chr>, description <chr>,
# staff_comments <chr>, status <chr>, intent <chr>, flow_type <chr>, …
# Projects in specific countries
get_gcdf3_dataset() |>
filter(recipient %in% c("Angola", "Ethiopia", "Kenya"))
# A tibble: 963 × 129
country_name iso3c country_or_regional aid_data_record_id
<chr> <chr> <chr> <int>
1 Angola AGO country 92519
2 Angola AGO country 93104
3 Angola AGO country 93281
4 Angola AGO country 93286
5 Angola AGO country 95331
6 Ethiopia ETH country 91947
7 Ethiopia ETH country 92357
8 Ethiopia ETH country 95864
9 Ethiopia ETH country 95866
10 Ethiopia ETH country 95869
# ℹ 953 more rows
# ℹ 125 more variables: recommended_for_aggregates <chr>,
# aid_data_parent_id <chr>, umbrella <chr>, financier_country <chr>,
# recipient <chr>, recipient_iso_3 <chr>, recipient_region <chr>,
# commitment_year <int>, implementation_start_year <int>,
# completion_year <int>, title <chr>, description <chr>,
# staff_comments <chr>, status <chr>, intent <chr>, flow_type <chr>, …
# Projects where we don't have an unknown (NA) commitment value
get_gcdf3_dataset() |>
filter(
!is.na(amount_constant_usd_2021),
== "Yes"
recommended_for_aggregates )
# A tibble: 10,671 × 129
country_name iso3c country_or_regional aid_data_record_id
<chr> <chr> <chr> <int>
1 Afghanistan AFG country 94556
2 Afghanistan AFG country 94564
3 Afghanistan AFG country 94565
4 Afghanistan AFG country 94567
5 Afghanistan AFG country 94568
6 Afghanistan AFG country 94619
7 Afghanistan AFG country 98007
8 Afghanistan AFG country 98008
9 Africa, regional <NA> regional 92947
10 Algeria DZA country 94808
# ℹ 10,661 more rows
# ℹ 125 more variables: recommended_for_aggregates <chr>,
# aid_data_parent_id <chr>, umbrella <chr>, financier_country <chr>,
# recipient <chr>, recipient_iso_3 <chr>, recipient_region <chr>,
# commitment_year <int>, implementation_start_year <int>,
# completion_year <int>, title <chr>, description <chr>,
# staff_comments <chr>, status <chr>, intent <chr>, flow_type <chr>, …
==
: Exactly equals!=
: Does not equal>
,>=
: Greater than, Greater than or equal to<
,<=
: Less than, Less than or equal to%in%
: Is in a set of values!is.na()
: Is not missing&
: And (multiple conditions)|
: Or (either condition)
5.5.2 Verb 2. arrange()
: Ordering Your Data
arrange()
lets you sort your data. By default, it sorts in ascending order (smallest to largest):
# Using mini dataset: Sort by size (largest first)
|>
mini_gcdf arrange(desc(amount_constant_usd_2021))
# A tibble: 10 × 5
recipient recipient_region sector_name commitment_year amount_constant_usd_…¹
<chr> <chr> <chr> <int> <dbl>
1 Venezuela America OTHER MULT… 2010 14402361186.
2 Venezuela America OTHER MULT… 2010 13927269358.
3 Angola Africa ENERGY 2016 8147551108.
4 Pakistan Asia BANKING AN… 2020 4855813054.
5 Pakistan Asia BANKING AN… 2021 4651162791.
6 Angola Africa OTHER SOCI… 2010 3481817339.
7 Indonesia Asia INDUSTRY, … 2009 2853819092.
8 Indonesia Asia TRANSPORT … 2017 2743140983.
9 Zambia Africa ENERGY 2017 881870586.
10 Zambia Africa ENERGY 2017 881870586.
# ℹ abbreviated name: ¹amount_constant_usd_2021
# Real world example: Sort projects by multiple columns
get_gcdf3_dataset() |>
filter(recommended_for_aggregates == "Yes") |>
arrange(
# First by country A-Z
recipient, desc(commitment_year) # Then by most recent year
)
# A tibble: 17,957 × 129
country_name iso3c country_or_regional aid_data_record_id
<chr> <chr> <chr> <int>
1 Afghanistan AFG country 94556
2 Afghanistan AFG country 94564
3 Afghanistan AFG country 94565
4 Afghanistan AFG country 94567
5 Afghanistan AFG country 94568
6 Afghanistan AFG country 94613
7 Afghanistan AFG country 94619
8 Afghanistan AFG country 95322
9 Afghanistan AFG country 95323
10 Afghanistan AFG country 95324
# ℹ 17,947 more rows
# ℹ 125 more variables: recommended_for_aggregates <chr>,
# aid_data_parent_id <chr>, umbrella <chr>, financier_country <chr>,
# recipient <chr>, recipient_iso_3 <chr>, recipient_region <chr>,
# commitment_year <int>, implementation_start_year <int>,
# completion_year <int>, title <chr>, description <chr>,
# staff_comments <chr>, status <chr>, intent <chr>, flow_type <chr>, …
Use desc()
to sort in descending order. When sorting by multiple columns, each one is used as a tie-breaker for the previous ones.
5.5.3 Verb 3. select()
: Choosing Columns
select()
helps you focus on specific variables. It’s particularly useful when you have datasets with many columns:
# Using mini dataset: Select key columns
|>
mini_gcdf select(
recipient,
commitment_year,
amount_constant_usd_2021 )
# A tibble: 10 × 3
recipient commitment_year amount_constant_usd_2021
<chr> <int> <dbl>
1 Angola 2016 8147551108.
2 Angola 2010 3481817339.
3 Indonesia 2009 2853819092.
4 Indonesia 2017 2743140983.
5 Pakistan 2020 4855813054.
6 Pakistan 2021 4651162791.
7 Venezuela 2010 14402361186.
8 Venezuela 2010 13927269358.
9 Zambia 2017 881870586.
10 Zambia 2017 881870586.
# Real world example: Select columns by pattern
get_gcdf3_dataset() |>
select(
starts_with("amount"),
contains("year")
)
# A tibble: 20,985 × 7
amount_original_currency amount_estimated amount_constant_usd_2021
<dbl> <chr> <dbl>
1 550000000 <NA> 7111456.
2 12600000 Yes 12600000
3 14400000 Yes 14400000
4 13000000 <NA> 13000000
5 7500000 <NA> 7500000
6 NA <NA> NA
7 3600000 Yes 3600000
8 30000000 <NA> 4651163.
9 NA <NA> NA
10 NA <NA> NA
# ℹ 20,975 more rows
# ℹ 4 more variables: amount_nominal_usd <dbl>, commitment_year <int>,
# implementation_start_year <int>, completion_year <int>
starts_with()
: Columns starting with a prefixends_with()
: Columns ending with a suffixcontains()
: Columns containing a stringmatches()
: Columns matching a regular expressioneverything()
: All remaining columns
5.5.4 Verb 4. mutate()
: Creating New Variables
mutate()
lets you create new columns based on existing ones. Let’s look at some examples:
# Using mini dataset: Calculate billions and shares
|>
mini_gcdf group_by(recipient_region) |>
mutate(
amount_bn = amount_constant_usd_2021 / 1e9,
share_of_region = amount_constant_usd_2021 / sum(amount_constant_usd_2021) * 100
|>
) ungroup()
# A tibble: 10 × 7
recipient recipient_region sector_name commitment_year amount_constant_usd_…¹
<chr> <chr> <chr> <int> <dbl>
1 Angola Africa ENERGY 2016 8147551108.
2 Angola Africa OTHER SOCI… 2010 3481817339.
3 Indonesia Asia INDUSTRY, … 2009 2853819092.
4 Indonesia Asia TRANSPORT … 2017 2743140983.
5 Pakistan Asia BANKING AN… 2020 4855813054.
6 Pakistan Asia BANKING AN… 2021 4651162791.
7 Venezuela America OTHER MULT… 2010 14402361186.
8 Venezuela America OTHER MULT… 2010 13927269358.
9 Zambia Africa ENERGY 2017 881870586.
10 Zambia Africa ENERGY 2017 881870586.
# ℹ abbreviated name: ¹amount_constant_usd_2021
# ℹ 2 more variables: amount_bn <dbl>, share_of_region <dbl>
# Real world example: Year-over-year growth
|>
angola_annual_flows mutate(
prev_year_amount = lag(total_amount),
yoy_growth = (total_amount - prev_year_amount) / prev_year_amount * 100
)
# A tibble: 21 × 4
commitment_year total_amount prev_year_amount yoy_growth
<int> <dbl> <dbl> <dbl>
1 2001 15087227. NA NA
2 2002 489893161. 15087227. 3147.
3 2003 26796701. 489893161. -94.5
4 2004 989972579. 26796701. 3594.
5 2005 2857495673. 989972579. 189.
6 2006 2453696747. 2857495673. -14.1
7 2007 3895910151. 2453696747. 58.8
8 2008 198727773. 3895910151. -94.9
9 2009 3388534669. 198727773. 1605.
10 2010 4300740778. 3388534669. 26.9
# ℹ 11 more rows
In the regional shares example:
- Group by region so calculations happen within each region
- Convert amounts to billions (divide by 1e9)
- Calculate each project’s share of its regional total
- Remove grouping when done
In the growth example:
lag(total_amount)
gets previous year’s value- Calculate percent change from previous year
5.5.5 Verb 5. summarize()
: Creating Summaries
summarize()
collapses groups into single rows. This is especially powerful when combined with group_by()
:
# Using mini dataset: Regional summaries
|>
mini_gcdf group_by(recipient_region) |>
summarize(
total_amount_bn = sum(amount_constant_usd_2021) / 1e9,
project_count = n(),
avg_amount_bn = mean(amount_constant_usd_2021) / 1e9,
.groups = "drop"
)
# A tibble: 3 × 4
recipient_region total_amount_bn project_count avg_amount_bn
<chr> <dbl> <int> <dbl>
1 Africa 13.4 4 3.35
2 America 28.3 2 14.2
3 Asia 15.1 4 3.78
# Real world example: Annual lending by flow class
get_gcdf3_dataset() |>
filter(recommended_for_aggregates == "Yes") |>
group_by(commitment_year, flow_class) |>
summarize(
total_amount_bn = sum(amount_constant_usd_2021, na.rm = TRUE) / 1e9,
project_count = n(),
.groups = "drop"
)
# A tibble: 66 × 4
commitment_year flow_class total_amount_bn project_count
<int> <chr> <dbl> <int>
1 2000 ODA-like 1.44 138
2 2000 OOF-like 3.99 34
3 2000 Vague (Official Finance) 0.0128 2
4 2001 ODA-like 3.50 175
5 2001 OOF-like 4.16 33
6 2001 Vague (Official Finance) 0.204 6
7 2002 ODA-like 2.27 199
8 2002 OOF-like 4.89 38
9 2002 Vague (Official Finance) 0.429 8
10 2003 ODA-like 3.52 236
# ℹ 56 more rows
Common summary functions:
sum()
: Total valuesmean()
: Averagemedian()
: Middle valuesd()
: Standard deviationn()
: Count rowsn_distinct()
: Count unique values
Always use na.rm = TRUE
when working with financial data! Missing values are common and can break your calculations if not handled properly.
5.6 Understanding Groups: A Powerful Way to Organize Analysis
If you’ve used Excel, you’re probably familiar with pivot tables - they let you organize data by categories and calculate summaries for each group. The group_by()
function in R serves a similar purpose but is even more powerful. Just like in Excel when you:
- Create a pivot table to see total lending by region
- Calculate what percent each project is of its country’s total
- Find the largest project in each sector
In R, group_by()
lets you do all this and more. Let’s explore how it works using our mini dataset.
5.6.1 Three Key Grouping Patterns
There are three main ways you’ll use grouping in your analysis:
- Summarize by Group: Calculate totals, averages, or counts for each group
- Calculate Within Groups: Create new columns based on group calculations
- Find Extremes Within Groups: Identify top/bottom values in each group
Let’s look at each pattern:
5.6.2 Pattern 1: Summarize by Group
First, let’s see what happens without grouping:
# Without grouping - one summary for everything
|>
mini_gcdf summarize(
total_amount_bn = sum(amount_constant_usd_2021) / 1e9,
avg_amount_bn = mean(amount_constant_usd_2021) / 1e9
)
# A tibble: 1 × 2
total_amount_bn avg_amount_bn
<dbl> <dbl>
1 56.8 5.68
# With grouping - summaries for each region
|>
mini_gcdf group_by(recipient_region) |>
summarize(
total_amount_bn = sum(amount_constant_usd_2021) / 1e9,
avg_amount_bn = mean(amount_constant_usd_2021) / 1e9,
.groups = "drop"
)
# A tibble: 3 × 3
recipient_region total_amount_bn avg_amount_bn
<chr> <dbl> <dbl>
1 Africa 13.4 3.35
2 America 28.3 14.2
3 Asia 15.1 3.78
When you group by recipient_region
, R essentially:
- Splits the data into three pieces (Africa, America, Asia)
- Runs the calculations separately on each piece
- Combines the results back into one table
This is just like choosing “Region” as the row variable in a pivot table!
5.6.3 Pattern 2: Calculate Within Groups
Sometimes you want to compare values within their group, like calculating each loan’s share of its regional total:
# Calculate share of regional total
|>
mini_gcdf group_by(recipient_region) |>
mutate(
region_total = sum(amount_constant_usd_2021),
share_of_region = amount_constant_usd_2021 / region_total * 100
|>
) select(recipient, recipient_region, amount_constant_usd_2021, share_of_region) |> # Just show relevant columns
ungroup()
# A tibble: 10 × 4
recipient recipient_region amount_constant_usd_2021 share_of_region
<chr> <chr> <dbl> <dbl>
1 Angola Africa 8147551108. 60.8
2 Angola Africa 3481817339. 26.0
3 Indonesia Asia 2853819092. 18.9
4 Indonesia Asia 2743140983. 18.2
5 Pakistan Asia 4855813054. 32.1
6 Pakistan Asia 4651162791. 30.8
7 Venezuela America 14402361186. 50.8
8 Venezuela America 13927269358. 49.2
9 Zambia Africa 881870586. 6.58
10 Zambia Africa 881870586. 6.58
For each region:
sum(amount_constant_usd_2021)
adds up all loans in that region- Each loan’s amount is divided by its region’s total
- The share will always be between 0 and 100% within each region
This is similar to Excel’s “Show Values As” → “% of Parent Row Total” in pivot tables!
5.6.4 Pattern 3: Find Extremes Within Groups
Often you want to find the largest or smallest values within each group:
# Largest loan in each region
|>
mini_gcdf group_by(recipient_region) |>
slice_max(order_by = amount_constant_usd_2021, n = 1) |>
ungroup()
# A tibble: 3 × 5
recipient recipient_region sector_name commitment_year amount_constant_usd_…¹
<chr> <chr> <chr> <int> <dbl>
1 Angola Africa ENERGY 2016 8147551108.
2 Venezuela America OTHER MULTI… 2010 14402361186.
3 Pakistan Asia BANKING AND… 2020 4855813054.
# ℹ abbreviated name: ¹amount_constant_usd_2021
# Real world example: Top 3 loans by region
get_gcdf3_dataset() |>
filter(
== "Yes",
recommended_for_aggregates !is.na(amount_constant_usd_2021)
|>
) group_by(recipient_region) |>
slice_max(order_by = amount_constant_usd_2021, n = 3) |>
select(recipient_region, recipient, amount_constant_usd_2021, commitment_year) |>
ungroup()
# A tibble: 22 × 4
recipient_region recipient amount_constant_usd_…¹ commitment_year
<chr> <chr> <dbl> <int>
1 Africa Angola 8147551108. 2016
2 Africa Angola 3481817339. 2010
3 Africa Democratic Republic … 3094756963. 2008
4 America Argentina 21421774753. 2018
5 America Argentina 21270602003. 2019
6 America Argentina 21041856567. 2020
7 Asia Turkmenistan 6008040193. 2009
8 Asia Kazakhstan 5717197207. 2008
9 Asia Kazakhstan 5717197207. 2008
10 Europe Russia 37225206222. 2013
# ℹ 12 more rows
# ℹ abbreviated name: ¹amount_constant_usd_2021
For each region: 1. Sort loans by amount (largest to smallest) 2. Keep the top one (n = 1
) or top three (n = 3
) 3. Move on to the next region
This is like filtering a pivot table to show only the maximum value in each group!
5.6.5 The Importance of ungroup()
Notice how we often end with ungroup()
? This is important! When you group data:
- The grouping stays active until you explicitly remove it
- This can affect later calculations in unexpected ways
ungroup()
removes the grouping when you’re done with it
Let’s see what can go wrong:
# THIS IS WRONG! (still grouped when calculating overall_share)
|>
mini_gcdf group_by(recipient_region) |>
mutate(
# This gives regional share (correct)
region_share = amount_constant_usd_2021 / sum(amount_constant_usd_2021),
# This gives same result because we're still grouped! (wrong)
overall_share = amount_constant_usd_2021 / sum(amount_constant_usd_2021)
)
# A tibble: 10 × 7
# Groups: recipient_region [3]
recipient recipient_region sector_name commitment_year amount_constant_usd_…¹
<chr> <chr> <chr> <int> <dbl>
1 Angola Africa ENERGY 2016 8147551108.
2 Angola Africa OTHER SOCI… 2010 3481817339.
3 Indonesia Asia INDUSTRY, … 2009 2853819092.
4 Indonesia Asia TRANSPORT … 2017 2743140983.
5 Pakistan Asia BANKING AN… 2020 4855813054.
6 Pakistan Asia BANKING AN… 2021 4651162791.
7 Venezuela America OTHER MULT… 2010 14402361186.
8 Venezuela America OTHER MULT… 2010 13927269358.
9 Zambia Africa ENERGY 2017 881870586.
10 Zambia Africa ENERGY 2017 881870586.
# ℹ abbreviated name: ¹amount_constant_usd_2021
# ℹ 2 more variables: region_share <dbl>, overall_share <dbl>
# THIS IS RIGHT! (ungroup before overall calculation)
|>
mini_gcdf group_by(recipient_region) |>
mutate(
region_share = amount_constant_usd_2021 / sum(amount_constant_usd_2021)
|>
) ungroup() |>
mutate(
overall_share = amount_constant_usd_2021 / sum(amount_constant_usd_2021)
)
# A tibble: 10 × 7
recipient recipient_region sector_name commitment_year amount_constant_usd_…¹
<chr> <chr> <chr> <int> <dbl>
1 Angola Africa ENERGY 2016 8147551108.
2 Angola Africa OTHER SOCI… 2010 3481817339.
3 Indonesia Asia INDUSTRY, … 2009 2853819092.
4 Indonesia Asia TRANSPORT … 2017 2743140983.
5 Pakistan Asia BANKING AN… 2020 4855813054.
6 Pakistan Asia BANKING AN… 2021 4651162791.
7 Venezuela America OTHER MULT… 2010 14402361186.
8 Venezuela America OTHER MULT… 2010 13927269358.
9 Zambia Africa ENERGY 2017 881870586.
10 Zambia Africa ENERGY 2017 881870586.
# ℹ abbreviated name: ¹amount_constant_usd_2021
# ℹ 2 more variables: region_share <dbl>, overall_share <dbl>
- After
summarize()
: Usually automatic (but watch for warnings) - After
mutate()
: If you’re done with group calculations - After
slice_*()
: Almost always - When in doubt:
ungroup()
! It never hurts.
5.6.6 Real World Example: Time Series Analysis
Let’s apply these patterns to analyze year-over-year changes in Angola’s loan commitments:
# Calculate year-over-year changes
|>
angola_annual_flows mutate(
prev_year_amount = lag(total_amount),
yoy_change = (total_amount - prev_year_amount) / prev_year_amount * 100
|>
) filter(!is.na(yoy_change)) # Remove first year (no previous year to compare)
# A tibble: 20 × 4
commitment_year total_amount prev_year_amount yoy_change
<int> <dbl> <dbl> <dbl>
1 2002 489893161. 15087227. 3147.
2 2003 26796701. 489893161. -94.5
3 2004 989972579. 26796701. 3594.
4 2005 2857495673. 989972579. 189.
5 2006 2453696747. 2857495673. -14.1
6 2007 3895910151. 2453696747. 58.8
7 2008 198727773. 3895910151. -94.9
8 2009 3388534669. 198727773. 1605.
9 2010 4300740778. 3388534669. 26.9
10 2011 2764770648. 4300740778. -35.7
11 2012 1567120325. 2764770648. -43.3
12 2013 5155360714. 1567120325. 229.
13 2014 4980964328. 5155360714. -3.38
14 2015 6206188709. 4980964328. 24.6
15 2016 17920526346. 6206188709. 189.
16 2017 2999075287. 17920526346. -83.3
17 2018 4340735198. 2999075287. 44.7
18 2019 119578481. 4340735198. -97.2
19 2020 64776546. 119578481. -45.8
20 2021 79700000 64776546. 23.0
lag(total_amount)
gets the previous year’s value- Calculate percent change from previous year
- Remove the first year (which has no previous year)
This kind of analysis is common when looking at lending trends over time!
5.7 Common Transformation Patterns in Development Finance
Now that we understand both the basic operations and grouping, let’s look at some common patterns you’ll use when analyzing Chinese development finance data:
5.7.1 Pattern 1: Annual Flows By Region
This pattern helps understand how lending varies across regions and time:
get_gcdf3_dataset() |>
filter(recommended_for_aggregates == "Yes") |>
group_by(commitment_year, recipient_region) |>
summarize(
total_amount_bn = sum(amount_constant_usd_2021, na.rm = TRUE) / 1e9,
project_count = n(),
avg_project_size_bn = mean(amount_constant_usd_2021, na.rm = TRUE) / 1e9,
.groups = "drop"
|>
) arrange(recipient_region, commitment_year)
# A tibble: 147 × 5
commitment_year recipient_region total_amount_bn project_count
<int> <chr> <dbl> <int>
1 2000 Africa 1.35 80
2 2001 Africa 3.09 110
3 2002 Africa 2.68 126
4 2003 Africa 4.24 144
5 2004 Africa 2.64 150
6 2005 Africa 7.10 215
7 2006 Africa 7.42 274
8 2007 Africa 15.9 351
9 2008 Africa 12.4 295
10 2009 Africa 16.5 342
# ℹ 137 more rows
# ℹ 1 more variable: avg_project_size_bn <dbl>
5.7.2 Pattern 2: Portfolio Composition
Understanding the sectoral focus of lending:
get_gcdf3_dataset() |>
filter(
== "Yes",
recommended_for_aggregates >= 2018 # Focus on recent years
commitment_year |>
) group_by(sector_name) |>
summarize(
total_amount_bn = sum(amount_constant_usd_2021, na.rm = TRUE) / 1e9,
project_count = n(),
avg_amount_bn = mean(amount_constant_usd_2021, na.rm = TRUE) / 1e9,
.groups = "drop"
|>
) arrange(desc(total_amount_bn)) |>
slice_head(n = 10) # Top 10 sectors
# A tibble: 10 × 4
sector_name total_amount_bn project_count avg_amount_bn
<chr> <dbl> <int> <dbl>
1 BANKING AND FINANCIAL SERVICES 143. 163 0.897
2 INDUSTRY, MINING, CONSTRUCTION 56.5 205 0.304
3 TRANSPORT AND STORAGE 54.7 315 0.215
4 ENERGY 43.4 210 0.241
5 GENERAL BUDGET SUPPORT 24.0 43 0.572
6 COMMUNICATIONS 7.78 124 0.101
7 UNALLOCATED/UNSPECIFIED 7.06 70 0.116
8 BUSINESS AND OTHER SERVICES 6.80 78 0.101
9 OTHER MULTISECTOR 5.59 101 0.119
10 HEALTH 4.23 2314 0.00311
5.7.3 Pattern 3: Country Risk Analysis
Analyzing lending patterns for specific countries:
get_gcdf3_dataset() |>
filter(
== "Yes",
recommended_for_aggregates %in% c("Angola", "Kenya", "Ethiopia")
recipient |>
) group_by(recipient, flow_class) |>
summarize(
total_amount_bn = sum(amount_constant_usd_2021, na.rm = TRUE) / 1e9,
project_count = n(),
.groups = "drop"
|>
) arrange(recipient, desc(total_amount_bn))
# A tibble: 9 × 4
recipient flow_class total_amount_bn project_count
<chr> <chr> <dbl> <int>
1 Angola OOF-like 54.8 202
2 Angola Vague (Official Finance) 10.0 65
3 Angola ODA-like 0.310 101
4 Ethiopia OOF-like 18.0 64
5 Ethiopia ODA-like 3.09 187
6 Ethiopia Vague (Official Finance) 0.276 10
7 Kenya OOF-like 9.56 42
8 Kenya ODA-like 2.80 152
9 Kenya Vague (Official Finance) 0.884 14
5.8 Practice Exercises
Try these exercises to get comfortable with data transformation. Remember to use AI tools if you get stuck!
5.8.1 Exercise 1: Basic Filtering
Find all projects that are:
- ODA-like or OOF-like
- Committed between 2018-2021
- Worth at least $100 million
5.8.2 Exercise 2: Regional Analysis
For each region, calculate:
- Total lending volume
- Number of projects
- Average project size
- Number of recipient countries
5.8.3 Exercise 3: Sector Trends
Analyze how sector composition has changed:
- Compare 2013-2017 vs 2018-2021
- Look at both volume and project counts
- Focus on the top 5 sectors by volume
If you get stuck:
- Check the dplyr cheatsheet
- Ask AI tools for help
- Look at similar examples in this guide
- Post questions in our course Slack
5.9 Resources for Learning More
5.9.1 Essential References
- R for Data Science - Data Transformation
- Comprehensive guide to dplyr
- Many practical examples
- Free online!
- dplyr cheatsheet
- Quick reference for common operations
- Great to keep handy while working
5.9.2 Video Tutorials
- Animated versions of common dplyr functions
- Clear, beginner-friendly overview
- Shows live coding examples
- Perfect for visual learners
5.10 Next Steps
In our class session, we’ll:
- Review any questions about these concepts
- Practice more complex transformations
- Work with real analysis questions
- Learn some advanced dplyr features
Remember: The goal isn’t to memorize every function, but to understand the basic patterns of data transformation. With these five core verbs and the pipe operator, you can handle most analysis tasks!