7  Week 4: Import & Tidy Your Data (Pre-Class)

This pre-class preparation should take about 45-60 minutes to complete.

7.1 Overview

Getting data into R and preparing it for analysis is often your first challenge in any project. In this pre-class session, we’ll focus on importing data from spreadsheets and establishing good practices for data cleaning. You’ll learn reliable workflows that make your analysis more reproducible and easier to maintain.

7.1.1 Video Lecture

Watch this video lecture before our interactive session:

7.2 Learning Objectives

By completing this pre-class work, you will:

  1. Learn to import data from Excel and CSV files into R
  2. Set up organized project structures
  3. Establish reliable data cleaning workflows
  4. Standardize common variables like dates and country names
  5. Practice with real Chinese development finance data

7.3 The Power of Clean Data

7.3.1 Why Data Cleaning Matters

Data cleaning might not be anyone’s idea of fun, but here’s a secret: getting good at data cleaning is one of the highest-return skills you can develop as a researcher. Why? Because while lots of people can run sophisticated analyses on clean datasets, far fewer people can reliably turn messy real-world data into analysis-ready information.

At AidData, this matters even more. AidData’s mission isn’t just to analyze existing datasets - you create new ones that reveal previously hidden patterns in Chinese development finance. The ability to clean and standardize messy data is core to this mission.

7.3.2 The Tidyverse Advantage

The tidyverse provides an extraordinarily powerful toolkit for importing and cleaning data, with specialized packages designed specifically for common cleaning tasks:

7.3.2.1 Data Import Tools

These are just the basics. There’s a whole universe of tidyverse-style data import packages for all varieties of file formats and APIs (“There’s a package for that…”).

7.3.2.2 Data Cleaning Specialists

All of these packages follow consistent principles and are designed for humans to use. The R for Data Science (2e) book has excellent chapters on each:

7.3.3 The Compounding Value of Clean Data

Here’s why mastering data cleaning is worth your time:

  1. Unique Insights: When everyone works with the same clean datasets, it’s hard to find unique patterns. The ability to clean messy data gives you access to information others might miss.

  2. Reproducible Work: Good data cleaning isn’t just about getting the data right once - it’s about creating reproducible pipelines that can handle new data as it arrives.

  3. Time Investment: While cleaning data takes time upfront, having clean, reliable data saves countless hours of troubleshooting and redoing analyses later.

  4. Competitive Advantage: In research, the quality of your inputs often matters more than the sophistication of your analysis. Being good at data cleaning lets you work with sources others might avoid.

7.3.4 What We’ll Learn

In the following sections, you’ll learn to:

  1. Import data from various sources
  2. Create reliable cleaning pipelines
  3. Handle common data issues like:
    • Inconsistent text formats
    • Messy dates and times
    • Different monetary formats
    • Variant country names

While data cleaning may never be the most exciting part of research, by the end of this section, you’ll have the tools to make it a manageable, reliable part of your workflow.

7.4 Setup

Let’s get our workspace ready:

  1. Create a new Quarto document for your notes:
# File → New File → Quarto Document
# Save as "week_4_import_preclass.qmd" in your week_4/R folder
  1. Create folders for organizing data:
dir.create("data-raw", showWarnings = FALSE)  # For original data
dir.create("data", showWarnings = FALSE)      # For cleaned data
  1. Download the AidData Critical Minerals Dataset and save it to your data-raw folder.
Why Two Data Folders?
  • data-raw/: Store original data exactly as received
    • Never modify these files
    • Serves as your “source of truth”
    • Makes your work reproducible
  • data/: Store cleaned, analysis-ready data
    • Modified through documented R code
    • Ready for analysis
    • Can always recreate from raw data
  1. Load required packages:
library(tidyverse)    # Core data science tools
library(readxl)       # For reading Excel files
library(readr)        # For reading CSV files
library(janitor)      # For cleaning column names
library(countrycode)  # For standardizing country names
library(here)        # For relative filepaths

7.5 Core Concepts

7.5.1 Organizing Your Data Pipeline

Think of data cleaning as a multi-step recipe. Just like cooking, you want to:

  1. Get your ingredients (import raw data, save in data-raw)
  2. Prep them properly (clean data)
  3. Create something useful (analysis-ready data, save in data for further use)

Let’s see how this works with real data.

7.5.2 Reading Data from Spreadsheets

The two main functions for reading spreadsheet data are:

  • readxl::read_excel() for Excel files
  • readr::read_csv() for CSV files

Let’s try reading our Critical Minerals data:

# Read the Excel file
minerals_raw <- read_excel(
  here(
    "data-raw",
    "AidData_Chinese_Financing_for_Transition_Minerals_Dataset_Version_1.0.xlsx"
  ),
  sheet = "Financial_Contribution",
  na = c("", "NA", "N/A", "#N/A", "NULL"),  # Handle missing values
  guess_max = 20000  # Look at more rows when guessing types
)

# Quick look at what we imported
glimpse(minerals_raw)
Rows: 137
Columns: 128
$ `AidData Record ID`                                <dbl> 95747, 95748, 39557…
$ `AidData Parent ID`                                <chr> "228", "228", "695"…
$ `Loan Event ID`                                    <chr> "101", "201", "301"…
$ `Loan Event Description`                           <chr> "2010 $200 million …
$ `Financier Country`                                <chr> "China (People's Re…
$ Recipient                                          <chr> "Kazakhstan", "Kaza…
$ `Recipient ISO-3`                                  <chr> "KAZ", "KAZ", "KAZ"…
$ `Recipient Region`                                 <chr> "Asia", "Asia", "As…
$ `Commitment Year`                                  <dbl> 2010, 2012, 2011, 2…
$ `Implementation Start Year`                        <dbl> 2014, 2009, 2013, 2…
$ `Completion Year`                                  <dbl> NA, NA, 2015, 2015,…
$ `Mining Site`                                      <chr> "Abyz Copper Mine a…
$ `Mining Site ID`                                   <dbl> 1, 2, 3, 3, 3, 3, 4…
$ `Investors Ownership ID`                           <dbl> 1, 2, 3, 3, 3, 3, 4…
$ `Transition Minerals`                              <chr> "Copper", "Copper",…
$ Title                                              <chr> "CDB provides $200 …
$ Description                                        <chr> "In June 2009, Chin…
$ `Staff Comments`                                   <chr> "1. The loan’s prec…
$ Status                                             <chr> "Implementation", "…
$ Intent                                             <chr> "Commercial", "Mixe…
$ `Flow Type`                                        <chr> "Loan", "Loan", "Lo…
$ `OECD ODA Concessionality Threshold`               <dbl> 0.25, 0.25, 0.25, 0…
$ `Flow Class`                                       <chr> "OOF-like", "OOF-li…
$ `Sector Code`                                      <dbl> 320, 320, 320, 320,…
$ `Sector Name`                                      <chr> "INDUSTRY, MINING, …
$ Infrastructure                                     <chr> "Yes", "Yes", "Yes"…
$ `Funding Agencies`                                 <chr> "China Development …
$ `Funding Agencies Type`                            <chr> "State-owned Policy…
$ Cofinanced                                         <chr> "No", "No", "No", "…
$ `Cofinancing Agencies`                             <chr> NA, NA, NA, NA, NA,…
$ `Cofinancing Agencies Type`                        <chr> NA, NA, NA, NA, NA,…
$ `Direct Receiving Agencies`                        <chr> "KAZ Minerals Finan…
$ `Direct Receiving Agencies Type`                   <chr> "Other Joint Ventur…
$ `Indirect Receiving Agencies`                      <chr> NA, NA, NA, NA, "KA…
$ `Indirect Receiving Agencies Type`                 <chr> NA, NA, NA, NA, "Ot…
$ `On-Lending`                                       <chr> NA, NA, NA, NA, "Ye…
$ `Implementing Agencies`                            <chr> NA, NA, "ABB Group|…
$ `Implementing Agencies Type`                       <chr> NA, NA, "Other Priv…
$ `Operator/Owner`                                   <chr> "Kazakhmys Corporat…
$ `Operator/Owner Type`                              <chr> "Recipient Private …
$ Investors                                          <chr> "Kazakhmys PLC", "K…
$ `Investors Type`                                   <chr> "Other Private Sect…
$ `Investors Ownership Percentages`                  <chr> "100", "100", "100"…
$ `Controlling Shareholder`                          <chr> "Kazakhmys PLC", "K…
$ `Controlling Shareholder Type`                     <chr> "Other Private Sect…
$ `Guarantee Provided`                               <chr> "Yes", "Yes", "Yes"…
$ Guarantor                                          <chr> "KAZ Minerals PLC (…
$ `Guarantor Agency Type`                            <chr> "Other Private Sect…
$ `Insurance Provided`                               <chr> NA, NA, NA, NA, NA,…
$ `Insurance Provider`                               <chr> NA, NA, NA, NA, NA,…
$ `Insurance Provider Agency Type`                   <chr> NA, NA, NA, NA, NA,…
$ `Collateralized/Securitized`                       <chr> NA, "Yes", "Yes", "…
$ `Collateral Provider`                              <chr> NA, "KAZ Minerals F…
$ `Collateral Provider Agency Type`                  <chr> NA, "Other Joint Ve…
$ `Security Agent/Collateral Agent`                  <chr> NA, NA, NA, NA, NA,…
$ `Security Agent/Collateral Agent Type`             <chr> NA, NA, NA, NA, NA,…
$ Collateral                                         <chr> NA, "Borrower to pr…
$ `Amount (Original Currency)`                       <dbl> 200000000, 20000000…
$ `Original Currency`                                <chr> "USD", "USD", "CNY"…
$ `Amount Estimated`                                 <chr> NA, NA, NA, NA, NA,…
$ `Amount (Constant USD 2021)`                       <dbl> 278545387, 23482471…
$ `Amount (Nominal USD)`                             <dbl> 200000000, 20000000…
$ `Financial Distress`                               <chr> "Yes", "Yes", NA, N…
$ `Commitment Date (MM/DD/YYYY)`                     <dttm> 2010-01-12, 2012-0…
$ `Commitment Date Estimated`                        <chr> "No", "No", "No", "…
$ `Planned Implementation Start Date (MM/DD/YYYY)`   <dttm> NA, NA, NA, NA, NA…
$ `Actual Implementation Start Date (MM/DD/YYYY)`    <dttm> 2014-01-01, 2009-0…
$ `Actual Implementation Start Date Estimated`       <chr> "No", "No", NA, NA,…
$ `Deviation from Planned Implementation Start Date` <dbl> NA, NA, NA, NA, NA,…
$ `Planned Completion Date (MM/DD/YYYY)`             <dttm> NA, NA, NA, NA, NA…
$ `Actual Completion Date (MM/DD/YYYY)`              <dttm> NA, NA, 2015-12-01…
$ `Actual Completion Date Estimated`                 <chr> "No", "No", NA, NA,…
$ `Deviation from Planned Completion Date`           <dbl> NA, NA, NA, NA, NA,…
$ Maturity                                           <dbl> 13.50, 13.50, 15.00…
$ `Interest Rate`                                    <dbl> 5.199, 5.536, 4.540…
$ `Fixed/Variable Interest Rate`                     <chr> "Variable", "Variab…
$ `Variable Interest Reference Rate`                 <chr> "LIBOR", "LIBOR", "…
$ `Variable Interest Detail`                         <chr> NA, NA, NA, NA, "6-…
$ `Variable Interest Margin`                         <dbl> 4.8, 4.8, NA, 4.2, …
$ `Grace Period`                                     <dbl> 3, 3, 3, 3, NA, NA,…
$ `Management Fee`                                   <dbl> 1.5, 1.5, NA, NA, N…
$ `Commitment Fee`                                   <lgl> NA, NA, NA, NA, NA,…
$ `Insurance Fee (Percent)`                          <lgl> NA, NA, NA, NA, NA,…
$ `Insurance Fee (Nominal USD)`                      <lgl> NA, NA, NA, NA, NA,…
$ `Default Interest Rate`                            <lgl> NA, NA, NA, NA, NA,…
$ `First Loan Repayment Date`                        <dttm> 2013-01-11, 2015-0…
$ `Last Loan Repayment Date`                         <dttm> 2023-07-10, 2025-1…
$ `Grant Element (OECD Cash-Flow)`                   <dbl> 24.4630, 22.6562, 2…
$ `Grant Element (OECD Grant-Equiv)`                 <dbl> 4.4782, 2.3634, 29.…
$ `Grant Element (IMF)`                              <dbl> 0.0000, 0.0000, 2.7…
$ `Number of Lenders`                                <chr> "Bilateral Loan", "…
$ `Export Buyer's Credit`                            <chr> NA, NA, NA, NA, NA,…
$ `Supplier’s Credit/Export Seller’s Credit`         <chr> NA, NA, NA, NA, NA,…
$ `Interest-Free Loan`                               <chr> NA, NA, NA, NA, NA,…
$ Refinancing                                        <chr> NA, NA, NA, NA, NA,…
$ `Investment Project Loan`                          <chr> "Yes", "Yes", "Yes"…
$ `M&A`                                              <chr> NA, NA, NA, NA, NA,…
$ `Working Capital`                                  <chr> NA, NA, NA, NA, NA,…
$ EPCF                                               <lgl> NA, NA, NA, NA, NA,…
$ Lease                                              <lgl> NA, NA, NA, NA, NA,…
$ `FXSL/BOP`                                         <lgl> NA, NA, NA, NA, NA,…
$ `CC IRS`                                           <lgl> NA, NA, NA, NA, NA,…
$ RCF                                                <chr> NA, NA, NA, NA, NA,…
$ GCL                                                <chr> NA, NA, NA, NA, NA,…
$ PBC                                                <lgl> NA, NA, NA, NA, NA,…
$ `PxF/Commodity Prepayment`                         <chr> NA, NA, NA, NA, NA,…
$ `Inter-Bank Loan`                                  <chr> NA, NA, NA, NA, "Ye…
$ `Overseas Project Contracting Loan`                <lgl> NA, NA, NA, NA, NA,…
$ DPA                                                <chr> NA, NA, NA, NA, NA,…
$ `Project Finance`                                  <chr> NA, NA, "Yes", "Yes…
$ `Involving Multilateral`                           <lgl> NA, NA, NA, NA, NA,…
$ `Involving Non-Chinese Financier`                  <chr> "No", "No", "No", "…
$ `Short-Term`                                       <chr> NA, NA, NA, NA, NA,…
$ Rescue                                             <lgl> NA, NA, NA, NA, NA,…
$ `JV/SPV Host Government Ownership`                 <chr> "No Host Government…
$ `JV/SPV Chinese Government Ownership`              <chr> "No Chinese Governm…
$ `Level of Public Liability`                        <chr> "Central government…
$ `Total Source Count`                               <dbl> 9, 11, 17, 19, 10, …
$ `Official Source Count`                            <dbl> 2, 2, 11, 12, 6, 12…
$ `Source URLs`                                      <chr> "https://www.proact…
$ `Source Titles`                                    <chr> "Kazakhmys signs ne…
$ `Source Publishers`                                <chr> "Proactive Investor…
$ `Source Resource Types`                            <chr> "Media Report|Imple…
$ `Contact Name`                                     <chr> NA, NA, NA, NA, NA,…
$ `Contact Position`                                 <chr> NA, NA, NA, NA, NA,…
$ `ODA Eligible Recipient`                           <chr> "Yes", "Yes", "Yes"…
$ `OECD ODA Income Group`                            <chr> "Upper middle incom…
$ `Location Narrative`                               <chr> "The project develo…
Key Import Arguments
  1. na =: What values should be treated as missing
  2. guess_max =: How many rows to check when determining column types
  3. sheet =: Which Excel sheet to read
  4. range =: Specific cells to read

These help handle common import challenges like: - Different representations of missing data - Incorrect column type detection - Multiple sheets in one file

Read the Documentation!

While we’re covering key arguments, the readxl and readr packages have many more options. Taking time to read the documentation (read_excel(), read_csv()) will save you hours of troubleshooting later.

You don’t need to memorize all options - just know they exist and where to find them when needed.

7.6 A Systematic Approach to Data Cleaning

7.6.1 Step 1: Clean Column Names & Inspect Data

First, we always want consistent, programming-friendly column names. glimpse() your new clean names, and look for data that might need to be cleaned. Note any variables where the data type displayed does not match what it logically should be (e.g. is a numeric variable show with data type <chr>?)

# Clean the names
minerals_clean <- minerals_raw |>
  clean_names()

# See the difference
minerals_clean |> glimpse()
Rows: 137
Columns: 128
$ aid_data_record_id                               <dbl> 95747, 95748, 39557, …
$ aid_data_parent_id                               <chr> "228", "228", "695", …
$ loan_event_id                                    <chr> "101", "201", "301", …
$ loan_event_description                           <chr> "2010 $200 million lo…
$ financier_country                                <chr> "China (People's Repu…
$ recipient                                        <chr> "Kazakhstan", "Kazakh…
$ recipient_iso_3                                  <chr> "KAZ", "KAZ", "KAZ", …
$ recipient_region                                 <chr> "Asia", "Asia", "Asia…
$ commitment_year                                  <dbl> 2010, 2012, 2011, 201…
$ implementation_start_year                        <dbl> 2014, 2009, 2013, 201…
$ completion_year                                  <dbl> NA, NA, 2015, 2015, 2…
$ mining_site                                      <chr> "Abyz Copper Mine and…
$ mining_site_id                                   <dbl> 1, 2, 3, 3, 3, 3, 4, …
$ investors_ownership_id                           <dbl> 1, 2, 3, 3, 3, 3, 4, …
$ transition_minerals                              <chr> "Copper", "Copper", "…
$ title                                            <chr> "CDB provides $200 mi…
$ description                                      <chr> "In June 2009, China …
$ staff_comments                                   <chr> "1. The loan’s precis…
$ status                                           <chr> "Implementation", "Im…
$ intent                                           <chr> "Commercial", "Mixed"…
$ flow_type                                        <chr> "Loan", "Loan", "Loan…
$ oecd_oda_concessionality_threshold               <dbl> 0.25, 0.25, 0.25, 0.2…
$ flow_class                                       <chr> "OOF-like", "OOF-like…
$ sector_code                                      <dbl> 320, 320, 320, 320, 3…
$ sector_name                                      <chr> "INDUSTRY, MINING, CO…
$ infrastructure                                   <chr> "Yes", "Yes", "Yes", …
$ funding_agencies                                 <chr> "China Development Ba…
$ funding_agencies_type                            <chr> "State-owned Policy B…
$ cofinanced                                       <chr> "No", "No", "No", "No…
$ cofinancing_agencies                             <chr> NA, NA, NA, NA, NA, N…
$ cofinancing_agencies_type                        <chr> NA, NA, NA, NA, NA, N…
$ direct_receiving_agencies                        <chr> "KAZ Minerals Finance…
$ direct_receiving_agencies_type                   <chr> "Other Joint Venture/…
$ indirect_receiving_agencies                      <chr> NA, NA, NA, NA, "KAZ …
$ indirect_receiving_agencies_type                 <chr> NA, NA, NA, NA, "Othe…
$ on_lending                                       <chr> NA, NA, NA, NA, "Yes"…
$ implementing_agencies                            <chr> NA, NA, "ABB Group|Al…
$ implementing_agencies_type                       <chr> NA, NA, "Other Privat…
$ operator_owner                                   <chr> "Kazakhmys Corporatio…
$ operator_owner_type                              <chr> "Recipient Private Se…
$ investors                                        <chr> "Kazakhmys PLC", "Kaz…
$ investors_type                                   <chr> "Other Private Sector…
$ investors_ownership_percentages                  <chr> "100", "100", "100", …
$ controlling_shareholder                          <chr> "Kazakhmys PLC", "Kaz…
$ controlling_shareholder_type                     <chr> "Other Private Sector…
$ guarantee_provided                               <chr> "Yes", "Yes", "Yes", …
$ guarantor                                        <chr> "KAZ Minerals PLC (Fo…
$ guarantor_agency_type                            <chr> "Other Private Sector…
$ insurance_provided                               <chr> NA, NA, NA, NA, NA, N…
$ insurance_provider                               <chr> NA, NA, NA, NA, NA, N…
$ insurance_provider_agency_type                   <chr> NA, NA, NA, NA, NA, N…
$ collateralized_securitized                       <chr> NA, "Yes", "Yes", "Ye…
$ collateral_provider                              <chr> NA, "KAZ Minerals Fin…
$ collateral_provider_agency_type                  <chr> NA, "Other Joint Vent…
$ security_agent_collateral_agent                  <chr> NA, NA, NA, NA, NA, N…
$ security_agent_collateral_agent_type             <chr> NA, NA, NA, NA, NA, N…
$ collateral                                       <chr> NA, "Borrower to prov…
$ amount_original_currency                         <dbl> 200000000, 200000000,…
$ original_currency                                <chr> "USD", "USD", "CNY", …
$ amount_estimated                                 <chr> NA, NA, NA, NA, NA, N…
$ amount_constant_usd_2021                         <dbl> 278545387, 234824712,…
$ amount_nominal_usd                               <dbl> 200000000, 200000000,…
$ financial_distress                               <chr> "Yes", "Yes", NA, NA,…
$ commitment_date_mm_dd_yyyy                       <dttm> 2010-01-12, 2012-06-…
$ commitment_date_estimated                        <chr> "No", "No", "No", "No…
$ planned_implementation_start_date_mm_dd_yyyy     <dttm> NA, NA, NA, NA, NA, …
$ actual_implementation_start_date_mm_dd_yyyy      <dttm> 2014-01-01, 2009-01-…
$ actual_implementation_start_date_estimated       <chr> "No", "No", NA, NA, N…
$ deviation_from_planned_implementation_start_date <dbl> NA, NA, NA, NA, NA, N…
$ planned_completion_date_mm_dd_yyyy               <dttm> NA, NA, NA, NA, NA, …
$ actual_completion_date_mm_dd_yyyy                <dttm> NA, NA, 2015-12-01, …
$ actual_completion_date_estimated                 <chr> "No", "No", NA, NA, N…
$ deviation_from_planned_completion_date           <dbl> NA, NA, NA, NA, NA, N…
$ maturity                                         <dbl> 13.50, 13.50, 15.00, …
$ interest_rate                                    <dbl> 5.199, 5.536, 4.540, …
$ fixed_variable_interest_rate                     <chr> "Variable", "Variable…
$ variable_interest_reference_rate                 <chr> "LIBOR", "LIBOR", "LI…
$ variable_interest_detail                         <chr> NA, NA, NA, NA, "6-Mo…
$ variable_interest_margin                         <dbl> 4.8, 4.8, NA, 4.2, 4.…
$ grace_period                                     <dbl> 3, 3, 3, 3, NA, NA, N…
$ management_fee                                   <dbl> 1.5, 1.5, NA, NA, NA,…
$ commitment_fee                                   <lgl> NA, NA, NA, NA, NA, N…
$ insurance_fee_percent                            <lgl> NA, NA, NA, NA, NA, N…
$ insurance_fee_nominal_usd                        <lgl> NA, NA, NA, NA, NA, N…
$ default_interest_rate                            <lgl> NA, NA, NA, NA, NA, N…
$ first_loan_repayment_date                        <dttm> 2013-01-11, 2015-06-…
$ last_loan_repayment_date                         <dttm> 2023-07-10, 2025-12-…
$ grant_element_oecd_cash_flow                     <dbl> 24.4630, 22.6562, 29.…
$ grant_element_oecd_grant_equiv                   <dbl> 4.4782, 2.3634, 29.44…
$ grant_element_imf                                <dbl> 0.0000, 0.0000, 2.790…
$ number_of_lenders                                <chr> "Bilateral Loan", "Bi…
$ export_buyers_credit                             <chr> NA, NA, NA, NA, NA, N…
$ supplier_s_credit_export_seller_s_credit         <chr> NA, NA, NA, NA, NA, N…
$ interest_free_loan                               <chr> NA, NA, NA, NA, NA, N…
$ refinancing                                      <chr> NA, NA, NA, NA, NA, N…
$ investment_project_loan                          <chr> "Yes", "Yes", "Yes", …
$ m_a                                              <chr> NA, NA, NA, NA, NA, N…
$ working_capital                                  <chr> NA, NA, NA, NA, NA, N…
$ epcf                                             <lgl> NA, NA, NA, NA, NA, N…
$ lease                                            <lgl> NA, NA, NA, NA, NA, N…
$ fxsl_bop                                         <lgl> NA, NA, NA, NA, NA, N…
$ cc_irs                                           <lgl> NA, NA, NA, NA, NA, N…
$ rcf                                              <chr> NA, NA, NA, NA, NA, N…
$ gcl                                              <chr> NA, NA, NA, NA, NA, N…
$ pbc                                              <lgl> NA, NA, NA, NA, NA, N…
$ px_f_commodity_prepayment                        <chr> NA, NA, NA, NA, NA, N…
$ inter_bank_loan                                  <chr> NA, NA, NA, NA, "Yes"…
$ overseas_project_contracting_loan                <lgl> NA, NA, NA, NA, NA, N…
$ dpa                                              <chr> NA, NA, NA, NA, NA, N…
$ project_finance                                  <chr> NA, NA, "Yes", "Yes",…
$ involving_multilateral                           <lgl> NA, NA, NA, NA, NA, N…
$ involving_non_chinese_financier                  <chr> "No", "No", "No", "No…
$ short_term                                       <chr> NA, NA, NA, NA, NA, N…
$ rescue                                           <lgl> NA, NA, NA, NA, NA, N…
$ jv_spv_host_government_ownership                 <chr> "No Host Government O…
$ jv_spv_chinese_government_ownership              <chr> "No Chinese Governmen…
$ level_of_public_liability                        <chr> "Central government-g…
$ total_source_count                               <dbl> 9, 11, 17, 19, 10, 13…
$ official_source_count                            <dbl> 2, 2, 11, 12, 6, 12, …
$ source_ur_ls                                     <chr> "https://www.proactiv…
$ source_titles                                    <chr> "Kazakhmys signs new …
$ source_publishers                                <chr> "Proactive Investors|…
$ source_resource_types                            <chr> "Media Report|Impleme…
$ contact_name                                     <chr> NA, NA, NA, NA, NA, N…
$ contact_position                                 <chr> NA, NA, NA, NA, NA, N…
$ oda_eligible_recipient                           <chr> "Yes", "Yes", "Yes", …
$ oecd_oda_income_group                            <chr> "Upper middle income"…
$ location_narrative                               <chr> "The project develope…
Why Clean Column Names?

Raw data often comes with inconsistent column names:

  • Spaces (“Project Name”)
  • Special characters (“Amount ($)”)
  • Inconsistent capitalization (“projectName”, “Project_name”)
  • Numbers at start (“2021_amount”)

clean_names():

  • Converts to lowercase
  • Replaces spaces/special chars with underscores
  • Makes names programming-friendly
  • Creates consistent style

This matters because:

  • Reduces coding errors
  • Makes autocomplete work better
  • Prevents quoting/escaping headaches
  • Creates consistent style across projects

7.6.2 Step 2: Fix Dates

Always inspect and clean date columns early:

# First, find all date columns
minerals_clean |> 
  select(contains("date")) |> 
  glimpse()
Rows: 137
Columns: 12
$ commitment_date_mm_dd_yyyy                       <dttm> 2010-01-12, 2012-06-…
$ commitment_date_estimated                        <chr> "No", "No", "No", "No…
$ planned_implementation_start_date_mm_dd_yyyy     <dttm> NA, NA, NA, NA, NA, …
$ actual_implementation_start_date_mm_dd_yyyy      <dttm> 2014-01-01, 2009-01-…
$ actual_implementation_start_date_estimated       <chr> "No", "No", NA, NA, N…
$ deviation_from_planned_implementation_start_date <dbl> NA, NA, NA, NA, NA, N…
$ planned_completion_date_mm_dd_yyyy               <dttm> NA, NA, NA, NA, NA, …
$ actual_completion_date_mm_dd_yyyy                <dttm> NA, NA, 2015-12-01, …
$ actual_completion_date_estimated                 <chr> "No", "No", NA, NA, N…
$ deviation_from_planned_completion_date           <dbl> NA, NA, NA, NA, NA, N…
$ first_loan_repayment_date                        <dttm> 2013-01-11, 2015-06-…
$ last_loan_repayment_date                         <dttm> 2023-07-10, 2025-12-…
# Convert all dates to proper format
minerals_dates <- minerals_clean |>
  mutate(
    commitment_date_mm_dd_yyyy = ymd(commitment_date_mm_dd_yyyy),
    planned_implementation_start_date_mm_dd_yyyy = ymd(planned_implementation_start_date_mm_dd_yyyy),
    actual_implementation_start_date_mm_dd_yyyy = ymd(actual_implementation_start_date_mm_dd_yyyy),
    planned_completion_date_mm_dd_yyyy = ymd(planned_completion_date_mm_dd_yyyy),
    actual_completion_date_mm_dd_yyyy = ymd(actual_completion_date_mm_dd_yyyy),
    first_loan_repayment_date = ymd(first_loan_repayment_date),
    last_loan_repayment_date = ymd(last_loan_repayment_date)
  )

minerals_dates |> 
  select(contains("date")) |> 
  glimpse()
Rows: 137
Columns: 12
$ commitment_date_mm_dd_yyyy                       <date> 2010-01-12, 2012-06-…
$ commitment_date_estimated                        <chr> "No", "No", "No", "No…
$ planned_implementation_start_date_mm_dd_yyyy     <date> NA, NA, NA, NA, NA, …
$ actual_implementation_start_date_mm_dd_yyyy      <date> 2014-01-01, 2009-01-…
$ actual_implementation_start_date_estimated       <chr> "No", "No", NA, NA, N…
$ deviation_from_planned_implementation_start_date <dbl> NA, NA, NA, NA, NA, N…
$ planned_completion_date_mm_dd_yyyy               <date> NA, NA, NA, NA, NA, …
$ actual_completion_date_mm_dd_yyyy                <date> NA, NA, 2015-12-01, …
$ actual_completion_date_estimated                 <chr> "No", "No", NA, NA, N…
$ deviation_from_planned_completion_date           <dbl> NA, NA, NA, NA, NA, N…
$ first_loan_repayment_date                        <date> 2013-01-11, 2015-06-…
$ last_loan_repayment_date                         <date> 2023-07-10, 2025-12-…
Why Clean Dates?

Raw dates can appear in many formats:

  • “2021-01-15”
  • “1/15/21”
  • “15 Jan 2021”
  • “2021-Q1”

Proper date formatting enables:

  • Time-series analysis
  • Duration calculations
  • Correct sorting
  • Filtering by time periods

If you are creating a data cleaning pipeline that will be used on newer versions of the same dataset, make sure to coerce dates into the correct format even if they are parsed correctly by read_excel() (or other data import methods). Next time they might not be, and a date variable that is read in as a character string might mess up your subsequent analysis pipelines.

7.6.3 Step 3: Standardize Country Information

Create consistent country identifiers:

Why Standardize Country Names?

Countries often appear differently across datasets:

  • “Cote d’Ivoire” vs “Ivory Coast”
  • “Democratic Republic of the Congo” vs “DR Congo”
  • “People’s Republic of China” vs “China”

Standardization enables:

  • Joining across datasets
  • Consistent visualization labels
  • Regional aggregation
  • Integration with other global data
minerals_countries <- minerals_dates |>
  mutate(
    # Add ISO3C codes
    iso3c = countrycode(
      sourcevar = recipient,
      origin = "country.name",
      destination = "iso3c",
      origin_regex = TRUE,  # Helps match variations
      warn = TRUE          # Shows what doesn't match
    ),
    # Add standardized names
    country_name = countrycode(
      sourcevar = iso3c,
      origin = "iso3c",
      destination = "country.name"
    ),
    # Add World Bank regions
    wb_region = countrycode(
      sourcevar = iso3c,
      origin = "iso3c",
      destination = "region"
    )
  )
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso3c = countrycode(...)`.
Caused by warning:
! Some values were not matched unambiguously: Africa, regional
minerals_countries |> 
  select(
    country_name,
    recipient,
    wb_region,
    recipient_region
  ) |> 
  unique()
# A tibble: 20 × 4
   country_name       recipient                       wb_region recipient_region
   <chr>              <chr>                           <chr>     <chr>           
 1 Kazakhstan         Kazakhstan                      Europe &… Asia            
 2 Peru               Peru                            Latin Am… America         
 3 Eritrea            Eritrea                         Sub-Saha… Africa          
 4 Serbia             Serbia                          Europe &… Europe          
 5 Kyrgyzstan         Kyrgyz Republic                 Europe &… Asia            
 6 Russia             Russia                          Europe &… Europe          
 7 Dominican Republic Dominican Republic              Latin Am… America         
 8 Chile              Chile                           Latin Am… America         
 9 Congo - Kinshasa   Democratic Republic of the Con… Sub-Saha… Africa          
10 Myanmar (Burma)    Myanmar                         East Asi… Asia            
11 Zambia             Zambia                          Sub-Saha… Africa          
12 Ecuador            Ecuador                         Latin Am… America         
13 South Africa       South Africa                    Sub-Saha… Africa          
14 Laos               Lao People's Democratic Republ… East Asi… Asia            
15 <NA>               Africa, regional                <NA>      Africa          
16 Papua New Guinea   Papua New Guinea                East Asi… Oceania         
17 Vietnam            Viet Nam                        East Asi… Asia            
18 Uganda             Uganda                          Sub-Saha… Africa          
19 Indonesia          Indonesia                       East Asi… Asia            
20 Iran               Iran                            Middle E… Middle East     
Handling Non-Country Entries

Some datasets include regional entries (like “Africa, regional”) that won’t match country codes. Options for handling these:

  1. Use warn = TRUE to see what doesn’t match
  2. Create custom matching rules for special cases (see ?countrycode)
  3. Document any manual corrections needed

How you handle these will depend on the context of your dataset and analysis. It’s worth remembering that the definition of regions and other aggregates often varies slightly by data source.

Country Code Best Practices
  1. Use ISO3C for Programming

    • Three-letter codes are unambiguous
    • Avoid ISO2C (e.g., Namibia’s “NA” can cause issues)
    • Perfect for plot labels where space is tight
  2. Use Standardized Names for Presentation

    • More readable than codes
    • Consistent across datasets
    • Good for reports and visualizations
  3. Keep Multiple Identifiers

    • Original names (match documentation)
    • ISO3C codes (for programming)
    • Standardized names (for presentation)
    • Regional groupings (for analysis)
  4. Check the countrycode::codelist

    # See all available code types
    ?countrycode::codelist
    
    # Common useful conversions:
    # - "continent" Continent as defined in the World Bank Development Indicators
    # - "currency" ISO 4217 currency name
    # - "region" Regions as defined in the World Bank Development Indicators
    # - "eu28" for EU membership

7.6.4 Step 4: Create Proper Categories

Convert text categories to meaningful factors.

You don’t need to do this for all text variables, but consider doing it for ones you are going to use often, and where order matters for tables + charts.

minerals_cats <- minerals_countries |>
  mutate(
    # Make status an ordered factor
    status = factor(
      status,
      levels = c(
        "Pipeline: Commitment",
        "Implementation",
        "Completion"
      )
    ),
    # Make income groups ordered
    oecd_oda_income_group = factor(
      oecd_oda_income_group,
      levels = c(
        "Low income",
        "Lower middle income",
        "Upper middle income"
      )
    )
  )

minerals_cats |> 
  count(
    status
  )
# A tibble: 3 × 2
  status                   n
  <fct>                <int>
1 Pipeline: Commitment    22
2 Implementation          26
3 Completion              89
Why Create Proper Factors?

Raw categorical data often needs structure:

  • Natural ordering (status phases)
  • Grouping levels (income categories)
  • Consistent labels

Proper factors enable:

  • Correct ordering in plots
  • Meaningful summaries
  • Efficient filtering
  • Clear presentation

7.7 Creating a Reusable Pipeline

Now let’s combine these steps into a reusable function:

process_minerals_data <- function(data) {
  data |>
    # Step 1: Clean column names
    clean_names() |>
    
    # Step 2: Fix dates
    mutate(
      commitment_date_mm_dd_yyyy = ymd(commitment_date_mm_dd_yyyy),
      planned_implementation_start_date_mm_dd_yyyy = ymd(planned_implementation_start_date_mm_dd_yyyy),
      actual_implementation_start_date_mm_dd_yyyy = ymd(actual_implementation_start_date_mm_dd_yyyy),
      planned_completion_date_mm_dd_yyyy = ymd(planned_completion_date_mm_dd_yyyy),
      actual_completion_date_mm_dd_yyyy = ymd(actual_completion_date_mm_dd_yyyy),
      first_loan_repayment_date = ymd(first_loan_repayment_date),
      last_loan_repayment_date = ymd(last_loan_repayment_date)
    ) |>
    
    # Step 3: Standardize country information
    mutate(
      iso3c = countrycode(
        sourcevar = recipient,
        origin = "country.name",
        destination = "iso3c",
        origin_regex = TRUE,
        warn = TRUE
      ),
      country_name = countrycode(
        sourcevar = iso3c,
        origin = "iso3c",
        destination = "country.name"
      ),
      wb_region = countrycode(
        sourcevar = iso3c,
        origin = "iso3c",
        destination = "region"
      )
    ) |>
    
    # Step 4: Create proper factors
    mutate(
      status = factor(
        status,
        levels = c(
          "Pipeline: Commitment",
          "Implementation",
          "Completion"
        ),
      ),
      # Make income groups ordered
      oecd_oda_income_group = factor(
        oecd_oda_income_group,
        levels = c(
          "Low income",
          "Lower middle income",
          "Upper middle income"
        )
      ) 
    ) |>
    
    # Step 5: Add derived variables
    mutate(
      amount_bn = amount_constant_usd_2021 / 1e9
    )
}

# Use the pipeline
# First import the data
minerals_raw <- read_excel(
  here::here(
    "data-raw", 
    "AidData_Chinese_Financing_for_Transition_Minerals_Dataset_Version_1.0.xlsx"
  ),
  sheet = "Financial_Contribution",
  na = c("", "NA", "N/A", "#N/A", "NULL"),
  guess_max = 20000
)

# Then process it
minerals_clean <- minerals_raw |> 
  process_minerals_data()
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso3c = countrycode(...)`.
Caused by warning:
! Some values were not matched unambiguously: Africa, regional
# Save the results
write_rds(
  minerals_clean, 
  here::here("data", "minerals_clean.rds")
)

write_csv(
  minerals_clean, 
  here::here("data", "minerals_clean.csv")
)

7.8 Data Cleaning Checklist

Before considering your data clean, verify:

Data Structure

Data Types

Standardization

Documentation

Quality Control

7.9 Effective AI Prompts for Data Cleaning

Here are some powerful prompts that will help you get the most out of AI tools when cleaning data:

7.9.1 Making Data Tidy

I have a dataset that looks like this:
[paste first few rows of your data using head() or glimpse()]

I think it might not be in tidy format because:
[describe what seems wrong, e.g., "multiple variables in one column" or "values spread across columns"]

Can you help me:
1. Identify which tidyverse principles it violates:
   - Is each variable a column?
   - Is each observation a row?
   - Is each value a single cell?

2. Suggest a tidyr pipeline to fix it?
3. Explain why each step in the pipeline helps?

Example:

I have this dataset:
Year  Q1_Sales  Q2_Sales  Q3_Sales  Q4_Sales
2021    100       120       95        150
2022    110       125       100       160

This seems untidy because sales values are spread across columns.
How can I reshape this to have columns: year, quarter, sales?

Tips for good tidy data prompts:

  1. Show sample data
  2. Explain what seems wrong
  3. Describe desired output
  4. Ask for explanation of steps

7.9.2 Understanding Data Structure

I have an Excel file with this glimpse() output:
[paste your glimpse() output]

I want to:
1. Clean the column names
2. Convert dates to proper format
3. Standardize country names
Can you help me write a tidyverse pipeline to do this?

7.9.3 Date Standardization

I have dates in these formats:
[paste unique(date_column)]

I need to:
1. Convert them to proper date objects
2. Handle missing/invalid dates
3. Create consistent format

Can you help me write a robust date cleaning function?

7.9.4 Debugging Data Issues

I'm trying to clean this data but getting this error:
[paste error message]

Here's my code:
[paste code]

Here's a sample of my data:
[paste glimpse(data)]

Can you help me:
1. Understand what's wrong
2. Fix the immediate issue
3. Prevent similar issues?

7.9.5 Creating Cleaning Functions

I need to clean multiple similar datasets with these characteristics:
[paste glimpse(data)]

Common issues include:
[list issues]

Can you help me write a robust cleaning function that:
1. Handles all these cases
2. Includes error checking
3. Documents the cleaning steps?
Getting the Most from AI
  1. Show Your Data
    • Use glimpse(), head(), or str()
    • Include sample values
    • Show error messages
  2. Be Specific
    • Explain your goal
    • Describe current issues
    • List any constraints
  3. Ask for Explanation
    • Request comments in code
    • Ask about trade-offs
    • Get help with error handling

7.10 Practice Exercises

7.10.1 Exercise 1: Creating a Data Processing Pipeline

Let’s practice by cleaning some financial data with common issues:

# Create our messy data
messy_data <- "
Country,Project Value,Date,Status
People's Republic of China,\"$12,000,000\",Sep 20 2021,ACTIVE
Democratic Republic of Congo,\"$8,500,000\",Sep 15 2021,Active
Vietnam,\"$15,250,000\",Sep 10 2021,COMPLETED
China,\"$9,750,000\",Sep 5 2021,active
Cote d'Ivoire,\"$11,250,000\",Sep 1 2021,Pipeline
"

# Save to data-raw using here()
write_file(
  messy_data,
  here::here("data-raw", "messy_finance.csv")
)

Your tasks:

  1. Create a function called process_finance_data() that:

    • Standardizes country names (note: China appears twice with different names)
    • Cleans monetary values using parse_number()
    • Converts dates to proper date format using lubridate
    • Creates proper status factors. Use str_to_title() to get values to consistent case first.
  2. Write a pipeline that:

    # Import data
    finance_raw <- read_csv(
      here::here("data-raw", "messy_finance.csv")
    )
    
    # Process it
    finance_clean <- finance_raw |>
      process_finance_data()
    
    # Save results
    write_rds(
      finance_clean,
      here::here("data", "finance_clean.rds")
    )
Useful Cleaning Functions

For Numbers:

# parse_number() removes currency symbols and commas
parse_number("$12,000,000")  # Returns 12000000
parse_number("$1,234.56")    # Returns 1234.56

For Dates: Lubridate provides functions matching common date formats:

  • mdy() (month-day-year): “Sep 20 2021” → 2021-09-20
  • ymd() (year-month-date): “2021-09-20” → 2021-09-20
  • dmy() (day-month-year): “20-09-2021” → 2021-09-20
# Examples
library(lubridate)
mdy("Sep 20 2021")    # Returns "2021-09-20"
[1] "2021-09-20"
ymd("2021-09-20")     # Returns "2021-09-20"
[1] "2021-09-20"
dmy("20-09-2021")     # Returns "2021-09-20"
[1] "2021-09-20"

The function name matches the order of the date components (m=month, d=day, y=year).

Making Text Case Consistent

The {stringr} package provides several functions for standardizing text case:

# Convert to title case (First Letter Of Each Word)
str_to_title("RURAL ELECTRIFICATION project")  # Returns "Rural Electrification Project"

# Convert to upper case (ALL CAPS)
str_to_upper("Rural Electrification Project")  # Returns "RURAL ELECTRIFICATION PROJECT"

# Convert to lower case (all lowercase)
str_to_lower("Rural Electrification Project")  # Returns "rural electrification project"

# Real world example: standardizing status values
status_values <- c("IN PROGRESS", "Completed", "not started", "In Progress")

status_clean <- status_values |>
  str_to_title()  # Returns: "In Progress", "Completed", "Not Started", "In Progress"

When to use each:

  • str_to_title(): Names, project titles, status values
  • str_to_upper(): Country codes, ID values
  • str_to_lower(): Before matching or comparing strings

7.10.2 Exercise 2: Working with Multiple Data Quality Issues

Let’s practice handling several common data quality issues:

# Create project data with various issues
projects <- "
Region,Project Title,Start Date,Budget (USD),Status
East Asia,Water Treatment Plant,Sep 15 2021,$12000000,In Progress
Eastern Asia,Solar Farm Phase 1,Sep 10 2021,$15250000,ACTIVE
Sub-Saharan Africa,Highway Extension,,\"$8,500,000\",planning
SSA,Rural Electrification,Sep 1 2021,$11250000,ACTIVE
"

# Save using here()
write_file(
  projects,
  here::here("data-raw", "projects.csv")
)

Your tasks:

  1. Create a data processing function that:
    • Standardizes region names (East Asia/Eastern Asia, Sub-Saharan Africa/SSA)
    • Handles the missing date
    • Cleans monetary values
    • Standardizes status values
    • Groups similar projects by region
  2. Add validation checks that:
    • Verify all required fields are present
    • Check date ranges
    • Validate budget ranges
    • Ensure status values are standardized
Common Data Quality Issues

When working with real data, watch for:

  1. Inconsistent Names
    • Different spellings
    • Abbreviations
    • Regional variations
  2. Missing Values
    • Empty cells
    • Placeholder values (“N/A”, “-”, etc.)
    • Impossible values
  3. Format Inconsistencies
    • Mixed date formats
    • Different currency notations
    • Varied text cases

Always document how you handle each type of issue!

7.11 Resources for Learning More

7.11.1 Essential References

  1. Data Import & Cleaning
  2. Working with Dates
  3. Number Formatting

7.12 Next Steps

In our class session, we’ll:

  1. Work with Complex Datasets
    • Handle multiple related files
    • Learn about different types of joins
    • Create robust cleaning pipelines
  2. Build Validation Systems
    • Create data quality checks
    • Validate transformations
    • Document cleaning decisions
  3. Practice with Real Data
    • Work with your own datasets
    • Solve common challenges
    • Create reusable solutions
  4. Learn Advanced Techniques
    • Handle special cases
    • Create custom cleaning functions
    • Build automated workflows

Remember: Good data cleaning is the foundation of reliable analysis. The time you invest in creating robust cleaning pipelines will save you hours of troubleshooting later!