library(tidyverse) # Core data science tools
library(readxl) # For reading Excel files
library(readr) # For reading CSV files
library(janitor) # For cleaning column names
library(countrycode) # For standardizing country names
library(here) # For relative filepaths
7 Week 4: Import & Tidy Your Data (Pre-Class)
This pre-class preparation should take about 45-60 minutes to complete.
7.1 Overview
Getting data into R and preparing it for analysis is often your first challenge in any project. In this pre-class session, we’ll focus on importing data from spreadsheets and establishing good practices for data cleaning. You’ll learn reliable workflows that make your analysis more reproducible and easier to maintain.
7.1.1 Video Lecture
Watch this video lecture before our interactive session:
7.2 Learning Objectives
By completing this pre-class work, you will:
- Learn to import data from Excel and CSV files into R
- Set up organized project structures
- Establish reliable data cleaning workflows
- Standardize common variables like dates and country names
- Practice with real Chinese development finance data
7.3 The Power of Clean Data
7.3.1 Why Data Cleaning Matters
Data cleaning might not be anyone’s idea of fun, but here’s a secret: getting good at data cleaning is one of the highest-return skills you can develop as a researcher. Why? Because while lots of people can run sophisticated analyses on clean datasets, far fewer people can reliably turn messy real-world data into analysis-ready information.
At AidData, this matters even more. AidData’s mission isn’t just to analyze existing datasets - you create new ones that reveal previously hidden patterns in Chinese development finance. The ability to clean and standardize messy data is core to this mission.
7.3.2 The Tidyverse Advantage
The tidyverse provides an extraordinarily powerful toolkit for importing and cleaning data, with specialized packages designed specifically for common cleaning tasks:
7.3.2.1 Data Import Tools
- {readr} (cheatsheet): CSV and flat files
- {readxl}: Excel files
- {haven}: SPSS, Stata, and SAS files
- {httr2}: Web APIs
These are just the basics. There’s a whole universe of tidyverse-style data import packages for all varieties of file formats and APIs (“There’s a package for that…”).
7.3.2.2 Data Cleaning Specialists
- {stringr} (cheatsheet): Text cleaning and manipulation
- {lubridate} (cheatsheet): Dates and times
- {forcats} (cheatsheet): Factor handling
- {tidyr} (cheatsheet): Data structure tools
All of these packages follow consistent principles and are designed for humans to use. The R for Data Science (2e) book has excellent chapters on each:
7.3.3 The Compounding Value of Clean Data
Here’s why mastering data cleaning is worth your time:
Unique Insights: When everyone works with the same clean datasets, it’s hard to find unique patterns. The ability to clean messy data gives you access to information others might miss.
Reproducible Work: Good data cleaning isn’t just about getting the data right once - it’s about creating reproducible pipelines that can handle new data as it arrives.
Time Investment: While cleaning data takes time upfront, having clean, reliable data saves countless hours of troubleshooting and redoing analyses later.
Competitive Advantage: In research, the quality of your inputs often matters more than the sophistication of your analysis. Being good at data cleaning lets you work with sources others might avoid.
7.3.4 What We’ll Learn
In the following sections, you’ll learn to:
- Import data from various sources
- Create reliable cleaning pipelines
- Handle common data issues like:
- Inconsistent text formats
- Messy dates and times
- Different monetary formats
- Variant country names
While data cleaning may never be the most exciting part of research, by the end of this section, you’ll have the tools to make it a manageable, reliable part of your workflow.
7.4 Setup
Let’s get our workspace ready:
- Create a new Quarto document for your notes:
# File → New File → Quarto Document
# Save as "week_4_import_preclass.qmd" in your week_4/R folder
- Create folders for organizing data:
dir.create("data-raw", showWarnings = FALSE) # For original data
dir.create("data", showWarnings = FALSE) # For cleaned data
- Download the AidData Critical Minerals Dataset and save it to your
data-raw
folder.
data-raw/
: Store original data exactly as received- Never modify these files
- Serves as your “source of truth”
- Makes your work reproducible
data/
: Store cleaned, analysis-ready data- Modified through documented R code
- Ready for analysis
- Can always recreate from raw data
- Load required packages:
7.5 Core Concepts
7.5.1 Organizing Your Data Pipeline
Think of data cleaning as a multi-step recipe. Just like cooking, you want to:
- Get your ingredients (import raw data, save in
data-raw
) - Prep them properly (clean data)
- Create something useful (analysis-ready data, save in
data
for further use)
Let’s see how this works with real data.
7.5.2 Reading Data from Spreadsheets
The two main functions for reading spreadsheet data are:
readxl::read_excel()
for Excel filesreadr::read_csv()
for CSV files
Let’s try reading our Critical Minerals data:
# Read the Excel file
<- read_excel(
minerals_raw here(
"data-raw",
"AidData_Chinese_Financing_for_Transition_Minerals_Dataset_Version_1.0.xlsx"
),sheet = "Financial_Contribution",
na = c("", "NA", "N/A", "#N/A", "NULL"), # Handle missing values
guess_max = 20000 # Look at more rows when guessing types
)
# Quick look at what we imported
glimpse(minerals_raw)
Rows: 137
Columns: 128
$ `AidData Record ID` <dbl> 95747, 95748, 39557…
$ `AidData Parent ID` <chr> "228", "228", "695"…
$ `Loan Event ID` <chr> "101", "201", "301"…
$ `Loan Event Description` <chr> "2010 $200 million …
$ `Financier Country` <chr> "China (People's Re…
$ Recipient <chr> "Kazakhstan", "Kaza…
$ `Recipient ISO-3` <chr> "KAZ", "KAZ", "KAZ"…
$ `Recipient Region` <chr> "Asia", "Asia", "As…
$ `Commitment Year` <dbl> 2010, 2012, 2011, 2…
$ `Implementation Start Year` <dbl> 2014, 2009, 2013, 2…
$ `Completion Year` <dbl> NA, NA, 2015, 2015,…
$ `Mining Site` <chr> "Abyz Copper Mine a…
$ `Mining Site ID` <dbl> 1, 2, 3, 3, 3, 3, 4…
$ `Investors Ownership ID` <dbl> 1, 2, 3, 3, 3, 3, 4…
$ `Transition Minerals` <chr> "Copper", "Copper",…
$ Title <chr> "CDB provides $200 …
$ Description <chr> "In June 2009, Chin…
$ `Staff Comments` <chr> "1. The loan’s prec…
$ Status <chr> "Implementation", "…
$ Intent <chr> "Commercial", "Mixe…
$ `Flow Type` <chr> "Loan", "Loan", "Lo…
$ `OECD ODA Concessionality Threshold` <dbl> 0.25, 0.25, 0.25, 0…
$ `Flow Class` <chr> "OOF-like", "OOF-li…
$ `Sector Code` <dbl> 320, 320, 320, 320,…
$ `Sector Name` <chr> "INDUSTRY, MINING, …
$ Infrastructure <chr> "Yes", "Yes", "Yes"…
$ `Funding Agencies` <chr> "China Development …
$ `Funding Agencies Type` <chr> "State-owned Policy…
$ Cofinanced <chr> "No", "No", "No", "…
$ `Cofinancing Agencies` <chr> NA, NA, NA, NA, NA,…
$ `Cofinancing Agencies Type` <chr> NA, NA, NA, NA, NA,…
$ `Direct Receiving Agencies` <chr> "KAZ Minerals Finan…
$ `Direct Receiving Agencies Type` <chr> "Other Joint Ventur…
$ `Indirect Receiving Agencies` <chr> NA, NA, NA, NA, "KA…
$ `Indirect Receiving Agencies Type` <chr> NA, NA, NA, NA, "Ot…
$ `On-Lending` <chr> NA, NA, NA, NA, "Ye…
$ `Implementing Agencies` <chr> NA, NA, "ABB Group|…
$ `Implementing Agencies Type` <chr> NA, NA, "Other Priv…
$ `Operator/Owner` <chr> "Kazakhmys Corporat…
$ `Operator/Owner Type` <chr> "Recipient Private …
$ Investors <chr> "Kazakhmys PLC", "K…
$ `Investors Type` <chr> "Other Private Sect…
$ `Investors Ownership Percentages` <chr> "100", "100", "100"…
$ `Controlling Shareholder` <chr> "Kazakhmys PLC", "K…
$ `Controlling Shareholder Type` <chr> "Other Private Sect…
$ `Guarantee Provided` <chr> "Yes", "Yes", "Yes"…
$ Guarantor <chr> "KAZ Minerals PLC (…
$ `Guarantor Agency Type` <chr> "Other Private Sect…
$ `Insurance Provided` <chr> NA, NA, NA, NA, NA,…
$ `Insurance Provider` <chr> NA, NA, NA, NA, NA,…
$ `Insurance Provider Agency Type` <chr> NA, NA, NA, NA, NA,…
$ `Collateralized/Securitized` <chr> NA, "Yes", "Yes", "…
$ `Collateral Provider` <chr> NA, "KAZ Minerals F…
$ `Collateral Provider Agency Type` <chr> NA, "Other Joint Ve…
$ `Security Agent/Collateral Agent` <chr> NA, NA, NA, NA, NA,…
$ `Security Agent/Collateral Agent Type` <chr> NA, NA, NA, NA, NA,…
$ Collateral <chr> NA, "Borrower to pr…
$ `Amount (Original Currency)` <dbl> 200000000, 20000000…
$ `Original Currency` <chr> "USD", "USD", "CNY"…
$ `Amount Estimated` <chr> NA, NA, NA, NA, NA,…
$ `Amount (Constant USD 2021)` <dbl> 278545387, 23482471…
$ `Amount (Nominal USD)` <dbl> 200000000, 20000000…
$ `Financial Distress` <chr> "Yes", "Yes", NA, N…
$ `Commitment Date (MM/DD/YYYY)` <dttm> 2010-01-12, 2012-0…
$ `Commitment Date Estimated` <chr> "No", "No", "No", "…
$ `Planned Implementation Start Date (MM/DD/YYYY)` <dttm> NA, NA, NA, NA, NA…
$ `Actual Implementation Start Date (MM/DD/YYYY)` <dttm> 2014-01-01, 2009-0…
$ `Actual Implementation Start Date Estimated` <chr> "No", "No", NA, NA,…
$ `Deviation from Planned Implementation Start Date` <dbl> NA, NA, NA, NA, NA,…
$ `Planned Completion Date (MM/DD/YYYY)` <dttm> NA, NA, NA, NA, NA…
$ `Actual Completion Date (MM/DD/YYYY)` <dttm> NA, NA, 2015-12-01…
$ `Actual Completion Date Estimated` <chr> "No", "No", NA, NA,…
$ `Deviation from Planned Completion Date` <dbl> NA, NA, NA, NA, NA,…
$ Maturity <dbl> 13.50, 13.50, 15.00…
$ `Interest Rate` <dbl> 5.199, 5.536, 4.540…
$ `Fixed/Variable Interest Rate` <chr> "Variable", "Variab…
$ `Variable Interest Reference Rate` <chr> "LIBOR", "LIBOR", "…
$ `Variable Interest Detail` <chr> NA, NA, NA, NA, "6-…
$ `Variable Interest Margin` <dbl> 4.8, 4.8, NA, 4.2, …
$ `Grace Period` <dbl> 3, 3, 3, 3, NA, NA,…
$ `Management Fee` <dbl> 1.5, 1.5, NA, NA, N…
$ `Commitment Fee` <lgl> NA, NA, NA, NA, NA,…
$ `Insurance Fee (Percent)` <lgl> NA, NA, NA, NA, NA,…
$ `Insurance Fee (Nominal USD)` <lgl> NA, NA, NA, NA, NA,…
$ `Default Interest Rate` <lgl> NA, NA, NA, NA, NA,…
$ `First Loan Repayment Date` <dttm> 2013-01-11, 2015-0…
$ `Last Loan Repayment Date` <dttm> 2023-07-10, 2025-1…
$ `Grant Element (OECD Cash-Flow)` <dbl> 24.4630, 22.6562, 2…
$ `Grant Element (OECD Grant-Equiv)` <dbl> 4.4782, 2.3634, 29.…
$ `Grant Element (IMF)` <dbl> 0.0000, 0.0000, 2.7…
$ `Number of Lenders` <chr> "Bilateral Loan", "…
$ `Export Buyer's Credit` <chr> NA, NA, NA, NA, NA,…
$ `Supplier’s Credit/Export Seller’s Credit` <chr> NA, NA, NA, NA, NA,…
$ `Interest-Free Loan` <chr> NA, NA, NA, NA, NA,…
$ Refinancing <chr> NA, NA, NA, NA, NA,…
$ `Investment Project Loan` <chr> "Yes", "Yes", "Yes"…
$ `M&A` <chr> NA, NA, NA, NA, NA,…
$ `Working Capital` <chr> NA, NA, NA, NA, NA,…
$ EPCF <lgl> NA, NA, NA, NA, NA,…
$ Lease <lgl> NA, NA, NA, NA, NA,…
$ `FXSL/BOP` <lgl> NA, NA, NA, NA, NA,…
$ `CC IRS` <lgl> NA, NA, NA, NA, NA,…
$ RCF <chr> NA, NA, NA, NA, NA,…
$ GCL <chr> NA, NA, NA, NA, NA,…
$ PBC <lgl> NA, NA, NA, NA, NA,…
$ `PxF/Commodity Prepayment` <chr> NA, NA, NA, NA, NA,…
$ `Inter-Bank Loan` <chr> NA, NA, NA, NA, "Ye…
$ `Overseas Project Contracting Loan` <lgl> NA, NA, NA, NA, NA,…
$ DPA <chr> NA, NA, NA, NA, NA,…
$ `Project Finance` <chr> NA, NA, "Yes", "Yes…
$ `Involving Multilateral` <lgl> NA, NA, NA, NA, NA,…
$ `Involving Non-Chinese Financier` <chr> "No", "No", "No", "…
$ `Short-Term` <chr> NA, NA, NA, NA, NA,…
$ Rescue <lgl> NA, NA, NA, NA, NA,…
$ `JV/SPV Host Government Ownership` <chr> "No Host Government…
$ `JV/SPV Chinese Government Ownership` <chr> "No Chinese Governm…
$ `Level of Public Liability` <chr> "Central government…
$ `Total Source Count` <dbl> 9, 11, 17, 19, 10, …
$ `Official Source Count` <dbl> 2, 2, 11, 12, 6, 12…
$ `Source URLs` <chr> "https://www.proact…
$ `Source Titles` <chr> "Kazakhmys signs ne…
$ `Source Publishers` <chr> "Proactive Investor…
$ `Source Resource Types` <chr> "Media Report|Imple…
$ `Contact Name` <chr> NA, NA, NA, NA, NA,…
$ `Contact Position` <chr> NA, NA, NA, NA, NA,…
$ `ODA Eligible Recipient` <chr> "Yes", "Yes", "Yes"…
$ `OECD ODA Income Group` <chr> "Upper middle incom…
$ `Location Narrative` <chr> "The project develo…
na =
: What values should be treated as missingguess_max =
: How many rows to check when determining column typessheet =
: Which Excel sheet to readrange =
: Specific cells to read
These help handle common import challenges like: - Different representations of missing data - Incorrect column type detection - Multiple sheets in one file
While we’re covering key arguments, the readxl
and readr
packages have many more options. Taking time to read the documentation (read_excel()
, read_csv()
) will save you hours of troubleshooting later.
You don’t need to memorize all options - just know they exist and where to find them when needed.
7.6 A Systematic Approach to Data Cleaning
7.6.1 Step 1: Clean Column Names & Inspect Data
First, we always want consistent, programming-friendly column names. glimpse()
your new clean names, and look for data that might need to be cleaned. Note any variables where the data type displayed does not match what it logically should be (e.g. is a numeric variable show with data type <chr>
?)
# Clean the names
<- minerals_raw |>
minerals_clean clean_names()
# See the difference
|> glimpse() minerals_clean
Rows: 137
Columns: 128
$ aid_data_record_id <dbl> 95747, 95748, 39557, …
$ aid_data_parent_id <chr> "228", "228", "695", …
$ loan_event_id <chr> "101", "201", "301", …
$ loan_event_description <chr> "2010 $200 million lo…
$ financier_country <chr> "China (People's Repu…
$ recipient <chr> "Kazakhstan", "Kazakh…
$ recipient_iso_3 <chr> "KAZ", "KAZ", "KAZ", …
$ recipient_region <chr> "Asia", "Asia", "Asia…
$ commitment_year <dbl> 2010, 2012, 2011, 201…
$ implementation_start_year <dbl> 2014, 2009, 2013, 201…
$ completion_year <dbl> NA, NA, 2015, 2015, 2…
$ mining_site <chr> "Abyz Copper Mine and…
$ mining_site_id <dbl> 1, 2, 3, 3, 3, 3, 4, …
$ investors_ownership_id <dbl> 1, 2, 3, 3, 3, 3, 4, …
$ transition_minerals <chr> "Copper", "Copper", "…
$ title <chr> "CDB provides $200 mi…
$ description <chr> "In June 2009, China …
$ staff_comments <chr> "1. The loan’s precis…
$ status <chr> "Implementation", "Im…
$ intent <chr> "Commercial", "Mixed"…
$ flow_type <chr> "Loan", "Loan", "Loan…
$ oecd_oda_concessionality_threshold <dbl> 0.25, 0.25, 0.25, 0.2…
$ flow_class <chr> "OOF-like", "OOF-like…
$ sector_code <dbl> 320, 320, 320, 320, 3…
$ sector_name <chr> "INDUSTRY, MINING, CO…
$ infrastructure <chr> "Yes", "Yes", "Yes", …
$ funding_agencies <chr> "China Development Ba…
$ funding_agencies_type <chr> "State-owned Policy B…
$ cofinanced <chr> "No", "No", "No", "No…
$ cofinancing_agencies <chr> NA, NA, NA, NA, NA, N…
$ cofinancing_agencies_type <chr> NA, NA, NA, NA, NA, N…
$ direct_receiving_agencies <chr> "KAZ Minerals Finance…
$ direct_receiving_agencies_type <chr> "Other Joint Venture/…
$ indirect_receiving_agencies <chr> NA, NA, NA, NA, "KAZ …
$ indirect_receiving_agencies_type <chr> NA, NA, NA, NA, "Othe…
$ on_lending <chr> NA, NA, NA, NA, "Yes"…
$ implementing_agencies <chr> NA, NA, "ABB Group|Al…
$ implementing_agencies_type <chr> NA, NA, "Other Privat…
$ operator_owner <chr> "Kazakhmys Corporatio…
$ operator_owner_type <chr> "Recipient Private Se…
$ investors <chr> "Kazakhmys PLC", "Kaz…
$ investors_type <chr> "Other Private Sector…
$ investors_ownership_percentages <chr> "100", "100", "100", …
$ controlling_shareholder <chr> "Kazakhmys PLC", "Kaz…
$ controlling_shareholder_type <chr> "Other Private Sector…
$ guarantee_provided <chr> "Yes", "Yes", "Yes", …
$ guarantor <chr> "KAZ Minerals PLC (Fo…
$ guarantor_agency_type <chr> "Other Private Sector…
$ insurance_provided <chr> NA, NA, NA, NA, NA, N…
$ insurance_provider <chr> NA, NA, NA, NA, NA, N…
$ insurance_provider_agency_type <chr> NA, NA, NA, NA, NA, N…
$ collateralized_securitized <chr> NA, "Yes", "Yes", "Ye…
$ collateral_provider <chr> NA, "KAZ Minerals Fin…
$ collateral_provider_agency_type <chr> NA, "Other Joint Vent…
$ security_agent_collateral_agent <chr> NA, NA, NA, NA, NA, N…
$ security_agent_collateral_agent_type <chr> NA, NA, NA, NA, NA, N…
$ collateral <chr> NA, "Borrower to prov…
$ amount_original_currency <dbl> 200000000, 200000000,…
$ original_currency <chr> "USD", "USD", "CNY", …
$ amount_estimated <chr> NA, NA, NA, NA, NA, N…
$ amount_constant_usd_2021 <dbl> 278545387, 234824712,…
$ amount_nominal_usd <dbl> 200000000, 200000000,…
$ financial_distress <chr> "Yes", "Yes", NA, NA,…
$ commitment_date_mm_dd_yyyy <dttm> 2010-01-12, 2012-06-…
$ commitment_date_estimated <chr> "No", "No", "No", "No…
$ planned_implementation_start_date_mm_dd_yyyy <dttm> NA, NA, NA, NA, NA, …
$ actual_implementation_start_date_mm_dd_yyyy <dttm> 2014-01-01, 2009-01-…
$ actual_implementation_start_date_estimated <chr> "No", "No", NA, NA, N…
$ deviation_from_planned_implementation_start_date <dbl> NA, NA, NA, NA, NA, N…
$ planned_completion_date_mm_dd_yyyy <dttm> NA, NA, NA, NA, NA, …
$ actual_completion_date_mm_dd_yyyy <dttm> NA, NA, 2015-12-01, …
$ actual_completion_date_estimated <chr> "No", "No", NA, NA, N…
$ deviation_from_planned_completion_date <dbl> NA, NA, NA, NA, NA, N…
$ maturity <dbl> 13.50, 13.50, 15.00, …
$ interest_rate <dbl> 5.199, 5.536, 4.540, …
$ fixed_variable_interest_rate <chr> "Variable", "Variable…
$ variable_interest_reference_rate <chr> "LIBOR", "LIBOR", "LI…
$ variable_interest_detail <chr> NA, NA, NA, NA, "6-Mo…
$ variable_interest_margin <dbl> 4.8, 4.8, NA, 4.2, 4.…
$ grace_period <dbl> 3, 3, 3, 3, NA, NA, N…
$ management_fee <dbl> 1.5, 1.5, NA, NA, NA,…
$ commitment_fee <lgl> NA, NA, NA, NA, NA, N…
$ insurance_fee_percent <lgl> NA, NA, NA, NA, NA, N…
$ insurance_fee_nominal_usd <lgl> NA, NA, NA, NA, NA, N…
$ default_interest_rate <lgl> NA, NA, NA, NA, NA, N…
$ first_loan_repayment_date <dttm> 2013-01-11, 2015-06-…
$ last_loan_repayment_date <dttm> 2023-07-10, 2025-12-…
$ grant_element_oecd_cash_flow <dbl> 24.4630, 22.6562, 29.…
$ grant_element_oecd_grant_equiv <dbl> 4.4782, 2.3634, 29.44…
$ grant_element_imf <dbl> 0.0000, 0.0000, 2.790…
$ number_of_lenders <chr> "Bilateral Loan", "Bi…
$ export_buyers_credit <chr> NA, NA, NA, NA, NA, N…
$ supplier_s_credit_export_seller_s_credit <chr> NA, NA, NA, NA, NA, N…
$ interest_free_loan <chr> NA, NA, NA, NA, NA, N…
$ refinancing <chr> NA, NA, NA, NA, NA, N…
$ investment_project_loan <chr> "Yes", "Yes", "Yes", …
$ m_a <chr> NA, NA, NA, NA, NA, N…
$ working_capital <chr> NA, NA, NA, NA, NA, N…
$ epcf <lgl> NA, NA, NA, NA, NA, N…
$ lease <lgl> NA, NA, NA, NA, NA, N…
$ fxsl_bop <lgl> NA, NA, NA, NA, NA, N…
$ cc_irs <lgl> NA, NA, NA, NA, NA, N…
$ rcf <chr> NA, NA, NA, NA, NA, N…
$ gcl <chr> NA, NA, NA, NA, NA, N…
$ pbc <lgl> NA, NA, NA, NA, NA, N…
$ px_f_commodity_prepayment <chr> NA, NA, NA, NA, NA, N…
$ inter_bank_loan <chr> NA, NA, NA, NA, "Yes"…
$ overseas_project_contracting_loan <lgl> NA, NA, NA, NA, NA, N…
$ dpa <chr> NA, NA, NA, NA, NA, N…
$ project_finance <chr> NA, NA, "Yes", "Yes",…
$ involving_multilateral <lgl> NA, NA, NA, NA, NA, N…
$ involving_non_chinese_financier <chr> "No", "No", "No", "No…
$ short_term <chr> NA, NA, NA, NA, NA, N…
$ rescue <lgl> NA, NA, NA, NA, NA, N…
$ jv_spv_host_government_ownership <chr> "No Host Government O…
$ jv_spv_chinese_government_ownership <chr> "No Chinese Governmen…
$ level_of_public_liability <chr> "Central government-g…
$ total_source_count <dbl> 9, 11, 17, 19, 10, 13…
$ official_source_count <dbl> 2, 2, 11, 12, 6, 12, …
$ source_ur_ls <chr> "https://www.proactiv…
$ source_titles <chr> "Kazakhmys signs new …
$ source_publishers <chr> "Proactive Investors|…
$ source_resource_types <chr> "Media Report|Impleme…
$ contact_name <chr> NA, NA, NA, NA, NA, N…
$ contact_position <chr> NA, NA, NA, NA, NA, N…
$ oda_eligible_recipient <chr> "Yes", "Yes", "Yes", …
$ oecd_oda_income_group <chr> "Upper middle income"…
$ location_narrative <chr> "The project develope…
Raw data often comes with inconsistent column names:
- Spaces (“Project Name”)
- Special characters (“Amount ($)”)
- Inconsistent capitalization (“projectName”, “Project_name”)
- Numbers at start (“2021_amount”)
clean_names()
:
- Converts to lowercase
- Replaces spaces/special chars with underscores
- Makes names programming-friendly
- Creates consistent style
This matters because:
- Reduces coding errors
- Makes autocomplete work better
- Prevents quoting/escaping headaches
- Creates consistent style across projects
7.6.2 Step 2: Fix Dates
Always inspect and clean date columns early:
# First, find all date columns
|>
minerals_clean select(contains("date")) |>
glimpse()
Rows: 137
Columns: 12
$ commitment_date_mm_dd_yyyy <dttm> 2010-01-12, 2012-06-…
$ commitment_date_estimated <chr> "No", "No", "No", "No…
$ planned_implementation_start_date_mm_dd_yyyy <dttm> NA, NA, NA, NA, NA, …
$ actual_implementation_start_date_mm_dd_yyyy <dttm> 2014-01-01, 2009-01-…
$ actual_implementation_start_date_estimated <chr> "No", "No", NA, NA, N…
$ deviation_from_planned_implementation_start_date <dbl> NA, NA, NA, NA, NA, N…
$ planned_completion_date_mm_dd_yyyy <dttm> NA, NA, NA, NA, NA, …
$ actual_completion_date_mm_dd_yyyy <dttm> NA, NA, 2015-12-01, …
$ actual_completion_date_estimated <chr> "No", "No", NA, NA, N…
$ deviation_from_planned_completion_date <dbl> NA, NA, NA, NA, NA, N…
$ first_loan_repayment_date <dttm> 2013-01-11, 2015-06-…
$ last_loan_repayment_date <dttm> 2023-07-10, 2025-12-…
# Convert all dates to proper format
<- minerals_clean |>
minerals_dates mutate(
commitment_date_mm_dd_yyyy = ymd(commitment_date_mm_dd_yyyy),
planned_implementation_start_date_mm_dd_yyyy = ymd(planned_implementation_start_date_mm_dd_yyyy),
actual_implementation_start_date_mm_dd_yyyy = ymd(actual_implementation_start_date_mm_dd_yyyy),
planned_completion_date_mm_dd_yyyy = ymd(planned_completion_date_mm_dd_yyyy),
actual_completion_date_mm_dd_yyyy = ymd(actual_completion_date_mm_dd_yyyy),
first_loan_repayment_date = ymd(first_loan_repayment_date),
last_loan_repayment_date = ymd(last_loan_repayment_date)
)
|>
minerals_dates select(contains("date")) |>
glimpse()
Rows: 137
Columns: 12
$ commitment_date_mm_dd_yyyy <date> 2010-01-12, 2012-06-…
$ commitment_date_estimated <chr> "No", "No", "No", "No…
$ planned_implementation_start_date_mm_dd_yyyy <date> NA, NA, NA, NA, NA, …
$ actual_implementation_start_date_mm_dd_yyyy <date> 2014-01-01, 2009-01-…
$ actual_implementation_start_date_estimated <chr> "No", "No", NA, NA, N…
$ deviation_from_planned_implementation_start_date <dbl> NA, NA, NA, NA, NA, N…
$ planned_completion_date_mm_dd_yyyy <date> NA, NA, NA, NA, NA, …
$ actual_completion_date_mm_dd_yyyy <date> NA, NA, 2015-12-01, …
$ actual_completion_date_estimated <chr> "No", "No", NA, NA, N…
$ deviation_from_planned_completion_date <dbl> NA, NA, NA, NA, NA, N…
$ first_loan_repayment_date <date> 2013-01-11, 2015-06-…
$ last_loan_repayment_date <date> 2023-07-10, 2025-12-…
Raw dates can appear in many formats:
- “2021-01-15”
- “1/15/21”
- “15 Jan 2021”
- “2021-Q1”
Proper date formatting enables:
- Time-series analysis
- Duration calculations
- Correct sorting
- Filtering by time periods
If you are creating a data cleaning pipeline that will be used on newer versions of the same dataset, make sure to coerce dates into the correct format even if they are parsed correctly by read_excel()
(or other data import methods). Next time they might not be, and a date variable that is read in as a character string might mess up your subsequent analysis pipelines.
7.6.3 Step 3: Standardize Country Information
Create consistent country identifiers:
Countries often appear differently across datasets:
- “Cote d’Ivoire” vs “Ivory Coast”
- “Democratic Republic of the Congo” vs “DR Congo”
- “People’s Republic of China” vs “China”
Standardization enables:
- Joining across datasets
- Consistent visualization labels
- Regional aggregation
- Integration with other global data
<- minerals_dates |>
minerals_countries mutate(
# Add ISO3C codes
iso3c = countrycode(
sourcevar = recipient,
origin = "country.name",
destination = "iso3c",
origin_regex = TRUE, # Helps match variations
warn = TRUE # Shows what doesn't match
),# Add standardized names
country_name = countrycode(
sourcevar = iso3c,
origin = "iso3c",
destination = "country.name"
),# Add World Bank regions
wb_region = countrycode(
sourcevar = iso3c,
origin = "iso3c",
destination = "region"
) )
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso3c = countrycode(...)`.
Caused by warning:
! Some values were not matched unambiguously: Africa, regional
|>
minerals_countries select(
country_name,
recipient,
wb_region,
recipient_region|>
) unique()
# A tibble: 20 × 4
country_name recipient wb_region recipient_region
<chr> <chr> <chr> <chr>
1 Kazakhstan Kazakhstan Europe &… Asia
2 Peru Peru Latin Am… America
3 Eritrea Eritrea Sub-Saha… Africa
4 Serbia Serbia Europe &… Europe
5 Kyrgyzstan Kyrgyz Republic Europe &… Asia
6 Russia Russia Europe &… Europe
7 Dominican Republic Dominican Republic Latin Am… America
8 Chile Chile Latin Am… America
9 Congo - Kinshasa Democratic Republic of the Con… Sub-Saha… Africa
10 Myanmar (Burma) Myanmar East Asi… Asia
11 Zambia Zambia Sub-Saha… Africa
12 Ecuador Ecuador Latin Am… America
13 South Africa South Africa Sub-Saha… Africa
14 Laos Lao People's Democratic Republ… East Asi… Asia
15 <NA> Africa, regional <NA> Africa
16 Papua New Guinea Papua New Guinea East Asi… Oceania
17 Vietnam Viet Nam East Asi… Asia
18 Uganda Uganda Sub-Saha… Africa
19 Indonesia Indonesia East Asi… Asia
20 Iran Iran Middle E… Middle East
Some datasets include regional entries (like “Africa, regional”) that won’t match country codes. Options for handling these:
- Use
warn = TRUE
to see what doesn’t match - Create custom matching rules for special cases (see
?countrycode
) - Document any manual corrections needed
How you handle these will depend on the context of your dataset and analysis. It’s worth remembering that the definition of regions and other aggregates often varies slightly by data source.
Use ISO3C for Programming
- Three-letter codes are unambiguous
- Avoid ISO2C (e.g., Namibia’s “NA” can cause issues)
- Perfect for plot labels where space is tight
Use Standardized Names for Presentation
- More readable than codes
- Consistent across datasets
- Good for reports and visualizations
Keep Multiple Identifiers
- Original names (match documentation)
- ISO3C codes (for programming)
- Standardized names (for presentation)
- Regional groupings (for analysis)
Check the
countrycode::codelist
# See all available code types ::codelist ?countrycode # Common useful conversions: # - "continent" Continent as defined in the World Bank Development Indicators # - "currency" ISO 4217 currency name # - "region" Regions as defined in the World Bank Development Indicators # - "eu28" for EU membership
7.6.4 Step 4: Create Proper Categories
Convert text categories to meaningful factors.
You don’t need to do this for all text variables, but consider doing it for ones you are going to use often, and where order matters for tables + charts.
<- minerals_countries |>
minerals_cats mutate(
# Make status an ordered factor
status = factor(
status,levels = c(
"Pipeline: Commitment",
"Implementation",
"Completion"
)
),# Make income groups ordered
oecd_oda_income_group = factor(
oecd_oda_income_group,levels = c(
"Low income",
"Lower middle income",
"Upper middle income"
)
)
)
|>
minerals_cats count(
status )
# A tibble: 3 × 2
status n
<fct> <int>
1 Pipeline: Commitment 22
2 Implementation 26
3 Completion 89
Raw categorical data often needs structure:
- Natural ordering (status phases)
- Grouping levels (income categories)
- Consistent labels
Proper factors enable:
- Correct ordering in plots
- Meaningful summaries
- Efficient filtering
- Clear presentation
7.7 Creating a Reusable Pipeline
Now let’s combine these steps into a reusable function:
<- function(data) {
process_minerals_data |>
data # Step 1: Clean column names
clean_names() |>
# Step 2: Fix dates
mutate(
commitment_date_mm_dd_yyyy = ymd(commitment_date_mm_dd_yyyy),
planned_implementation_start_date_mm_dd_yyyy = ymd(planned_implementation_start_date_mm_dd_yyyy),
actual_implementation_start_date_mm_dd_yyyy = ymd(actual_implementation_start_date_mm_dd_yyyy),
planned_completion_date_mm_dd_yyyy = ymd(planned_completion_date_mm_dd_yyyy),
actual_completion_date_mm_dd_yyyy = ymd(actual_completion_date_mm_dd_yyyy),
first_loan_repayment_date = ymd(first_loan_repayment_date),
last_loan_repayment_date = ymd(last_loan_repayment_date)
|>
)
# Step 3: Standardize country information
mutate(
iso3c = countrycode(
sourcevar = recipient,
origin = "country.name",
destination = "iso3c",
origin_regex = TRUE,
warn = TRUE
),country_name = countrycode(
sourcevar = iso3c,
origin = "iso3c",
destination = "country.name"
),wb_region = countrycode(
sourcevar = iso3c,
origin = "iso3c",
destination = "region"
)|>
)
# Step 4: Create proper factors
mutate(
status = factor(
status,levels = c(
"Pipeline: Commitment",
"Implementation",
"Completion"
),
),# Make income groups ordered
oecd_oda_income_group = factor(
oecd_oda_income_group,levels = c(
"Low income",
"Lower middle income",
"Upper middle income"
)
) |>
)
# Step 5: Add derived variables
mutate(
amount_bn = amount_constant_usd_2021 / 1e9
)
}
# Use the pipeline
# First import the data
<- read_excel(
minerals_raw ::here(
here"data-raw",
"AidData_Chinese_Financing_for_Transition_Minerals_Dataset_Version_1.0.xlsx"
),sheet = "Financial_Contribution",
na = c("", "NA", "N/A", "#N/A", "NULL"),
guess_max = 20000
)
# Then process it
<- minerals_raw |>
minerals_clean process_minerals_data()
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso3c = countrycode(...)`.
Caused by warning:
! Some values were not matched unambiguously: Africa, regional
# Save the results
write_rds(
minerals_clean, ::here("data", "minerals_clean.rds")
here
)
write_csv(
minerals_clean, ::here("data", "minerals_clean.csv")
here )
7.8 Data Cleaning Checklist
Before considering your data clean, verify:
Data Structure
Data Types
Standardization
Documentation
Quality Control
7.9 Effective AI Prompts for Data Cleaning
Here are some powerful prompts that will help you get the most out of AI tools when cleaning data:
7.9.1 Making Data Tidy
I have a dataset that looks like this:
[paste first few rows of your data using head() or glimpse()]
I think it might not be in tidy format because:[describe what seems wrong, e.g., "multiple variables in one column" or "values spread across columns"]
Can you help me:1. Identify which tidyverse principles it violates:
- Is each variable a column?
- Is each observation a row?
- Is each value a single cell?
2. Suggest a tidyr pipeline to fix it?
3. Explain why each step in the pipeline helps?
Example:
I have this dataset:
Year Q1_Sales Q2_Sales Q3_Sales Q4_Sales
2021 100 120 95 150
2022 110 125 100 160
This seems untidy because sales values are spread across columns. How can I reshape this to have columns: year, quarter, sales?
Tips for good tidy data prompts:
- Show sample data
- Explain what seems wrong
- Describe desired output
- Ask for explanation of steps
7.9.2 Understanding Data Structure
I have an Excel file with this glimpse() output:[paste your glimpse() output]
I want to:1. Clean the column names
2. Convert dates to proper format
3. Standardize country names
Can you help me write a tidyverse pipeline to do this?
7.9.3 Date Standardization
I have dates in these formats:
[paste unique(date_column)]
I need to:1. Convert them to proper date objects
2. Handle missing/invalid dates
3. Create consistent format
Can you help me write a robust date cleaning function?
7.9.4 Debugging Data Issues
I'm trying to clean this data but getting this error:[paste error message]
Here's my code:[paste code]
Here's a sample of my data:[paste glimpse(data)]
Can you help me:1. Understand what's wrong
2. Fix the immediate issue
3. Prevent similar issues?
7.9.5 Creating Cleaning Functions
I need to clean multiple similar datasets with these characteristics:
[paste glimpse(data)]
Common issues include:[list issues]
Can you help me write a robust cleaning function that:1. Handles all these cases
2. Includes error checking
3. Documents the cleaning steps?
- Show Your Data
- Use
glimpse()
,head()
, orstr()
- Include sample values
- Show error messages
- Use
- Be Specific
- Explain your goal
- Describe current issues
- List any constraints
- Ask for Explanation
- Request comments in code
- Ask about trade-offs
- Get help with error handling
7.10 Practice Exercises
7.10.1 Exercise 1: Creating a Data Processing Pipeline
Let’s practice by cleaning some financial data with common issues:
# Create our messy data
<- "
messy_data Country,Project Value,Date,Status
People's Republic of China,\"$12,000,000\",Sep 20 2021,ACTIVE
Democratic Republic of Congo,\"$8,500,000\",Sep 15 2021,Active
Vietnam,\"$15,250,000\",Sep 10 2021,COMPLETED
China,\"$9,750,000\",Sep 5 2021,active
Cote d'Ivoire,\"$11,250,000\",Sep 1 2021,Pipeline
"
# Save to data-raw using here()
write_file(
messy_data,::here("data-raw", "messy_finance.csv")
here )
Your tasks:
Create a function called
process_finance_data()
that:- Standardizes country names (note: China appears twice with different names)
- Cleans monetary values using
parse_number()
- Converts dates to proper date format using lubridate
- Creates proper status factors. Use
str_to_title()
to get values to consistent case first.
Write a pipeline that:
# Import data <- read_csv( finance_raw ::here("data-raw", "messy_finance.csv") here ) # Process it <- finance_raw |> finance_clean process_finance_data() # Save results write_rds( finance_clean,::here("data", "finance_clean.rds") here )
For Numbers:
# parse_number() removes currency symbols and commas
parse_number("$12,000,000") # Returns 12000000
parse_number("$1,234.56") # Returns 1234.56
For Dates: Lubridate provides functions matching common date formats:
mdy()
(month-day-year): “Sep 20 2021” → 2021-09-20ymd()
(year-month-date): “2021-09-20” → 2021-09-20dmy()
(day-month-year): “20-09-2021” → 2021-09-20
# Examples
library(lubridate)
mdy("Sep 20 2021") # Returns "2021-09-20"
[1] "2021-09-20"
ymd("2021-09-20") # Returns "2021-09-20"
[1] "2021-09-20"
dmy("20-09-2021") # Returns "2021-09-20"
[1] "2021-09-20"
The function name matches the order of the date components (m=month, d=day, y=year).
The {stringr} package provides several functions for standardizing text case:
# Convert to title case (First Letter Of Each Word)
str_to_title("RURAL ELECTRIFICATION project") # Returns "Rural Electrification Project"
# Convert to upper case (ALL CAPS)
str_to_upper("Rural Electrification Project") # Returns "RURAL ELECTRIFICATION PROJECT"
# Convert to lower case (all lowercase)
str_to_lower("Rural Electrification Project") # Returns "rural electrification project"
# Real world example: standardizing status values
<- c("IN PROGRESS", "Completed", "not started", "In Progress")
status_values
<- status_values |>
status_clean str_to_title() # Returns: "In Progress", "Completed", "Not Started", "In Progress"
When to use each:
str_to_title()
: Names, project titles, status valuesstr_to_upper()
: Country codes, ID valuesstr_to_lower()
: Before matching or comparing strings
7.10.2 Exercise 2: Working with Multiple Data Quality Issues
Let’s practice handling several common data quality issues:
# Create project data with various issues
<- "
projects Region,Project Title,Start Date,Budget (USD),Status
East Asia,Water Treatment Plant,Sep 15 2021,$12000000,In Progress
Eastern Asia,Solar Farm Phase 1,Sep 10 2021,$15250000,ACTIVE
Sub-Saharan Africa,Highway Extension,,\"$8,500,000\",planning
SSA,Rural Electrification,Sep 1 2021,$11250000,ACTIVE
"
# Save using here()
write_file(
projects,::here("data-raw", "projects.csv")
here )
Your tasks:
- Create a data processing function that:
- Standardizes region names (East Asia/Eastern Asia, Sub-Saharan Africa/SSA)
- Handles the missing date
- Cleans monetary values
- Standardizes status values
- Groups similar projects by region
- Add validation checks that:
- Verify all required fields are present
- Check date ranges
- Validate budget ranges
- Ensure status values are standardized
When working with real data, watch for:
- Inconsistent Names
- Different spellings
- Abbreviations
- Regional variations
- Missing Values
- Empty cells
- Placeholder values (“N/A”, “-”, etc.)
- Impossible values
- Format Inconsistencies
- Mixed date formats
- Different currency notations
- Varied text cases
Always document how you handle each type of issue!
7.11 Resources for Learning More
7.11.1 Essential References
- Data Import & Cleaning
- Working with Dates
- Number Formatting
7.12 Next Steps
In our class session, we’ll:
- Work with Complex Datasets
- Handle multiple related files
- Learn about different types of joins
- Create robust cleaning pipelines
- Build Validation Systems
- Create data quality checks
- Validate transformations
- Document cleaning decisions
- Practice with Real Data
- Work with your own datasets
- Solve common challenges
- Create reusable solutions
- Learn Advanced Techniques
- Handle special cases
- Create custom cleaning functions
- Build automated workflows
Remember: Good data cleaning is the foundation of reliable analysis. The time you invest in creating robust cleaning pipelines will save you hours of troubleshooting later!