27. Chapter 27: Data Wrangling#

Author: Meghan R. Hutch

27.1. What does it mean to wrangle our data?#

Simply, data wrangling is the act of preparing data for analysis. Many of the datasets you work with during class may already be fairly or completely clean, meaning, the data was previously prepared to make it easy for you to download and begin analyzing right away. Most real-word data is messy due to the way the data was collected. For instance, there might be missing values, measurements might be in different units, text-labels might have typos or varying use of uppercase or lowercase letters. Thus, it is critical to check for, and resolve, any inconsistencies in our data prior to analysis.

This task if often not trivial and requires careful investigation and consultation with domain experts - those who can help clarify how the data was collected and what variables mean. This process also helps ensures that downstream analyses will not be hindered by data inaccuracies. Thus, we can feel confident about the conclusions we draw as they relate to the question or problem we are trying to solve.

27.2. Case Study on Data Wrangling#

In this chapter, we will begin answering these questions by working with a new real-world dataset called MIMIC-IV [^*]. The Medical Information Mart for Intensive Care (MIMIC)-IV is a large dataset curated to help support studies on intensive care unit (ICU) patients.

As we will see, medical data is a great case study for the importance of data wrangling. Medical records often contain heterogenous types of data. Many of these medical concepts and measurements are often recorded in unstandardized ways.

[^*] Johnson, A., Bulgarelli, L., Pollard, T., Gow, B., Moody, B., Horng, S., Celi, L. A., & Mark, R. (2024). MIMIC-IV (version 3.1). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/kpb9-mt58