27.3. How to begin Wrangling: Get to know your Data!#

Whenever you begin working with a new dataset, it’s essential that you get to know your data! To get started, there are a few questions to consider:

  • Where did the data come from and how it was collected?

  • What data types are my variables (numeric, string, date, etc)?

  • Does Python recognize recognize these variables as their correct data types?

  • Are there outliers in our data?

  • How should we handle missing data?

The answers to these questions are just a handful of ways that may influence the approaches you take to wrangle your data.

Note

Data wrangling may also be referred to as data cleaning. By wrangling our data, we help ensure that it will be clean for later analysis.

Case Study of Data Wrangling#

In this chapter, we will consider these questions as we get to know the MIMIC-IV Dataset [^*] and demonstrate the steps needed to ensure that it can used for downstream analyses.

Where did the data come from and how was it collected?#

The MIMIC-IV dataset contains data from >220,000 patients who were admitted to Beth Israel Deaconess Medical Center in Boston, Massachusetts between 2008-2022. The dataset contains information that was entered into a patient’s medical record during the time they were admitted to the hospital.

For this case study, we will rely on MIMIC’s demo-dataset [^**] which has been made publically available and contains the data from random subset of 100 hospitalized patients.

Note

A note on the data used for this case study

Due to the sensitive nature of medical data, the full MIMIC-IV dataset requires special approval to work with. The demo-dataset has been specially prepared for educational purposes. Even though this is an abbreviated dataset, the amount of medical data is so voluminous, that we will only be using a subset of the demo-dataset for the purposes of this case study.

What does this data look like?#

The MIMIC-IV dataset contains many seperate data tables (structured as csv files).

To begin, we will import just one of the csv files concerning the demographics of the patients included in study. We will use pd.read_csv() to import the data as a DataFrame.

import pandas as pd

patients = pd.read_csv('../../data/mimic_patients.csv')

Determine Data Types#

Now that we’ve loaded in our data, we can start to explore what type of data we have.

The use of the method .head() can be a good first choice to get a glimpse of the first few rows of the DataFrame.

We see that we have three columns: subject_id (unique patient identifier), gender, and anchor_age (age of patient).

patients.head()
subject_id gender anchor_age
0 10014729 F 21
1 10003400 F 72
2 10002428 F 80
3 10032725 F 38
4 10027445 F 48

Note

Side-Note: Data Dictionaries and Codebooks

The creators of MIMIC have curated excellent documentation detailing the process of data collection, dataset construction, and detailed information about what each column (i.e. variable) of the dataset represents.

As mentioned, at its core, the MIMIC dataset is built from the compilation of numerous tables (csv files), each containing different types of data.

When working with a new dataset, it’s good practice to determine whether there is a data dictionary or codebook which might tell us what each variable represents and what data types they are. For example, we can find information about the patients table in the documentation’s patients table page.

One useful fact we can learn about the patient’s data is that if a patient’s anchor_age is >89, the value becomes capped at 91. This is done to preserve patient confidentiality.

Since .head() only showed us the first 5 rows of data, let’s run some code to evaluate how many rows we have.

len(patients)
100

We should also check that subject_id is in fact unique - if it is, the number of unique values for subject_id should equal the number of rows in our dataframe.

patients['subject_id'].nunique()
100
# we could also automate this check
patients['subject_id'].nunique() == len(patients)
True

Next, we might want to check how Python interepreted these data types.

patients.dtypes
subject_id     int64
gender        object
anchor_age     int64
dtype: object

Should subject_id really be considered an integer? We are informed from the dataset curators that this variable is a unique subject (or patient) identifier.

Ask yourself: would adding two subject_id values make sense?

If this is a unique identifier, we can almost think of this variable representing a person’s name. In this case, subject_id is a categorical variable.

Let’s convert this column to a string to make sure we don’t accidently try to include it as a variable in subsequent numerical computations.

# convert `subject_id` to a string
patients['subject_id'] = patients['subject_id'].astype('str')

# check that the data has been converted to type str successfully
patients.dtypes
subject_id    object
gender        object
anchor_age     int64
dtype: object

From this step, we also see that gender is of type object and anchor_age is of type integer. These seem reasonable!

Examine the values and distributions of each variable#

To understand our data even more, we can inspect the actual values of our variables. For example, what categorical label is the gender variable composed of?

patients['gender'].value_counts()
gender
M    57
F    43
Name: count, dtype: int64

For numerical variables, we can access the distribution of values by calculating summary statictics.

patients['anchor_age'].describe()
count    100.00000
mean      61.75000
std       16.16979
min       21.00000
25%       51.75000
50%       63.00000
75%       72.00000
max       91.00000
Name: anchor_age, dtype: float64

Patients in our data have a mean age of 61.8. The youngest and oldest patients are 21 and 91 years old, respectively. Recall from the documentation, all ages > 89 years old are capped at 91, so this is a nice sanity check that our maximum age is as expected.

Better yet, for a quick glance of the distribution, we can also use a histogram:

patients['anchor_age'].hist()
<Axes: >
../../_images/c9a586c4925f74695db630058e60e6c8c17cbf27bf305e2bc24c826a4b5ffdc1.png

Checking the distribution of numerical values is important as it can help us quickly see if there is an unexpected observation in our data. For instance, perhaps we were told our dataset should only include adults, but we saw some ages that were < 10 years old. Or maybe someone is said to be 150 years old! These would necessitate further exploration to determine whether there are any data entry issues. Perhaps a busy doctor mistyped the age or maybe age was confused with a different variable.

Incorporating more data!#

As previously mentioned, MIMIC contains multiple tables worth of data. Next, let’s look at a second file: vitals.csv which contains vital sign measurements for patients.

vitals = pd.read_csv('../../data/mimic_vitals.csv')

Let’s repeat the previously described steps to help us get to know our new vitals DataFrame.

vitals.head()
subject_id chartdate result_name result_value
0 10011398 2146-12-01 Height (Inches) 63
1 10011398 2147-01-22 Weight (Lbs) 127
2 10011398 2146-12-01 Weight (Lbs) 135
3 10011398 2147-07-24 Weight (Lbs) 136
4 10011398 2147-03-26 Weight (Lbs) 136

Once again, we see that the subject_id column is present, in addition to three new columns named: chartdate, result_name, and result_value.

Let’s see if the number of unique subject_ids match the number of rows:

vitals['subject_id'].nunique() == len(vitals)
False
len(vitals)
2964

In this case, it appears that there are many more observations in the vitals DataFrame (note: this was also apparent just from looking at the first 5 rows: the same subject_id appears in each row!).

To get a better idea of what is going on here, we can examine the specific types of categories in the result_name column. When we do, we learn that this variable annotates vital sign measurements including weight, height, body mass index (BMI), and blood pressure.

Since this dataset represents hospitalized patients, it is reasonable that each patient may have each vital measured several times throughout their hospital stay.

We can can confirm this by using .groupby() on the subject_id and result_name columns to see how many unique results (result_value) each patient had.

vitals.groupby(['subject_id', 'result_name'])[['result_value']].count()
result_value
subject_id result_name
10000032 BMI (kg/m2) 8
Blood Pressure 6
Height (Inches) 2
Weight (Lbs) 25
10001217 BMI (kg/m2) 1
... ... ...
10039997 Weight (Lbs) 5
10040025 BMI (kg/m2) 10
Blood Pressure 1
Height (Inches) 1
Weight (Lbs) 10

268 rows × 1 columns

As we did before, let’s also check the data types:

vitals.dtypes
subject_id       int64
chartdate       object
result_name     object
result_value    object
dtype: object

First, we can convert subject_id to a string again

vitals['subject_id'] = vitals['subject_id'].astype('str')

Working with Dates#

Next, we see that chartdate is an object. We can convert dates to a special datetime data type in Python. This specification will allow us to perform operations between dates (e.g., finding the number of days between two blood pressure measurements).

vitals['chartdate'] = pd.to_datetime(vitals['chartdate'])
print(vitals.dtypes)
vitals.head()
subject_id              object
chartdate       datetime64[ns]
result_name             object
result_value            object
dtype: object
subject_id chartdate result_name result_value
0 10011398 2146-12-01 Height (Inches) 63
1 10011398 2147-01-22 Weight (Lbs) 127
2 10011398 2146-12-01 Weight (Lbs) 135
3 10011398 2147-07-24 Weight (Lbs) 136
4 10011398 2147-03-26 Weight (Lbs) 136

Note

Important Note about Dates

chartdate is the date the vital sign was recorded. You’ll notice that these years are in the future. Data shifting was performed to add an additional measure of privacy (as you may recall from #pillar-2-data-privacy). To protect the sensitive data of patients, dates were shifted up randomly for each patient, but in a consistent way (consistent shifting meaning if two blood presure measurements were taken on two back-to-back days, the shifted dates might look like: 2067-01-01 and 2067-01-02). Knowing whether or not a dataset’s dates are accurate or have been shifted is crucial if you’re trying to draw conclusions that are based on the actual date. For example, let’s say a patient was admitted with an unknown but severe respiratory infection in 1990, but had their dates shifted to 2020 - it would be completely erroneous to conclude they may have had COVID-19 (considering COVID-19 did not even exist in the 1990s!).

Now that chartdate has been converted to a datetime object, we can perform date-specific types of operations.

For instance, for each patient, let’s order rows in chronological order.

vitals_sorted = vitals.sort_values(by=['subject_id', 'chartdate'], ascending=[True, True])
vitals_sorted.head(10)
subject_id chartdate result_name result_value
1204 10000032 2180-04-27 Weight (Lbs) 94
1241 10000032 2180-04-27 Blood Pressure 110/65
1203 10000032 2180-05-07 Height (Inches) 60
1207 10000032 2180-05-07 BMI (kg/m2) 18.0
1218 10000032 2180-05-07 Weight (Lbs) 92.15
1219 10000032 2180-05-07 Weight (Lbs) 92.15
1220 10000032 2180-05-07 Weight (Lbs) 92.15
1221 10000032 2180-05-07 Weight (Lbs) 92.15
1222 10000032 2180-05-07 Weight (Lbs) 92.15
1223 10000032 2180-05-07 Weight (Lbs) 92.15

We can also calculate the difference in dates between a patient’s earliest measurement and most recent measurement. We can do this by first aggregating the min and max days using .agg() method and then calculating the difference in the number of days between the max and min dates by using Pandas .dt.days attribute.

date_range = vitals_sorted.groupby('subject_id')['chartdate'].agg(['min', 'max'])
date_range['day_diff'] = (date_range['max'] - date_range['min']).dt.days
date_range.sort_values('day_diff', ascending = True)
min max day_diff
subject_id
10009049 2174-07-20 2174-07-20 0
10009628 2153-10-21 2153-10-21 0
10007818 2146-06-10 2146-06-10 0
10010471 2155-05-08 2155-05-08 0
10008454 2110-12-29 2110-12-29 0
... ... ... ...
10020306 2128-11-28 2135-08-20 2456
10004457 2141-10-30 2149-01-19 2638
10019003 2148-08-25 2155-11-23 2646
10015860 2186-10-14 2194-03-29 2723
10037928 2175-03-16 2183-12-25 3206

79 rows × 3 columns

Just from this sample we can see that some patients have years between measurements. This is reasonable since patients in MIMIC can be hospitalized more than once.

Other ways we could work with the datetime objects are by separating out the year, month, and day.

For example, if we wanted to create unique columns for year, month, and day, we could do the following:

vitals_sorted['year'] = vitals_sorted['chartdate'].dt.year
vitals_sorted['month'] = vitals_sorted['chartdate'].dt.month
vitals_sorted['day'] = vitals_sorted['chartdate'].dt.day

vitals_sorted
subject_id chartdate result_name result_value year month day
1204 10000032 2180-04-27 Weight (Lbs) 94 2180 4 27
1241 10000032 2180-04-27 Blood Pressure 110/65 2180 4 27
1203 10000032 2180-05-07 Height (Inches) 60 2180 5 7
1207 10000032 2180-05-07 BMI (kg/m2) 18.0 2180 5 7
1218 10000032 2180-05-07 Weight (Lbs) 92.15 2180 5 7
... ... ... ... ... ... ... ...
2689 10040025 2147-12-29 Weight (Lbs) 212 2147 12 29
2696 10040025 2147-12-29 BMI (kg/m2) 34.2 2147 12 29
2692 10040025 2147-12-30 BMI (kg/m2) 30.3 2147 12 30
2704 10040025 2147-12-30 Weight (Lbs) 187.61 2147 12 30
2703 10040025 2148-01-19 Blood Pressure 112/60 2148 1 19

2964 rows × 7 columns

Data Wrangling Strings to Numbers#

You may have noticed that result_name was not being recognized as a numerical variable. Perhaps there are non-numeric values in this column? We can check this using the following expression to see which observations have non-numeric characters:

# return the rows where there are non-numeric characters in `result_value`
vitals[vitals['result_value'].astype(str).str.contains(r'[^0-9]')][['result_name', 'result_value']].drop_duplicates()
result_name result_value
8 BMI (kg/m2) 23.6
9 BMI (kg/m2) 23.9
10 BMI (kg/m2) 25.3
12 BMI (kg/m2) 25.8
13 Height (Inches) 61.5
... ... ...
2957 Blood Pressure 152/94
2958 Blood Pressure 153/94
2961 Blood Pressure 111/72
2962 Blood Pressure 135/76
2963 Weight (Lbs) 252.65

1282 rows × 2 columns

You’ll notice in some cases that when result_name == Blood Pressure, the results contain a ‘/’ to separate systolic from diastolic blood pressure.

If we were interested in investigating blood pressure of these patients, we’d need to be mindful that Python will not allow us to simply average blood pressure without doing further data wrangling to separate these numbers from the non-numerical character.

# create blood pressure dataframe
bp_measurements = vitals.copy()

bp_measurements = bp_measurements[bp_measurements['result_name'] == 'Blood Pressure']
bp_measurements.head()
subject_id chartdate result_name result_value
14 10011398 2146-12-01 Blood Pressure 110/70
15 10011398 2147-03-26 Blood Pressure 112/80
16 10011398 2146-06-02 Blood Pressure 128/84
17 10011398 2147-07-24 Blood Pressure 132/84
18 10011398 2146-05-28 Blood Pressure 138/90

In the code below, we can use str.split() to help us separate systolic (number before /) and diastolic (number after /) blood pressure. Once we remove the /, we can also convert our new columns to integers.

bp_measurements[['systolic', 'diastolic']] = bp_measurements['result_value'].str.split('/', expand=True).astype(int)
print(bp_measurements.dtypes)
bp_measurements.head()
subject_id              object
chartdate       datetime64[ns]
result_name             object
result_value            object
systolic                 int64
diastolic                int64
dtype: object
subject_id chartdate result_name result_value systolic diastolic
14 10011398 2146-12-01 Blood Pressure 110/70 110 70
15 10011398 2147-03-26 Blood Pressure 112/80 112 80
16 10011398 2146-06-02 Blood Pressure 128/84 128 84
17 10011398 2147-07-24 Blood Pressure 132/84 132 84
18 10011398 2146-05-28 Blood Pressure 138/90 138 90

Examine distribution of continous variables#

Let’s explore the distribution of our now cleaned systolic and diastolic variables

bp_measurements['systolic'].hist()
<Axes: >
../../_images/15a3cd613bf6b45045b004454166005235fab1ca3b65fba6390ca8f64419c120.png
bp_measurements['diastolic'].hist()
<Axes: >
../../_images/e569b42a81ef8fd512769a88547916814a74e23be36b6d788022743cc631fa38.png

From these distributions, there does not appear to be any extreme outliers. However, if our primary analysis was to focus on blood pressure, we would likely want to consult a medical domain expert to ensure this distribution seems reasonable. Importantly, we would want to consider whether the the systolic and diastolic blood pressure values we see at either end of the distributions are biologically plausible.

Note

We refer to domain experts as those who have specialized knowledge in a field (i.e. domain) of interest. For example, domain experts might include medical doctors, physicists, CEOS, lawyers, etc. Often, domain experts might not know how to code or analyze data. However, domain experts can help clarify how the data was collected and what each variable means in the context of the particular domain. This can help ensure robust data wrangling.

Filtering Observations of Interest#

Perhaps we are only interested in subjects who have height and weight measurements. We can create a new copy of the vitals dataframe and filter it to keep those observations of interest.

Note, from the result_name we see that height and weight are said to be recorded in inches and pounds (lbs), respectively.

height = vitals.copy()
height = height[height['result_name'] == 'Height (Inches)']

weight = vitals.copy()
weight = weight[weight['result_name'] == 'Weight (Lbs)']

Evaluate height#

height.head()
subject_id chartdate result_name result_value
0 10011398 2146-12-01 Height (Inches) 63
13 10011398 2147-01-22 Height (Inches) 61.5
20 10014729 2125-02-22 Height (Inches) 65
21 10014729 2125-04-05 Height (Inches) 66
22 10014729 2126-01-24 Height (Inches) 66

First, let’s make sure that Python recognizes height as a continuous variable. If it’s not, we will convert it to type numeric.

# check variable data types
height.dtypes
subject_id              object
chartdate       datetime64[ns]
result_name             object
result_value            object
dtype: object
height['result_value'] = pd.to_numeric(height['result_value'])

Identify the presence of outliers#

As mentioned in an earlier chapter, outliers refer to observations that appear abnormal. Outliers may arise due to data-entry errors or, in this case, mistakes made during the actual vital sign measurements. It’s important to identify and investigate outliers so as to decide how we should handle them (if at all).

Visualing the distribution of a continous variable can help identify whether there are potential outliers. We can also use Pandas .describe() method to quickly calculate summary statistics.

height['result_value'].hist()
<Axes: >
../../_images/8eade7ceca7f391d71aa7761dd0e24471076c98fe6a94fafcc600a42156ed415.png
# use describe() to compute summary statistics of the distribution
height['result_value'].describe()
count    378.000000
mean      67.105952
std        5.117295
min        5.000000
25%       63.500000
50%       67.000000
75%       71.000000
max       73.000000
Name: result_value, dtype: float64

Potential red flag: Is there really a patient who might be 5 inches tall? This is clearly a case we should investigate more.

First, we can sort the height values from lowest to highest in order to retrieve the subject_id of the patient who is said to have a height of 5 inches.

Sorting will also help us see whether there are any other patients with unusually low heights

height.sort_values(['result_value']).head()
subject_id chartdate result_name result_value
1844 10012853 2175-04-05 Height (Inches) 5.0
2883 10002428 2154-10-18 Height (Inches) 58.0
2885 10002428 2155-08-12 Height (Inches) 59.0
2884 10002428 2158-09-13 Height (Inches) 59.0
1203 10000032 2180-05-07 Height (Inches) 60.0

Let’s see if this patient has other recorded height measurements

height[height['subject_id']=='10012853']
subject_id chartdate result_name result_value
1844 10012853 2175-04-05 Height (Inches) 5.0
1845 10012853 2178-11-04 Height (Inches) 64.0
1846 10012853 2176-06-07 Height (Inches) 64.0
1847 10012853 2177-01-05 Height (Inches) 64.0

Interesting, it appears this patient had an initial obsevation of 5 inches, but then three years later was always measured at 64 inches (or 5’4 in feet/inches). It seems even further unlikely that a person would have that dramatic of a growth in such a short timespan. But, let’s investigate the variation in change for other patients who have more than one height measurement.

For each unique patient with multiple observations, we will calculate the standard deviation of each patient’s height measurements.

Note: ddof = 0 indicates that std for patients with one measurement will be displayed as 0 rather than NaN

import numpy as np

height['std'] = height.groupby('subject_id')['result_value'].transform('std', ddof=0)
height.sort_values('std', ascending = False)[['subject_id', 'std']].drop_duplicates().head(10)
subject_id std
1845 10012853 25.547749
1095 10005909 2.724312
461 10019917 1.000000
2720 10022880 0.942809
300 10001725 0.862706
0 10011398 0.750000
1265 10037928 0.745356
2729 10004457 0.731247
2884 10002428 0.707107
524 10005348 0.596212

It is obvious that our ‘concerning’ patient has a much higher standard deviation of their height measurements compared to other patients.

Let’s take a look at the two patients with the second and third highest standard deviation for comparison

height[height['subject_id']=='10005909']
subject_id chartdate result_name result_value std
1094 10005909 2144-10-29 Height (Inches) 65.0 2.724312
1095 10005909 2144-12-10 Height (Inches) 70.0 2.724312
1096 10005909 2145-01-03 Height (Inches) 70.0 2.724312
1102 10005909 2144-11-15 Height (Inches) 72.5 2.724312
height[height['subject_id']=='10019917']
subject_id chartdate result_name result_value std
461 10019917 2182-02-14 Height (Inches) 69.0 1.0
462 10019917 2182-01-09 Height (Inches) 71.0 1.0

It clearly seems that there the 5 inch observation is an outlier. Perhaps whoever recorded the data mistook the variable for feet instead of inches?

The best way to handle these types of inconsistencies will largely depend on the exact variable, research question of interest, and domain knowledge. In this case, it may be safe to take the median (64 years) for the concerning observation since this was a consistent height recorded over the following years. However, there are a couple of ways to best handle apparent outliers:

  • Seek consultation from mentors and domain experts!

  • Consider the age of the patient - height may be expected to change more dramatically over time for some patients rather than others. Depending on the patient’s age, it may be fair to take the next measurement that occured closest in time to the date of the suspect measurement.

Handling multiple observations#

Despite the considerations metioned, let’s demonstrate how to groupby subject_id and take the median height for patients with multiple measurements.

# create a new variable with the median height value for each patient
height['result_value_median'] = height.groupby('subject_id')['result_value'].transform('median')

This next step ensures that we will now only have one unique row per patient.

# assuming we don't need to link by date
height_clean = height[['subject_id', 'result_name', 'result_value_median']].drop_duplicates()
# should have one observation per person
len(height_clean)
61

Let’s review the distribution again - now it looks a bit more normally distributed

height_clean['result_value_median'].hist()
<Axes: >
../../_images/faae4136b79f1327bdbbdfd911a463d7dcaeab6f7e1d607dcda6ff3496f8205e.png

Unit Conversion Issues#

Another important consideration when wrangling our data involves unit conversions. Errors involving unit conversions may show up as outliers.

Let’s next examine weight measurements.

### pre-process the data
weight['result_value'] = pd.to_numeric(weight['result_value'])

# calculate summary statistics
weight['result_value'].describe()
count    941.000000
mean     173.202763
std       44.009522
min       88.000000
25%      147.300000
50%      169.310000
75%      185.000000
max      296.000000
Name: result_value, dtype: float64
weight['result_value'].hist()
<Axes: >
../../_images/79f723f6cdfba9f83a5b2629b2bf326fa8e0366c46e26b38b2e136e1742f6686.png

Investigate the variance in measurements between patients with multiple observations#

As we did with height, let’s calculate the standard deviation of weight measurements for patients with multiple weights recorded

weight['std'] = weight.groupby('subject_id')['result_value'].transform('std', ddof=0)# ddof = 0 indicates that std for patients with one measurement will be displayed as 0 rather than NaN
weight.sort_values('std', ascending = False)[['subject_id', 'std']].drop_duplicates().head(5)
subject_id std
177 10019385 58.385000
640 10021487 22.758934
1976 10019003 19.732249
1159 10039708 19.442869
2707 10040025 16.104849

Let’s examine the first patient who’s weight measurements have a std of 58.4.

When we look at this patient’s measurements, we may notice that they appear to have lost >110 pounds in < 1 month. Is this likely?

weight[weight['subject_id']=='10019385'].sort_values('chartdate')
subject_id chartdate result_name result_value std
179 10019385 2180-02-15 Weight (Lbs) 214.07 58.385
177 10019385 2180-03-04 Weight (Lbs) 97.30 58.385

A unit conversion error may be a possibility here.

If we do the math, 1 pound ~ 0.45 kg.

If 214.07 is the weight in pounds, the equivalent weight in kilograms is 214.07*0.45 = ~96kg. These weight measurements made one month apart seem very close if we consider that there may be a unit conversion issue.

Let’s examine the second patient we observe having a high standard deviation as regards their weight measurements.

If we look at the measurements sorted chronologically, it looks like this second patient iteratively lost weight over the course of about 6 months before the weight began to increase again. This may be likely if the patient was sick and hospitalized for a long time and then began to recover and regain some of their weight.

weight[weight['subject_id']=='10021487'].sort_values('chartdate')
subject_id chartdate result_name result_value std
644 10021487 2117-01-14 Weight (Lbs) 264.0 22.758934
643 10021487 2117-01-19 Weight (Lbs) 254.0 22.758934
641 10021487 2117-02-10 Weight (Lbs) 227.0 22.758934
638 10021487 2117-05-05 Weight (Lbs) 204.0 22.758934
635 10021487 2117-06-16 Weight (Lbs) 193.0 22.758934
656 10021487 2117-07-07 Weight (Lbs) 196.6 22.758934
636 10021487 2117-08-18 Weight (Lbs) 199.0 22.758934
637 10021487 2117-08-18 Weight (Lbs) 199.0 22.758934
639 10021487 2117-09-09 Weight (Lbs) 208.0 22.758934
657 10021487 2117-12-03 Weight (Lbs) 210.1 22.758934
640 10021487 2118-01-24 Weight (Lbs) 212.0 22.758934
642 10021487 2118-08-01 Weight (Lbs) 241.0 22.758934

While it may seem tedious, it is good practice to get in the habit of really getting to know your data. When specific observations jump out as being suspicious, you can perform case studies focused on particular patients to have a better sense of their data. In the case of the patient with rapid weight loss and regain, it may be helpful to referenmce additional data tables to have a more holistic view of this patient’s time in the hospital.

Non-standardized labels#

Next, we will use a third MIMIC table: labs.csv to demonstrate measurements that have slightly different label names.

In this modified labs data, we have different types of hemoglobin and heparin laboratory (blood test) measurements (as noted in the label column).

labs = pd.read_csv('../../data/mimic_labs.csv')
labs.head()
subject_id label value valueuom
0 10000032 Hemoglobin 11.9 g/dL
1 10000032 MCHC 34.5 %
2 10001217 Hemoglobin 11.8 g/dL
3 10001217 MCHC 34.4 %
4 10001217 MCHC 32.4 %

Let’s take a look at the number of unique ways these labs may be represented. If we are not domain experts, we might not know whether there are differences in these blood tests.

For instance, if we are interested in examining hemoglobin levels, we could search the labs DataFrame for the string hemoglobin. When we do so, we’ll notice that there are many different variations of labs with “hemoglobin” in the name. Are all these the same? Sometimes the units (valueuom) can help us determine this. Othertimes, we may need to compare distributions or consult domain experts as to which are the proper variables to include.

labs[labs['label'].str.contains('hemoglobin', case = False)][['label', 'valueuom']].value_counts()
label              valueuom
Hemoglobin         g/dL        159
% Hemoglobin A1c   %            49
Carboxyhemoglobin  %             7
Methemoglobin      %             6
Hemoglobin C       %             1
Name: count, dtype: int64

Let’s compare the distributions of the measurements labeled Hemoglobin (measured in g/dL) and % Hemoglobin A1c. Although the units different, the distributions seem fairly similar. This may warrent clinical domain expertise to review and confirm whether these lab measurements can combined.

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)

labs[labs['label'] == 'Hemoglobin'].hist(ax=axes[0])
axes[0].set_title('Hemoglobin (g/dL)')
axes[0].set_ylabel('Frequency of Measurement')
axes[0].set_xlabel('Measurement value (g/dL)')

labs[labs['label'] == '% Hemoglobin A1c'].hist(ax=axes[1])
axes[1].set_title('% Hemoglobin A1c')
axes[1].set_ylabel('Frequency of Measurement')
axes[1].set_xlabel('Measurement value (%)')

plt.tight_layout()
plt.show()
../../_images/7813f7d5cba64247ff4d493349a5574111f136d9a1881a3ebce5be7c7d9da5c4.png

Missing Data#

Data wrangling is also necessary in the case of missing data.

There are many reasons for that data may be missing, and a full review of these is out of the scope of this textbook. However, common reasons for missing data include:

  • Data entry problems

  • Measurement was not recorded due to negligence or perhaps there is a reason the measurement wasn’t needed

Some common ways to handle missing data:

  • Remove those measurements

  • Impute the missing value using the mean or median

  • Forward/backward filling

Rule of thumb: Talk to your domain experts/collaborators!

Let’s examine a simple, one patient case, for how we might work with missing blood pressure measurements. The following patient has multiple blood pressure measurements recorded.

bp_test = bp_measurements[bp_measurements['subject_id']=='10035631'][['chartdate', 'systolic']].sort_values('chartdate')

Let’s focus on a few time points when the patient had multiple measurements taken sequentially except for one day (2116-02-23).

bp_test = bp_test[(bp_test['chartdate'] >= '2116-02-16') & (bp_test['chartdate'] <='2116-02-24')]
bp_test
chartdate systolic
1826 2116-02-16 125
1760 2116-02-17 106
1777 2116-02-18 110
1796 2116-02-19 115
1748 2116-02-20 103
1785 2116-02-21 111
1799 2116-02-22 116
1753 2116-02-24 104

How might we handle missing values in this case?#

One common way to handle missing values is to impute the missing value. Imputation refers to the process of filling in missing values. A simple method of imputation might include using the mean or the median value of a series of measurements in place of the missing value.

In this case, let’s demonstrate how we might use median imputation to replace the missing value.

First, in this case, we can see that there is not a row for 2116-02-23. Let’s use date_range() to create a complete series of dates. We can then re-index the original data so that we have a series of sequential dates.

full_range = pd.date_range(
    start='2116-02-16',
    end='2116-02-24',
    freq='D'
)
full_range
DatetimeIndex(['2116-02-16', '2116-02-17', '2116-02-18', '2116-02-19',
               '2116-02-20', '2116-02-21', '2116-02-22', '2116-02-23',
               '2116-02-24'],
              dtype='datetime64[ns]', freq='D')
# Reindex 
bp_test = (
    bp_test
    .set_index('chartdate')
    .reindex(full_range)
)

bp_test
systolic
2116-02-16 125.0
2116-02-17 106.0
2116-02-18 110.0
2116-02-19 115.0
2116-02-20 103.0
2116-02-21 111.0
2116-02-22 116.0
2116-02-23 NaN
2116-02-24 104.0

After re-indexing, we have a complete set of dates. You will notice that there is a NaN next to the date that was originally missing a measurement. Next, we can use fillna() to replace the missing value by the median of the entire series.

# Impute the missing value with the median
bp_test_median = bp_test.fillna(bp_test.median())

# reset index to add `chartdate` as its own column
bp_test_median = bp_test_median.reset_index().rename(columns={'index': 'chartdate'})

bp_test_median
chartdate systolic
0 2116-02-16 125.0
1 2116-02-17 106.0
2 2116-02-18 110.0
3 2116-02-19 115.0
4 2116-02-20 103.0
5 2116-02-21 111.0
6 2116-02-22 116.0
7 2116-02-23 110.5
8 2116-02-24 104.0

We could also impute by forward filling, which is an imputation technique referring to when the last observation is carried forward to replace a missing value.

# Impute the missing value with the median
bp_test_forward = bp_test.ffill()

# reset index to add `chartdate` as its own column
bp_test_forward = bp_test_forward.reset_index().rename(columns={'index': 'chartdate'})

bp_test_forward
chartdate systolic
0 2116-02-16 125.0
1 2116-02-17 106.0
2 2116-02-18 110.0
3 2116-02-19 115.0
4 2116-02-20 103.0
5 2116-02-21 111.0
6 2116-02-22 116.0
7 2116-02-23 116.0
8 2116-02-24 104.0

Warning

It’s always important to think critically about the best ways to impute any missing data. We demonstrated two simple methods (median/mean imputation and forward filling), but there are entire books written on different imputation techniques. As mentioned throughout this chapter, discussions with your collaboraters and domain experts are important for identifying the best imputation method.

If imputation is used, it’s important to be careful that methods we used are not biased. For example, mean imputation may not be preferred if the data is highly skewed.

Another imputation method called backward filling refers to when a missing value is replaced by carrying the next observation backwards. As you can imagine, this can be problematic in cases when we are trying to time series data to predict some future outcome. We may be making our task much easier owing to data leakage, as you learned in the chapter on feature engineering.

Conclusions#

This chapter aimed to provide you with a real-world dataset that benefited from data wrangling. As you can see, the process of data wrangling is highly dependent on your dataset. In many cases, there is also not one correct way to go about these processes. What’s important is to deeply consider what our data represents, how it’s structured, and the downstream implications for our analysis if we do or do not prepare our data appropriately. Data wrangling is also a great opportunity to work deeply with your collaborators (domain experts!) to ensure each step of the process is grounded in practical utility and is reflective of how the data was generated, collected, and how it’s meant to be analyzed.

[^*] Johnson, A., Bulgarelli, L., Pollard, T., Gow, B., Moody, B., Horng, S., Celi, L. A., & Mark, R. (2024). MIMIC-IV (version 3.1). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/kpb9-mt58

[^**] Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV Clinical Database Demo (version 2.2). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/dp1f-ex47