How to begin Wrangling: Get to know your Data!

27.3. How to begin Wrangling: Get to know your Data!#

Whenever you begin working with a new dataset, it’s essential that you get to know your data! To get started, there are a few questions to consider:

Where did the data come from and how it was collected?
What data types are my variables (numeric, string, date, etc)?
Does Python recognize recognize these variables as their correct data types?
Are there outliers in our data?
How should we handle missing data?

The answers to these questions are just a handful of ways that may influence the approaches you take to wrangle your data.

Note

Data wrangling may also be referred to as data cleaning. By wrangling our data, we help ensure that it will be clean for later analysis.

Case Study of Data Wrangling#

In this chapter, we will consider these questions as we get to know the MIMIC-IV Dataset [^*] and demonstrate the steps needed to ensure that it can used for downstream analyses.

Where did the data come from and how was it collected?#

The MIMIC-IV dataset contains data from >220,000 patients who were admitted to Beth Israel Deaconess Medical Center in Boston, Massachusetts between 2008-2022. The dataset contains information that was entered into a patient’s medical record during the time they were admitted to the hospital.

For this case study, we will rely on MIMIC’s demo-dataset [^**] which has been made publically available and contains the data from random subset of 100 hospitalized patients.

Note

A note on the data used for this case study

Due to the sensitive nature of medical data, the full MIMIC-IV dataset requires special approval to work with. The demo-dataset has been specially prepared for educational purposes. Even though this is an abbreviated dataset, the amount of medical data is so voluminous, that we will only be using a subset of the demo-dataset for the purposes of this case study.

What does this data look like?#

The MIMIC-IV dataset contains many seperate data tables (structured as csv files).

To begin, we will import just one of the csv files concerning the demographics of the patients included in study. We will use pd.read_csv() to import the data as a DataFrame.

import pandas as pd

patients = pd.read_csv('../../data/mimic_patients.csv')

Determine Data Types#

Now that we’ve loaded in our data, we can start to explore what type of data we have.

The use of the method .head() can be a good first choice to get a glimpse of the first few rows of the DataFrame.

We see that we have three columns: subject_id (unique patient identifier), gender, and anchor_age (age of patient).

patients.head()

	subject_id	gender	anchor_age
0	10014729	F	21
1	10003400	F	72
2	10002428	F	80
3	10032725	F	38
4	10027445	F	48

Note

Side-Note: Data Dictionaries and Codebooks

The creators of MIMIC have curated excellent documentation detailing the process of data collection, dataset construction, and detailed information about what each column (i.e. variable) of the dataset represents.

As mentioned, at its core, the MIMIC dataset is built from the compilation of numerous tables (csv files), each containing different types of data.

When working with a new dataset, it’s good practice to determine whether there is a data dictionary or codebook which might tell us what each variable represents and what data types they are. For example, we can find information about the patients table in the documentation’s patients table page.

One useful fact we can learn about the patient’s data is that if a patient’s anchor_age is >89, the value becomes capped at 91. This is done to preserve patient confidentiality.

Since .head() only showed us the first 5 rows of data, let’s run some code to evaluate how many rows we have.

len(patients)

We should also check that subject_id is in fact unique - if it is, the number of unique values for subject_id should equal the number of rows in our dataframe.

patients['subject_id'].nunique()

# we could also automate this check
patients['subject_id'].nunique() == len(patients)

True

Next, we might want to check how Python interepreted these data types.

patients.dtypes

subject_id     int64
gender        object
anchor_age     int64
dtype: object

Should subject_id really be considered an integer? We are informed from the dataset curators that this variable is a unique subject (or patient) identifier.

Ask yourself: would adding two subject_id values make sense?

If this is a unique identifier, we can almost think of this variable representing a person’s name. In this case, subject_id is a categorical variable.

Let’s convert this column to a string to make sure we don’t accidently try to include it as a variable in subsequent numerical computations.

# convert `subject_id` to a string
patients['subject_id'] = patients['subject_id'].astype('str')

# check that the data has been converted to type str successfully
patients.dtypes

subject_id    object
gender        object
anchor_age     int64
dtype: object

From this step, we also see that gender is of type object and anchor_age is of type integer. These seem reasonable!

Examine the values and distributions of each variable#

To understand our data even more, we can inspect the actual values of our variables. For example, what categorical label is the gender variable composed of?

patients['gender'].value_counts()

gender
M    57
F    43
Name: count, dtype: int64

For numerical variables, we can access the distribution of values by calculating summary statictics.

patients['anchor_age'].describe()

count    100.00000
mean      61.75000
std       16.16979
min       21.00000
25%       51.75000
50%       63.00000
75%       72.00000
max       91.00000
Name: anchor_age, dtype: float64

Patients in our data have a mean age of 61.8. The youngest and oldest patients are 21 and 91 years old, respectively. Recall from the documentation, all ages > 89 years old are capped at 91, so this is a nice sanity check that our maximum age is as expected.

Better yet, for a quick glance of the distribution, we can also use a histogram:

patients['anchor_age'].hist()

<Axes: >

../../_images/c9a586c4925f74695db630058e60e6c8c17cbf27bf305e2bc24c826a4b5ffdc1.png

Checking the distribution of numerical values is important as it can help us quickly see if there is an unexpected observation in our data. For instance, perhaps we were told our dataset should only include adults, but we saw some ages that were < 10 years old. Or maybe someone is said to be 150 years old! These would necessitate further exploration to determine whether there are any data entry issues. Perhaps a busy doctor mistyped the age or maybe age was confused with a different variable.

Incorporating more data!#

As previously mentioned, MIMIC contains multiple tables worth of data. Next, let’s look at a second file: vitals.csv which contains vital sign measurements for patients.

vitals = pd.read_csv('../../data/mimic_vitals.csv')

Let’s repeat the previously described steps to help us get to know our new vitals DataFrame.

vitals.head()

	subject_id	chartdate	result_name	result_value
0	10011398	2146-12-01	Height (Inches)	63
1	10011398	2147-01-22	Weight (Lbs)	127
2	10011398	2146-12-01	Weight (Lbs)	135
3	10011398	2147-07-24	Weight (Lbs)	136
4	10011398	2147-03-26	Weight (Lbs)	136

Once again, we see that the subject_id column is present, in addition to three new columns named: chartdate, result_name, and result_value.

Let’s see if the number of unique subject_ids match the number of rows:

vitals['subject_id'].nunique() == len(vitals)

False

len(vitals)

In this case, it appears that there are many more observations in the vitals DataFrame (note: this was also apparent just from looking at the first 5 rows: the same subject_id appears in each row!).

To get a better idea of what is going on here, we can examine the specific types of categories in the result_name column. When we do, we learn that this variable annotates vital sign measurements including weight, height, body mass index (BMI), and blood pressure.

Since this dataset represents hospitalized patients, it is reasonable that each patient may have each vital measured several times throughout their hospital stay.

We can can confirm this by using .groupby() on the subject_id and result_name columns to see how many unique results (result_value) each patient had.

vitals.groupby(['subject_id', 'result_name'])[['result_value']].count()

		result_value
subject_id	result_name
10000032	BMI (kg/m2)	8
	Blood Pressure	6
	Height (Inches)	2
	Weight (Lbs)	25
10001217	BMI (kg/m2)	1
...	...	...
10039997	Weight (Lbs)	5
10040025	BMI (kg/m2)	10
	Blood Pressure	1
	Height (Inches)	1
	Weight (Lbs)	10

268 rows × 1 columns

As we did before, let’s also check the data types:

vitals.dtypes

subject_id       int64
chartdate       object
result_name     object
result_value    object
dtype: object

First, we can convert subject_id to a string again

vitals['subject_id'] = vitals['subject_id'].astype('str')

Working with Dates#

Next, we see that chartdate is an object. We can convert dates to a special datetime data type in Python. This specification will allow us to perform operations between dates (e.g., finding the number of days between two blood pressure measurements).

vitals['chartdate'] = pd.to_datetime(vitals['chartdate'])
print(vitals.dtypes)
vitals.head()

subject_id              object
chartdate       datetime64[ns]
result_name             object
result_value            object
dtype: object

	subject_id	chartdate	result_name	result_value
0	10011398	2146-12-01	Height (Inches)	63
1	10011398	2147-01-22	Weight (Lbs)	127
2	10011398	2146-12-01	Weight (Lbs)	135
3	10011398	2147-07-24	Weight (Lbs)	136
4	10011398	2147-03-26	Weight (Lbs)	136

Note

Important Note about Dates

chartdate is the date the vital sign was recorded. You’ll notice that these years are in the future. Data shifting was performed to add an additional measure of privacy (as you may recall from #pillar-2-data-privacy). To protect the sensitive data of patients, dates were shifted up randomly for each patient, but in a consistent way (consistent shifting meaning if two blood presure measurements were taken on two back-to-back days, the shifted dates might look like: 2067-01-01 and 2067-01-02). Knowing whether or not a dataset’s dates are accurate or have been shifted is crucial if you’re trying to draw conclusions that are based on the actual date. For example, let’s say a patient was admitted with an unknown but severe respiratory infection in 1990, but had their dates shifted to 2020 - it would be completely erroneous to conclude they may have had COVID-19 (considering COVID-19 did not even exist in the 1990s!).

Now that chartdate has been converted to a datetime object, we can perform date-specific types of operations.

For instance, for each patient, let’s order rows in chronological order.

vitals_sorted = vitals.sort_values(by=['subject_id', 'chartdate'], ascending=[True, True])
vitals_sorted.head(10)

	subject_id	chartdate	result_name	result_value
1204	10000032	2180-04-27	Weight (Lbs)	94
1241	10000032	2180-04-27	Blood Pressure	110/65
1203	10000032	2180-05-07	Height (Inches)	60
1207	10000032	2180-05-07	BMI (kg/m2)	18.0
1218	10000032	2180-05-07	Weight (Lbs)	92.15
1219	10000032	2180-05-07	Weight (Lbs)	92.15
1220	10000032	2180-05-07	Weight (Lbs)	92.15
1221	10000032	2180-05-07	Weight (Lbs)	92.15
1222	10000032	2180-05-07	Weight (Lbs)	92.15
1223	10000032	2180-05-07	Weight (Lbs)	92.15

We can also calculate the difference in dates between a patient’s earliest measurement and most recent measurement. We can do this by first aggregating the min and max days using .agg() method and then calculating the difference in the number of days between the max and min dates by using Pandas .dt.days attribute.

date_range = vitals_sorted.groupby('subject_id')['chartdate'].agg(['min', 'max'])
date_range['day_diff'] = (date_range['max'] - date_range['min']).dt.days
date_range.sort_values('day_diff', ascending = True)

	min	max	day_diff
subject_id
10009049	2174-07-20	2174-07-20	0
10009628	2153-10-21	2153-10-21	0
10007818	2146-06-10	2146-06-10	0
10010471	2155-05-08	2155-05-08	0
10008454	2110-12-29	2110-12-29	0
...	...	...	...
10020306	2128-11-28	2135-08-20	2456
10004457	2141-10-30	2149-01-19	2638
10019003	2148-08-25	2155-11-23	2646
10015860	2186-10-14	2194-03-29	2723
10037928	2175-03-16	2183-12-25	3206

79 rows × 3 columns

Just from this sample we can see that some patients have years between measurements. This is reasonable since patients in MIMIC can be hospitalized more than once.

Other ways we could work with the datetime objects are by separating out the year, month, and day.

For example, if we wanted to create unique columns for year, month, and day, we could do the following:

vitals_sorted['year'] = vitals_sorted['chartdate'].dt.year
vitals_sorted['month'] = vitals_sorted['chartdate'].dt.month
vitals_sorted['day'] = vitals_sorted['chartdate'].dt.day

vitals_sorted

	subject_id	chartdate	result_name	result_value	year	month	day
1204	10000032	2180-04-27	Weight (Lbs)	94	2180	4	27
1241	10000032	2180-04-27	Blood Pressure	110/65	2180	4	27
1203	10000032	2180-05-07	Height (Inches)	60	2180	5	7
1207	10000032	2180-05-07	BMI (kg/m2)	18.0	2180	5	7
1218	10000032	2180-05-07	Weight (Lbs)	92.15	2180	5	7
...	...	...	...	...	...	...	...
2689	10040025	2147-12-29	Weight (Lbs)	212	2147	12	29
2696	10040025	2147-12-29	BMI (kg/m2)	34.2	2147	12	29
2692	10040025	2147-12-30	BMI (kg/m2)	30.3	2147	12	30
2704	10040025	2147-12-30	Weight (Lbs)	187.61	2147	12	30
2703	10040025	2148-01-19	Blood Pressure	112/60	2148	1	19

2964 rows × 7 columns

Data Wrangling Strings to Numbers#

You may have noticed that result_name was not being recognized as a numerical variable. Perhaps there are non-numeric values in this column? We can check this using the following expression to see which observations have non-numeric characters:

# return the rows where there are non-numeric characters in `result_value`
vitals[vitals['result_value'].astype(str).str.contains(r'[^0-9]')][['result_name', 'result_value']].drop_duplicates()

	result_name	result_value
8	BMI (kg/m2)	23.6
9	BMI (kg/m2)	23.9
10	BMI (kg/m2)	25.3
12	BMI (kg/m2)	25.8
13	Height (Inches)	61.5
...	...	...
2957	Blood Pressure	152/94
2958	Blood Pressure	153/94
2961	Blood Pressure	111/72
2962	Blood Pressure	135/76
2963	Weight (Lbs)	252.65

1282 rows × 2 columns

You’ll notice in some cases that when result_name == Blood Pressure, the results contain a ‘/’ to separate systolic from diastolic blood pressure.

If we were interested in investigating blood pressure of these patients, we’d need to be mindful that Python will not allow us to simply average blood pressure without doing further data wrangling to separate these numbers from the non-numerical character.

# create blood pressure dataframe
bp_measurements = vitals.copy()

bp_measurements = bp_measurements[bp_measurements['result_name'] == 'Blood Pressure']
bp_measurements.head()

	subject_id	chartdate	result_name	result_value
14	10011398	2146-12-01	Blood Pressure	110/70
15	10011398	2147-03-26	Blood Pressure	112/80
16	10011398	2146-06-02	Blood Pressure	128/84
17	10011398	2147-07-24	Blood Pressure	132/84
18	10011398	2146-05-28	Blood Pressure	138/90

In the code below, we can use str.split() to help us separate systolic (number before /) and diastolic (number after /) blood pressure. Once we remove the /, we can also convert our new columns to integers.

bp_measurements[['systolic', 'diastolic']] = bp_measurements['result_value'].str.split('/', expand=True).astype(int)
print(bp_measurements.dtypes)
bp_measurements.head()

subject_id              object
chartdate       datetime64[ns]
result_name             object
result_value            object
systolic                 int64
diastolic                int64
dtype: object

	subject_id	chartdate	result_name	result_value	systolic	diastolic
14	10011398	2146-12-01	Blood Pressure	110/70	110	70
15	10011398	2147-03-26	Blood Pressure	112/80	112	80
16	10011398	2146-06-02	Blood Pressure	128/84	128	84
17	10011398	2147-07-24	Blood Pressure	132/84	132	84
18	10011398	2146-05-28	Blood Pressure	138/90	138	90

Examine distribution of continous variables#

Let’s explore the distribution of our now cleaned systolic and diastolic variables

bp_measurements['systolic'].hist()

<Axes: >

../../_images/15a3cd613bf6b45045b004454166005235fab1ca3b65fba6390ca8f64419c120.png

bp_measurements['diastolic'].hist()

<Axes: >

../../_images/e569b42a81ef8fd512769a88547916814a74e23be36b6d788022743cc631fa38.png

From these distributions, there does not appear to be any extreme outliers. However, if our primary analysis was to focus on blood pressure, we would likely want to consult a medical domain expert to ensure this distribution seems reasonable. Importantly, we would want to consider whether the the systolic and diastolic blood pressure values we see at either end of the distributions are biologically plausible.

Note

We refer to domain experts as those who have specialized knowledge in a field (i.e. domain) of interest. For example, domain experts might include medical doctors, physicists, CEOS, lawyers, etc. Often, domain experts might not know how to code or analyze data. However, domain experts can help clarify how the data was collected and what each variable means in the context of the particular domain. This can help ensure robust data wrangling.

Filtering Observations of Interest#

Perhaps we are only interested in subjects who have height and weight measurements. We can create a new copy of the vitals dataframe and filter it to keep those observations of interest.

Note, from the result_name we see that height and weight are said to be recorded in inches and pounds (lbs), respectively.

height = vitals.copy()
height = height[height['result_name'] == 'Height (Inches)']

weight = vitals.copy()
weight = weight[weight['result_name'] == 'Weight (Lbs)']

Evaluate height#

height.head()

	subject_id	chartdate	result_name	result_value
0	10011398	2146-12-01	Height (Inches)	63
13	10011398	2147-01-22	Height (Inches)	61.5
20	10014729	2125-02-22	Height (Inches)	65
21	10014729	2125-04-05	Height (Inches)	66
22	10014729	2126-01-24	Height (Inches)	66

First, let’s make sure that Python recognizes height as a continuous variable. If it’s not, we will convert it to type numeric.

# check variable data types
height.dtypes

subject_id              object
chartdate       datetime64[ns]
result_name             object
result_value            object
dtype: object

height['result_value'] = pd.to_numeric(height['result_value'])

Identify the presence of outliers#

As mentioned in an earlier chapter, outliers refer to observations that appear abnormal. Outliers may arise due to data-entry errors or, in this case, mistakes made during the actual vital sign measurements. It’s important to identify and investigate outliers so as to decide how we should handle them (if at all).

Visualing the distribution of a continous variable can help identify whether there are potential outliers. We can also use Pandas .describe() method to quickly calculate summary statistics.

height['result_value'].hist()

<Axes: >

../../_images/8eade7ceca7f391d71aa7761dd0e24471076c98fe6a94fafcc600a42156ed415.png

# use describe() to compute summary statistics of the distribution
height['result_value'].describe()

count    378.000000
mean      67.105952
std        5.117295
min        5.000000
25%       63.500000
50%       67.000000
75%       71.000000
max       73.000000
Name: result_value, dtype: float64

Potential red flag: Is there really a patient who might be 5 inches tall? This is clearly a case we should investigate more.

First, we can sort the height values from lowest to highest in order to retrieve the subject_id of the patient who is said to have a height of 5 inches.

Sorting will also help us see whether there are any other patients with unusually low heights

height.sort_values(['result_value']).head()

	subject_id	chartdate	result_name	result_value
1844	10012853	2175-04-05	Height (Inches)	5.0
2883	10002428	2154-10-18	Height (Inches)	58.0
2885	10002428	2155-08-12	Height (Inches)	59.0
2884	10002428	2158-09-13	Height (Inches)	59.0
1203	10000032	2180-05-07	Height (Inches)	60.0

Let’s see if this patient has other recorded height measurements

height[height['subject_id']=='10012853']

	subject_id	chartdate	result_name	result_value
1844	10012853	2175-04-05	Height (Inches)	5.0
1845	10012853	2178-11-04	Height (Inches)	64.0
1846	10012853	2176-06-07	Height (Inches)	64.0
1847	10012853	2177-01-05	Height (Inches)	64.0

Interesting, it appears this patient had an initial obsevation of 5 inches, but then three years later was always measured at 64 inches (or 5’4 in feet/inches). It seems even further unlikely that a person would have that dramatic of a growth in such a short timespan. But, let’s investigate the variation in change for other patients who have more than one height measurement.

For each unique patient with multiple observations, we will calculate the standard deviation of each patient’s height measurements.

Note: ddof = 0 indicates that std for patients with one measurement will be displayed as 0 rather than NaN

import numpy as np

height['std'] = height.groupby('subject_id')['result_value'].transform('std', ddof=0)

height.sort_values('std', ascending = False)[['subject_id', 'std']].drop_duplicates().head(10)

	subject_id	std
1845	10012853	25.547749
1095	10005909	2.724312
461	10019917	1.000000
2720	10022880	0.942809
300	10001725	0.862706
0	10011398	0.750000
1265	10037928	0.745356
2729	10004457	0.731247
2884	10002428	0.707107
524	10005348	0.596212

It is obvious that our ‘concerning’ patient has a much higher standard deviation of their height measurements compared to other patients.

Let’s take a look at the two patients with the second and third highest standard deviation for comparison

height[height['subject_id']=='10005909']

	subject_id	chartdate	result_name	result_value	std
1094	10005909	2144-10-29	Height (Inches)	65.0	2.724312
1095	10005909	2144-12-10	Height (Inches)	70.0	2.724312
1096	10005909	2145-01-03	Height (Inches)	70.0	2.724312
1102	10005909	2144-11-15	Height (Inches)	72.5	2.724312

height[height['subject_id']=='10019917']

	subject_id	chartdate	result_name	result_value	std
461	10019917	2182-02-14	Height (Inches)	69.0	1.0
462	10019917	2182-01-09	Height (Inches)	71.0	1.0

It clearly seems that there the 5 inch observation is an outlier. Perhaps whoever recorded the data mistook the variable for feet instead of inches?

The best way to handle these types of inconsistencies will largely depend on the exact variable, research question of interest, and domain knowledge. In this case, it may be safe to take the median (64 years) for the concerning observation since this was a consistent height recorded over the following years. However, there are a couple of ways to best handle apparent outliers:

Seek consultation from mentors and domain experts!
Consider the age of the patient - height may be expected to change more dramatically over time for some patients rather than others. Depending on the patient’s age, it may be fair to take the next measurement that occured closest in time to the date of the suspect measurement.

Handling multiple observations#

Despite the considerations metioned, let’s demonstrate how to groupby subject_id and take the median height for patients with multiple measurements.

# create a new variable with the median height value for each patient
height['result_value_median'] = height.groupby('subject_id')['result_value'].transform('median')

This next step ensures that we will now only have one unique row per patient.

# assuming we don't need to link by date
height_clean = height[['subject_id', 'result_name', 'result_value_median']].drop_duplicates()

# should have one observation per person
len(height_clean)

Let’s review the distribution again - now it looks a bit more normally distributed

height_clean['result_value_median'].hist()

<Axes: >

../../_images/faae4136b79f1327bdbbdfd911a463d7dcaeab6f7e1d607dcda6ff3496f8205e.png

Unit Conversion Issues#

Another important consideration when wrangling our data involves unit conversions. Errors involving unit conversions may show up as outliers.

Let’s next examine weight measurements.

### pre-process the data
weight['result_value'] = pd.to_numeric(weight['result_value'])

# calculate summary statistics
weight['result_value'].describe()

count    941.000000
mean     173.202763
std       44.009522
min       88.000000
25%      147.300000
50%      169.310000
75%      185.000000
max      296.000000
Name: result_value, dtype: float64

weight['result_value'].hist()

<Axes: >

../../_images/79f723f6cdfba9f83a5b2629b2bf326fa8e0366c46e26b38b2e136e1742f6686.png

Investigate the variance in measurements between patients with multiple observations#

As we did with height, let’s calculate the standard deviation of weight measurements for patients with multiple weights recorded

weight['std'] = weight.groupby('subject_id')['result_value'].transform('std', ddof=0)# ddof = 0 indicates that std for patients with one measurement will be displayed as 0 rather than NaN

weight.sort_values('std', ascending = False)[['subject_id', 'std']].drop_duplicates().head(5)

	subject_id	std
177	10019385	58.385000
640	10021487	22.758934
1976	10019003	19.732249
1159	10039708	19.442869
2707	10040025	16.104849

Let’s examine the first patient who’s weight measurements have a std of 58.4.

When we look at this patient’s measurements, we may notice that they appear to have lost >110 pounds in < 1 month. Is this likely?

weight[weight['subject_id']=='10019385'].sort_values('chartdate')

	subject_id	chartdate	result_name	result_value	std
179	10019385	2180-02-15	Weight (Lbs)	214.07	58.385
177	10019385	2180-03-04	Weight (Lbs)	97.30	58.385

A unit conversion error may be a possibility here.

If we do the math, 1 pound ~ 0.45 kg.

If 214.07 is the weight in pounds, the equivalent weight in kilograms is 214.07*0.45 = ~96kg. These weight measurements made one month apart seem very close if we consider that there may be a unit conversion issue.

Let’s examine the second patient we observe having a high standard deviation as regards their weight measurements.

If we look at the measurements sorted chronologically, it looks like this second patient iteratively lost weight over the course of about 6 months before the weight began to increase again. This may be likely if the patient was sick and hospitalized for a long time and then began to recover and regain some of their weight.

weight[weight['subject_id']=='10021487'].sort_values('chartdate')

	subject_id	chartdate	result_name	result_value	std
644	10021487	2117-01-14	Weight (Lbs)	264.0	22.758934
643	10021487	2117-01-19	Weight (Lbs)	254.0	22.758934
641	10021487	2117-02-10	Weight (Lbs)	227.0	22.758934
638	10021487	2117-05-05	Weight (Lbs)	204.0	22.758934
635	10021487	2117-06-16	Weight (Lbs)	193.0	22.758934
656	10021487	2117-07-07	Weight (Lbs)	196.6	22.758934
636	10021487	2117-08-18	Weight (Lbs)	199.0	22.758934
637	10021487	2117-08-18	Weight (Lbs)	199.0	22.758934
639	10021487	2117-09-09	Weight (Lbs)	208.0	22.758934
657	10021487	2117-12-03	Weight (Lbs)	210.1	22.758934
640	10021487	2118-01-24	Weight (Lbs)	212.0	22.758934
642	10021487	2118-08-01	Weight (Lbs)	241.0	22.758934

While it may seem tedious, it is good practice to get in the habit of really getting to know your data. When specific observations jump out as being suspicious, you can perform case studies focused on particular patients to have a better sense of their data. In the case of the patient with rapid weight loss and regain, it may be helpful to referenmce additional data tables to have a more holistic view of this patient’s time in the hospital.

Non-standardized labels#

Next, we will use a third MIMIC table: labs.csv to demonstrate measurements that have slightly different label names.

In this modified labs data, we have different types of hemoglobin and heparin laboratory (blood test) measurements (as noted in the label column).

labs = pd.read_csv('../../data/mimic_labs.csv')
labs.head()

	subject_id	label	value	valueuom
0	10000032	Hemoglobin	11.9	g/dL
1	10000032	MCHC	34.5	%
2	10001217	Hemoglobin	11.8	g/dL
3	10001217	MCHC	34.4	%
4	10001217	MCHC	32.4	%

Let’s take a look at the number of unique ways these labs may be represented. If we are not domain experts, we might not know whether there are differences in these blood tests.

For instance, if we are interested in examining hemoglobin levels, we could search the labs DataFrame for the string hemoglobin. When we do so, we’ll notice that there are many different variations of labs with “hemoglobin” in the name. Are all these the same? Sometimes the units (valueuom) can help us determine this. Othertimes, we may need to compare distributions or consult domain experts as to which are the proper variables to include.

labs[labs['label'].str.contains('hemoglobin', case = False)][['label', 'valueuom']].value_counts()

label              valueuom
Hemoglobin         g/dL        159
% Hemoglobin A1c   %            49
Carboxyhemoglobin  %             7
Methemoglobin      %             6
Hemoglobin C       %             1
Name: count, dtype: int64

Let’s compare the distributions of the measurements labeled Hemoglobin (measured in g/dL) and % Hemoglobin A1c. Although the units different, the distributions seem fairly similar. This may warrent clinical domain expertise to review and confirm whether these lab measurements can combined.

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)

labs[labs['label'] == 'Hemoglobin'].hist(ax=axes[0])
axes[0].set_title('Hemoglobin (g/dL)')
axes[0].set_ylabel('Frequency of Measurement')
axes[0].set_xlabel('Measurement value (g/dL)')

labs[labs['label'] == '% Hemoglobin A1c'].hist(ax=axes[1])
axes[1].set_title('% Hemoglobin A1c')
axes[1].set_ylabel('Frequency of Measurement')
axes[1].set_xlabel('Measurement value (%)')

plt.tight_layout()
plt.show()

../../_images/7813f7d5cba64247ff4d493349a5574111f136d9a1881a3ebce5be7c7d9da5c4.png

Missing Data#

Data wrangling is also necessary in the case of missing data.

There are many reasons for that data may be missing, and a full review of these is out of the scope of this textbook. However, common reasons for missing data include:

Data entry problems
Measurement was not recorded due to negligence or perhaps there is a reason the measurement wasn’t needed

Some common ways to handle missing data:

Remove those measurements
Impute the missing value using the mean or median
Forward/backward filling

Rule of thumb: Talk to your domain experts/collaborators!

Let’s examine a simple, one patient case, for how we might work with missing blood pressure measurements. The following patient has multiple blood pressure measurements recorded.

bp_test = bp_measurements[bp_measurements['subject_id']=='10035631'][['chartdate', 'systolic']].sort_values('chartdate')

Let’s focus on a few time points when the patient had multiple measurements taken sequentially except for one day (2116-02-23).

bp_test = bp_test[(bp_test['chartdate'] >= '2116-02-16') & (bp_test['chartdate'] <='2116-02-24')]
bp_test

	chartdate	systolic
1826	2116-02-16	125
1760	2116-02-17	106
1777	2116-02-18	110
1796	2116-02-19	115
1748	2116-02-20	103
1785	2116-02-21	111
1799	2116-02-22	116
1753	2116-02-24	104

How might we handle missing values in this case?#

One common way to handle missing values is to impute the missing value. Imputation refers to the process of filling in missing values. A simple method of imputation might include using the mean or the median value of a series of measurements in place of the missing value.

In this case, let’s demonstrate how we might use median imputation to replace the missing value.

First, in this case, we can see that there is not a row for 2116-02-23. Let’s use date_range() to create a complete series of dates. We can then re-index the original data so that we have a series of sequential dates.

full_range = pd.date_range(
    start='2116-02-16',
    end='2116-02-24',
    freq='D'
)
full_range

DatetimeIndex(['2116-02-16', '2116-02-17', '2116-02-18', '2116-02-19',
               '2116-02-20', '2116-02-21', '2116-02-22', '2116-02-23',
               '2116-02-24'],
              dtype='datetime64[ns]', freq='D')

# Reindex 
bp_test = (
    bp_test
    .set_index('chartdate')
    .reindex(full_range)
)

bp_test

	systolic
2116-02-16	125.0
2116-02-17	106.0
2116-02-18	110.0
2116-02-19	115.0
2116-02-20	103.0
2116-02-21	111.0
2116-02-22	116.0
2116-02-23	NaN
2116-02-24	104.0

After re-indexing, we have a complete set of dates. You will notice that there is a NaN next to the date that was originally missing a measurement. Next, we can use fillna() to replace the missing value by the median of the entire series.

# Impute the missing value with the median
bp_test_median = bp_test.fillna(bp_test.median())

# reset index to add `chartdate` as its own column
bp_test_median = bp_test_median.reset_index().rename(columns={'index': 'chartdate'})

bp_test_median

	chartdate	systolic
0	2116-02-16	125.0
1	2116-02-17	106.0
2	2116-02-18	110.0
3	2116-02-19	115.0
4	2116-02-20	103.0
5	2116-02-21	111.0
6	2116-02-22	116.0
7	2116-02-23	110.5
8	2116-02-24	104.0

We could also impute by forward filling, which is an imputation technique referring to when the last observation is carried forward to replace a missing value.

# Impute the missing value with the median
bp_test_forward = bp_test.ffill()

# reset index to add `chartdate` as its own column
bp_test_forward = bp_test_forward.reset_index().rename(columns={'index': 'chartdate'})

bp_test_forward

	chartdate	systolic
0	2116-02-16	125.0
1	2116-02-17	106.0
2	2116-02-18	110.0
3	2116-02-19	115.0
4	2116-02-20	103.0
5	2116-02-21	111.0
6	2116-02-22	116.0
7	2116-02-23	116.0
8	2116-02-24	104.0

Warning

It’s always important to think critically about the best ways to impute any missing data. We demonstrated two simple methods (median/mean imputation and forward filling), but there are entire books written on different imputation techniques. As mentioned throughout this chapter, discussions with your collaboraters and domain experts are important for identifying the best imputation method.

If imputation is used, it’s important to be careful that methods we used are not biased. For example, mean imputation may not be preferred if the data is highly skewed.

Another imputation method called backward filling refers to when a missing value is replaced by carrying the next observation backwards. As you can imagine, this can be problematic in cases when we are trying to time series data to predict some future outcome. We may be making our task much easier owing to data leakage, as you learned in the chapter on feature engineering.

Conclusions#

This chapter aimed to provide you with a real-world dataset that benefited from data wrangling. As you can see, the process of data wrangling is highly dependent on your dataset. In many cases, there is also not one correct way to go about these processes. What’s important is to deeply consider what our data represents, how it’s structured, and the downstream implications for our analysis if we do or do not prepare our data appropriately. Data wrangling is also a great opportunity to work deeply with your collaborators (domain experts!) to ensure each step of the process is grounded in practical utility and is reflective of how the data was generated, collected, and how it’s meant to be analyzed.

[^*] Johnson, A., Bulgarelli, L., Pollard, T., Gow, B., Moody, B., Horng, S., Celi, L. A., & Mark, R. (2024). MIMIC-IV (version 3.1). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/kpb9-mt58

[^**] Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV Clinical Database Demo (version 2.2). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/dp1f-ex47

How to begin Wrangling: Get to know your Data!

Contents

27.3. How to begin Wrangling: Get to know your Data!#

Case Study of Data Wrangling#

Where did the data come from and how was it collected?#

What does this data look like?#

Determine Data Types#

Examine the values and distributions of each variable#

Incorporating more data!#

Working with Dates#

Data Wrangling Strings to Numbers#

Examine distribution of continous variables#

Filtering Observations of Interest#

Evaluate height#

Identify the presence of outliers#

Handling multiple observations#

Unit Conversion Issues#

Investigate the variance in measurements between patients with multiple observations#

Non-standardized labels#

Missing Data#

How might we handle missing values in this case?#

Conclusions#