Episode 21: Data Cleaning, Part 1

In this episode, co-hosts Jennifer Miller and Ron Landis discuss the importance of data cleaning and management. They identify five aspects of data cleaning that are critical to checking prior to the analytic phase. In some cases, data management is often embedded in the data encoding and storage process (I.e., certain rules are in place to ensure that data fields can only handle one type of data such as a date). In this episode, they discuss how to check for data accuracy and what to do with missing data.

In this podcast episode, we had conversations around these data cleaning questions:  

  • What is data cleaning?  

  • Why is data cleaning important prior to the analytic phase?  

  • What are the five steps of data cleaning?  

  • How do you check for data accuracy in a data set?  

  • What does it mean to have missing data?  

  • What are some of the ways that you can evaluate your missing data?  

  • What do you do with missing data?  

Link to Measurement Podcast Episode

4 Key Takeaways on Data Cleaning

  • Data management is imperative to the data analytic process. Without a strong focus on the management process, the analyses and subsequent interpretation and use may be misleading and incorrect. While this topic may seem boring or perhaps intuitive, it is necessary to have a plan for data cleaning.  

  • There are five broad aspects of data cleaning. Some of this depends on the data and focal question but in general, some or all of these steps should be considered when conducting analytics. As noted above and also in the episode, some of these steps may be more automatic due to the platform and storage restrictions. The five steps include checking for data accuracy, missing data, linearity and normality, outliers, and multicollinearity.  

  • Data accuracy refers to whether the data are accurate and conform to the fields in which they are included.  

  • A missing data analysis checks for missing values. Depending on the type and kind of data, there are various procedures for handling these missing values. 

Related Links  

Next
Next

Episode 20: Applying Multiple Regression to Test for Moderation