Great things can happen with good data. And good data is everywhere. You just need to know how to look.
However, the effect of a well thought out study, with a good data collection methodology along with properly cleaned data is often underestimated. Not collecting the required data, not using clean data for analysis and lack of understanding of properties of the collected data can result in endless hours of rework or wrong results.
Every analysis software has some form of data cleaning tools embedded in them. Also, there are many free data cleaning software like OpenRefine, Trifacta Wrangler etc. that can be used as per the requirements.
The data cleaning process need not begin once we have all the data in hand. It needs to begin even before we start collecting the data. This can ensure that the data we have meets our purpose and gives us the right information for analysis.
Before Collecting Data
To get the appropriate data and derive meaningful insights we must ask ourselves certain important questions so we know the best route to take. For example, collecting a sample gives you an idea about the kind of information you actually get and whether it is the same as what you hoped to get. If there is any inaccuracy or logical errors you can choose another method or improve your methods.
- What is our intention behind this project?
- What information do you need at the end of the analysis?
- What are all the data points do you need to get this information?
- What method of data collection would ensure collection of accurate data?
Sample collecting is useful in more ways than one. It helps you validate all the questions you asked before data collection and the method of collection. Once you collect the sample, we need to check it for any inconsistencies.
- Verify if the sample gives you the information you need.
- How clean is this data?
- What is the accuracy level?
- Are there any logical incompatibilities?
- What is the level of unavailable data?
- Is the method of data collection effective? Do you need to alter the method for getting better data or use an entirely new method instead?
Once we complete the data collection, we need to clean it up. First, we need to establish how data needs to be classified. Then we need to begin the cleaning process with simple formatting issues to tackling the structural and logical issues and finally creating a codebook so that it makes it easier for other team members and the creators to revisit the dataset effectively.
Classifying the Data
The collected raw data must be classified separately into sheets or workbooks (as per convenience) according to the stages of manipulations it has gone through. This way the original data is preserved and in case we spot any issues in the analysis, we can trace back our steps to find the cause.
- Codebook – Description of the data
- Raw Data – Originally collected virgin data
- Clean Data – Cleaned data
- Analysis – Data on which analysis has been done
Cleaning the Entries
Starting with the file name we first need to do the dirty job of sifting through the data points to make sure the data points are valid and do not violate the logical rules of the dataset.
- Name of files
- Name of sheets
- Name of column names/other data structures
- Blank spaces
- Number of decimal points
- The Formats – Make sure the data is stored in its proper formats.
- Logical Errors – Make sure the columns and rows or other data structures that you use follow the logical rules of the data.
- Veracity – Check for any ambiguities in any data points.
Changing the Structure
Very often we need to change the structure of the data in order to make sense of it better.
- Normalize the tables so that data is consistent.
- Distribute data across inter-connected worksheets/tables so that there are no duplicate rows.
- However do not decentralize them so much that there are too many internal connections that analysis and getting the larger picture becomes tedious.
- The nature of the topic might require for data organization in particular structure or to be sorted in a certain order.
- Add new columns/data structures that are derivates of existing ones if required.
Once the data is sifted and restructured, we need to format them increase its readability. There are several ways of getting this done according to the personal taste or organizational policies. Generally, it is better to keep it simple and clean. Some examples are given below:
- Table Headings
- Table Borders
- Conditional Formatting
Creating a Codebook
A codebook describes the data. It shows the content, the structure and makes the data self-explanatory. It makes it easier for the creators and others to understand the data better. Even though a codebook seems a little tedious to create, it will help the analysts look at the data points in every data structure minutely and thus understand it better.
1. Study Details
- Study Question
- Method of Data Collection
2. Sampling Information
- Sample Selection Method
- Sample Size
- Type of Data – Structured, Unstructured, Semi-structured
3. Data description
- Total Number of Columns
- Total Number of Rows
- Total Number of Records
- Sorting Method (If any)
4. column description (For every column)
- Type – Qualitative or Quantitative and Discrete or Continuous
- Measuring Unit
- Scale – Nominal, Ordinal, Interval, Ratio
- Number of Observations
- Number of Unavailable Data
There are endless ways to tackle the dirty data problems and each analyst with practice develop their own strategy. Even though the data cleaning process and writing codebooks seem tedious and uninteresting, it gives the analyst a good deal of understanding the data. Cleaned data makes it easy for others to understand and use it.
So, what are your methods of data cleaning? Tell us at email@example.com.
E-mail us at firstname.lastname@example.org to inspire our readers with your story – be it your success story or a lesson learned, share what you learned or send some love to a friend. We would love to hear from you!