Skip to content

Data Cleansing Techniques Dec 14, 2023

Data Cleansing

What is Data Cleansing?

Data cleansing is the process of identifying, rectifying, and eliminating errors, inconsistencies, and inaccuracies within datasets. It involves refining raw data, removing duplicates, rectifying formatting issues, and resolving any discrepancies to ensure the information is accurate, consistent, and reliable for analysis and decision-making.

By ensuring data accuracy and reliability, data cleansing enables businesses to make informed decisions based on trustworthy information rather than flawed or misleading data.

What kind of Data Issues need Data Cleansing in Excel?

  1. Incomplete Data: Missing values or blank spaces within datasets can render information unusable or inaccurate. For instance, incomplete customer addresses or a lack of essential information like social security numbers in employee records can impede various processes.
  2. Duplicate Data: Instances where the same records appear more than once in a dataset can lead to misinterpretations and errors. This duplication can cause issues in inventory management, financial calculations, or customer analytics.
  3. Invalid Data: Data not following standard formats or containing incorrect attributes can lead to inaccuracies. For example, names with numeric characters or dates in the wrong format can be considered invalid.
  4. Conflicting Data: When multiple versions of the same information exist without clearly identifying the most current or accurate version, it can create confusion. For instance, having different addresses for a company without specifying the current one can lead to delivery or communication problems.
  5. Inconsistent Data: Data that doesn’t align regarding formats, units, or standards can pose significant challenges. For instance, varying date formats (mm/dd/yyyy vs. dd/mm/yyyy) or using different units of measurement for the same attribute can lead to confusion and errors in analysis.

In this Guide, we talk about:

  1. How to Remove Duplicate Data in Excel?
  2. Data Parsing (Text to Columns):
  3. How to fix Text Formatting Errors in Excel by using Find and Replace?
  4. How to Sort Missing or Incomplete Data in Excel?
  5. How to Remove Unnecessary Characters in Excel Data?
  6. How to Format Dates in Excel?
  7. How to Delete All the Formatting in your Excel Data?

How to Remove Duplicate Data in Excel?

Dealing with duplicate data can be a common challenge when handling datasets. But before you begin addressing duplicates, it’s essential to create a copy of your dataset. Deleting data is permanent, so having a backup is a good safety measure.

In Excel, removing duplicates is a handy tool found in the Data tab. By selecting your dataset and clicking ‘Remove Duplicates,’ Excel eliminates duplicate rows, leaving only the first occurrence. However, it’s crucial to analyse your dataset before going ahead with this process.

Consider this scenario:

Imagine a dataset of membership contact details where two rows have the same email address but different names.

There could be various reasons behind this:

1. Family members sharing an email address, resulting in the duplication of the email but with distinct names for each person.

2. A single individual entering different name variations, like “Bob” and “Robert,” or possibly a simple typo, such as “Robert” and “Roebrt.”

  • Select the data range containing duplicates.
  • Go to the “Data” tab on the Excel ribbon.
  • Click on “Remove Duplicates” in the Data Tools group.
  • Select the columns to check for duplicates and ensure the checkbox for headers is selected if applicable.
  • Click “OK” to remove duplicates based on the selected columns, keeping only unique entries.

Data Parsing (Text to Columns):

Using Excel’s “Text to Columns” feature can be a lifesaver when data elements in a single cell need to be separated. For instance, if you have an address column containing street, district, state, and nation information all crammed into one cell with commas separating each element, you can split these details into individual columns for better organisation.

Similarly, if you have car manufacturer and model information in a single column separated by a space, you can use “Text to Columns” to divide them into separate columns.

Let’s walk through using ‘Text to Columns’ to separate car manufacturer and car model name in an example:

  • Highlight the column containing the car manufacturer and model name data that’s currently combined (e.g., “Toyota Corolla”).
  • Go to the ‘Data’ tab on the Excel ribbon.
  • Find and click ‘Text to Columns’ in the’ Data Tools’ group.
  • In the ‘Convert Text to Columns Wizard’, select ‘Delimited’ and click ‘Next’.
  • Choose the delimiter used in your data. In this case, it’s likely a space that separates the manufacturer and model name. Check the ‘Space’ checkbox and preview the separation in the Data preview window.
  • Next, you can choose the format for each column. For example, you might want the manufacturer to be in one column and the model name in another. You can set the format for each column by selecting the column and choosing ‘Text’, ‘Date’, or ‘General’ as required.
  • Now, your original column containing “Toyota Corolla” should be separated into two columns – one for the manufacturer (e.g., “Toyota”) and another for the model name (e.g., “Corolla”).

How to fix Text Formatting Errors in Excel by using Find and Replace?

The ‘Find and Replace’ functionality in Excel is handy for data cleansing and organising. It’s a powerful tool that allows you to quickly find specific text strings and replace them with new text or remove them entirely. Let’s dive into how it can help in cleansing employee data:

For instance, let’s say you have an employee dataset where job titles are listed with a label followed by a colon and space, like “Job Title: Manager”, and you want to remove the “Job Title: ” prefix from each entry:

  • Press Ctrl + H on your keyboard or navigate to the ‘Home’ tab in the Excel ribbon.
  • In the ‘Editing’ group, click ‘Find & Select’, and then select ‘Replace’.
  • In the ‘Find what’ box, type “Job Title: ” (without quotes).
  • Leave the ‘Replace with’ box empty to replace the found text with nothing.
  • Click ‘Find Next’ to preview the first instance. If it’s the correct text to replace, click ‘Replace’. If you’re sure and want to replace all instances, click ‘Replace All’.

This process effectively removes the unwanted label “Job Title: ” from all entries in the dataset. However, ensure that you review and confirm replacements, especially when using ‘Replace All’, to avoid unintended changes in your data.

Apart from removing specific strings, ‘Find and Replace’ can perform various tasks like removing all zeros, updating references in formulas, altering formatting, and more. It’s a versatile tool that streamlines data cleansing and organisation tasks in Excel.

How to Sort Missing or Incomplete Data in Excel?

Missing or incomplete info in your data set can lead to inaccuracy in your insights; however, Excel does have a great tools to help fix this. One is the “IF” function. It creates rules to fill in missing info using other data you have.

Syntax: =IF(logical_test, value_if_true, value_if_false)

  • logical_test: Condition to check if the cell is empty or contains missing data.
  • value_if_true: Value to be displayed if the condition is met (i.e., if the cell is empty).
  • value_if_false: Value to be displayed if the condition is not met.

Example: =IF(A2=””, “Data Missing”, A2)

This formula checks if cell A2 is empty. If it is, it displays “Data Missing”; otherwise, it displays the existing value in cell A2.

Another helpful tool is the “VLOOKUP” function. It finds a value in a table and gives you info from another column that matches it. This is great for filling in gaps using similar information in your data. These tools make managing missing data in Excel much easier!

How do you remove unnecessary characters in Excel Data?

Removing extra spaces and unwanted characters is crucial for clean and accurate data in Excel. The “Trim” function is handy for eliminating leading or trailing spaces that might affect your analysis. Simply use it on your data to ensure precision in formulas and functions.

Additionally, the “Substitute” function is your go-to for removing particular characters or strings. This helps refine your data, ensuring only the essential information remains.

Keep in mind unnecessary spaces and characters can compromise the accuracy of your calculations. They might seem small, but they can lead to significant errors in your analyses and results. Here’s how you can use the TRIM function:

  1. Highlight the range of cells containing the text that needs cleansing.
  2. In the formula bar, type =TRIM( and then select the first cell in the range. Then close the parenthesis and hit Ctrl + Enter instead of just Enter.

For instance, if your range is A1:A10, the formula will look like =TRIM(A1:A10). Using Ctrl + Enter applies the formula to all selected cells simultaneously.

How to Format Dates in Excel?

Formatting dates in Excel can be a bit tricky, especially when dealing with varied formats.

  1. Begin by selecting the cells containing the date information.
  2. Click on “Format Cells” in the toolbar. This opens a menu where you can choose the desired date format.
  3. If dates are mixed with other data or separated by a delimiter, like commas, you can utilise the “Text to Columns” feature. This helps split date elements into separate columns for easier manipulation.

Remember, Excel stores dates as serial numbers, starting from January 1, 1900. This enables calculations involving dates, like finding date differences or adding days. However, be cautious with dates preceding January 1, 1900, as Excel might not interpret them accurately.

You can create custom formats using the “Custom” option in the “Format Cells” menu for dates combined with time data, such as timestamps. This allows tailoring the display to suit your analysis needs, effectively presenting date and time information.

How to Delete All the Formatting in your Excel Data?

Another useful Excel data cleansing technique involves managing and standardising formatting or, at times, removing any applied formatting. Formatting refers to the visual appearance of cells, including aspects like cell colours and text alignments.

  1. Select the Range: Begin by selecting the range of cells where you want to remove formatting.
  2. Clear Formats: Go to the “Home” tab on the toolbar. In the “Editing” group, click on “Clear” and then select “Clear Formats”. This will remove any applied formatting, such as cell colours, font styles, or alignments.

This method helps in reverting cells back to their default formatting, removing any visual changes like colours or alignments that were previously applied. For instance, if you have a table of car manufacturers and models with cells formatted with colours and alignments, using the “Clear Formats” option will reset these cells to the default formatting settings.

Data Cleansing with SBA

Educating your employees on spotting duplicates and using the right apps for data entry simplifies the process. Implementing these strategies ensures more organised and refined data, saving time in detecting and rectifying duplicate or outdated data. Excel offers numerous methods to clean up your database beyond what we’ve covered here.

For growing businesses without an in-house analyst, data management can escalate to the point that it’s time-consuming, difficult to manage, and just becomes an afterthought. It also takes up resources that could be better utilised elsewhere. Team SBA can take your data and prepare it to give you and your business the insights to help it grow.

The SBA Away-Team team of experts are your on-tap resource, supercharging your firm or business. Speak with us today and learn how we can streamline your data collection and handling process.

SHARE THIS POST:

Scroll To Top