https://www.myonlinetraininghub.com/excel-find-and-remove-duplicates
Highlight Duplicates with Conditional Formatting
Conditional Formatting can quickly highlight duplicates in a column. Simply select the column of cells containing the suspected duplicates > Home tab > Conditional Formatting > Highlight Cells Rules > Duplicate Values:
Tip: You can change the format by clicking the drop down for ‘Values with’ (see image above).
Once the formatting is applied you can use filters (Data tab > Filters), based on the cell fill color or font color to display or hide the duplicate values:
Pros: Great for visually highlighting duplicates in a column while retaining them in the dataset. You can use filters to hide duplicates or focus on them.
Cons: Duplicates remain in the dataset, and that may be exactly what you want, but if you just want to get rid of them, then keep reading.
This method also doesn’t highlight the row and only identifies duplicates in a single column.
Highlight Duplicate Rows with a Conditional Formatting Formula
Let’s say you want to highlight rows that contain duplicates across a row. For example, rows 9 and 11 have the same Date and ID:
For this we need to apply the conditional formatting using a formula:
Pros: Highlights the whole row and takes into consideration more than one column. Filters can be used to hide duplicates from view.
Cons: Formula can be difficult to remember. Duplicates remain in the dataset.
Identify Duplicates with a Formula
You can add a column to your data table to tag rows containing duplicates. The formula below is looking for duplicate rows, i.e. where both the Date and ID values are duplicated:
The formula in cell C7:
uses COUNTIFS to check both the Date and ID columns are the same, if the count is greater than 1, then ‘Duplicate’ is returned, otherwise the cell is left blank.
Tip: If you only want to check a single column, let’s say the ID column, then you could use the COUNTIF formula like so:
Pros: Column containing duplicate tag can be used in PivotTables or other formulas to ignore or focus on duplicate rows.
Cons: Formula can be difficult to remember. Requires an extra column in your dataset. Could be cumbersome in large files.
Remove Duplicate Values
We’ve looked at highlighting or tagging cells or rows containing duplicates, but sometimes you want to remove duplicates so you have a unique list of values. There are a few ways to tackle this.
Let’s say we want to remove duplicate rows from the table below i.e. we want to retain row 7 with ‘Produce’ and ‘Richard’, but we want to remove one of the duplicate rows (9 or 11) containing ‘Produce’ and ‘Rachel’:
We can use the Remove Duplicates tool on the Data tab of the ribbon:
By selecting both the Department and Name columns I’m telling Excel that I want it to find duplicates where the values in both columns are the same. Note that I also have the ‘My data has headers’ box checked so it ignores my headers.
And I’m left with a list of unique rows:
Pros: Quick and easy to use.
Cons: Removal of duplicates is permanent. If your data gets updated then you need to run the Remove Duplicates process again.
Power Query Remove Duplicates
Power Query (available in Excel 2010 onwards), also has a Remove Duplicates tool.
Format your data in an Excel Table then load the data into Power Query:
Excel 2010 & 2013: Power Query tab > From Table:
Excel 2016: Data tab > Get & Transform group: From Table:
This will load the data into Power Query and open the Power Query Editor window. In the Power Query Editor simply select the columns you want it to find duplicates for (hold Ctrl to select multiple columns, or Shift to select contiguous columns) > Home tab > Remove Rows > Remove Duplicates:
Pros: The great thing about using Power Query is if your source data gets updated you can Refresh the query and it will remove duplicates again, with just the click of a button. Original data remains intact, plus you have a new view of the data that excludes the duplicates.
Cons: Requires a few more steps than the previous example. Retaining original data may make the file unnecessarily large. If so, the original data can be stored in a separate file.
*Versions of Excel supporting Power Query. Download Power Query here.Remove Duplicates with Advanced Filter
Advanced Filter can extract a list of unique items from a column or columns. First select the data, then Data tab > Advanced:
In the Advanced filter dialog box (image above) choose to copy the list to another location (4 & 5), and check the box for ‘Unique records only’. And voila, we now have two lists, the original, and the list excluding duplicates in columns E & F:
Pros: Reasonably easy to use. Also has an option to just filter the list to hide duplicates. Can handle multiple columns of data.
Cons: No link is maintained between the original data and the filtered data. If the original data gets updated then the Advanced Filter must be run again.
Identify Duplicates with PivotTables
A PivotTable is an excellent way to quickly identify if you have any duplicates in a column.
Place the field you want to check for duplicates in both the Rows and Values areas, in my case it’s the Name field. The PivotTable gives you a list of unique names and the count:
Tips: sort the PivotTable Count column in descending order to bring the duplicates to the top; right-click a cell in the values area > Sort > Sort Largest to Smallest:
Or filter the Count column to only show records greater than 1:
Pros: Quick and easy to do and great for large datasets because you can sort the count in descending order to bring any duplicates to the top, or filter to only show duplicates. The PivotTable also provides the count of an item so you can see how many times it is duplicated.
Cons: Doesn’t remove duplicates, only highlights them.
So, there you have 6 ways to identify or remove duplicates. Depending on my needs I like to use Power Query to remove duplicates, Conditional Formatting to visually indicate duplicate records and PivotTables to identify if large datasets contain duplicates.
Comments
Post a Comment