Python 4 ways to remove duplicates from an object list

eye-catch Python
Sponsored links

Literal

Basic solution with for-loop

The basic solution to remove the duplicates is to use for-loop. Prepare an empty list and add an item if the same value has not been added to the list.

from typing import List

duplicated_int_list = [1, 2, 3, 4, 2, 3, 4, 5]


def remove_with_for(data_list: List[int]):
    result = []
    for item in data_list:
        if item not in result:
            result.append(item)
    return result


print(remove_with_for(duplicated_int_list))
# [1, 2, 3, 4, 5]

This is simple and readable. Non-Python developers can also understand what it does.

Remove duplicates with set

I guess many languages have set class, so many developers can understand the code.

def remove_with_set(data_list: List[int]):
    return list(set(data_list))

duplicated_int_list = [1, 2, 3, 4, 2, 3, 4, 5]
print(remove_with_set(duplicated_int_list))
# [1, 2, 3, 4, 5]

The result is correct.

Remove duplicates with dict.fromKeys

I don’t like this way because the intention is not clear but it works.

def remove_with_from_keys(data_list: List[int]):
    keys = dict.fromkeys(data_list)
    # {1: None, 2: None, 3: None, 4: None, 5: None}
    print(keys)
    return list(keys)

duplicated_int_list = [1, 2, 3, 4, 2, 3, 4, 5]
print(remove_with_from_keys(duplicated_int_list))
# [1, 2, 3, 4, 5]

Remove duplicates by using index func

This technique is often used in other languages like JavaScript. Search for the same value and the index. Once it’s found, compare the current index and the found index. If the index is the same, it is the first occurrence. Otherwise, it’s duplicated value.

def remove_with_comparing_index(data_list: List[int]):
    result = []
    for index, item in enumerate(data_list):
        if index == data_list.index(item):
            result.append(item)
    return result

duplicated_int_list = [1, 2, 3, 4, 2, 3, 4, 5]
print(remove_with_comparing_index(duplicated_int_list))
# [1, 2, 3, 4, 5]

This post explains it in more detail with a table.

Sponsored links

Object

When it comes to an object list, it’s harder to remove duplicates than a literal list. I created the following simple class for the examples.

class Point:
    def __init__(self, x, y) -> None:
        self.x = x
        self.y = y

class PointEx(Point):
    def __eq__(self, __o: object) -> bool:
        return self.x == getattr(__o, "x", None) and self.y == getattr(__o, "y", None)

    def __hash__(self) -> int:
        return hash(("x", self.x, "y", self.y))

def show_points(points: Union[List[Point], List[PointEx]]):
    for point in points:
        print(f"({point.x}, {point.y})")

Comparing two objects returns false by default. PointEx improves it.

duplicated_points = [
    Point(1, 1),
    Point(2, 2),
    Point(3, 3),
    Point(4, 4),
    Point(2, 2),
    Point(6, 3),
    Point(4, 4),
    Point(5, 5),
]
print(Point(1, 1) == Point(1, 1))  # False

duplicated_pointExes = [
    PointEx(1, 1),
    PointEx(2, 2),
    PointEx(3, 3),
    PointEx(4, 4),
    PointEx(2, 2),
    PointEx(6, 3),
    PointEx(4, 4),
    PointEx(5, 5),
]
print(PointEx(4, 4).__eq__(PointEx(4, 4))) # True
print(PointEx(4, 4) == PointEx(4, 4))      # True

Remove duplicates from an object list with for loop

The first solution for an object list is also for loop. If you don’t want to define __eq__ and __hash__ functions, this is the first choice.

def equal_point(a: Point, b: Point):
    return a.x == b.x and a.y == b.y

def remove_object_for(points: List[Point]):
    result: List[Point] = []
    for current in points:
        found = False
        for stored in result:
            if equal_point(current, stored):
                found = True
        if found is False:
            result.append(current)
    return result


show_points(remove_object_for(duplicated_points))
# (1, 1)
# (2, 2)
# (3, 3)
# (4, 4)
# (6, 3)
# (5, 5)

Basically, the logic is the same as the literal version but it looks worse than the literal version.

If __eq__ function is defined in the class, the two objects is directly comparable.

def remove_object_for2(points: List[PointEx]):
    result: List[PointEx] = []
    for current in points:
        found = False
        for stored in result:
            if current == stored:
                found = True
        if found is False:
            result.append(current)
    return result

But this doesn’t improve the code. If __eq__ function is correctly implemented, it can be written in the same way as the literal version.

def remove_object_for3(points: List[PointEx]):
    result: List[PointEx] = []
    for current in points:
        if current not in result:
            result.append(current)
    return result

Of course, the result is the same.

Remove duplicates from an object list with set

set can’t be used. To use set, both __eq__ and __hash__ need to be implemented.

print(Point(4, 4).__eq__(Point(4, 4))) # NotImplemented
show_points(list(set(duplicated_points)))
# (2, 2)
# (4, 4)
# (1, 1)
# (2, 2)
# (5, 5)
# (3, 3)
# (4, 4)
# (6, 3)


print(PointEx(4, 4).__eq__(PointEx(4, 4))) # True
show_points(list(set(duplicated_pointExes)))
# (6, 3)
# (3, 3)
# (1, 1)
# (4, 4)
# (5, 5)
# (2, 2)

If the class has a lot of variables, the function length becomes long but if they are implemented in the class, the business logic side gets more readable.

Remove duplicates from an object list with index func

When using index, it’s necessary to define a function to return an index for the searched value. Once it’s implemented, the same logic as the literal version can be applied.

def remove_object_enumerate(points: List[Point]):
    def index_where(data_list: List[Point], search: Point):
        for index, item in enumerate(data_list):
            if equal_point(item, search) is True:
                return index
        return -1

    result: List[Point] = []
    for index, current in enumerate(points):
        if index == index_where(points, current):
            result.append(current)

    return result


show_points(remove_object_enumerate(duplicated_points))
# (1, 1)
# (2, 2)
# (3, 3)
# (4, 4)
# (6, 3)
# (5, 5)

Partial comparison

If __eq__ and __hash__ are implemented, the same comparison is applied to all comparisons. However, sometimes two objects need to be compared with some properties only but not all the properties. How can we implement it if we want to change the condition?

Let’s define a new method in the class.

class PointEx(Point):
    # ... __eq__ and __hash__ functions here

    def equal_with(self, __o: object, *keys: str) -> bool:
        if isinstance(__o, PointEx) is False:
            return False

        if len(keys) == 0:
            return True

        for key in keys:
            if getattr(self, key) != getattr(__o, key):
                return False
        return True

This function allows us to use arbitrary properties for comparison. Let’s compare two PointEx object with y property.

duplicated_pointExes = [
    PointEx(1, 1),
    PointEx(2, 2),
    PointEx(3, 3),
    PointEx(4, 4),
    PointEx(2, 2),
    PointEx(6, 3),
    PointEx(4, 4),
    PointEx(5, 5),
]

def remove_object_for_partial(points: List[PointEx]):
    result: List[PointEx] = []
    for current in points:
        found = False
        for stored in result:
            if current.equal_with(stored, "y"):
                found = True
        if found is False:
            result.append(current)
    return result


show_points(remove_object_for_partial(duplicated_pointExes))
# (1, 1)
# (2, 2)
# (3, 3)
# (4, 4)
# (5, 5)

When passing both x and y

def remove_object_for_partial(points: List[PointEx]):
    result: List[PointEx] = []
    for current in points:
        found = False
        for stored in result:
            if current.equal_with(stored, "x", "y"):
                found = True
        if found is False:
            result.append(current)
    return result
# (1, 1)
# (2, 2)
# (3, 3)
# (4, 4)
# (6, 3)   <-- difference
# (5, 5)

When no arguments are passed, it will be only one item.

def remove_object_for_partial(points: List[PointEx]):
    result: List[PointEx] = []
    for current in points:
        found = False
        for stored in result:
            if current.equal_with(stored):
                found = True
        if found is False:
            result.append(current)
    return result
# (1, 1)

Comments

Copied title and URL