Java set contains complexity

Содержание

What is the time complexity performance of HashSet.contains() in Java?
java-hashset-arraylist-contains-performance
Performance of contains() in a HashSet vs ArrayList
1. Introduction
2. HashSet.contains()
3. ArrayList.contains()
4. Benchmark Testing
5. Conclusion

What is the time complexity performance of HashSet.contains() in Java?

The worst-case performance of contains will be O(log n) for Java 8, and O(n) for Java 7, but average case closer to O(1). This is because the hashset is backed by a hashmap, and thus has the same efficiency as hashmap lookup (ie, HashMap.get(. )). The actual mapping in a hashmap is constant time (O(1)), but the need to handle collisions brings the cost to log n. That is, multiple elements that hash to the same array index must be stored in a secondary data structure (aka bucket), and it is this bucket that determines the worst-case performance. In Java, hashmap collision-handling is implemented using a self-balanced tree.

Self-balanced trees guarantee O(log n) for all operations, hence, insertion and lookup in hashmap (and hashset) has a total cost of O(1) + O(log n) = O(log n). The use of a self-balanced tree for collision-handling was introduced in Java 8 as an improvement over chaining (used until java 7), which uses a linked-list, and has a worst case of O(n) for lookup and insertion (as it needs to traverse the list). Notice that chaining would have constant time for insertion (as opposed to lookup), since elements can be added to a linked list in O(1), but the set property (no duplicates) is imposed on the linked-list in the case of hashmap, and it thus need to traverse the linked-list also in the case of insertion to ensure that the element does not already exist in the list/bucket, and we end up with O(n) for both insertion and lookup.

This class implements the Set interface, backed by a hash table (actually a HashMap instance). https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html

Buckets containing a large number of colliding keys will store their entries in a balanced tree instead of a linked list after certain threshold is reached. (https://www.nagarro.com/en/blog/post/24/performance-improvement-for-hashmap-in-java-8)

It runs in O(1) expected time, as any hash table (assuming the hash function is decent). It is backed by a HashMap where the key is the Object.

Two objects might have the same hash code, but the HashSet wouldn’t think they are identical, unless the equals method for these objects says they are the same (i.e. returns true).

The contains method calls (indirectly) getEntry of HashMap , where key is the Object for which you wish to know if it’s in the HashSet .

As you can see below, two objects can be stored in the HashMap / HashSet even if their key is mapped to the same value by the hash function. The method iterates over all keys that have the same hash value, and performs equals on each one to find the matching key.

final Entry getEntry(Object key) < int hash = (key == null) ? 0 : hash(key.hashCode()); for (Entrye = table[indexFor(hash, table.length)]; e != null; e = e.next) < Object k; if (e.hash == hash && ((k = e.key) == key || (key != null && key.equals(k)))) return e; >return null; >

Источник

java-hashset-arraylist-contains-performance

Performance of contains() in a HashSet vs ArrayList

1. Introduction

In this quick guide, we’re going to discuss the performance of the contains() method available in java.util.HashSet and java.util.ArrayList. They are both collections for storing and manipulating objects.

HashSet is a collection for storing unique elements. To learn more about the HashSet, check out this link.

ArrayList is a popular implementation of the java.util.List interface.

We have an extended article about the ArrayList available here.

2. HashSet.contains()

Internally, the HashSet implementation is based on a HashMap instance. The contains() method calls HashMap.containsKey(object).

Here, it’s checking whether the object is in the internal map or not. The internal map stores data inside of the Nodes, known as buckets. Each bucket corresponds to a hash code generated with hashCode() method. So contains() is actually using hashCode() method to find the object’s location.

Читайте также: Php get all files in dir

Now let’s determine the lookup time complexity. Before moving ahead, make sure you are familiar with Big-O notation.

On average, the contains() of HashSet runs in O(1) time. Getting the object’s bucket location is a constant time operation. Taking into account possible collisions, the lookup time may rise to log(n) because the internal bucket structure is a TreeMap.

This is an improvement from Java 7 which used a LinkedList for the internal bucket structure. In general, hash code collisions are rare. So we can consider the elements lookup complexity as O(1).

3. *ArrayList.c**ontains()*

Internally, ArrayList uses the indexOf(object) method to check if the object is in the list. The indexOf(object) method iterates the entire array and compares each element with the equals(object) method.

Getting back to complexity analysis, the ArrayList.contains() method requires O(n) time. So the time we spend to find a specific object here depends on the number of items we have in the array.

4. Benchmark Testing

Now, let’s warm up the JVM with the performance benchmark test. We’ll use the JMH (Java Microbenchmark Harness) OpenJDK product. To learn more about setup and execution, check out our useful guide.

To start, let’s create a simple CollectionsBenchmark class:

@BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Warmup(iterations = 5) public class CollectionsBenchmark < @State(Scope.Thread) public static class MyState < private SetemployeeSet = new HashSet<>(); private List employeeList = new ArrayList<>(); private long iterations = 1000; private Employee employee = new Employee(100L, "Harry"); @Setup(Level.Trial) public void setUp() < for (long i = 0; i < iterations; i++) < employeeSet.add(new Employee(i, "John")); employeeList.add(new Employee(i, "John")); >employeeList.add(employee); employeeSet.add(employee); > > >

Here, we create and initialize HashSet and an ArrayList of Employee objects:

Читайте также: Css user select all

We add the employee = new Employee(100L, “Harry”) instance as the last elements to both collections. So we test the employee object’s lookup time for the worst possible case.

@OutputTimeUnit(TimeUnit.NANOSECONDS) indicates that we want the results in nanoseconds. The number of default @Warmup iterations are 5 in our case. The @BenchmarkMode is set to Mode.AverageTime, which means we’re interested in calculating an average running time. For the first execution, we put iterations = 1000 items in our collections.

After, we add our benchmark methods to the CollectionsBenchmark class:

@Benchmark public boolean testArrayList(MyState state)

Here we check whether the employeeList contains employee object.

Likewise, we have the familiar test for employeeSet:

@Benchmark public boolean testHashSet(MyState state)

Finally, we can run the test:

public static void main(String[] args) throws Exception

Benchmark Mode Cnt Score Error Units CollectionsBenchmark.testArrayList avgt 20 4035.646 ± 598.541 ns/op CollectionsBenchmark.testHashSet avgt 20 9.456 ± 0.729 ns/op

We can clearly see that the testArrayList method has 4035.646 ns average lookup score, while the testHashSet performs faster with 9.456 ns on average.

Now, let’s increase the elements count in our test and run it for iterations = 10.000 items:

Benchmark Mode Cnt Score Error Units CollectionsBenchmark.testArrayList avgt 20 57499.620 ± 11388.645 ns/op CollectionsBenchmark.testHashSet avgt 20 11.802 ± 1.164 ns/op

Here also, the contains() in HashSet has a huge performance advantage over the ArrayList.

5. Conclusion

This quick write-up explains the performance of the contains() method of the HashSet and ArrayList collections. With the help of the JMH benchmarking, we’ve presented the performance of contains() for each type of collection.

As a conclusion, we can learn, that the contains() method works faster in HashSet compared to an ArrayList.

As usual, the complete code for this article is over on GitHub project.