What is it?
Database index is very similar to the index given at the end of a book. A DB table index is essentially a copy of certain column(s) of the table along with the pointers to the actual memory location of the record that are stored in a specific way for easy retrieval.
So, why do we need an index?
For just the same reason why we would use an index in a book! Yes! It is for faster retrieval of data from the table. It makes better sense only for those tables which hold millions of records and doing a full-table scan for retrieval of data would be an over-kill!
The index however comes with its own cost – of course, the extra memory space. But it’s not just that… We’ll discover more on the way!
How is the index stored and how does it work?
This, by itself is a research topic. But as we saw – the primary purpose of index is to make the retrieval faster – which means, the index should be stored in such a way that accessing the elements in it should be sub-linear or of constant time (if possible). Else, the whole point of keeping an index would go waste.
The most commonly used structure is B-tree which has log(n) lookup time and stores the elements in sorted way. We can also have HashMap implementation which has a constant time lookup – O(1) – however, it has no ordering!
If you are wondering “So what? Why do I need to bother about ordering now?“, then think of a scenario where you want to select all those students who were born before 1990 or students who have scored > 60 where ordering does matter. In that case, indexing the correct column using a B-Tree would be of more help because HashMap has no way of telling you anything about order!
An appropriate data structure should be chosen based on the requirements of data. There is also a data structure called R-Tree when you have requirements related to spatial ordering – like you want to find the petrol bunk within 2 km radius of your current location! (Wikipedia would give you other types of indexes available)
And… what are the costs involved?
As mentioned earlier, space is definitely a big cost involved. You will incur more space when there are billions of records. And, you should also be updating your index whenever your table is updated. Having a b-tree index means sorting the tree every time an insert / delete / update happens.
Is it worth it?
Index may not be worth it if you are trying to use it for those tables which hardly get any lookup queries! You need index only when there are more searching operations! Else, Index is a mere waste of space and processing time.
And, furthermore, indexing column should be wisely chosen! Mostly the primary keys are chosen for index by default in the DBMS. Because, the indexing column should be as unique as possible! Else, it would again end up as a linear search and wouldn’t help much in optimizing the performance. Having a cardinality of 2 would just divide the data into 2 sets whereas a cardinality of 1000 would return 1000 records.
An important point to note here is – your index is not always guaranteed to be used! You can’t force the DBMS to use it in any query. The query optimizer makes the decision for every query based on how much performance improvement is possible with the given index. And the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.