Linked list is the suitable efficient data structure.
Contents
Is a data structure that allows efficient?
In computer science, a search data structure is any data structure that allows the efficient retrieval of specific items from a set of items, such as a specific record from a database, The simplest, most general, and least efficient search structure is merely an unordered sequential list of all the items.
- Locating the desired item in such a list, by the linear search method, inevitably requires a number of operations proportional to the number n of items, in the worst case as well as in the average case,
- Useful search data structures allow faster retrieval; however, they are limited to queries of some specific kind.
Moreover, since the cost of building such structures is at least proportional to n, they only pay off if several queries are to be performed on the same database (or on a database that changes little between queries). Static search structures are designed for answering many queries on a fixed database; dynamic structures also allow insertion, deletion, or modification of items between successive queries.
What are tree data structures good for?
We have all watched trees from our childhood. It has roots, stems, branches and leaves. It was observed long back that each leaf of a tree can be traced to root via a unique path. Hence tree structure was used to explain hierarchical relationships, e.g.
What is decision tree in construction?
Photo by Marius Masalar on Unsplash Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
What are decision trees in generalTypes of decision trees.Algorithms used to build decision trees.The step-by-step process of building a Decision tree.
Fig.1-Decision tree based on yes/no question The above picture is a simple decision tree. If a person is non-vegetarian, then he/she eats chicken (most probably), otherwise, he/she doesn’t eat chicken. The decision tree, in general, asks a question and classifies the person based on the answer. Fig.2-Decision tree based on numeric data If a person is driving above 80kmph, we can consider it as over-speeding, else not. Fig.3- Decision tree on ranked data Here is one more simple decision tree. This decision tree is based on ranked data, where 1 means the speed is too high, 2 corresponds to a much lesser speed. If a person is speeding above rank 1 then he/she is highly over-speeding. Fig.4-Complex DT Here’s a more complicated decision tree. It combines numeric data with yes/no data. For the most part Decision trees are pretty simple to work with. You start at the top and work your way down till you get to a point where you cant go further.
- That’s how a sample is classified.
- The very top of the tree is called the root node or just the root.
- The nodes in between are called internal nodes,
- Internal nodes have arrows pointing to them and arrows pointing away from them.
- The end nodes are called the leaf nodes or just leaves,
- Leaf nodes have arrows pointing to them but no arrows pointing away from them.
In the above diagrams, root nodes are represented by rectangles, internal nodes by circles, and leaf nodes by inverted-triangles. There are several algorithms to build a decision tree.
CART-Classification and Regression TreesID3-Iterative Dichotomiser 3C4.5CHAID-Chi-squared Automatic Interaction Detection
We will be discussing only CART and ID3 algorithms as they are the ones majorly used. CART is a DT algorithm that produces binary Classification or Regression Trees, depending on whether the dependent (or target) variable is categorical or numeric, respectively. Fig.5- Sample dataset Now we are going to discuss how to build a decision tree from a raw table of data. In the example given above, we will be building a decision tree that uses chest pain, good blood circulation, and the status of blocked arteries to predict if a person has heart disease or not. Fig.6-Chest pain as the root node There are two leaf nodes, one each for the two outcomes of chest pain. Each of the leaves contains the no. of patients having heart disease and not having heart disease for the corresponding entry of chest pain. Now we do the same thing for good blood circulation and blocked arteries. Fig.7-Good blood circulation as the root node Fig.8-Blocked arteries as the root node We can see that neither of the 3 features separates the patients having heart disease from the patients not having heart disease perfectly. It is to be noted that the total no. of patients having heart disease is different in all three cases.
This is done to simulate the missing values present in real-world datasets. Because none of the leaf nodes is either 100% ‘yes heart disease’ or 100% ‘no heart disease’, they are all considered impure. To decide on which separation is the best, we need a method to measure and compare impurity. The metric used in the CART algorithm to measure impurity is the Gini impurity score,
Calculating Gini impurity is very easy. Let’s start by calculating the Gini impurity for chest pain. Fig.9- Chest pain separation For the left leaf, Gini impurity = 1 – (probability of ‘yes’)² – (probability of ‘no’)² = 1 – (105/105+39)² – (39/105+39)² Gini impurity = 0.395 Similarly, calculate the Gini impurity for the right leaf node. Gini impurity = 1 – (probability of ‘yes’)² – (probability of ‘no’)² = 1 – (34/34+125)² – (125/34+125)² Gini impurity = 0.336 Now that we have measured the Gini impurity for both leaf nodes, we can calculate the total Gini impurity for using chest pain to separate patients with and without heart disease. Fig.10-Good blood circulation at the root node Now we need to figure out how well ‘chest pain’ and ‘blocked arteries’ separate the 164 patients in the left node(37 with heart disease and 127 without heart disease). Just like we did before we will separate these patients with ‘chest pain’ and calculate the Gini impurity value. Fig.11- Chest pain separation The Gini impurity was found to be 0.3. Then we do the same thing for ‘blocked arteries’. Fig.12-Blocked arteries separation The Gini impurity was found to be 0.29. Since ‘blocked arteries’ has the lowest Gini impurity, we will use it at the left node in Fig.10 for further separating the patients. Fig.13-Blocked arteries separation All we have left is ‘chest pain’, so we will see how well it separates the 49 patients in the left node(24 with heart disease and 25 without heart disease). Fig.14-Chest pain separation in the left node We can see that chest pain does a good job separating the patients. Fig.15-Final chest pain separation So these are the final leaf nodes of the left side of this branch of the tree. Now let’s see what happens when we try to separate the node having 13/102 patients using ‘chest pain’. Note that almost 90% of the people in this node are not having heart disease. Fig.16-Chest pain separation on the right node The Gini impurity of this separation is 0.29. But the Gini impurity for the parent-node before using chest-pain to separate the patients is Gini impurity = 1 – (probability of yes)² – (probability of no)² = 1 – (13/13+102)² – (102/13+102)² Gini impurity = 0.2 The impurity is lower if we don’t separate patients using ‘chest pain’. Fig.17-Left side completed At this point, we have worked out the entire left side of the tree. The same steps are to be followed to work out the right side of the tree.
Calculate the Gini impurity scores.If the node itself has the lowest score, then there is no point in separating the patients anymore and it becomes a leaf node.If separating the data results in improvement then pick the separation with the lowest impurity value.
Fig.18-Complete Decision tree The process of building a decision tree using the ID3 algorithm is almost similar to using the CART algorithm except for the method used for measuring purity/impurity. The metric used in the ID3 algorithm to measure purity is called Entropy, Entropy is a way to measure the uncertainty of a class in a subset of examples. Assume item belongs to subset S having two classes positive and negative. Entropy is defined as the no. of bits needed to say whether x is positive or negative. Entropy always gives a number between 0 and 1. Fig-19.Entropy vs. p(+) The above plot shows the relation between entropy and i.e., the probability of positive class. As we can see, the entropy reaches 1 which is the maximum value when which is there are equal chances for an item to be either positive or negative. Fig-20.Building an ID3 tree Consider this part of the problem we discussed above for the CART algorithm. We need to decide which attribute to use from chest pain and blocked arteries for separating the left node containing 164 patients(37 having heart disease and 127 not having heart disease). We can calculate the entropy before splitting as Let’s see how well chest pain separates the patients Fig.21- Chest pain separation The entropy for the left node can be calculated Similarly the entropy for the right node The total gain in entropy after splitting using chest pain This implies that if in the current situation if we were to pick chest pain for splitting the patients, we would gain 0.098 bits in certainty on the patient having or not having heart disease. Doing the same for blocked arteries, the gain obtained was 0.117.
Since splitting with blocked arteries gives us more certainty, it would be picked. We can repeat the same procedure for all the nodes to build a DT based on the ID3 algorithm. Note: The decision of whether to split a node into 2 or to declare it as a leaf node can be made by imposing a minimum threshold on the gain value required.
If the acquired gain is above the threshold value, we can split the node, otherwise, leave it as a leaf node. The following are the take-aways from this article
The general concept behind decision trees.The basic types of decision trees.Different algorithms to build a Decision tree.Building a Decision tree using CART algorithm.Building a Decision tree using ID3 algorithm.
Refer to this playlist on youtube for more details on building Decision trees using CART algorithm.
2. Refer to this playlist on youtube for more details on building Decision trees using ID3 algorithm. PS:- I will be posting another article regarding Regression trees and Random Forests soon. Stay tuned 🙂 Do check out my other articles regarding Data Science and Machine Learning here, Feel free to reach out for a more in-depth discussion in the comments and on LinkedIn,
What is level of tree in data structure?
Level. In tree data structures, the root node is said to be at level 0, and the root node’s children are at level 1, and the children of that node at level 1 will be level 2, and so on.
Which data structure is memory efficient?
Concept – A Bloom filter is a space-efficient approximate data structure. It can be used if not even a well-loaded hash table fits in memory and we need constant read access. Due to its approximate nature, however, one must be aware that false positives are possible, i.e.
- A membership request might return true although the element was never inserted into the set.
- The idea behind a Bloom filter is similar to a hash table.
- The first difference is that instead of reserving space for an array of integers, we allocate an array of m bits.
- Secondly, we utilize not one but k independent hash functions.
Each hash function takes a (potential) element of the set and produces a position in the bit array. Initially all bits are set to 0. In order to insert a new value into the set, we compute its hash value using each of the k hash functions. We then set all bits at the corresponding positions to 1. A membership query also computes all hash values and checks the bits at every position.
If all bits are set to 1, we return true with a certain probability of being correct. If we see at least one 0, we can be sure that the element is not a member of the set. The probability of false positives depends on the number of hash functions k, the size of the bit array m, and the number of elements n already inserted into the Bloom filter.
Assuming that all bits are set independently, the false positive rate FP can be approximated by For a Bloom filter with 1 million bits, 5 hash functions, and 100k elements already inserted, we get FP ≈ 1%, If you want to play around with the variables yourself, check out this awesome Bloom filter calculator by Thomas Hurst. A big disadvantage besides the fact that there can be false positives is that deletions are not supported.
We cannot simply set all positions of the element to be deleted to 0, as we do not know if there have been hash collisions while inserting other elements. Well, we cannot even be sure if the element we are trying to delete is inside the Bloom filter because it might be a false positive. So what are Bloom filters used for in practice? One use case is in web crawling.
In order to avoid duplicate work a crawler needs to determine whether he already visited a site before following a link. Bloom filters are perfect as storing all visited websites in memory is not really possible. Also if we get a false positive it means that we are not visiting a website although it has not been visited before.
- If the output of the crawler is used as an input to a search engine we do not really mind as highly ranked websites will most likely not have only one incoming link and thus we have a chance of seeing it again.
- Another use case is in databases.
- Bloom filters are used in combination with LSM trees which we know already from the previous blog post,
When performing a read operation we potentially have to inspect every level of the log. We can use a Bloom filter to efficiently check if the key we are looking for is present in each of the blocks. If the Bloom filter tells us that the key is not present we can be sure and do not have to touch that block, which is very useful for higher levels which are commonly stored on disk.
Which data structure is most efficient for in memory text file?
- Gap Buffer Data Structure
- Improve Article
- Save Article
- Like Article
- Example: Consider an example with initial gap size 10, initially, array or gap are of the same size, as we insert the elements in the array similarly elements will be inserted in the gap buffer, the only difference is gap size reduces on each insert.
- This was the basic case to insert the character in the front.
- If the output of the crawler is used as an input to a search engine we do not really mind as highly ranked websites will most likely not have only one incoming link and thus we have a chance of seeing it again.
- Another use case is in databases.
- Bloom filters are used in combination with LSM trees which we know already from the previous blog post,
- Meaning of the term “Recursion”.
- Recursion has two components: base case and recurrence relation.
- Stack data structure is used for implementing the recursion.
Gap Buffer is a used for editing and storing text in an efficient manner that is being currently edited, It is also similar to an array but a gap is introduced in the array for handling multiple changes at the cursor, Let’s assume a gap to be another array which contains empty spaces.
Now, whenever there is need to insert a character at certain position we will just move the gap up-to that position using left() and right() then try to insert the character.
Which algorithm is more efficient and why?
The most efficient algorithm is one that takes the least amount of execution time and memory usage possible while still yielding a correct answer.
Which is the fastest algorithm in data structure?
Q1. Which is the best sorting algorithm? – If you’ve observed, the time complexity of Quicksort is O(n logn) in the best and average case scenarios and O(n^2) in the worst case. But since it has the upper hand in the average cases for most inputs, Quicksort is generally considered the “fastest” sorting algorithm,
Which data structure is memory efficient?
Concept – A Bloom filter is a space-efficient approximate data structure. It can be used if not even a well-loaded hash table fits in memory and we need constant read access. Due to its approximate nature, however, one must be aware that false positives are possible, i.e.
a membership request might return true although the element was never inserted into the set. The idea behind a Bloom filter is similar to a hash table. The first difference is that instead of reserving space for an array of integers, we allocate an array of m bits. Secondly, we utilize not one but k independent hash functions.
Each hash function takes a (potential) element of the set and produces a position in the bit array. Initially all bits are set to 0. In order to insert a new value into the set, we compute its hash value using each of the k hash functions. We then set all bits at the corresponding positions to 1. A membership query also computes all hash values and checks the bits at every position.
If all bits are set to 1, we return true with a certain probability of being correct. If we see at least one 0, we can be sure that the element is not a member of the set. The probability of false positives depends on the number of hash functions k, the size of the bit array m, and the number of elements n already inserted into the Bloom filter.
Assuming that all bits are set independently, the false positive rate FP can be approximated by For a Bloom filter with 1 million bits, 5 hash functions, and 100k elements already inserted, we get FP ≈ 1%, If you want to play around with the variables yourself, check out this awesome Bloom filter calculator by Thomas Hurst. A big disadvantage besides the fact that there can be false positives is that deletions are not supported.
We cannot simply set all positions of the element to be deleted to 0, as we do not know if there have been hash collisions while inserting other elements. Well, we cannot even be sure if the element we are trying to delete is inside the Bloom filter because it might be a false positive. So what are Bloom filters used for in practice? One use case is in web crawling.
In order to avoid duplicate work a crawler needs to determine whether he already visited a site before following a link. Bloom filters are perfect as storing all visited websites in memory is not really possible. Also if we get a false positive it means that we are not visiting a website although it has not been visited before.
When performing a read operation we potentially have to inspect every level of the log. We can use a Bloom filter to efficiently check if the key we are looking for is present in each of the blocks. If the Bloom filter tells us that the key is not present we can be sure and do not have to touch that block, which is very useful for higher levels which are commonly stored on disk.
Which data structure is most efficient for recursion?
Highlights:
Recursion: Recursion is a technique of problem-solving where a function is called again and again on smaller inputs until some base case i.e. smallest input which has a trivial solution arrives and then we start calculating the solution from that point.
Recursion has two parts i.e. first is the base condition and another is the recurrence relation. Let’s understand them one by one using an example of the factorial of a number. Recurrence: Recurrence is the actual relationship between the same function on different sizes of inputs i.e. we generally compute the solution of larger input using smaller input.
For example calculating the factorial of a number, in this problem let’s say we need to calculate the factorial of a number N and we create a helper function say fact(N) which returns the factorial of a number N now we can see that the factorial of a number N using this function can also be represented as fact(N) = N * fact(N-1) The function fact(N) calls itself but with a smaller input the above equation is called recurrence relation, Recursion backtracks to previous input once it finds the base case and the temporary function calls which are pending are stored in the stack data structure in the memory as follows. with each function call, the stack keeps filling until the base case arrives which is fact(1) = 1 in this case. After that, each function call is evaluated in the last in first out order. We will discuss the function calls in recursion and consumption of stack in detail.
Which is the most efficient complexity?
Constant complexity – O(1) The most efficient algorithm, in theory, runs in constant time and consumes a constant amount of memory.