Skip to main content

Search algorithms

By Afshine Amidi and Shervine Amidi

Linear search is a basic search method often used when the relative ordering of elements is not known, e.g. in unsorted arrays. It has a complexity of O(n)\mathcal{O}(n) time and O(1)\mathcal{O}(1) space:

  • Scan step: Check whether the constraint of the problem is verified on the first element of the array.

    Repeat the process on the remaining elements in a sequential way.

  • Final step: Stop the search when the desired condition is satisfied.

Two-pointer technique

The two-pointer technique provides an efficient way to scan arrays by leveraging the assumptions of the problem, e.g. the fact that an array is sorted.

To illustrate this concept, suppose A=[a0,...,an1]A=[a_0, ..., a_{n-1}] a sorted array. Given SS, the goal is to find two elements ai,aja_i, a_j such that ai+aj=Sa_i+a_j=S.

The two-pointer technique finds an answer in O(n)\mathcal{O}(n) time:

  • Initialization: Initialize pointers ll and rr that are set to be the first and last element of the array respectively.

  • Compute step: Compute Spointersal+arS_{\textrm{pointers}}\triangleq a_{l}+a_{r}.

    • Case Spointers<SS_{\textrm{pointers}} < S: The computed sum needs to increase, so we move the ll pointer to the right.

    • Case Spointers>SS_{\textrm{pointers}} > S: The computed sum needs to decrease, so we move the rr pointer to the left.

  • Final step: Repeat the compute step until one of the following situations happens:

    • Case l=rl=r: There is no solution to the problem.

    • Case Spointers=SS_{\textrm{pointers}}=S: A solution is found with i=li=l and j=rj=r.

Trapping rain water

A classic problem that uses the two-pointer technique is the trapping rain water problem. Given a histogram H=[h0,...,hn1]H=[h_0, ..., h_{n-1}] that represents the height hi0h_i\geqslant0 of each building ii, the goal is to find the total volume VV of water that can be trapped between the buildings.

Tricks

  • The maximum volume of water viv_i trapped at index ii is determined by the relative height between hih_i and the tallest buildings on the left hlMh_l^{M} and on the right hrMh_r^{M}:

    vi=min(hlM,hrM)hi\boxed{v_i=\min(h_l^M, h_r^M)-h_i}
  • Use the two-pointer technique to efficiently update hlMh_l^M and hrMh_r^M at each step.

The solution is found in O(n)\mathcal{O}(n) time and O(1)\mathcal{O}(1) space:

  • Initialization:

    • Pointers: Set the left and right pointers to l=0l=0 and r=n1r=n-1.

    • Maximum heights: Set the maximum height of the left buildings hlMh_l^M and the right buildings hrMh_r^M to hlh_l and hrh_r.

    • Total amount of water: Set the total amount of water VV to 0.

  • Compute step: Update the values of the maximum heights of left and right buildings with the current values:

    hlM=max(hlM,hl)andhrM=max(hrM,hr)\boxed{h_l^M=\max(h_l^M, h_l)}\quad\textrm{and}\quad\boxed{h_r^M=\max(h_r^M, h_r)}
    • If hlM<hrMh_l^M < h_r^M, then the left side is the limiting side, so:

      • Add the available volume vl=hlMhlv_l = h_l^M - h_l at index ll to the total amount of water VV:

        VV+vl\boxed{V\longleftarrow V+v_l}
      • Move the left pointer to the right:

        ll+1\boxed{l\longleftarrow l+1}

    • Conversely, if hlMhrMh_l^M \geqslant h_r^M, then:

      • Add the available volume vr=hrMhrv_r = h_r^M - h_{r} at index rr to the total amount of water VV:

        VV+vr\boxed{V\longleftarrow V+v_r}
      • Move the right pointer to the left:

        rr1\boxed{r\longleftarrow r-1}

  • Final step: The algorithm finishes when the two pointers meet, in which case the amount of water is given by VV.

Binary search aims at finding a given target value tt in a sorted array A=[a0,...,an1]A=[a_0, ..., a_{n-1}].

This algorithm has a complexity of O(log(n))\mathcal{O}(\log(n)) time and O(1)\mathcal{O}(1) space:

  • Initialization: Choose the lower bound ll and the upper bound rr corresponding to the desired search space. Here, l=0l=0 and r=n1r=n-1.

  • Compute step: Compute the middle mm of the two bounds:

    ml+rl2\boxed{m\triangleq l+\left\lfloor\frac{r-l}{2}\right\rfloor}

    The above formula is written in a way that avoids integer overflow in case ll and rr are too large.

    We have the following cases:

    • Case t<amt<a_m: Shrink the search space to the left side by updating rr to r=m1r=m-1.

    • Case t>amt>a_m: Shrink the search space to the right side by updating ll to l=m+1l=m+1.

    • Case t=amt=a_m: We found a solution.

  • Final step: Repeat the compute step until one of the following situations happens:

    • Case t=amt=a_m: A solution is found.

    • Case r<lr<l: There is no solution since the pointers intersect after fully covering the search space.

Median of sorted arrays

Suppose we want to compute the median of two sorted arrays A,BA, B of respective lengths n1,n2n_1, n_2, with n1n2n_1\leqslant n_2.

A naive approach would combine the two sorted arrays in O(n1+n2)\mathcal{O}(n_1+n_2) time, and then select the median of the resulting array. However, a O(log(n1))\mathcal{O}(\log(n_1)) approach exists that uses binary search in a clever way.

Tricks

  • If we categorize all elements of both arrays into:

    • A left partition that has the n1+n22\approx\frac{n_1+n_2}{2} smallest elements.

    • A right partition that has the n1+n22\approx\frac{n_1+n_2}{2} biggest elements.

      Then we can deduce the median by looking at the maximum of the left partition and the minimum of the right partition.

  • If we know how many elements from AA are in the left partition, then we can deduce how many elements from BB are in there too.

  • By noting Aleft,Aright,Bleft,BrightA_\textrm{left}, A_\textrm{right}, B_\textrm{left}, B_\textrm{right} the left and right partitions from AA and BB, we know the partitioning is done correctly when the following conditions are satisfied:

    (C1max(Aleft)min(Bright)and(C2max(Bleft)min(Aright)\textrm{(\textbf{C1}) }\boxed{\max(A_\textrm{left})\leqslant\min(B_\textrm{right})}\quad\textrm{and}\quad\textrm{(\textbf{C2}) }\boxed{\max(B_\textrm{left})\leqslant\min(A_\textrm{right})}

The binary search solution is based on the position of the partitioning within the array AA. The solution is found in O(log(n1))\mathcal{O}(\log(n_1)) time:

  • Initialization: We note i[ ⁣[0,n1] ⁣]i\in[\![0, n_1]\!] and j[ ⁣[0,n2] ⁣]j\in[\![0, n_2]\!] the lengths of the left partitions within array AA and BB respectively. We fix the total size of the left partitions to be n1+n2+12\left\lfloor\frac{n_1+n_2+1}{2}\right\rfloor.

    For a given ii, the following formula determines jj:

    j=n1+n2+12i\boxed{j=\left\lfloor\frac{n_1+n_2+1}{2}\right\rfloor-i}
  • Binary search: We perform a binary search on ii between 00 and n1n_1.

    For each candidate, we have the following cases:
    • (C1) is not satisfied: We restrict the search space to the left side of ii.

    • (C2) is not satisfied: We restrict the search space to the right side of ii.

    • Both (C1) and (C2) are satisfied: The partitioning is done correctly.

  • Final result: Once the correct partitioning is found, we deduce the value of the median. We note:

    maxleftmax(Aleft,Bleft)andminrightmin(Aright,Bright)\boxed{\textrm{max}_{\textrm{left}}\triangleq\max(A_\textrm{left}, B_\textrm{left})}\quad\textrm{and}\quad\boxed{\textrm{min}_\textrm{right}\triangleq\min(A_\textrm{right}, B_\textrm{right})}

We have:

CaseResultIllustration
n1+n2n_1+n_2 evenmedian=maxleft+minright2\displaystyle\textrm{median}=\frac{\textrm{max}_\textrm{left}+\textrm{min}_\textrm{right}}{2}
n1+n2n_1+n_2 oddmedian=maxleft\displaystyle\textrm{median}=\textrm{max}_\textrm{left}

String pattern matching

Given a string s=s0...sn1s=s_0...s_{n-1} of length nn and a pattern p=p0...pm1p=p_0...p_{m-1} of length mm, the string matching problem is about finding whether pattern pp appears in string ss.

We note si:i+m1s_{i:i+m-1} the substring of ss of length mm that starts from sis_i and finishes at si+m1s_{i+m-1}.

Brute-force approach

The brute-force approach consists of checking whether pattern pp matches any possible substring of ss of size mm in O(n×m)\mathcal{O}(n\times m) time and O(1)\mathcal{O}(1) space.

Starting from i=0i=0, compare the substring that starts from index ii with pattern pp.

We have the following possible situations:

  • One of the characters is not matching: The pattern does not match the associated substring.

  • All characters are matching: The pattern is found in the string.

Knuth-Morris-Pratt algorithm

The Knuth-Morris-Pratt (KMP) algorithm is a string pattern matching algorithm that has a complexity of O(n+m)\mathcal{O}(n+m) time and O(m)\mathcal{O}(m) space.

Intuition Avoid starting the pattern search from scratch after a failed match.

We do so by checking whether the pattern matched so far has a prefix that is also a suffix:

  • Step 1: Build the prefix-suffix array.

  • Step 2: Find the pattern in the original string using the prefix-suffix array.

Prefix-suffix array construction We want to build an array Q=[q0,...,qm1]Q=[q_0, ..., q_{m-1}] where each value qiq_i has two interpretations:

  • It is the length of the largest prefix that is also a suffix in the substring p0...pip_0...p_i.

  • It is also the index in the pattern that is right after the prefix.

This part has a complexity of O(m)\mathcal{O}(m) time and O(m)\mathcal{O}(m) space:

  • Initialization:

    • By convention, we set q1q_{-1} to 00.

    • The element q0q_0 is set to 0 since the substring p0p_0 has a single character.

    • The pointers ii and jj find the largest prefix that is also a suffix to deduce the value of qiq_i. They are initially set to i=1i=1 and j=0j=0.

  • Compute step: Determine the length of the largest prefix that is also a suffix for substring p0...pip_0...p_i:

    • Case pipjp_i\neq p_j and j=0j=0: The substring does not have a prefix that is also a suffix, so we set qiq_i to 0:

      qi0\boxed{q_i\longleftarrow 0}

      We increment ii by 1.

    • Case pi=pjp_i=p_j: The last letter pip_i of the substring matches with pjp_j. As a result, the length of the prefix is the length of p0...pjp_0...p_j:

      qij+1\boxed{q_i\longleftarrow j+1}

      We increment both ii and jj by 1 respectively.

    • Case pipjp_i\neq p_j and j>0j>0: The substring may have a smaller prefix that is also a suffix:

      jqj1\boxed{j\longleftarrow q_{j-1}}

      We repeat this process until we fall in one of the two previous cases.

Linear-time string pattern search We want to efficiently search pattern pp in string ss using the newly-constructed prefix-suffix array QQ.

This part has a complexity of O(n)\mathcal{O}(n) time and O(1)\mathcal{O}(1) space:

  • Initialization: The pointers ii and jj check whether there is a match between each character of the string and the pattern respectively. They are initially set to 0.

  • Compute step: We check whether characters sis_{i} and pjp_{j} match.

    • Case si=pjs_{i}=p_{j}: The characters match. We increment ii and jj by 1.

    • Case sipjs_{i}\neq p_{j}: The characters do not match. We determine where we should start our search again by using the information contained in the prefix-suffix array.

      • qj1=0q_{j-1}=0: We need to restart the pattern match from the beginning since there is no prefix/suffix to leverage.

      • qj1>0q_{j-1}>0: We can continue our search after the next largest matched prefix of the pattern.

        We set jj to qj1q_{j-1}.

  • Final step: The algorithm stops when one of the following situations happens:

    • Case j=mj=m: The pattern has been matched.

    • Case i=ni=n and j<mj<m: The whole string has been searched without finding a match.

Rabin-Karp algorithm

The Rabin-Karp algorithm is a string pattern matching algorithm that uses a hashing trick. It has an expected time complexity of O(n+m)\mathcal{O}(n+m), but depending on the quality of the hash function, the complexity may increase up to O(n×m)\mathcal{O}(n\times m). The space complexity is O(1)\mathcal{O}(1).

Intuition Compute the hash value of the pattern along with those of the substrings of the same length. If the hash values are equal, confirm whether the underlying strings indeed match.

The trick is to choose a hash function hh that can deduce h(si+1:i+m)h(s_{i+1:i+m}) from the hash value of the previous substring h(si:i+m1)h(s_{i:i+m-1}) in O(1)\mathcal{O}(1) time via a known function ff:

h(si+1:i+m)=f(h(si:i+m1),h(si),h(si+m))\boxed{h(s_{i+1:i+m})=f(h(s_{i:i+m-1}), h(s_i), h(s_{i+m}))}

Algorithm

  • Initialization: Given a hash function hh, we compute the hash value h(p)h(p) of the pattern pp.

  • Compute step: Starting from index i=0i=0, we use the hashing trick to compute the hash value of substring si:i+m1s_{i:i+m-1} that starts from ii and that has the same length as pp in O(1)\mathcal{O}(1) time.

    • Case h(p)h(si:i+m1)h(p)\neq h(s_{i:i+m-1}): pp and si:i+m1s_{i:i+m-1} definitely do not match.

    • Case h(p)=h(si:i+m1)h(p)=h(s_{i:i+m-1}): There may be a match between pp and si:i+m1s_{i:i+m-1}, so we compare pp and si:i+m1s_{i:i+m-1}.

      • Sub-case p=si:i+m1p=s_{i:i+m-1}: The pattern has been matched.

      • Sub-case psi:i+m1p\neq s_{i:i+m-1}: There is a collision between the hash values of pp and si:i+m1s_{i:i+m-1}. Since there is no match, we continue the search.


Want more content like this?

Subscribe here to be notified of new Super Study Guide releases!