Palindromes II

Palindromes I visualized a naive algorithm that finds the longest palindromic substring with worst-case complexity of $$O(n^2)$$. The below compares the same naive algorithm with a linear-time algorithm, tracing the algorithm path as a given string is scanned from left to right.

The two input string examples are relatively difficult ones, in the sense that repeated characters and overlapping palindromic substrings cause the naive algorithm to approach its worst-case complexity of $$O(n^2)$$.

Each circle, or node, represents a character visited by the algorithm. The visualization shows that the naive algorithm on the left performs significant duplicate work by visiting the same nodes repeatedly, producing a relatively dense graph. In contrast, the linear-time algorithm on the right visits fewer nodes and terminates in fewer iterations, producing a sparse graph. (For illustrative purposes, the naive algorithm is not optimized to terminate as early as possible.)
• Green point = presence of palindrome (single characters included).
• Red point = no palindrome.
• Gray points = space between characters.
The longest palindromic substring is identified by two green points on the x axis with the greatest distance between them. In the first case of input string "xxxxxxxaxxx", there are two solutions of equal length: "xxxxxxx" and "xxxaxxx". In the second case of input string "babcbabcbac", the solution is "abcbabcba".

The specifics of the linear-time algorithm are discussed further below.

Left: quadratic algorithm. Right: linear algorithm.
>> Restart animation.
Left: quadratic algorithm. Right: linear algorithm.
>> Restart animation.

Built with D3.

Linear-Time Algorithm

Wikipedia documents Glenn Manacher's 1975 paper as the origins of a linear time algorithm, followed by developments in the 1990s to extend Manacher's idea to find the longest palindromic substring.

I found the formal papers difficult to parse, but a Google search returns some more accessible results:
The above visualization shows that in comparison to the given naive algorithm, the linear-time algorithm is clearly superior. Moreover, it's not hard to believe that the running time of the latter is indeed linear (perhaps if one were feeling generous). But the graphs don't really convey the logic behind the linear-time algorithm -- in fact, the output can seem rather arbitrary. Rather than rehash the full details and proof of the algorithm, the following section attempts to explain just its crux, which eluded me for some time.

> Link to Github code.

If there are no palindromic substrings, or if there is only one palindromic substring, the naive algorithm described in Palindromes I terminates in $$O(n)$$ time. This follows directly from the fact that constant time is required for each step (and if a palindromic substring is present, an additional maximum of $$n$$). The challenge arises when there are multiple, overlapping palindromic substrings, as in both examples above. The problem is especially clear in the case of input string "xxxxxxxaxxx". How can duplicate work be avoided?

The key to saving effort lies in the symmetrical nature of palindromes: by definition, all palindromic strings are reflections of themselves around the center. We proceed by taking advantage of this fact.

Let's say we have a situation like the graph below, in which:
• The input string "babcbabcbaccba" is being scanned from left to right.
• At the current center of place 11, the longest palindrome is 9 characters long. The three lines labeled $$L$$, $$C$$, and $$R$$ denote the left, center, and right of the palindrome.
• We have maintained an array called $$palLen$$, which tracks the length of palindromes centered at each character seen so far. For example, $$palLen == 9$$ and $$palLen == 7$$.
• We define $$pRight$$ and $$pLeft$$ as the pair of "place" indices that are mirrored around the current center. For example, if $$pRight$$ is 12, its pair $$pLeft$$ is 10.
• We define $$distToEdge$$ as $$R$$ - $$pRight$$. For example, at place 13, $$distToEdge$$ = 20 - 13 = 7.

Mouseover $$palLen$$ from center to left to follow the algorithm's progress (as explained below). The pink line represents the palindrome centered at $$pLeft$$, were it to be fully reflected at $$pRight$$.

>>Reset graph.

Rather than naively moving the center to place 12, we attempt to fill in $$palLen$$ to the right of the current center (the "?" characters). For each position $$pRight$$ and its $$distToEdge$$, we proceed by examining the value of its pair $$palLen[pLeft]$$. Three cases are possible. Recall that, to a large extent, palindromes centered at $$pLeft$$ and $$pRight$$ should mirror each other.
• Case 1: $$palLen[pLeft]$$ < $$distToEdge$$

The palidrome centered at $$pLeft$$ is reflected at $$pRight$$, since the length of the palindrome does not exceed the right boundary $$R$$.

Therefore: $$palLen[pRight]$$ = $$palLen[pLeft]$$. Do not move center.

• Case 2: $$palLen[pLeft]$$ > $$distToEdge$$

The palidrome centered at $$pLeft$$ cannot be entirely reflected at $$pRight$$, since the length of the palindrome would exceed $$R$$. However, we know by definition that the palindrome at $$pLeft$$ is reflected at $$pRight$$ up to a certain point -- precisely, $$R$$. In addition, we know that the palindrome at $$pRight$$ does not exceed $$R$$. If it did, the palindrome centered at $$C$$ would match beyond $$L$$ and $$R$$, and would thus be longer.

Therefore: $$palLen[pRight]$$ = $$distToEdge$$. Do not move center.

• Case 3: $$palLen[pLeft]$$ == $$distToEdge$$

The palidrome centered at $$pLeft$$ is reflected at $$pRight$$, since the length of the palindrome does not exceed the right boundary $$R$$. Unlike case 2, however, it is possible that the palindrome centered at $$pRight$$ could extend further -- we don't know, so we need to expand beyond $$R$$ using $$pRight$$ as the new center.

Therefore: $$palLen[pRight]$$ = $$distToEdge$$. Move center to $$pRight$$ and attempt to expand the palindrome beyond $$R$$.

Given the above, the rest of the algorithm is mostly a matter of correct bookkeeping as the string is processed until $$R$$ reaches the end.