CSCI 311 Homework 7

CSCI 311 -- Data Structures

Homework 6

25 Points Total

Posted 08 November 2019
Due 15 November 2019

The Longest Common Subsequence (LCS) Problem

Finding the length of the LCS

Using top-down (memoized) dynamic programming, write an efficient Java program to solve the longest common subsequence length (LCS length) problem defined below. Also write a main() function to read inputs, call your LCS function, and print the results. (NOTE that the book solves this with a bottom-up program.)

A subsequence of a given sequence is the given sequence with some elements (possibly none) left out. Formally, given a sequence X = (x₁, x₂, ..., x_m), a sequence Z = (z₁, z₂, ..., z_k), is a subsequence of X if there is a strictly increasing sequence (i₁, i₂, ..., i_k) of indices of X such that x_{i_j} = z_j. For example, a sequence Z = (B, C, D, B) is a subsequence of X = (A, B, C, B, D, A, B), with corresponding index sequence (2, 3, 5, 7).

Given two sequences X and Y, a sequence Z is a common subsequence of X and Y if Z is a subsequence of both X and Y. A longest common subsequence (LCS) of X and Y is a subsequence of X and Y of the maximum length. The longest-common-subsequence-length problem is defined as follows.

Problem 1 (Longest Common Subseqence Length) [20 points]
Input: two sequences X = (x₁, x₂, ..., x_m) and Y = (y₁, y₂, ..., y_n)
Output: the length of a longest common subsequence of X and Y

For example, let X = (A, B, C, B, D, A, B) and Y = (B, D, C, A, B, A). The sequence (B, C, A) is a common subsequence of X and Y. The sequence (B, C, B, A) is an LCS of X and Y. The sequence (B, D, A, B) is another LCS. The length of an LCS of X and Y is 4.

Sample input.


  ABCBDAB
  BDCABA

Prompt and read the input strings from standard input. I will test your programs with inputs with no spaces between the characters.

Sample output.


  The length of a longest common subsequence of 

  ABCBDAB

  and

  BDCABA

  is 4.

Note. You may assume that input sequences are always character sequences. As the above example shows, there are many instances where two given sequences have more than one LCS. However, by the definition, the length of an LCS is clearly unique.

Target running time. For given sequences X of length m and Y of length n, your program must run in O(mn) time.

Hints. First, we define a notation for a prefix of a sequence. For an arbitrary sequence Z = (z₁, z₂, ..., z_k) and an index i <= k, let Z_i denote the sequence (z₁, ..., z_i) consisting of the first i elements of Z.

If the last elements of X and Y match (that is, x_m = y_n), then an LCS of X and Y is simply an LCS of X_m-1 and Y_n-1 with the last matching element attached at the end. Otherwise, the last elements of X and Y will not both be in the solution together, and an LCS of X and Y is the longer of the following two sequences: an LCS of X and Y_n-1 and an LCS of X_m-1 and Y.

A solution for the method described above involves establishing a recursive procedure. We will use dynamic programming with a two-dimensional array c, where each entry c[i][j] will store the length of an LCS of the subsequences X_i and Y_j. First, we note the base case: If either i = 0 or j = 0, one of the sequences has length 0, and the LCS of X_i and Y_j thus has length 0. Otherwise, we have the following recursive formulas.

c[i][j] = c[i-1][j-1] + 1 if x_i = y_j.
c[i][j] = max(c[i][j-1], c[i-1][j]) if x_i != y_j.

Based on the above formulas, use dynamic programming to compute the solutions top down.

Be careful! The LCS algorithm we've looked at numbers the characters from 1 to the length of the sequence. This is one off from Java string indexing! Think through how you will make your indexing scheme consistent for both your known-values array and your strings. Be sure to think about the extreme cases of the complete strings and one or both arguments being the empty string.

Constructing the LCS

Modify the program you wrote for the longest-commmon-subsequence-length problem to solve the following problem. Your solution must be clearly documented so that someone reading your code can understand how it works without too much difficulty.

Problem 2 (Longest Common Subseqence) [5 points]
Input: two sequences X = (x₁, x₂, ..., x_m) and Y = (y₁, y₂, ..., y_n)
Output: a longest common subsequence of X and Y

HINT: For this problem, it suffices to find one LCS, even in the case when more than one such sequence exist (such as the example given above). Build the table as you did for part 1. Look at entry c[m][n]. Given the value and the values in nearby entries along with the associated input values, can you figure out what values went into computing c[m][n]? Can you use that information to determine when to add a character to the LCS? Note that if you work backwards, you will produce the LCS in reverse order; what do you do to get it printed into forward order?

Target timing. Your program must run within the same timing bound as before.

What to hand in.

Email me your source code for the LCS problem.