Detect differences between two strings

I have 2 strings

string a = "foo bar";
string b = "bar foo";


and I want to detect the changes from a to b. What characters do I have to change, to get from a to b?

I think there must be a iteration over each character and detect if it was added, removed or remained equal. So this is my exprected result

'f' Remove
'o' Remove
'o' Remove
' ' Remove
'b' Equal
'a' Equal
'r' Equal
' ' Add
'f' Add
'o' Add
'o' Add


class and enum for the result:

public enum Operation { Add,Equal,Remove };
public class Difference
{
public Operation op { get; set; }
public char c { get; set; }
}


Here is my solution but the "Remove" case is not clear to me how the code has to look like

public static List<Difference> CalculateDifferences(string left, string right)
{
int count = 0;
List<Difference> result = new List<Difference>();
foreach (char ch in left)
{
int index = right.IndexOf(ch, count);
if (index == count)
{
count++;
result.Add(new Difference() { c = ch, op = Operation.Equal });
}
else if (index > count)
{
string add = right.Substring(count, index - count);
result.AddRange(add.Select(x => new Difference() { c = x, op = Operation.Add }));
count += add.Length;
}
else
{
//Remove?
}
}
return result;
}


How does the code have to look like for removed characters?



Update - added a few more examples

example 1:

string a = "foobar";
string b = "fooar";


expected result:

'f' Equal
'o' Equal
'o' Equal
'b' Remove
'a' Equal
'r' Equal


example 2:

string a = "asdfghjk";
string b = "wsedrftr";


expected result:

'a' Remove
'w' Add
's' Equal
'e' Add
'd' Equal
'r' Add
'f' Equal
'g' Remove
'h' Remove
'j' Remove
'k' Remove
't' Add
'r' Add




Update:

Here is a comparison between Dmitry's and ingen's answer: https://dotnetfiddle.net/MJQDAO

You are looking for (minimum) edit distance / (minimum) edit sequence. You can find the theory of the process here:

https://web.stanford.edu/class/cs124/lec/med.pdf

Let's implement (simplest) Levenstein Distance / Sequence algorithm (for details see https://en.wikipedia.org/wiki/Levenshtein_distance). Let's start from helper classes (I've changed a bit your implementation of them):

public enum EditOperationKind : byte {
None, // Nothing to do
Add, // Add new character
Edit, // Edit character into character (including char into itself)
Remove, // Delete existing character
};

public struct EditOperation {
public EditOperation(char valueFrom, char valueTo, EditOperationKind operation) {
ValueFrom = valueFrom;
ValueTo = valueTo;

Operation = valueFrom == valueTo ? EditOperationKind.None : operation;
}

public char ValueFrom { get; }
public char ValueTo {get ;}
public EditOperationKind Operation { get; }

public override string ToString() {
switch (Operation) {
case EditOperationKind.None:
return $"'{ValueTo}' Equal";
case EditOperationKind.Add:
return $"'{ValueTo}' Add";
case EditOperationKind.Remove:
return $"'{ValueFrom}' Remove";
case EditOperationKind.Edit:
return $"'{ValueFrom}' to '{ValueTo}' Edit";
default:
return "???";
}
}
}


As far as I can see from the examples provided we don't have any edit operation, but add + remove; that's why I've put editCost = 2 when insertCost = 1, int removeCost = 1 (in case of tie: insert + remove vs. edit we put insert + remove).
Now we are ready to implement Levenstein algorithm:

public static EditOperation[] EditSequence(
string source, string target,
int insertCost = 1, int removeCost = 1, int editCost = 2) {

if (null == source)
throw new ArgumentNullException("source");
else if (null == target)
throw new ArgumentNullException("target");

// Forward: building score matrix

// Best operation (among insert, update, delete) to perform
EditOperationKind[][] M = Enumerable
.Range(0, source.Length + 1)
.Select(line => new EditOperationKind[target.Length + 1])
.ToArray();

// Minimum cost so far
int[][] D = Enumerable
.Range(0, source.Length + 1)
.Select(line => new int[target.Length + 1])
.ToArray();

// Edge: all removes
for (int i = 1; i <= source.Length; ++i) {
M[i][0] = EditOperationKind.Remove;
D[i][0] = removeCost * i;
}

// Edge: all inserts
for (int i = 1; i <= target.Length; ++i) {
M[0][i] = EditOperationKind.Add;
D[0][i] = insertCost * i;
}

// Having fit N - 1, K - 1 characters let's fit N, K
for (int i = 1; i <= source.Length; ++i)
for (int j = 1; j <= target.Length; ++j) {
// here we choose the operation with the least cost
int insert = D[i][j - 1] + insertCost;
int delete = D[i - 1][j] + removeCost;
int edit = D[i - 1][j - 1] + (source[i - 1] == target[j - 1] ? 0 : editCost);

int min = Math.Min(Math.Min(insert, delete), edit);

if (min == insert)
M[i][j] = EditOperationKind.Add;
else if (min == delete)
M[i][j] = EditOperationKind.Remove;
else if (min == edit)
M[i][j] = EditOperationKind.Edit;

D[i][j] = min;
}

// Backward: knowing scores (D) and actions (M) let's building edit sequence
List<EditOperation> result =
new List<EditOperation>(source.Length + target.Length);

for (int x = target.Length, y = source.Length; (x > 0) || (y > 0);) {
EditOperationKind op = M[y][x];

if (op == EditOperationKind.Add) {
x -= 1;
result.Add(new EditOperation('\0', target[x], op));
}
else if (op == EditOperationKind.Remove) {
y -= 1;
result.Add(new EditOperation(source[y], '\0', op));
}
else if (op == EditOperationKind.Edit) {
x -= 1;
y -= 1;
result.Add(new EditOperation(source[y], target[x], op));
}
else // Start of the matching (EditOperationKind.None)
break;
}

result.Reverse();

return result.ToArray();
}


Demo:

var sequence = EditSequence("asdfghjk", "wsedrftr");

Console.Write(string.Join(Environment.NewLine, sequence));


Outcome:

'a' Remove
'w' Add
's' Equal
'e' Add
'd' Equal
'r' Add
'f' Equal
'g' Remove
'h' Remove
'j' Remove
'k' Remove
't' Add
'r' Add


I'll go out on a limb here and provide an algorithm that's not the most efficient, but is easy to reason about.

Let's cover some ground first:

1) Order matters

string before = "bar foo"
string after = "foo bar"


Even though "bar" and "foo" occur in both strings, "bar" will need to be removed and added again later. This also tells us it's the after string that gives us the order of chars we're interested in, we want "foo" first.

2) Order over count

Another way to look at it, is that some chars may never get their turn.

string before = "abracadabra"
string after = "bar bar"


Only the bold chars of "bar bar", get their say in "abracadabra". Even though we've got two b's in both strings, only the first one counts. By the time we get to the second b in "bar bar" the second b in "abracadabra" has already been passed, when we were looking for the first occurrence of 'r'.

3) Barriers

Barriers are the chars that exist in both strings, taking order and count into consideration. This already suggests a set might not be the most appropriate data structure, as we would lose count.

For an input

string before = "pinata"
string after = "accidental"


We get (pseudocode)

var barriers = { 'a', 't', 'a' }


"pinata"

"accidental"

Let's follow the execution flow:


'a' is the first barrier, it's also the first char of after so everything prepending the first 'a' in before can be removed. "pinata" -> "ata"
the second barrier is 't', it's not at the next position in our after string, so we can insert everything in between. "ata" -> "accidenta"
the third barrier 'a' is already at the next position, so we can move to the next barrier without doing any real work.
there are no more barriers, but our string length is still less than that of after, so there will be some post processing. "accidenta" -> "accidental"


Note 'i' and 'n' don't get to play, again, order over count.



Implementation

We've established that order and count matter, a Queue comes to mind.

static public List<Difference> CalculateDifferences(string before, string after)
{
List<Difference> result = new List<Difference>();
Queue<char> barriers = new Queue<char>();

#region Preprocessing
int index = 0;
for (int i = 0; i < after.Length; i++)
{
// Look for the first match starting at index
int match = before.IndexOf(after[i], index);
if (match != -1)
{
barriers.Enqueue(after[i]);
index = match + 1;
}
}
#endregion

#region Queue Processing
index = 0;
while (barriers.Any())
{
char barrier = barriers.Dequeue();
// Get the offset to the barrier in both strings,
// ignoring the part that's already been handled
int offsetBefore = before.IndexOf(barrier, index) - index;
int offsetAfter = after.IndexOf(barrier, index) - index;
// Remove prefix from 'before' string
if (offsetBefore > 0)
{
RemoveChars(before.Substring(index, offsetBefore), result);
before = before.Substring(offsetBefore);
}
// Insert prefix from 'after' string
if (offsetAfter > 0)
{
string substring = after.Substring(index, offsetAfter);
AddChars(substring, result);
before = before.Insert(index, substring);
index += substring.Length;
}
// Jump over the barrier
KeepChar(barrier, result);
index++;
}
#endregion

#region Post Queue processing
if (index < before.Length)
{
RemoveChars(before.Substring(index), result);
}
if (index < after.Length)
{
AddChars(after.Substring(index), result);
}
#endregion

return result;
}

static private void KeepChar(char barrier, List<Difference> result)
{
result.Add(new Difference()
{
c = barrier,
op = Operation.Equal
});
}

static private void AddChars(string substring, List<Difference> result)
{
result.AddRange(substring.Select(x => new Difference()
{
c = x,
op = Operation.Add
}));
}

static private void RemoveChars(string substring, List<Difference> result)
{
result.AddRange(substring.Select(x => new Difference()
{
c = x,
op = Operation.Remove
}));
}


I tested with 3 examples above, and it returns the expected result properly and perfectly.

int flag = 0;
int flag_2 = 0;

string a = "asdfghjk";
string b = "wsedrftr";

char[] array_a = a.ToCharArray();
char[] array_b = b.ToCharArray();

for (int i = 0,j = 0, n= 0; i < array_b.Count(); i++)
{
//Execute 1 time until reach first equal character
if(i == 0 && a.Contains(array_b[0]))
{
while (array_a[n] != array_b[0])
{
Console.WriteLine(String.Concat(array_a[n], " : Remove"));
n++;
}
Console.WriteLine(String.Concat(array_a[n], " : Equal"));
n++;
}
else if(i == 0 && !a.Contains(array_b[0]))
{
Console.WriteLine(String.Concat(array_a[n], " : Remove"));
n++;
Console.WriteLine(String.Concat(array_b[0], " : Add"));
}


else
{
if(n < array_a.Count())
{
if (array_a[n] == array_b[i])
{
Console.WriteLine(String.Concat(array_a[n], " : Equal"));
n++;
}
else
{
flag = 0;
for (int z = n; z < array_a.Count(); z++)
{
if (array_a[z] == array_b[i])
{
flag = 1;
break;
}
}

if (flag == 0)
{
flag_2 = 0;
for (int aa = i; aa < array_b.Count(); aa++)
{
for(int bb = n; bb < array_a.Count(); bb++)
{
if (array_b[aa] == array_a[bb])
{
flag_2 = 1;
break;
}
}
}

if(flag_2 == 1)
{
Console.WriteLine(String.Concat(array_b[i], " : Add"));
}
else
{
for (int z = n; z < array_a.Count(); z++)
{
Console.WriteLine(String.Concat(array_a[z], " : Remove"));
n++;
}
Console.WriteLine(String.Concat(array_b[i], " : Add"));
}

}
else
{
Console.WriteLine(String.Concat(array_a[n], " : Remove"));
i--;
n++;
}

}
}
else
{
Console.WriteLine(String.Concat(array_b[i], " : Add"));
}

}

}//end for


MessageBox.Show("Done");


//OUTPUT CONSOLE:
/*
a : Remove
w : Add
s : Equal
e : Add
d : Equal
r : Add
f : Equal
g : Remove
h : Remove
j : Remove
k : Remove
t : Add
r : Add
*/


Here might be another solution, full code and commented.
However the result of your first original example is inverted :

class Program
{
enum CharState
{
Add,
Equal,
Remove
}

struct CharResult
{
public char c;
public CharState state;
}

static void Main(string[] args)
{
string a = "asdfghjk";
string b = "wsedrftr";
while (true)
{
Console.WriteLine("Enter string a (enter to quit) :");
a = Console.ReadLine();
if (a == string.Empty)
break;
Console.WriteLine("Enter string b :");
b = Console.ReadLine();

List<CharResult> result = calculate(a, b);
DisplayResults(result);
}
Console.WriteLine("Press a key to exit");
Console.ReadLine();
}

static List<CharResult> calculate(string a, string b)
{
List<CharResult> res = new List<CharResult>();
int i = 0, j = 0;

char[] array_a = a.ToCharArray();
char[] array_b = b.ToCharArray();

while (i < array_a.Length && j < array_b.Length)
{
//For the current char in a, we check for the equal in b
int index = b.IndexOf(array_a[i], j);
if (index < 0) //not found, this char should be removed
{
res.Add(new CharResult() { c = array_a[i], state = CharState.Remove });
i++;
}
else
{
//we add all the chars between B's current index and the index
while (j < index)
{
res.Add(new CharResult() { c = array_b[j], state = CharState.Add });
j++;
}
//then we say the current is the same
res.Add(new CharResult() { c = array_a[i], state = CharState.Equal });
i++;
j++;
}
}

while (i < array_a.Length)
{
//b is now empty, we remove the remains
res.Add(new CharResult() { c = array_a[i], state = CharState.Remove });
i++;
}
while (j < array_b.Length)
{
//a has been treated, we add the remains
res.Add(new CharResult() { c = array_b[j], state = CharState.Add });
j++;
}

return res;
}

static void DisplayResults(List<CharResult> results)
{
foreach (CharResult r in results)
{
Console.WriteLine($"'{r.c}' - {r.state}");
}
}
}


If you want to have a precise comparison between two strings, you must read and understand Levenshtein Distance. by using this algorithm you can precisely calculate rate of similarity between two string and also you can backtrack the algorithm to get the chain of changing on the second string. this algorithm is a important metric for Natural Language Processing also.

there are some other benefits and it's need time to learn.

in this link there is a C# version of Levenshtein Distance :

https://www.dotnetperls.com/levenshtein

Comments

Popular posts from this blog

Meaning of `{}` for return expression

Get current scroll position of ScrollView in React Native

React Native - Image Cache