Optimal matching

Translations of this material:

into Russian: Оптимальное соответсвие. 19% translated in draft.
Submitted for translation by asja_m 15.09.2015

Text

Optimal matching

From Wikipedia, the free encyclopedia

Optimal matching is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations (e.g. individuals in a cohort) classical tools (such as cluster analysis) can be used. The method was tailored to social sciences[1] from a technique originally introduced to study molecular biology (protein or genetic) sequences (see sequence alignment). Optimal matching uses the Needleman-Wunsch algorithm.

Contents [hide]

1 Algorithm

2 Criticism

3 Optimal matching in causal modelling

4 Software

5 References and notes

Algorithm[edit]

Let S = (s_1, s_2, s_3, \ldots s_T) be a sequence of states s_i belonging to a finite set of possible states. Let us denote {\mathbf S} the sequence space, i.e. the set of all possible sequences of states.

Optimal matching algorithms work by defining simple operator algebras that manipulate sequences, i.e. a set of operators a_i: {\mathbf S} \rightarrow {\mathbf S}. In the most simple approach, a set composed of only three basic operations to transform sequences is used:

one state s is inserted in the sequence a^{\rm Ins}_{s'} (s_1, s_2, s_3, \ldots s_T) = (s_1, s_2, s_3, \ldots, s', \ldots s_T)

one state is deleted from the sequence a^{\rm Del}_{s_2} (s_1, s_2, s_3, \ldots s_T) = (s_1, s_3, \ldots s_T) and

a state s_1 is replaced (substituted) by state s'_1, a^{\rm Sub}_{s_1,s'_1} (s_1, s_2, s_3, \ldots s_T) = (s'_1, s_2, s_3, \ldots s_T).

Imagine now that a cost c(a_i) \in {\mathbf R}^+_0 is associated to each operator. Given two sequences S_1 and S_2, the idea is to measure the cost of obtaining S_2 from S_1 using operators from the algebra. Let A={a_1, a_2, \ldots a_n} be a sequence of operators such that the application of all the operators of this sequence A to the first sequence S_1 gives the second sequence S_2: S_2 = a_1 \circ a_2 \circ \ldots \circ a_{n} (S_1) where a_1 \circ a_2 denotes the compound operator. To this set we associate the cost c(A) = \sum_{i=1}^n c(a_i), that represents the total cost of the transformation. One should consider at this point that there might exist different such sequences A that transform S_1 into S_2; a reasonable choice is to select the cheapest of such sequences. We thus call distance

d(S_1,S_2)= \min_A \left \{ c(A)~{\rm such~that}~S_2 = A (S_1) \right \}

that is, the cost of the least expensive set of transformations that turn S_1 into S_2. Notice that d(S_1,S_2) is by definition nonnegative since it is the sum of positive costs, and trivially d(S_1,S_2)=0 if and only if S_1=S_2, that is there is no cost. The distance function is symmetric if insertion and deletion costs are equal c(a^{\rm Ins}) = c(a^{\rm Del}); the term indel cost usually refers to the common cost of insertion and deletion.

Considering a set composed of only the three basic operations described above, this proximity measure satisfies the triangular inequality. Transitivity however, depends on the definition of the set of elementary operations.

Criticism[edit]

Although optimal matching techniques are widely used in sociology and demography, such techniques also have their flaws. As was pointed out by several authors (for example L. L. Wu[2]), the main problem in the application of optimal matching is to appropriately define the costs c(a_i).

Optimal matching in causal modelling[edit]

Optimal matching is also a term used in statistical modelling of causal effects. In this context it refers to matching "cases" with "controls", and is completely separate from the sequence-analytic sense.

Software[edit]

TDA is a powerful program, offering access to some of the latest developments in transition data analysis.

STATA has implemented a package to run optimal matching analysis.

TraMineR is an open source R-package for analysing and visualizing states and events sequences, including optimal matching analysis.

References and notes[edit]