Information-Based Similarity 1.0.0

File: <base>/doc-sources/ibsi.tex (16,504 bytes)
\documentclass[12pt]{article}
\usepackage{html}              %for use with latex2html
\usepackage{wrapfig}
\usepackage[dvips]{graphicx, color} % also needed for scaling, rotation, etc. in latex2html
\begin{document}

\centerline{\LARGE Information-Based Similarity Index}
\bigskip\bigskip

\centerline{Albert C.-C. Yang, Ary L Goldberger, C.-K. Peng}

\begin{htmlonly}
This document is also available in \htmladdnormallink{PDF}{ibsi.pdf}
format.  The most recent version of the software described within is
freely available
\htmladdnormallink{here}{../src/}.
\end{htmlonly}
\begin{latexonly}
\vspace{2cm}
This document can also be read on-line at
http://physionet.org/physio\-tools/ibs/doc/.
The most recent version of the software described within is freely
available from http://physionet.org/physiotools/ibs/src/.
\end{latexonly}

\section*{Background}

Physiologic systems generate complex fluctuations in their output
signals that reflect the underlying dynamics. Therefore, finding and
analyzing hidden dynamical structures of these signals are of both
basic and clinical interest. One approach to discovering such hidden
information is to analyze the repetitive appearance of certain basic
patterns that are embedded in the complete signals. We developed a
novel information-based similarity index to detect and quantify such
temporal structures in the human heart rate time series using tools
from statistical linguistics [1].

\section*{Methods}

Human cardiac dynamics are driven by the complex nonlinear
interactions of two competing forces: sympathetic stimulation
increases and parasympathetic stimulation decreases heart rate. For
this type of intrinsically noisy system, it may be useful to simplify
the dynamics via mapping the output to binary sequences, where the
increase and decrease of the inter-beat intervals are denoted by 1 and
0, respectively. The resulting binary sequence retains important
features of the dynamics generated by the underlying control system,
but is tractable enough to be analyzed as a symbolic sequence.

Consider an inter-beat interval time series, $\{ x_1, x_2, \cdots
,x_N\}$, where $x_i$ is the $i$-th inter-beat interval. We can classify each
pair of successive inter-beat intervals into one of the two states that
represents a decrease in $x$, or an increase in $x$. These two states are
mapped to the symbols 0 and 1, respectively
%
\begin{equation}
\label{E1}
I_n = \left\{ 
      \begin{array}{ll} 
        0, & {\rm if} \; x_{n}\le x_{n-1} \\
        1, & {\rm if} \; x_{n} > x_{n-1}.
      \end{array}
      \right.
\end{equation}
%
\begin{figure}[h]
\centerline{\includegraphics[width=4in]{fig1a.ps}}
\caption{\label{fig1} Schematic illustration of the mapping
procedure for 8-bit words from a heartbeat time series. }
\end{figure}

We map $m + 1$ successive intervals to a binary sequence of length $m$,
called an $m$-bit ``word.'' Each $m$-bit word, $w_k$, therefore, represents a
unique pattern of fluctuations in a given time series. By shifting one
data point at a time, the algorithm produces a collection of $m$-bit
words over the whole time series. Therefore, it is plausible that the
occurrence of these $m$-bit words reflects the underlying dynamics of
the original time series. Different types of dynamics thus produce
different distributions of these $m$-bit words.

In studies of natural languages, it has been observed that
authors have characteristic preferences for the words they use with higher
frequency. To apply this concept to symbolic sequences mapped from the
inter-beat interval time series, we count the occurrences of different
words, and then sort them in descending order by frequency of occurrence.

\begin{figure}[h]
\centerline{\includegraphics[width=4in]{fig1c.ps}}
\caption{ The linear regime (for rank $\le 50$).}
\end{figure}

The resulting rank-frequency distribution, therefore, represents the
statistical hierarchy of symbolic words of the original time
series. For example, the first rank word corresponds to one type of
fluctuation which is the most frequent pattern in the time series. In
contrast, the last rank word defines the most unlikely pattern in the
time series.

To define a measurement of similarity between two signals, we plot the
rank number of each $m$-bit word in the first time series against that
of the second time series.

\begin{figure}[h]
\center{\includegraphics*[width=3in]{fig2.ps}}
\caption{\label{fig2} Rank order comparison of two cardiac inter-beat
interval time series from same subject. For each word, its rank in the
first time series is plotted against its rank in the second time
series. The dashed diagonal line indicates the case where the
rank-order of words for both time series is identical.}
\end{figure}

If two time series are similar in their rank order of the words, the
scattered points will be located near the diagonal line. Therefore,
the average deviation of these scattered points away from the diagonal
line is a measure of the ``distance'' between these two time
series. Greater distance indicates less similarity and vice versa. In
addition, we incorporate the likelihood of each word in the following
definition of a weighted distance, $D_m$, between two symbolic
sequences, $S_1$ and $S_2$.
%
\begin{equation}
D_m(S_1, S_2) ={1 \over{2^m-1}}
{\sum\limits_{k=1}^{2^m} |R_1(w_k)-R_2(w_k)|\,F(w_k)}
\label{E.dist}
\end{equation}
where
\begin{equation}
F(w_k)={1 \over Z} [-p_1(w_k)\,\log p_1(w_k) - \,p_2(w_k)\,\log p_2(w_k)].
\label{E2}
\end{equation}
%

Here $p_1(w_k)$ and $R_1(w_k)$ represent probability and rank of a
specific word, $w_k$, in time series $S_1$. Similarly, $p_2(w_k)$ and
$R_2(w_k)$ stand for probability and rank of the same $m$-bit word in
time series $S_2$. The absolute difference of ranks is multiplied by
the normalized probabilities as a weighted sum by using Shannon
entropy as the weighting factor. Finally, the sum is divided by the
value $2^m-1$ to keep the value in the same range of [0, 1]. The
normalization factor $Z$ in Eq.~\ref{E2} is given by 
$Z=\sum_k [-p_1(w_k)\,\log p_1(w_k) - \,p_2(w_k)\,\log p_2(w_k)]$.


\section*{Example}

Here we provide a concise example to demonstrate how to calculate an
information-based similarity index between two time series. The
following figure illustrates a sample heartbeat interval time series
from a healthy subject (left panel) showing complex variability. In
contrast, a time series from a CHF subject (right panel) shows less
variability. Both sample time series contain 1000 inter-beat
intervals.  (See \htmladdnormallink{RR Intervals, Heart Rate, and HRV
Howto}{/tutorials/hrv} for information on how to obtain additional
inter-beat interval time series in this format.)

\begin{figure}[h]
\centerline{\includegraphics*[width=5in]{fig4.ps}}
\caption{\label{fig4} Representative inter-beat interval time series for (a) a
healthy subject, and (b) a subject with congestive heart failure
(CHF).}
\end{figure}

We first map each signal to a binary sequence according to the
increment of consecutive inter-beat intervals. Suppose we set the m
equal to 8, then there will be $2^8$ = 256 different 8-bit words. We
count the occurrences of each 8-bit word, and then sort them by
descending frequency. The resulting rank-frequency distribution
represents the statistical hierarchy of repetitive patterns of a given
time series. For example, the top-ranked 8-bit words correspond to the
most frequently occurring patterns in a given heartbeat time
series. In contrast, the last ranked word defines the rarest patterns.

 
\begin{table}[htb]
\begin{tabular}{crrllll}

8-bit words $w_k$ &
$R_1(w_k)$ &
$R_2(w_k)$ &
$p_1(w_k)$ &
$p_2(w_k)$ &
$H_1(w_k)$ &
$H_2(w_k)$ \\ \hline

00000000 &
 25 &
 117 &
 0.014423 &
 0.001603 &
 0.061138 &
 0.010314 \\ \hline
 
00000001 &
 8 &
 132 &
 0.002404 &
 0.004006 &
 0.014497 &
 0.022115 \\ \hline
 
00000010 &
 93 &
 80 &
 0.023237 &
 0 &
 0.087418 &
 0 \\ \hline
 
00000011 &
 3 &
 179 &
 0.004808 &
 0.007212 &
 0.025661 &
 0.035568 \\ \hline
 
00000100 &
 44 &
 35 &
 0.002404 &
 0.003205 &
 0.014497 &
 0.018407 \\ \hline
 
00000101 &
 91 &
 92 &
 0.004808 &
 0.000801 &
 0.025661 &
 0.005713 \\ \hline
 
00000110 &
 47 &
 140 &
 0.029647 &
 0 &
 0.104311 &
 0 \\ \hline
 
00000111 &
 1 &
 176 &
 0.003205 &
 0.009615 &
 0.018407 &
 0.044658 \\ \hline
 
00001000 &
 66 &
 23 &
 0.003205 &
 0.004006 &
 0.018407 &
 0.022115 \\ \hline
 
00001001 &
 64 &
 81 &
 0 &
 0.00641 &
 0 &
 0.032371  \\ \hline
 
$\cdots$ & $\cdots$ & $\cdots$ & $\cdots$ & $\cdots$ & $\cdots$ 
& $\cdots$ \\ \hline
 
\end{tabular}

\caption{
$H(w_k)= -p(w_k)\, \log\,p(w_k)$ is the Shannon entropy.}
\end{table}

The rank order difference between two time series can be visualized by
plotting the rank number of each 8-bit word in the first time series
against that of the second time series. The dashed diagonal line
indicates the case where the rank order of words for both time series
is identical.

As demonstrated by the above rank order comparison map, the ``distance''
(or dissimilarity) between any two time series can be quantified by
measuring the scatter of these points from the diagonal line in the
rank order comparison plot. By applying Eq.1 to the rank-order
frequency list obtained from the sample time series, we obtained an
information-based similarity index equal to 0.412725.  Using the
example data files provided with the {\tt ibs} software, this result
may be reproduced by running the command

\begin{verbatim}
    ibs 8 healthy.txt chf.txt
\end{verbatim}


\section*{Applications}

\subsection*{Phylogenetic Tree of Human Heart Beats}

We applied this distance measurement to RR interval time series, each
at least 2 hours in length, from 40 ostensibly healthy subjects with
subgroups of young (10 females and 10 males, average 25.9 years) and
elderly (10 females and 10 males, average 74.5 years), a group of
subjects (n = 43) with severe congestive heart failure (CHF) (15
females and 28 males, average 55.5 years) and a group of 9 subjects
with atrial fibrillation (AF). We measured the average distance
between subjects across different groups. We defined the inter-group
distance of groups A and B as the average distance between all pairs
of subjects where one subject is from group A and the other subject is
from group B. We calculated the inter-group distances among all groups
of our time series as well as a group of 100 artificial time series of
uncorrelated noise (white noise group).

The method for constructing phylogenetic trees is a useful tool to
present our results since the algorithm arranges different groups on a
branching tree to best fit the pairwise distance measurements. Here we
show the result of a rooted tree for the case of $m = 8$.

\begin{figure}[h]
\centerline{\includegraphics[width=4in]{fig3.ps}}
\caption{\label{tree} A rooted phylogenetic tree generated according
to the distances between different groups. White noise indicates
simulated uncorrelated random time series. }
\end{figure}

We note that the structure of the tree is consistent with the
underlying physiology: the further down the branch the more complex
the dynamics are. The groups are arranged in the following order (from
bottom to top as shown in the above figure): 1) Time series from the
healthy young group represent dynamical fluctuations of a highly
complex integrative control system. 2) The healthy elderly group
represents deviation from the ``optimal'' youthful state, possibly due
to decoupling (or drop-out) of components in the integrative control
system. 3) Severe damage to the control system is represented by the
CHF group. These individuals have profound abnormalities in cardiac
function associated with pathologic alterations in both the
sympathetic and parasympathetic control mechanisms that regulate
beat-to-beat variability. 4) The AF group is an example of a
pathologic state in which there appears to be very limited external
input on the heartbeat control system. 5) The artificial white noise
group represents the extreme case in that only noise and no signal is
present. This example demonstrates that the physiologic complexity of
human heart beat dynamics can be robustly described by our information
categorization method.

\subsection*{Written texts and genetic sequences}

The generic concept underlying the information categorization method
makes it applicable to a wide range of problems. Recently, as a
further proof of principle application, we applied this approach to
address a long-standing authorship debate related to Shakespeare’s
plays [2]. This work was featured in the Boston Globe (Aug. 5, 2003)
and was the basis for the award-winning entry in the Calvin \& Rose
G. Hoffman Marlowe Memorial Trust 2003 Prize. In addition to being a
new approach to forensic text analysis, this method has potential
applications in genetic sequence analysis [3].


\section*{Information-Based Similarity Software}

The software may be obtained from
\begin{htmlonly}
\htmladdnormallink{here}{../src},
\end{htmlonly}
\begin{latexonly}
{\tt http://www.\-physionet.\-org/\-physio\-tools/\-ibs/},
\end{latexonly}
where you will find \htmladdnormallink{{\tt ibs.c}}{../src/ibs.c}
(the source for the software), a
\htmladdnormallink{{\tt Makefile}}{../src/Makefile},
two data files (\htmladdnormallink{{\tt healthy.txt}}{../src/healthy.txt}
and \htmladdnormallink{{\tt chf.txt}}{../src/chf.txt}) and a file named
\htmladdnormallink{{\tt ibs.expected}}{../src/ibs.expected}. Download all of
these files.

If you have a {\tt make} utility, you can use it to compile and test
the software, simply by typing ``{\tt make check}'' (look in {\tt  Makefile}
to see what this command does).  Otherwise, compile {\tt ibs.c} and
link it with the C standard math library (needed for
the {\tt abs} and {\tt log} functions only).  For example, if you
use the GNU C compiler (recommended), you can do this by:

\begin{verbatim}
     gcc -o ibs -O ibs.c -lm
\end{verbatim}

\noindent
Test the program by running the command:

\begin{verbatim}
     ibs 8 healthy.txt chf.txt
\end{verbatim}

\noindent
If the current directory is not in your PATH, you may need to type
the location of {\tt ibs}, as in

\begin{verbatim}
     ./ibs 8 healthy.txt chf.txt
\end{verbatim}

\noindent
The output should match the contents of {\tt ibs.expected}.
For brief instructions about how to run the program, type its name at
a command prompt:

\begin{verbatim}
    ibs
\end{verbatim}

\noindent
which should produce a message similar to:

\begin{verbatim}
    usage: ibs M SERIES1 SERIES2
      where M is the word length (an integer greater than 1), and
      SERIES1 and SERIES2 are one-column text files containing the
      data of the two series that are to be compared.  The output
      is the information-based similarity index of the input series
      evaluated for M-tuples (words of length M).
 
      For additional information, see
             http://physionet.org/physiotools/ibs/.
\end{verbatim}

\noindent
This program reads two text files of numbers, which are interpreted as values
of two time series.  Within each series, pairs of consecutive values are
compared to derive a binary series, which has values that are either 1 (if the
second value of the pair was greater than the first) or 0 (otherwise).  A
user-specified parameter, $m$, determines the length of "words" ($m$-tuples) to
be analyzed by this progam.

Within each binary series, all $m$-tuples of consecutive values are
treated as "words"; the function counts the occurrences of each of the $2^m$
possible "words" and then derives the word rank order frequency (WROF) list for
the series.  Finally, it calculates the information-based similarity between
the two WROF lists, and outputs this number.  Depending on the input series and
on the choice of $m$, the value of the index can vary between 0 (completely
dissimilar) and 1 (identical).

\section*{References}

1. Yang AC, Hseu SS, Yien HW, Goldberger AL, Peng CK: Linguistic
analysis of the human heartbeat using frequency and rank order
statistics. Phys Rev Lett 2003, 90: 
\htmladdnormallink{108103}{http://link.aps.org/abstract/PRL/v90/e108103}.

2. Yang AC, Peng C-K, Yien H-W, Goldberger AL: Information
categorization approach to literary authorship disputes. Physica A
2003, 329: 473--483.

3. Yang AC, Goldberger AL, Peng CK: Genomic Classification Using a New
Information-Based Similarity Index: Application to the SARS
Coronavirus.  J Comput Biol 2005 Oct; 12(8):1103-16.

\end{document}