History
Sequence analysis methods were first imported into the social sciences from the information and biological sciences (seeDomain-specific theoretical foundation
Sociology
The analysis of sequence patterns has foundations in sociological theories that emerged in the middle of the 20th century. Structural theorists argued that society is a system that is characterized by regular patterns. Even seemingly trivial social phenomena are ordered in highly predictable ways. This idea serves as an implicit motivation behind social sequence analysts' use of optimal matching, clustering, and related methods to identify common "classes" of sequences at all levels of social organization, a form of pattern search. This focus on regularized patterns of social action has become an increasingly influential framework for understanding microsocial interaction and contact sequences, or "microsequences." This is closely related toDemography and historical demography
In demography and historical demography, from the 1980s the rapid appropriation of the life course perspective and methods was part of a substantive paradigmatic change that implied a stronger embedment of demographic processes into social sciences dynamics. After a first phase with a focus on the occurrence and timing of demographic events studied separately from each other with a hypothetico-deductive approach, from the early 2000s the need to consider the structure of the life courses and to make justice to its complexity led to a growing use of sequence analysis with the aim of pursuing a holistic approach. At an inter-individual level, pairwise dissimilarities and clustering appeared as the appropriate tools for revealing the heterogeneity in human development. For example, the meta-narrations contrasting individualized Western societies with collectivist societies in the South (especially in Asia) were challenged by comparative studies revealing the diversity of pathways to legitimate reproduction. At an intra-individual level, sequence analysis integrates the basic life course principle that individuals interpret and make decision about their life according to their past experiences and their perception of contingencies. The interest for this perspective was also promoted by the changes in individuals' life courses for cohorts born between the beginning and the end of the 20th century. These changes have been described as de-standardization, de-synchronization, de-institutionalization. Among the drivers of these dynamics, the transition to adulthood is key: for more recent birth cohorts this crucial phase along individual life courses implied a larger number of events and lengths of the state spells experienced. For example, many postponed leaving parental home and the transition to parenthood, in some context cohabitation replaced marriage as long-lasting living arrangement, and the birth of the first child occurs more frequently while parents cohabit instead of within a wedlock. Such complexity required to be measured to be able to compare quantitative indicators across birth cohorts (see for an extension of this questioning to populations from low- and medium income countries). The demography's old ambition to develop a 'family demography' has found in the sequence analysis a powerful tool to address research questions at the cross-road with other disciplines: for example, multichannel techniques represent precious opportunities to deal with the issue of compatibility between working and family lives. Similarly, more recent combinations of sequence analysis and event history analysis have been developed (see for a review) and can be applied, for instance, for understanding of the link between demographic transitions and health.Political sciences
The analysis of temporal processes in the domain of political sciences regards how institutions, that is, systems and organizations (regimes, governments, parties, courts, etc.) that crystallize political interactions, formalize legal constraints and impose a degree of stability or inertia. Special importance is given to, first, the role of contexts, which confer meaning to trends and events, while shared contexts offer shared meanings; second, to changes over time in power relationships, and, subsequently, asymmetries, hierarchies, contention, or conflict; and, finally, to historical events that are able to shape trajectories, such as elections, accidents, inaugural speeches, treaties, revolutions, or ceasefires. Empirically, political sequences' unit of analysis can be individuals, organizations, movements, or institutional processes. Depending on the unit of analysis, the sample sizes may be limited few cases (e.g., regions in a country when considering the turnover of local political parties over time) or include a few hundreds (e.g., individuals' voting patterns). Three broad kinds of political sequences may be distinguished. The first and most common is ''careers,'' that is, formal, mostly hierarchical positions along which individuals progress in institutional environments, such as parliaments, cabinets, administrations, parties, unions or business organizations. We may name ''trajectories'' political sequences that develop in more informal and fluid contexts, such as activists evolving across various causes and social movements,Fillieule, O. and Blanchard, P. (2013). Fighting Together. Assessing Continuity and Change in Social Movement Organizations Through the Study of Constituencies' Heterogeneity. In A Political Sociology of Transnational Europe, chapter 4. ECPR Press, Colchester. or voters navigating a political and ideological landscape across successive polls. Finally, ''processes'' relate to non-individual entities, such as: public policies developing through successive policy stages across distinct arenas; sequences of symbolic or concrete interactions between national and international actors in diplomatic and military contexts; and development of organizations or institutions, such as pathways of countries towards democracy (Wilson 2014).Concepts
A sequence ''s'' is an ordered list of elements (''s''1,''s''2,...,''sl'') taken from a finite alphabet ''A''. For a set S of sequences, three sizes matter: the number ''n'' of sequences, the size ''a'' = , ''A'', of the alphabet, and the length ''l'' of the sequences (that could be different for each sequence). In social sciences, ''n'' is generally something between a few hundreds and a few thousands, the alphabet size remains limited (most often less than 20), while sequence length rarely exceeds 100. We may distinguish between ''state sequences'' and ''event sequences'', where states last while events occur at one time point and do not last but contribute possibly together with other events to state changes. For instance, the joint occurrence of the two events leaving home and starting a union provoke a state change from 'living at home with parents' to 'living with a partner'. When a state sequence is represented as the list of states observed at the successive time points, the position of each element in the sequence conveys this time information and the distance between positions reflects duration. An alternative more compact representation of a sequence, is the list of the successive spells stamped with their duration, where a ''spell'' (also called ''episode'') is a substring in a same state. For example, in is a spell of length 3 in state ''b'', and the whole sequence can be represented as (''a'',2)-(''b'',3)-(''c'',1).Methods
Conventional SA consists essentially in building a typology of the observed trajectories. Abbott and Tsay (2000) describe this typical SA as a three-step program: 1. Coding individual narratives as sequences of states; 2. Measuring pairwise dissimilarities between sequences; and 3. Clustering the sequences from the pairwise dissimilarities. However, SA is much more (see e.g.) and encompasses also among others the description and visual rendering of sets of sequences, ANOVA-like analysis and regression trees for sequences, the identification of representative sequences, the study of the relationship between linked sequences (e.g. dyadic, linked-lives, or various life dimensions such as occupation, family, health), and sequence-network.Describing and rendering state sequences
Given an alignment rule, a set of sequences can be represented in tabular form with sequences in rows and columns corresponding to the positions in the sequences.Sequences of cross-sectional distributions
Characteristics of individual sequences
Alternatively, we can look at the rows. The ''index plot'' where each sequence is represented as a horizontal stacked bar or line is the basic plot for rendering individual sequences.Other overall descriptive measures
* Mean time in the different states (overall state distribution) and their standard errors * Transition probabilities between states.Visualization
State sequences can nicely be rendered graphically and such plots prove useful for interpretation purposes. As shown above, the two basic plots are the index plot that renders individual sequences and the chronogram that renders the evolution of the cross-sectional state distribution along the timeframe. Chronograms (also known as status proportion plot or state distribution plot) completely overlook the diversity of the sequences, while index plots are often too scattered to be readable. Relative frequency plots and plots of representative sequences attempt to increase the readability of index plots without falling in the oversimplification of a chronogram. In addition, there are many plots that focus on specific characteristics of the sequences. Below is a list of plots that have been proposed in the literature for rendering large sets of sequences. For each plot, we give examples of software (details in section Software) that produce it. * Index plot: renders the set of individual sequences (SADI, SQ, TraMineR) * Chronogram (status proportion plot, state distribution plot): renders the sequence of cross-sectional distributions (SADI, SQ, TraMineR) * Plot of multichannel sequences grouped by channels (seqHMM) or by individuals * Plot of time series of cross-sectional indicators (entropy, modal state, ...) (SQ, TraMineR) * Frequency plot (SQ, TraMineR) * Relative frequency plot (TraMineRextras) * Representative sequences (TraMineR) * Mean time in the different states and their standard errors (TraMineR) * State survival plot (TraMineRextras) * Transition patterns (SADI) * Transition plot (SQ; Gmisc) and plot of transition probabilities (seqHMM) * Parallel coordinate plot (TraMineR, SQ) * Probabilistic suffix trees (PST) * Sequence networks (see social network analysis) (Software?) *Pairwise dissimilarities
Pairwise dissimilarities between sequences serve to compare sequences and many advanced SA methods are based on these dissimilarities. The most popular dissimilarity measure is '' optimal matching'' (OM), i.e. the minimal cost of transforming one sequence into the other by means of indel (insert or delete) and substitution operations with possibly costs of these elementary operations depending on the states involved. SA is so intimately linked with OM that it is sometimes named optimal matching analysis (OMA). There are roughly three categories of dissimilarity measures: * Optimal matching and other edit distances ** Examples: OM, OMloc (localized OM), OMslen (spell-length sensitive OM), OMspell (OM of spell sequences), OMstran (OM of sequences of transitions), TWED (time-warp edit distance), HAM (Hamming and generalized Hamming), DHD (Dynamic Hamming). ** Strategies for setting the substitution and indel costs *** Constant costs (all substitution costs identical and single indel cost) *** Theory-based costs *** Feature-based costs *** Data-driven costs: based on transition probabilities or state frequencies * Measures based on the count of common attributes ** Examples: LCS (derived from length of longest common subsequence), LCP (from length of longest common prefix), NMS (number of matching subsequences), and NMSMST and SVRspell two variants of NMS. * Distances between within-sequence state distributions ** Examples: CHI2 and EUCLID defined as the average of respectively the Chi-squared and Euclidean distance between state distributions in successive sliding windows.Dissimilarity-based analysis
Pairwise dissimilarities between sequences give access to a series of techniques to discover holistic structuring characteristics of the sequence data. In particular, dissimilarities between sequences can serve as input to cluster algorithms and multidimensional scaling, but also allow to identify medoids or other representative sequences, define neighborhoods, measure the discrepancy of a set of sequences, proceed to ANOVA-like analyses, and grow regression trees. *Other methods of analysis
Although dissimilarity-based methods play a central role in social SA, essentially because of their ability to preserve the holistic perspective, several other approaches also prove useful for analyzing sequence data. * Non dissimilarity-based clustering ** Latent class analysis (LCA), ** Markov model mixture and hidden Markov model mixture ** Mixtures of exponential-distance models * Sequence networks ** Representing a single sequence as a network ** Meta network of sequences ** Sequence network measures ** Life history graph * Probabilistic approaches ** Markovian and other transition distribution models. See also Markov model. ** Probabilistic Suffix Tree (PST) also known as variable-order Markov model or variable-length Markov model. * Event sequences ** Event structure models **Rendering of event sequences (parallel coordinate plots, ...) ** Frequent subsequences ** Discriminant subsequences ** Dissimilarity-based analysis of event sequencesAdvances: the third wave of sequence analysis
Some recent advances can be conceived as the ''third wave of SA''. This wave is largely characterized by the effort of bringing together the stochastic and the algorithmic modeling culture by jointly applying SA with more established methods such as analysis of variance, event history, network analysis, or causal analysis and statistical modeling in general. Some examples are given below; see also "Other methods of analysis". * Effect of past trajectories on the hazard of an event: Sequence History Analysis, SHA * Effect of time varying covariates on trajectories: Competing Trajectories Analysis (CTA), and Sequence Analysis Multistate Model (SAMM) * Validation of cluster typologies * Discrepancy analysis to bring time back to qualitative comparative analysis – QCAOpen issues and limitations
Although SA witnesses a steady inflow of methodological contributions that address the issues raised two decades ago, some pressing open issues remain. Among the most challenging, we can mention: * Sequences of different lengths, truncated sequences, and missing values. * Validation of cluster results * Sequence length vs importance of recency: for example, when analyzing biographic sequences 40 year-long from age 1 to 40, one can only consider individuals born 40 years earlier and therefore the behavior of younger birth cohorts is disregarded. Up-to-date information on advances, methodological discussions, and recent relevant publications can be found on the Sequence Analysis AssociatioFields of application
These techniques have proved valuable in a variety of contexts. In life-course research, for example, research has shown that retirement plans are affected not just by the last year or two of one's life, but instead how one's work and family careers unfolded over a period of several decades. People who followed an "orderly" career path (characterized by consistent employment and gradual ladder-climbing within a single organization) retired earlier than others, including people who had intermittent careers, those who entered the labor force late, as well as those who enjoyed regular employment but who made numerous lateral moves across organizations throughout their careers. In the field of economic sociology, research has shown that firm performance depends not just on a firm's current or recent social network connectedness, but also the durability or stability of their connections to other firms. Firms that have more "durably cohesive" ownership network structures attract more foreign investment than less stable or poorly connected structures. Research has also used data on everyday work activity sequences to identify classes of work schedules, finding that the timing of work during the day significantly affects workers' abilities to maintain connections with the broader community, such as through community events. More recently, social sequence analysis has been proposed as a meaningful approach to study trajectories in the domain of creative enterprise, allowing the comparison among the idiosyncrasies of unique creative careers. While other methods for constructing and analyzing whole sequence structure have been developed during the past three decades, including event structure analysis, OM and other sequence comparison methods form the backbone of research on whole sequence structures. Some examples of application include: Sociology *Labor market entry sequences * De-standardization of the life course * Residential trajectories * Time use * Actual and idealized relationship scripts * Basic types of figures in ritual dances * Pathways of alcohol consumption Demography and historical demography * Transition to adulthood * Partnership biographies * Family formation life course * Childbirth histories Political sciences * Pathways towards democratization * Pathways of legislative processes * Bargaining between actors during national crises Psychology * Sequences of adolescences' social interactions Medical research * Care trajectory in chronic disease Survey methodology * Response in survey collection Geography * Mobility studies * Land useSoftware
Two main statistical computing environment offer tools to conduct a sequence analysis in the form of user-written packages: Stata and R. *Institutional development
The first international conference dedicated to social-scientific research that uses sequence analysis methods – the Lausanne Conference on Sequence Analysis, oSee also
References
External links