GenMatch {Matching} | R Documentation |

This function finds optimal balance using multivariate matching where
a genetic search algorithm determines the weight each covariate is
given. Balance is determined by examining cumulative probability
distribution functions of a variety of standardized statistics. By
default, these statistics include t-tests and Kolmogorov-Smirnov
tests. A variety of descriptive statistics based on empirical-QQ
(eQQ) plots can also be used or any user provided measure of balance.
The statistics are not used to conduct formal hypothesis tests,
because no measure of balance is a monotonic function of bias and
because balance should be maximized without limit. The object
returned by `GenMatch`

can be supplied to the `Match`

function (via the `Weight.matrix`

option) to obtain causal
estimates. `GenMatch`

uses `genoud`

to
perform the genetic search. Using the `cluster`

option, one may
use multiple computers, CPUs or cores to perform parallel
computations.

GenMatch(Tr, X, BalanceMatrix=X, estimand="ATT", M=1, weights=NULL, pop.size = 100, max.generations=100, wait.generations=4, hard.generation.limit=FALSE, starting.values=rep(1,ncol(X)), fit.func="pvals", MemoryMatrix=TRUE, exact=NULL, caliper=NULL, replace=TRUE, ties=TRUE, CommonSupport=FALSE, nboots=0, ks=TRUE, verbose=FALSE, distance.tolerance=1e-05, tolerance=sqrt(.Machine$double.eps), min.weight=0, max.weight=1000, Domains=NULL, print.level=2, project.path=NULL, paired=TRUE, loss=1, data.type.integer=FALSE, restrict=NULL, cluster=FALSE, balance=TRUE, ...)

`Tr` |
A vector indicating the observations which are in the treatment regime and those which are not. This can either be a logical vector or a real vector where 0 denotes control and 1 denotes treatment. |

`X` |
A matrix containing the variables we wish to match on. This matrix may contain the actual observed covariates or the propensity score or a combination of both. |

`BalanceMatrix` |
A matrix containing the variables we wish
to achieve balance on. This is by default equal to `X` , but it can
in principle be a matrix which contains more or less variables than
`X` or variables which are transformed in various ways. See
the examples. |

`estimand` |
A character string for the estimand. The default estimand is "ATT", the sample average treatment effect for the treated. "ATE" is the sample average treatment effect, and "ATC" is the sample average treatment effect for the controls. |

`M` |
A scalar for the number of matches which should be
found. The default is one-to-one matching. Also see the `ties`
option. |

`weights` |
A vector the same length as `Y` which
provides observation specific weights. |

`pop.size` |
Population Size. This is the number of individuals
`genoud` uses to solve the optimization problem.
The theorems proving that genetic algorithms find good solutions are
asymptotic in population size. Therefore, it is important that this value not
be small. See `genoud` for more details. |

`max.generations` |
Maximum Generations. This is the maximum
number of generations that `genoud` will run when
optimizing. This is a soft limit. The maximum generation
limit will be binding only if `hard.generation.limit` has been
set equal to TRUE. Otherwise, `wait.generations` controls
when optimization stops. See `genoud` for more
details. |

`wait.generations` |
If there is no improvement in the objective
function in this number of generations, optimization will stop. The
other options controlling termination are `max.generations` and
`hard.generation.limit` . |

`hard.generation.limit` |
This logical variable determines if the
`max.generations` variable is a binding constraint. If
`hard.generation.limit` is FALSE, then
the algorithm may exceed the `max.generations`
count if the objective function has improved within a given number of
generations (determined by `wait.generations` ). |

`starting.values` |
This vector's length is equal to the number of variables in `X` . This
vector contains the starting weights each of the variables is
given. The `starting.values` vector is a way for the user
to insert one individual into the starting population.
`genoud` will randomly create the other individuals. These values
correspond to the diagonal of the `Weight.matrix` as described
in detail in the `Match` function. |

`fit.func` |
The balance metric `GenMatch` should optimize.
The user may choose from the following or provide a function:`pvals` : maximize the p.values from (paired) t-tests and
Kolmogorov-Smirnov tests conducted for each column in
`BalanceMatrix` . Lexical optimization is conducted—see the
`loss` option for details.`qqmean.mean` : calculate the mean standardized difference in the eQQ
plot for each variable. Minimize the mean of these differences across
variables.`qqmean.max` : calculate the mean standardized difference in the eQQ
plot for each variable. Minimize the maximum of these differences across
variables. Lexical optimization is conducted.`qqmedian.mean` : calculate the median standardized difference in the eQQ
plot for each variable. Minimize the median of these differences across
variables.`qqmedian.max` : calculate the median standardized difference in the eQQ
plot for each variable. Minimize the maximum of these differences across
variables. Lexical optimization is conducted.`qqmax.mean` : calculate the maximum standardized difference in the eQQ
plot for each variable. Minimize the mean of these differences across
variables.`qqmax.max` : calculate the maximum standardized difference in the eQQ
plot for each variable. Minimize the maximum of these differences across
variables. Lexical optimization is conducted.Users may provide their own `fit.func` . The name of the user
provided function should not be backquoted or quoted. This function needs
to return a fit value that will be minimized, by lexical
optimization if more than one fit value is returned. The function
should expect two arguments. The first being the `matches` object
returned by `GenMatch` —see
below. And the second being a matrix which contains the variables to
be balanced—i.e., the `BalanceMatrix` the user provided to
`GenMatch` . For an example see
http://sekhon.berkeley.edu/matching/R/my_fitfunc.R. |

`MemoryMatrix` |
This variable controls if `genoud` sets up a memory matrix. Such a
matrix ensures that `genoud` will request the fitness evaluation
of a given set of parameters only once. The variable may be
TRUE or FALSE. If it is FALSE, `genoud`
will be aggressive in
conserving memory. The most significant negative implication of
this variable being set to FALSE is that `genoud` will no
longer maintain a memory
matrix of all evaluated individuals. Therefore, `genoud` may request
evaluations which it has previously requested. When
the number variables in `X` is large, the memory matrix
consumes a large amount of RAM.`genoud` 's memory matrix will require significantly less
memory if the user sets `hard.generation.limit` equal
to TRUE. Doing this is a good way of conserving
memory while still making use of the memory matrix structure. |

`exact` |
A logical scalar or vector for whether exact matching
should be done. If a logical scalar is
provided, that logical value is applied to all covariates in
`X` . If a logical vector is provided, a logical value should
be provided for each covariate in `X` . Using a logical vector
allows the user to specify exact matching for some but not other
variables. When exact matches are not found, observations are
dropped. `distance.tolerance` determines what is considered to
be an exact match. The `exact` option takes precedence over the
`caliper` option. Obviously, if `exact` matching is done
using all of the covariates, one should not be using
`GenMatch` unless the `distance.tolerance` has been set
unusually high. |

`caliper` |
A scalar or vector denoting the caliper(s) which
should be used when matching. A caliper is the distance which is
acceptable for any match. Observations which are outside of the
caliper are dropped. If a scalar caliper is provided, this caliper is
used for all covariates in `X` . If a vector of calipers is
provided, a caliper value should be provided for each covariate in
`X` . The caliper is interpreted to be in standardized units. For
example, `caliper=.25` means that all matches not equal to or
within .25 standard deviations of each covariate in `X` are
dropped. The `ecaliper` object which is returned by
`GenMatch` shows the enforced caliper on the scale of the
`X` variables. Note that dropping observations generally changes
the quantity being estimated. |

`replace` |
A logical flag for whether matching should be done with
replacement. Note that if `FALSE` , the order of matches
generally matters. Matches will be found in the same order as the
data are sorted. Thus, the match(es) for the first observation will
be found first, the match(es) for the second observation will be found second, etc.
Matching without replacement will generally increase bias.
Ties are randomly broken when `replace==FALSE` —see the
`ties` option for details. |

`ties` |
A logical flag for whether ties should be handled deterministically. By
default `ties==TRUE` . If, for example, one treated observation
matches more than one control observation, the matched dataset will
include the multiple matched control observations and the matched data
will be weighted to reflect the multiple matches. The sum of the
weighted observations will still equal the original number of
observations. If `ties==FALSE` , ties will be randomly broken.
If the dataset is large and there are many ties, setting
Whether two
potential matches are close enough to be considered tied, is
controlled by the `ties=FALSE` often results in a large speedup.`distance.tolerance`
option. |

`CommonSupport` |
This logical flag implements the usual procedure
by which observations outside of the common support of a variable
(usually the propensity score) across treatment and control groups are
discarded. The `caliper` option is to
be preferred to this option because `CommonSupport` , consistent
with the literature, only drops outliers and leaves
inliers while the caliper option drops both.
If `CommonSupport==TRUE` , common support will be enforced on
the first variable in the `X` matrix. Note that dropping
observations generally changes the quantity being estimated. Use of
this option renders it impossible to use the returned
object `matches` to reconstruct the matched dataset.
Seriously, don't use this option; use the `caliper` option instead. |

`nboots` |
The number of bootstrap samples to be run for the
`ks` test. By default this option is set to zero so no
bootstraps are done. See `ks.boot` for additional
details. |

`ks` |
A logical flag for if the univariate bootstrap
Kolmogorov-Smirnov (KS) test should be calculated. If the ks option
is set to true, the univariate KS test is calculated for all
non-dichotomous variables. The bootstrap KS test is consistent even
for non-continuous variables. By default, the bootstrap KS test is
not used. To change this see the `nboots` option. If a given
variable is dichotomous, a t-test is used even if the KS test is requested. See
`ks.boot` for additional details. |

`verbose` |
A logical flag for whether details of each
fitness evaluation should be printed. Verbose is set to FALSE if
the `cluster` option is used. |

`distance.tolerance` |
This is a scalar which is used to determine
if distances between two observations are different from zero. Values
less than `distance.tolerance` are deemed to be equal to zero.
This option can be used to perform a type of optimal matching. |

`tolerance` |
This is a scalar which is used to determine numerical tolerances. This option is used by numerical routines such as those used to determine if a matrix is singular. |

`min.weight` |
This is the minimum weight any variable may be given. |

`max.weight` |
This is the maximum weight any variable may be given. |

`Domains` |
This is a `ncol(X)` *2 matrix.
The first column is the lower bound, and the second column is the
upper bound for each variable over which `genoud` will
search for weights. If the user does not provide this matrix, the
bounds for each variable will be determined by the `min.weight`
and `max.weight` options. |

`print.level` |
This option controls the level of printing. There
are four possible levels: 0 (minimal printing), 1 (normal), 2
(detailed), and 3 (debug). If level 2 is selected, `GenMatch` will
print details about the population at each generation, including the
best individual found so far. If debug
level printing is requested, details of the `genoud`
population are printed in the "genoud.pro" file which is located in
the temporary `R` directory returned by the `tempdir`
function. See the `project.path` option for more details.
Because `GenMatch` runs may take a long time, it is important for the
user to receive feedback. Hence, print level 2 has been set as the
default. |

`project.path` |
This is the path of the
`genoud` project file. By default no file is
produced unless `print.level=3` . In that case,
`genoud` places its output in a file called
"genoud.pro" located in the temporary directory provided by
`tempdir` . If a file path is provided to the
`project.path` option, a file will be created regardless of the
`print.level` . The behavior of the project file, however, will
depend on the `print.level` chosen. If the `print.level`
variable is set to 1, then the project file is rewritten after each
generation. Therefore, only the currently fully completed generation
is included in the file. If the `print.level` variable is set to
2 or higher, then each new generation is simply appended to the
project file. No project file is generated for
`print.level=0` . |

`paired` |
A flag for whether the paired `t.test` should be
used when determining balance. |

`loss` |
The loss function to be optimized. The default value, `1` ,
implies "lexical" optimization: all of the balance statistics will
be sorted from the most discrepant to the least and weights will be
picked which minimize the maximum discrepancy. If multiple sets of
weights result in the same maximum discrepancy, then the second
largest discrepancy is examined to choose the best weights. The
processes continues iteratively until ties are broken. If the value of `2` is used, then only the maximum discrepancy
is examined. This was the default behavior prior to version 1.0. The
user may also pass in any function she desires. Note that the
option 1 corresponds to the `sort` function and option 2
to the `min` function. Any user specified function
should expect a vector of balance statistics ("p-values") and it
should return either a vector of values (in which case "lexical"
optimization will be done) or a scalar value (which will be
maximized). Some possible alternative functions are
`mean` or `median` . |

`data.type.integer` |
By default, floating-point weights are considered. If this option is
set to `TRUE` , search will be done over integer weights. Note
that before version 4.1, the default was to use integer weights. |

`restrict` |
A matrix which restricts the possible matches. This
matrix has one row for each restriction and three
columns. The first two columns contain the two observation numbers
which are to be restricted (for example 4 and 20), and the third
column is the restriction imposed on the observation-pair.
Negative numbers in the third column imply that the two observations
cannot be matched under any circumstances, and positive numbers are
passed on as the distance between the two observations for the
matching algorithm. The most commonly used positive restriction is
`0` which implies that the two observations will always
be matched. Exclusion restriction are even more common. For example, if we want to exclude the observation pair 4 and 20 and the pair 6 and 55 from being matched, the restrict matrix would be: `restrict=rbind(c(4,20,-1),c(6,55,-1))` |

`cluster` |
This can either be an object of the 'cluster' class
returned by one of the `makeCluster` commands in
the snow package or a vector of machine names so that `GenMatch` can
setup the cluster automatically. If it is the latter, the vector should
look like: `c("localhost","musil","musil","deckard")` .This vector would create a cluster with four nodes: one on the localhost another on "deckard" and two on the machine named "musil". Two nodes on a given machine make sense if the machine has two or more chips/cores. `GenMatch` will setup a SOCK cluster by a call to
`makeSOCKcluster` . This will require the user
to type in her password for each node as the cluster is by default
created via `ssh` . One can add on usernames to the machine
name if it differs from the current shell: "username@musil". Other
cluster types, such as PVM and MPI,
which do not require passwords, can be created by directly calling
`makeCluster` , and then passing the returned
cluster object to `GenMatch` . For an example of how to manually setup up
a cluster with a direct call to `makeCluster` see
http://sekhon.berkeley.edu/matching/R/cluster_manual.R.
For an example of how to get around a firewall by ssh tunneling see:
http://sekhon.berkeley.edu/matching/R/cluster_manual_tunnel.R. |

`balance` |
This logical flag controls if load balancing is done
across the cluster. Load balancing can result in better cluster
utilization; however, increased communication can reduce
performance. This option is best used if each individual call to
`Match` takes at least several minutes to
calculate or if the
nodes in the cluster vary significantly in their performance. If
cluster==FALSE, this option has no effect. |

`...` |
Other options which are passed on to `genoud` . |

`value` |
The fit values at the solution. By default, this is a
vector of p-values sorted from the smallest to the largest. There
will generally be twice as many p-values as there are variables in
`BalanceMatrix` , unless there are dichotomous variables in this
matrix. There is one p-value for each covariate in
`BalanceMatrix` which is the result of a paired t-test and
another p-value for each non-dichotomous variable in
`BalanceMatrix` which is the result of a Kolmogorov-Smirnov
test. Recall that these p-values cannot be interpreted as hypothesis
tests. They are simply measures of balance. |

`par` |
A vector of the weights given to each variable in `X` . |

`Weight.matrix` |
A matrix whose diagonal corresponds to the weight
given to each variable in `X` . This object corresponds to the
`Weight.matrix` in the `Match` function. |

`matches` |
A matrix where the first column contains the row
numbers of the treated observations in the matched dataset. The second
column contains the row numbers of the control observations. And the
third column contains the weight that each matched pair is given.
These columns correspond respectively to the `index.treated` ,
`index.control` and `weights` objects which are returned by
`Match` . |

`ecaliper ` |
The size of the enforced caliper on the scale of the
`X` variables. This object has the same length as the number of
covariates in `X` . |

Jasjeet S. Sekhon, UC Berkeley, sekhon@berkeley.edu, http://sekhon.berkeley.edu/.

Sekhon, Jasjeet S. 2011. "Multivariate and Propensity Score
Matching Software with Automated Balance Optimization.”
*Journal of Statistical Software* 42(7): 1-52.
http://www.jstatsoft.org/v42/i07/

Diamond, Alexis and Jasjeet S. Sekhon. 2005. "Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies.” Working Paper. http://sekhon.berkeley.edu/papers/GenMatch.pdf

Sekhon, Jasjeet Singh and Walter R. Mebane, Jr. 1998. "Genetic
Optimization Using Derivatives: Theory and Application to Nonlinear
Models.” *Political Analysis*, 7: 187-210.
http://sekhon.berkeley.edu/genoud/genoud.pdf

Sekhon, Jasjeet Singh and Richard D. Grieve. 2011. "A Matching Method
For Improving Covariate Balance in Cost-Effectiveness Analyses."
*Health Economics*. forthcoming.

Also see `Match`

, `summary.Match`

,
`MatchBalance`

, `genoud`

,
`balanceUV`

, `qqstats`

,
`ks.boot`

, `GerberGreenImai`

, `lalonde`

data(lalonde) attach(lalonde) #The covariates we want to match on X = cbind(age, educ, black, hisp, married, nodegr, u74, u75, re75, re74) #The covariates we want to obtain balance on BalanceMat <- cbind(age, educ, black, hisp, married, nodegr, u74, u75, re75, re74, I(re74*re75)) # #Let's call GenMatch() to find the optimal weight to give each #covariate in 'X' so as we have achieved balance on the covariates in #'BalanceMat'. This is only an example so we want GenMatch to be quick #so the population size has been set to be only 16 via the 'pop.size' #option. This is *WAY* too small for actual problems. #For details see http://sekhon.berkeley.edu/papers/MatchingJSS.pdf. # genout <- GenMatch(Tr=treat, X=X, BalanceMatrix=BalanceMat, estimand="ATE", M=1, pop.size=16, max.generations=10, wait.generations=1) #The outcome variable Y=re78/1000 # # Now that GenMatch() has found the optimal weights, let's estimate # our causal effect of interest using those weights # mout <- Match(Y=Y, Tr=treat, X=X, estimand="ATE", Weight.matrix=genout) summary(mout) # #Let's determine if balance has actually been obtained on the variables of interest # mb <- MatchBalance(treat~age +educ+black+ hisp+ married+ nodegr+ u74+ u75+ re75+ re74+ I(re74*re75), match.out=mout, nboots=500) # For more examples see: http://sekhon.berkeley.edu/matching/R.