- psmpy: Propensity Score Matching in Python — and why it’s needed
- PsmPy
- Citing this work:
- Installation
- Data Prep
- Import psmpy class and functions
- Initialize PsmPy Class
- Predict Scores
- Matching algorithm — version 1
- Matching algorithm — version 2
- Graphical Outputs
- Plot the propensity score or propensity logits
- Plot the effect sizes
- Extra Attributes
- Matched IDs
- Matched Dataframe
- Effect sizes per variable
- Cohen D Function
- Conclusion
- Saved searches
- Use saved searches to filter your results more quickly
- RyanPiao/Tutorial-Propensity-Score-Matching
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
psmpy: Propensity Score Matching in Python — and why it’s needed
We often attempt to answer questions using randomized control trials (RCTs), which we use to perform A|B testing. We try and control the kind and number of variables that might affect this outcome to ascertain if some intervention (or lack thereof) has a potential causal link to the outcome! In the 3rd question posed above you might want to control for socioeconomic status (SES), neighborhood or parental education. In the case of medications, we may want to consider: age, sex, race, underlying/pre-exiting health conditions etc. Enter RCTs…
RCTs are prospective (planned and outcomes occur in the future) and patients are usually matched (sharing similar features — SES, age, etc.) 1:1 for all the variables we want to control for, where 1 receives the intervention and 1 does not.
- Expensive $$$
- Don’t scale (as we increase number of variables we want to control for)
- Patient/participant drop out
- May ultimately not yield meaningful results
Propensity score matching (PSM) is a statistical technique used with retrospective data that attempts to perform the task that would normally occur in a RCT. It is the probability of treatment assignment conditional on observed baseline covariates:
- A large group of patient/participant data that have already been collected (historical data remember?) — age, sex, SES, weight. These are covariates. Often our covariates are potential confounders that could bias our results. The intervention = taking the drug (yes/no) and follow up on our outcome = heart attack (yes/no). Imagine electronic health record data!
- Try to find someone who is a ‘match’…
PsmPy
Matching techniques for epidemiological observational studies as carried out in Python. Propensity score matching is a statistical matching technique used with observational data that attempts to ascertain the validity of concluding there is a potential causal link between a treatment or intervention and an outcome(s) of interest. It does so by accounting for a set of covariates between a binary treatment state (as would occur in a randomized control trial, either received the intervention or not), and control for potential confounding (covariates) in outcome measures between the treatment and control groups such as death, or length of stay etc. It is using this technique on observational data that we gain an insight into the effects or lack thereof of an interventional state.
Citing this work:
A. Kline and Y. Luo, PsmPy: A Package for Retrospective Cohort Matching in Python, 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2022, pp. 1354-1357, doi: 10.1109/EMBC48229.2022.9871333.
- Integration with Jupyter Notebooks
- Additional plotting functionality to assess balance before and after
- A more modular, user-specified matching process
- Ability to define 1:1 or 1:many matching
Installation
Install the package through pip:
Data Prep
Import psmpy class and functions
Initialize PsmPy Class
Initialize the PsmPy class:
- PsmPy — The class. It will use all covariates in the dataset unless formally excluded in the exclude argument.
- df — the dataframe being passed to the class
- exclude — (optional) parameter and will ignore any covariates (columns) passed to the it during the model fitting process. This will be a list of strings. Note, it is not necessary to pass the unique index column here. That process will be taken care of within the code after specifying your index column.
- indx — required parameter that references a unique ID number for each case in the dataset.
Predict Scores
Calculate logistic propensity scores/logits:
There often exists a significant Class Imbalance in the data. This will be detected automatically in the software where the majority group has more records than the minority group. We account for this by setting balance=True when calling psm.logistic_ps() . This tells PsmPy to sample from the majority group when fitting the logistic regression model so that the groups are of equal size. This process is repeated until all the entries of the major class have been regressed on the minor class in equal paritions. This calculates both the logistic propensity scores and logits for each entry.
Review values in dataframe:
Matching algorithm — version 1
- matcher — propensity_logit (default) and generated inprevious step alternative option is propensity_score , specifies the argument on which matching will proceed
- replacement — False (default), determines whethermacthing will happen with or without replacement,when replacement is false matching happens 1:1
- caliper — None (default), user can specify caliper size relative to std. dev of the control sample, restricting neighbors eligible to match within a certain distance.
- drop_unmatched — True (default) In the event that indexes do not have a match due to caliper size it will remove them from the ‘matched_df’, ‘matched_ids’ and subsequent calculations of effect size
Matching algorithm — version 2
Perform KNN matching 1:many
- matcher — propensity_logit (default) and generated inprevious step alternative option is propensity_score , specifies the argument on which matching will proceed
- how_many — 1 (default) performs 1:n matching, where ‘n’ is specified by the user and matched the minor class ‘n’ times to the major class
Graphical Outputs
Plot the propensity score or propensity logits
Plot the distribution of the propensity scores (or logits) for the two groups side by side. Note that here the names are coded as ‘treatment’ and ‘control’ under the assumption that the majority class you are sampling from is the control group. If this is not the case you will need to flip the order of these.
- title — ‘Side by side matched controls’ (default),creates plot title
- Ylabel — ‘Number of patients’ (default), string, labelfor y-axis
- Xlabel — ‘Propensity logit’ (default), string, label for x-axis
- names — [‘treatment’, ‘control’] (default), list of strings for legend
- colors — [‘#E69F00’, ‘#56B4E9’] (default) plotting colors default
- save — False (default), saves the figure generated to current working directory if True
Plot the effect sizes
- title — Title of the plot
- before_color — color (hex) for before matching effect size
- after_color — color (hex) for after macthing effect size
- save — False (default), saves the figure generated tocurrent working directory if True
Extra Attributes
Other attributes available to user:
Matched IDs
Note: That not all matches will be unique if replacement=False
Matched Dataframe
Effect sizes per variable
Note: The thresholds for a small, medium and large effect size were characterizedby Cohen in: J. Cohen, «A Power Primer», Quantitative Methods in Psychology, vol.111, no. 1, pp. 155-159, 1992
Relative Size | Effect Size |
---|---|
small | ≤ 0.2 |
medium | ≤ 0.5 |
large | ≤0.8 |
Cohen D Function
A function to calculate effect size (Cohen D) can be imported alone should the user have a need for it. A floating point number is returned. This floating point number represents the effect size of a variable on a binary outcome.
Conclusion
This package offers a user friendly propensity score matching protocol created for a Python environment. In this we have tried to capture automatic figure generation, contextualization of the results and flexibility in the matching and modeling protocol to serve a wide base.
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Propensity Score Matching tutorial in Python
RyanPiao/Tutorial-Propensity-Score-Matching
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Propensity Score Matching tutorial in Python
Conclusion: Male receive 26% more wage than female with similar background.
In this tutorial, I will demonstrate how Propensity Score Matching is implemented in Python.
More on Propensity Score Matching
Demonstration and Results