In this blog post, I’ll introduce the concept of Directed Acyclic Graphs (DAGs) and I’ll show how you can use simulation studies to demonstrate that unbiased effects can be estimated by blocking the backdoor paths identified in DAGs.
A DAG is a diagram which demonstrates which things cause other things in a situation of interest. Here, we have a DAG showing the relationship between social factors, smoking, and heart disease. We believe that social factors might cause smoking (for example, people of lower incomes might be more likely to smoke than people of higher incomes). We also believe that social factors might cause heart disease (for example, people who have more money might be able to access more preventative care for heart disease). Finally, we believe that exercise influences risk of heart disease (more exercise lowers your risk for heart disease).
using Graphs, GraphPlot
using Compose, Cairo, Fontconfig
g = SimpleDiGraph(4)
add_edge!(g, 1, 2)
add_edge!(g, 1, 3)
add_edge!(g, 2, 3)
add_edge!(g, 4, 3)
nodelabel = ["Social\nFactors", "Smoking", "Heart\nDisease", "Exercise"]
locs_x = [0, 2, 3, 2]
locs_y = [0, -1, 1, 0]
nodestrokelw = [0, 0, 0, 0, 0]
g = gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .15)
draw(PNG("smoking.png", 16cm, 16cm),g)
g
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-ba522d1b">
]]>
Note that in this DAG, there are no arrows between social factors and exercise (demonstrating that we believe that people of high and low socioeconomic status get the same amount of exercise) and that there are no arrows between smoking and exercise (we do not believe that smokers are more or less likely than non-smokers to exercise). And, note that each connection between components in a DAG only shows a causal effect going one way. The “Acyclic” in DAG means that there can be no cycles, so there can be no path from one component back to itself.
Also, note that this DAG is just a representation of how we believe the world to be. By creating a DAG, we do not claim that we have the exact correct understanding of which things cause which other things. However, by using a DAG, we clearly present our assumptions, and make it easy for others to see where our assumptions may have been inaccurate.
When there are two arrows going into a component of a graph, that component is called a collider. Here, L is a collider.
If a collider is conditioned upon (designated by drawing a circle around the collider), the path between X and Y is open. If a collider is not conditioned upon, the path is closed.
using Graphs, GraphPlot
using Compose, Cairo, Fontconfig
g = SimpleDiGraph(3)
add_edge!(g, 1, 2)
add_edge!(g, 3, 2)
nodelabel = ["X", "L", "Y"]
locs_x = [0, 2, 3]
locs_y = [0, -1, 1]
nodestrokelw = [0, 0, 0]
g = gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .15)
draw(PNG("collider_conditioned.png", 16cm, 16cm),g)
g
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-c8ee03ea">
]]>
using Graphs, GraphPlot
using Compose, Cairo, Fontconfig
g = SimpleDiGraph(3)
add_edge!(g, 1, 2)
add_edge!(g, 3, 2)
nodelabel = ["X", "L", "Y"]
locs_x = [0, 2, 3]
locs_y = [0, -1, 1]
nodestrokelw = [0, 1, 0]
g = gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .15)
draw(PNG("collider_conditioned.png", 16cm, 16cm),g)
g
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-1dc09bfd">
]]>
A component is a non-collider if there is at least one arrow coming out of the component to another component.
For non-colliders, a path is closed if the component is conditioned upon, and the path is open if the component is not conditioned upon.
In the DAG below, L is a non-collider, and there exists an open path between X and Y.
using Graphs, GraphPlot
using Compose, Cairo, Fontconfig
g = SimpleDiGraph(3)
add_edge!(g, 2, 1)
add_edge!(g, 2, 3)
nodelabel = ["X", "L", "Y"]
locs_x = [0, 2, 3]
locs_y = [0, -1, 1]
nodestrokelw = [0, 0, 0]
g = gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .15)
draw(PNG("non_collider.png", 16cm, 16cm),g)
g
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-0fd3b4e0">
]]>
One way in which estimates of causal effects may be biased is if there is an open backdoor path between the independent variable and the dependent variable. A path is a backdoor path if it has an arrow coming into (rather than out of) the independent variable.
We can condition upon a component in our graph in many ways. In this post, I will condition on a component by including it as a covariate in a linear model and by restricting our analytical population based on the component. I will signify that a component is conditioned upon by drawing a circle around the component.
A path between two variables is blocked if it contains a non-collider that has been conditioned upon, or if it has a collider which has not been conditioned upon. (To be precise, none of the descendants of the collider can be conditioned on either for a path to be blocked)
Here, a backdoor path between X and Y is open because there are only non-colliders between X and Y, and because none of these non-colliders have been conditioned upon. If we were to estimate an effect of X on Y without including any other covariate, the effect would be biased.
using Graphs, GraphPlot
g = SimpleDiGraph(5)
add_edge!(g, 1, 2)
add_edge!(g, 2, 3)
add_edge!(g, 1, 4)
add_edge!(g, 5, 2)
nodelabel = ["A", "C", "Y", "X", "B"]
locs_x = [0, 2, 3, 1, 2]
locs_y = [0, -1, 1, -2, 0]
nodestrokelw = [0, 0, 0, 0, 0]
g = gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .15)
draw(PNG("open_path.png", 16cm, 16cm),g)
g
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-25bdac90">
]]>
Here, the path is closed because we conditioned on C
using Graphs, GraphPlot
g = SimpleDiGraph(5)
add_edge!(g, 1, 2)
add_edge!(g, 2, 3)
add_edge!(g, 1, 4)
add_edge!(g, 5, 2)
nodelabel = ["A", "C", "Y", "X", "B"]
locs_x = [0, 2, 3, 1, 2]
locs_y = [0, -1, 1, -2, 0]
nodestrokelw = [0, 1, 0, 0, 0]
g = gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .15)
draw(PNG("closed_path.png", 16cm, 16cm),g)
g
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-0f643b5f">
]]>
Julia is a programming language which is designed for statisticians and data scientists. I won’t attempt to explain everything about using Julia in this post, but there are a few commands that you should be familiar with:
using DataFrames, GLM
x = rand(10) #creates 10 random numbers between 0 and 1
y = x .+ 5 # adds 5 to each of the random numbers that have been created. So, these numbers will be between 5 and 6.
# The . in .+ means that we are adding 5 to each element of x
# if we were adding 2 vectors together or 2 scalars together we would just use +
z = rand(Normal(0, 1), 10) #samples 10 numbers from a normal distribution with mean 0 and standard deviation 1
df = DataFrame(z = z, y = y) #create a dataframe with columns for z and y
lm(@formula(y ~ z), df) # fit a linear model with y as the dependent variable and z as the independent variable
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
y ~ 1 + z
Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) 5.52486 0.107426 51.43 <1e-10 5.27714 5.77259
z 0.0261243 0.0831303 0.31 0.7614 -0.165575 0.217823
─────────────────────────────────────────────────────────────────────────
Now, we can start simulating some data. We’ll start by by considering a situation which can be represented using this DAG:
using Graphs, GraphPlot
g = SimpleDiGraph(5)
add_edge!(g, 1, 2)
add_edge!(g, 1, 3)
add_edge!(g, 4, 2)
add_edge!(g, 5, 3)
nodelabel = ["Age", "Cane\nUse", "Heart\nDisease", "Injury", "Blood\nPressure"]
locs_x = [0, 2, 3, 1, 2]
locs_y = [0, -1, 1, -2, 2]
nodestrokelw = [0, 0, 0, 0, 0]
g = gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .15)
draw(PNG("heart_disease_1.png", 16cm, 16cm),g)
g
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-406673e8">
]]>
First, let’s generate plausible values for age, injury, and smoking. These variables don’t have any arrows going into them, so we assume that they are just generated according to a random process. Note that this simulation is deliberately simplistic and not representative of real-life data.
using Distributions
n = 1000
blood_pressure = rand(Normal(80, 10), n) # reasonable range of numbers for blood pressure
injury = rand(Bernoulli(0.1), n) # 10% of our population has a leg injury
age = rand(n) * 100 #We'll assume that our population is uniformly distributed between 0 and 100
1000-element Vector{Float64}:
95.5475455002515
39.09850072823843
72.61833493194055
66.88615588110834
97.92251564730009
12.784685558460652
89.81913074308488
61.99059200438847
50.78945063451321
55.26094628917575
5.714049149347988
77.61365829929339
88.53145108078839
⋮
25.822384380807495
14.293630831848436
54.914668706440025
13.215324676245977
50.57609277364288
79.63042537988933
54.54427785677364
39.54023252070568
13.723531270586875
25.050717261766607
18.05422979833491
28.693898067449975
Now, we’ll generate heart disease and cane use, which will not be independent.
For cane use, we’ll assume that having an injury is like adding 40 years to your life. And, we’ll say that anyone with an injury-adjusted “age” over 80 will use a cane.
cane_use_score = age+injury*40
cane_use= cane_use_score .> 80
1000-element BitVector:
1
0
0
0
1
0
1
0
0
0
0
1
1
⋮
0
0
0
0
0
0
0
0
0
0
0
0
heart_disease_score = age+blood_pressure
heart_disease= heart_disease_score.>120
1000-element BitVector:
1
1
1
1
1
0
1
1
1
1
0
1
1
⋮
0
0
1
0
1
1
1
1
0
0
0
0
using DataFrames
df = DataFrame(blood_pressure = blood_pressure, age = age, injury=injury, cane_use = cane_use, heart_disease = heart_disease)
If we only include cane use in this model, we leave an open backdoor path. If we fit a logistic regression model that leaves the backdoor path open, we will have a biased estimate.
The true effect of cane use on heart disease is zero because cane use was not used in our generating process for heart disease. Thus, if the coefficient for cane use is non-zero, we know that the estimate was biased.
using GLM
unadjusted_model = glm(@formula(heart_disease ~ cane_use), df, Binomial(), LogitLink())
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
heart_disease ~ 1 + cane_use
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) -0.190157 0.073409 -2.59 0.0096 -0.334036 -0.0462781
cane_use 4.08605 0.457633 8.93 <1e-18 3.18911 4.983
──────────────────────────────────────────────────────────────────────────
Now, let’s close the backdoor path in this model by adjusting for age.
using Graphs, GraphPlot
g = SimpleDiGraph(5)
add_edge!(g, 1, 2)
add_edge!(g, 1, 3)
add_edge!(g, 4, 2)
add_edge!(g, 5, 3)
nodelabel = ["Age", "Cane\nUse", "Heart\nDisease", "Injury", "Blood\nPressure"]
locs_x = [0, 2, 3, 1, 2]
locs_y = [0, -1, 1, -2, 2]
nodestrokelw = [1, 0, 0, 0, 0]
g = gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .15)
draw(PNG("heart_disease_blocked.png", 16cm, 16cm),g)
g
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-3e4cef0d">
]]>
adjusted_model = glm(@formula(heart_disease ~ cane_use + age), df, Binomial(), LogitLink())
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
heart_disease ~ 1 + cane_use + age
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) -7.9067 0.628814 -12.57 <1e-35 -9.13915 -6.67425
cane_use 0.170928 0.60365 0.28 0.7771 -1.01221 1.35406
age 0.195755 0.0151852 12.89 <1e-37 0.165993 0.225518
──────────────────────────────────────────────────────────────────────────
After adjusting for the confounder, cane use is no longer associated with heart disease.
Let’s take a look at selection bias.
Here, we’ll do a study to see if there is a relationship between age and height. But, because of funding limitations, we can only recruit for our study at an amusement park. We decide to set up our station in the exit doors of a roller coaster. However, the roller coaster requires that individuals be at least 9 years old and at least 150 cm to ride the coaster.
First, let’s generate some data assuming that each year, a child grows 5 cm each year.
using Graphs, GraphPlot
g = SimpleDiGraph(3)
add_edge!(g, 1, 2)
add_edge!(g, 3, 2)
add_edge!(g, 1, 3)
nodelabel = ["Age", "Study\nEntry", "Height"]
nodestrokelw = [0, 1, 0]
locs_x = [-10, 10, 0]
locs_y = [1, 1, 1]
gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .15,
linetype = "curve"
)
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-5c041350">
]]>
n = 10000000
height = rand(Normal(130, 10), n) # reasonable range of numbers for blood pressure
age = rand(n) * 18 #We'll assume that only young people like going to theme parks
df = DataFrame(age = age, height = height)
df.s1 = df.age + rand(n) * 10 .+ 1
df.s2 = df.height + rand(Normal(130, 10), n)
df.height = df.height + df.age .*5 .- mean(df.age) .*5
#df.selection = df.age .* 10 + df.height
# selected_df = df[df.selection .> 150, :]
# selected_df = df[(df.s1 .> 16) .* (df.s2 .> 300),:]
selected_df = df[(df.age .> 9) .* (df.height .> 150),:]
Now, let’s verify that the true relationship could have been recovered had no selection effects ever happened. This is what our sample would have been like if we had taken a random sample of children and asked them their ages and height.
using Graphs, GraphPlot
g = SimpleDiGraph(3)
#add_edge!(g, 1, 2)
#add_edge!(g, 3, 2)
add_edge!(g, 1, 3)
nodelabel = ["Age", "Study\nEntry", "Height"]
nodestrokelw = [0, 1, 0]
locs_x = [-10, 10, 0]
locs_y = [1, 1, 1]
gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .15,
linetype = "curve"
)
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-69f797f2">
]]>
using GLM
model_no_selection = lm(@formula(height ~ age), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
height ~ 1 + age
Coefficients:
────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept) 85.0038 0.00632593 13437.36 <1e-99 84.9914 85.0161
age 4.99935 0.000608695 8213.22 <1e-99 4.99816 5.00054
────────────────────────────────────────────────────────────────────────────
And, now, we’ll show that if we use data that are collected with selection effects, our estimate for how much someone grows per year of age is biased.
using GLM
model_selection = lm(@formula(height ~ age), selected_df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
height ~ 1 + age
Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept) 120.835 0.0384283 3144.42 <1e-99 120.76 120.91
age 2.88792 0.0025198 1146.09 <1e-99 2.88298 2.89286
───────────────────────────────────────────────────────────────────────────
One kind of bias that can be seen in causal inference is M-bias. In M-bias, an open backdoor path can exist despite there being no “common cause” of the independent and dependent variables.
using Graphs, GraphPlot
g = SimpleDiGraph(5)
add_edge!(g, 1, 2)
add_edge!(g, 1, 5)
add_edge!(g, 4, 2)
add_edge!(g, 4, 3)
#add_edge!(g, 5, 3)
nodelabel = ["Age", "Classical\nMusic", "Person's\nIncome", "Parents'\nIncome", "Partner's\nIncome"]
locs_x = [0, 2, 5, 1, 2]
locs_y = [0, -1, -1, -2, 2]
nodestrokelw = [0, 0, 0, 0, 0]
p = gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .16)
draw(PNG("m_bias_1.png", 16cm, 16cm),p)
p
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-aa3a3b21">
]]>
n = 1000
age = rand(n) .* 50 .+ 20 #We'll look at people with ages 20 to 70.
parents_income = rand(Normal(50000, 20000), n) #income is definitely not normally distributed, but we'll use it in this example
# We'll assume everyone has some chance of being really interested in classical music
musical_interest = rand(Bernoulli(0.3), n) .* 2
# We'll assume that parents with higher income will always have their kids listen to more classical music
# (1 additional minute of music per 100 dollars of income) and that people who are older listen to more
# classical music (10 additional minutes per year of age)
minutes_per_year = musical_interest .* 30 + parents_income ./ 100 + age .* 10
# We'll assume that on average, people make 1000 additional dollars per year they work.
# And, we'll assume that everyone has an average baseline pay of 20,000 per year
partner_income = rand(Normal(20000, 10000), n) + age .* 1000
# We'll assume that a person's income is related to their parents' income, and that people vary
# from their parents' income by about $10,000 on average.
personal_income = parents_income + rand(Normal(0, 10000), n)
df = DataFrame(age = age, parents_income = parents_income,
minutes_per_year = minutes_per_year, partner_income = partner_income,
personal_income = personal_income)
using GLM
model = lm(@formula(personal_income ~ partner_income), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
personal_income ~ 1 + partner_income
Coefficients:
──────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────────────────
(Intercept) 54596.2 2711.84 20.13 <1e-75 49274.7 59917.8
partner_income -0.0711116 0.0401647 -1.77 0.0769 -0.149928 0.00770525
──────────────────────────────────────────────────────────────────────────────────────────
using Graphs, GraphPlot
g = SimpleDiGraph(5)
add_edge!(g, 1, 2)
add_edge!(g, 1, 5)
add_edge!(g, 4, 2)
add_edge!(g, 4, 3)
#add_edge!(g, 5, 3)
nodelabel = ["Age", "Classical\nMusic", "Person's\nIncome", "Parents'\nIncome", "Partner's\nIncome"]
locs_x = [0, 2, 5, 1, 2]
locs_y = [0, -1, -1, -2, 2]
nodestrokelw = [0, 1, 0, 0, 0]
p = gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .16)
draw(PNG("m_bias_open_path.png", 16cm, 16cm),p)
p
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-5ebb21e0">
]]>
using GLM
model_conditioned = lm(@formula(personal_income ~ partner_income + minutes_per_year), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
personal_income ~ 1 + partner_income + minutes_per_year
Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept) 7992.43 1924.05 4.15 <1e-04 4216.78 11768.1
partner_income -0.581704 0.0264102 -22.03 <1e-87 -0.633529 -0.529878
minutes_per_year 82.5458 1.90231 43.39 <1e-99 78.8128 86.2788
────────────────────────────────────────────────────────────────────────────────────────
using Graphs, GraphPlot
g = SimpleDiGraph(5)
add_edge!(g, 1, 2)
add_edge!(g, 1, 5)
add_edge!(g, 4, 2)
add_edge!(g, 4, 3)
#add_edge!(g, 5, 3)
nodelabel = ["Age", "Classical\nMusic", "Person's\nIncome", "Parents'\nIncome", "Partner's\nIncome"]
locs_x = [0, 2, 5, 1, 2]
locs_y = [0, -1, -1, -2, 2]
nodestrokelw = [0, 1, 0, 1, 0]
p = gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .16)
draw(PNG("m_bias_2_conditions.png", 16cm, 16cm),p)
p
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-75dee72b">
]]>
using GLM
model_conditioned_backdoor_blocked = lm(@formula(personal_income ~ partner_income + parents_income + minutes_per_year), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
personal_income ~ 1 + partner_income + parents_income + minutes_per_year
Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept) 2709.85 1566.75 1.73 0.0840 -364.652 5784.35
partner_income -0.0433236 0.0314577 -1.38 0.1688 -0.105055 0.0184074
parents_income 0.964769 0.0415196 23.24 <1e-95 0.883293 1.04625
minutes_per_year 1.82847 3.79682 0.48 0.6302 -5.62222 9.27916
────────────────────────────────────────────────────────────────────────────────────────
Now, what would happen if there was an effect of partner income on personal income? (For example, maybe a high-earning partner might help get you a job at their high-paying company)
personal_income = parents_income + rand(Normal(0, 10000), n) + partner_income .- mean(partner_income)
df = DataFrame(age = age, parents_income = parents_income,
minutes_per_year = minutes_per_year, partner_income = partner_income,
personal_income = personal_income)
We still get the correct effect (in this case, 1) by not conditioning on the minutes of classical music per year.
using GLM
model = lm(@formula(personal_income ~ partner_income), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
personal_income ~ 1 + partner_income
Coefficients:
──────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────────────
(Intercept) -11324.3 2806.52 -4.03 <1e-04 -16831.6 -5816.91
partner_income 0.942698 0.041567 22.68 <1e-91 0.86113 1.02427
──────────────────────────────────────────────────────────────────────────────────────
using GLM
model_conditioned = lm(@formula(personal_income ~ partner_income + minutes_per_year), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
personal_income ~ 1 + partner_income + minutes_per_year
Coefficients:
─────────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────────────────────
(Intercept) -60272.4 1934.07 -31.16 <1e-99 -64067.7 -56477.1
partner_income 0.406422 0.0265477 15.31 <1e-46 0.354326 0.458517
minutes_per_year 86.6982 1.91222 45.34 <1e-99 82.9458 90.4506
─────────────────────────────────────────────────────────────────────────────────────────────
using GLM
model_conditioned_backdoor_blocked = lm(@formula(personal_income ~ partner_income + parents_income + minutes_per_year), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
personal_income ~ 1 + partner_income + parents_income + minutes_per_year
Coefficients:
────────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────────────────
(Intercept) -65722.8 1551.88 -42.35 <1e-99 -68768.1 -62677.5
partner_income 0.961906 0.0311592 30.87 <1e-99 0.90076 1.02305
parents_income 0.99542 0.0411256 24.20 <1e-99 0.914717 1.07612
minutes_per_year 3.41651 3.76079 0.91 0.3639 -3.96347 10.7965
────────────────────────────────────────────────────────────────────────────────────────────
For more information, see here
using Graphs, GraphPlot
g = SimpleDiGraph(5)
add_edge!(g, 1, 2)
add_edge!(g, 2, 3)
add_edge!(g, 1, 4)
add_edge!(g, 2, 5)
nodelabel = ["A", "C", "Y", "X", "B"]
locs_x = [0, 2, 3, 1, 2]
locs_y = [0, -1, 1, -2, 2]
nodestrokelw = [0, 1, 0, 0, 0]
g = gplot(g, locs_x, locs_y, nodelabel=nodelabel,
nodestrokec = "black", nodestrokelw = nodestrokelw, NODESIZE = .15)
draw(PNG("heart_disease_1.png", 16cm, 16cm),g)
g
<svg xmlns=“http://www.w3.org/2000/svg" xmlns:xlink=“http://www.w3.org/1999/xlink" version=“1.2” width=“141.42mm” height=“100mm” viewBox=“0 0 141.42 100” stroke=“none” fill="#000000” stroke-width=“0.3” font-size=“3.88”
id="img-dc803d49">
]]>