04 - Categories & Curves

Homework - solutions

Book question 4M4

A sample of students is measured for height each year for 3 years. After the third year, you want to fit a linear regression predicting height using year as a predictor. Write down the mathematical model definition for this regression, using any variable names and priors you choose. Be prepared to defend your choice of priors.

Solution

h_ij ~ Normal(mu_ij, sigma)
mu_ij = alpha + beta * (y_j - y_bar)
alpha ~ Normal(100, 10)
beta ~ Normal(0, 10)
sigma ~ Exponential(1)

Where

h is height
y is year
y_bar is average year
index i for students
index j for years

Book question 4M5

Now suppose I remind you that every student got taller each year. Does this information lead you to change your choice of priors? How?

Solution

The beta should be positive only: we could use lower bound or log-normal distribution.

df <- tibble(dist = c(dist_lognormal(1, 0.5)))

ggplot(df, aes(y = 1)) +
  stat_dist_halfeye(aes(dist = dist))

Book question 4M6

Now suppose I tell you that the variance among heights for students of the same age is never more than 64 cm. How does this lead you to revise your priors?

Solution

We could use uniform distribution (sigma ~ Uniform(0, 8)) or an upper bound.

Book question 4H1

The weights listed below were recorded in the !Kung census, but heights were not recorded for these individuals. Provide predicted heights and 89% intervals for each of these individuals. That is, fill in the table below, using model-based predictions.

Individual	weight	expected height	89% interval
1	46.95
2	43.72
3	64.78
4	32.59
5	54.63

Solution

data(Howell1, package = "rethinking")
d <- as_tibble(Howell1)
d2 <- d %>% filter(age >= 18)

m <-
  brm(
    data = d2,
    family = gaussian,
    height ~ 1 + weight,
    prior = c(
      prior(normal(100, 30), class = Intercept),
      prior(normal(0, 10), class = b, lb = 0),
      prior(exponential(10), class = sigma)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    save_pars = save_pars(all = T),
    sample_prior = T,
    seed = 4,
    file = file.path(models_folder, "04_hw4h1")
  )
nd <- tibble(weight = c(46.95, 43.72, 64.78, 32.59, 54.63))
p89 <- c((1-0.89)/2, ((1-0.89)/2) + 0.89)
# fitted(m, newdata = nd, probs = p89)
predict(m, newdata = nd, probs = p89)

     Estimate Est.Error     Q5.5    Q94.5
[1,] 156.4527  4.809213 148.7962 164.1524
[2,] 153.4157  4.658831 145.9057 160.9373
[3,] 172.4229  4.890977 164.5585 180.4124
[4,] 143.3478  4.826957 135.6490 151.0508
[5,] 163.3167  4.725462 155.8310 171.0050

Book question 4H2

Select out all the rows in the Howell1 data with ages below 18 years of age. If you do it right, you should end up with a new data frame with 192 rows in it.

Fit a linear regression to these data, using quap. Present and interpret the estimates. For every 10 units of increase in weight, how much taller does the model predict a child gets?
Plot the raw data, with height on the vertical axis and weight on the horizontal axis. Superimpose the MAP regression line and 89% interval for the mean. Also superimpose the 89% interval for predicted heights.
What aspects of the model fit concern you? Describe the kinds of assumptions you would change, if any, to improve the model. You don’t have to write any new code. Just explain what the model appears to be doing a bad job of, and what you hypothesize would be a better model.

Solution

data(Howell1, package = "rethinking")
d <- as_tibble(Howell1)
d1 <- d %>% filter(age < 18) %>% mutate(weight_c = weight - mean(weight))

m1 <- 
  brm(
    data = d1,
    family = gaussian,
    height ~ 1 + weight_c,
    prior = c(
      prior(normal(100, 30), class = Intercept),
      prior(normal(0, 10), class = b, lb = 0),
      prior(exponential(10), class = sigma)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    save_pars = save_pars(all = T),
    sample_prior = T,
    seed = 4,
    file = file.path(models_folder, "04_hw4h2")
  )
summary(m1)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: height ~ 1 + weight_c 
   Data: d1 (Number of observations: 192) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   108.32      0.53   107.33   109.37 1.00     3942     3080
weight_c      2.72      0.06     2.60     2.83 1.00     4275     2605

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     7.25      0.31     6.66     7.87 1.00     4150     2801

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

nd <- tibble(weight = seq(5, 45), weight_c = weight - mean(d1$weight))
f <- fitted(m1, newdata = nd) %>% bind_cols(nd)
p <- predict(m1, newdata = nd) %>% bind_cols(nd)

d1 %>% 
  ggplot(aes(x = weight, y = height)) + 
  geom_ribbon(
    aes(x = weight, y = Estimate, ymin = Q2.5, ymax = Q97.5), 
    data = p, fill = "grey"
  ) + 
  geom_ribbon(
    aes(x = weight, y = Estimate, ymin = Q2.5, ymax = Q97.5), 
    data = f, fill = "blue"
  ) + 
  geom_point()

On average, we expect an increase by 2.7 cm for each 1 kg.
See above.
The relationship between weight and height is not linear. Note the model underestimates the height for central weight values and overestimates for low/high values. Maybe a quadratic or logarithmic model would be better here.