---
title: "04 - Categories & Curves"
subtitle: "Homework - solutions"
format:
  html: 
    embed-resources: true
    code-fold: false
    toc: true
    toc-depth: 4
---

```{r}
#| echo: false
#| output: false
library(tidyverse)
library(patchwork)
library(GGally)
library(brms)
library(tidybayes)
library(distributional)

theme_set(theme_minimal())

# storage for models
models_folder <- "models"
```

## Book question 4M4

A sample of students is measured for height each year for 3 years. After the third year, you want to fit a linear regression predicting height using year as a predictor. Write down the mathematical model definition for this regression, using any variable names and priors you choose. Be prepared to defend your choice of priors.

### Solution

- h_ij ~ Normal(mu_ij, sigma)
- mu_ij = alpha + beta * (y_j - y_bar)
- alpha ~ Normal(100, 10)
- beta ~ Normal(0, 10)
- sigma ~ Exponential(1)

Where

- h is height
- y is year
- y_bar is average year
- index i for students
- index j for years

## Book question 4M5

Now suppose I remind you that every student got taller each year. Does this information lead you to change your choice of priors? How?

### Solution

The beta should be positive only: 
we could use lower bound or 
[log-normal distribution](https://en.wikipedia.org/wiki/Log-normal_distribution).

```{r}
df <- tibble(dist = c(dist_lognormal(1, 0.5)))

ggplot(df, aes(y = 1)) +
  stat_dist_halfeye(aes(dist = dist))
```

## Book question 4M6

Now suppose I tell you that the variance among heights for students of the same age is never more than 64 cm. How does this lead you to revise your priors?

### Solution

We could use uniform distribution (sigma ~ Uniform(0, 8)) or an upper bound.

## Book question 4H1

The weights listed below were recorded in the !Kung census, but heights were not recorded for these individuals. Provide predicted heights and 89% intervals for each of these individuals. That is, fill in the table below, using model-based predictions.

| Individual  | weight  | expected height  | 89% interval  |
|:-:|:-:|:-:|---|
| 1   | 46.95   |   |   |
| 2  | 43.72  |   |   |
| 3  | 64.78  |   |   |
| 4  | 32.59  |   |   |
| 5  | 54.63  |   |   |

### Solution

```{r}
data(Howell1, package = "rethinking")
d <- as_tibble(Howell1)
d2 <- d %>% filter(age >= 18)

m <-
  brm(
    data = d2,
    family = gaussian,
    height ~ 1 + weight,
    prior = c(
      prior(normal(100, 30), class = Intercept),
      prior(normal(0, 10), class = b, lb = 0),
      prior(exponential(10), class = sigma)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    save_pars = save_pars(all = T),
    sample_prior = T,
    seed = 4,
    file = file.path(models_folder, "04_hw4h1")
  )
nd <- tibble(weight = c(46.95, 43.72, 64.78, 32.59, 54.63))
p89 <- c((1-0.89)/2, ((1-0.89)/2) + 0.89)
# fitted(m, newdata = nd, probs = p89)
predict(m, newdata = nd, probs = p89)
```


## Book question 4H2

Select out all the rows in the Howell1 data with ages below 18 years of age. If you do it right, you should end up with a new data frame with 192 rows in it.

(a) Fit a linear regression to these data, using quap. Present and interpret the estimates. For every 10 units of increase in weight, how much taller does the model predict a child gets?
(b) Plot the raw data, with height on the vertical axis and weight on the horizontal axis. Superimpose the MAP regression line and 89% interval for the mean. Also superimpose the 89% interval for predicted heights.
(c) What aspects of the model fit concern you? Describe the kinds of assumptions you would change, if any, to improve the model. You don’t have to write any new code. Just explain what the model appears to be doing a bad job of, and what you hypothesize would be a better model.

### Solution

```{r}
data(Howell1, package = "rethinking")
d <- as_tibble(Howell1)
d1 <- d %>% filter(age < 18) %>% mutate(weight_c = weight - mean(weight))

m1 <- 
  brm(
    data = d1,
    family = gaussian,
    height ~ 1 + weight_c,
    prior = c(
      prior(normal(100, 30), class = Intercept),
      prior(normal(0, 10), class = b, lb = 0),
      prior(exponential(10), class = sigma)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    save_pars = save_pars(all = T),
    sample_prior = T,
    seed = 4,
    file = file.path(models_folder, "04_hw4h2")
  )
summary(m1)

nd <- tibble(weight = seq(5, 45), weight_c = weight - mean(d1$weight))
f <- fitted(m1, newdata = nd) %>% bind_cols(nd)
p <- predict(m1, newdata = nd) %>% bind_cols(nd)

d1 %>% 
  ggplot(aes(x = weight, y = height)) + 
  geom_ribbon(
    aes(x = weight, y = Estimate, ymin = Q2.5, ymax = Q97.5), 
    data = p, fill = "grey"
  ) + 
  geom_ribbon(
    aes(x = weight, y = Estimate, ymin = Q2.5, ymax = Q97.5), 
    data = f, fill = "blue"
  ) + 
  geom_point()
```

(a) On average, we expect an increase by 2.7 cm for each 1 kg.
(b) See above.
(c) The relationship between weight and height is not linear. Note the model underestimates the height for central weight values and overestimates for low/high values.
Maybe a quadratic or logarithmic model would be better here.