Skip to content

M3 not working with torch

3 messages · Simon Urbanek, Gilberto Camara

#
Dear R-SIG-MAC

I bought a new MacBook Air with the M3 chip, which has 8 CPUs, 10 GPUs, and 16GB of integrated memory. My R `torch` apps are crashing. I have assembled an MWE that works on other Mac architectures, including MacBook Air M1 and MacMini. The OS is the same (Sonoma 14.5). The MWE follows:

```{r}
# ==== MWE

# Download the training samples
rds_file <- "https://raw.githubusercontent.com/e-sensing/sitsdata/master/inst/extdata/torch/train_samples.rds?raw=true"
dest_file <- paste0(tempdir(),"/train_samples.rds")
download.file(rds_file,
              destfile = dest_file,
              method = "curl")
train_samples <- readRDS(dest_file)

# Sample labels
labels <- c("Cerrado", "Forest", "Pasture", "Soy_Corn")

# Create numeric labels vector
code_labels <- seq_along(labels)
names(code_labels) <- labels

# Split the data into training and validation data sets
# Create partitions for different splits of the input data
frac <- 0.2
train_samples <- dplyr::group_by(train_samples, .data[["label"]])
test_samples <- train_samples |>
    dplyr::slice_sample(prop = frac) |>
    dplyr::ungroup()
    
# Remove the lines used for validation
sel <- !train_samples[["sample_id"]] %in% test_samples[["sample_id"]]
train_samples <- train_samples[sel, ]

# Shuffle the data
train_samples <- train_samples[sample(nrow(train_samples), nrow(train_samples)), ]
test_samples <- test_samples[sample(nrow(test_samples), nrow(test_samples)), ]

# Organize data for model training
train_x <- as.matrix(train_samples[, -2:0])
train_y <- unname(code_labels[train_samples[["label"]]])

# Create the test data
test_x <- as.matrix(test_samples[, -2:0])
test_y <- unname(code_labels[test_samples[["label"]]])

# Set torch seed
torch::torch_manual_seed(sample.int(10^5, 1))

# Avoid a global variable for 'self'
self <- NULL

# function to create a simple sequential NN module
.torch_linear_relu_dropout <- torch::nn_module(
    classname = "torch_linear_batch_norm_relu_dropout",
    initialize = function(input_dim,
                          output_dim,
                          dropout_rate) {
        self$block <- torch::nn_sequential(
            torch::nn_linear(input_dim, output_dim),
            torch::nn_relu(),
            torch::nn_dropout(dropout_rate)
        )
    },
    forward = function(x) {
        self$block(x)
    }
)

# Define the MLP architecture
mlp_model <- torch::nn_module(
    initialize = function(num_pred, layers, dropout_rates, y_dim) {
        tensors <- list()
        # input layer
        tensors[[1]] <- .torch_linear_relu_dropout(
            input_dim = num_pred,
            output_dim = 512,
            dropout_rate = 0.40
        )
        # output layer
        tensors[[length(tensors) + 1]] <-
            torch::nn_linear(layers[length(layers)], y_dim)
        # add softmax tensor
        tensors[[length(tensors) + 1]] <- torch::nn_softmax(dim = 2)
        # create a sequential module that calls the layers in the same
        # order.
        self$model <- torch::nn_sequential(!!!tensors)
    },
    forward = function(x) {
        self$model(x)
    }
)
# Train the model using luz

torch_model <- luz::setup(
    module = mlp_model,
    loss = torch::nn_cross_entropy_loss(),
    metrics = list(luz::luz_metric_accuracy()),
    optimizer = torch::optim_adamw,
)
torch_model <- luz::set_hparams(
    torch_model,
    num_pred = ncol(train_x),
    layers = 512,
    dropout_rates = 0.3,
    y_dim = length(code_labels)
)
torch_model <- luz::set_opt_hparams(
    torch_model,
    lr = 0.001,
    eps = 1e-08,
    weight_decay = 1.0e-06
)
torch_model <- luz::fit(
    torch_model,
    data = list(train_x, train_y),
    epochs = 100,
    valid_data = list(test_x, test_y),
    callbacks = list(luz::luz_callback_early_stopping(
        patience = 20,
        min_delta = 0.01
    )),
  verbose = TRUE
)

```
The error occurs in the `luz::fit` function. Inside RStudio, the code gets stuck, and then RStudio asks to restart R. When running R from the terminal, the output is:
```{r}
 *** caught bus error ***
address 0x16daa0000, cause 'invalid alignment'

 *** caught segfault ***
address 0x9, cause 'invalid permissions'
zsh: segmentation fault  R
```
The `sessionInfo()` output is as follows:

```{r}

R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] crayon_1.5.2      vctrs_0.6.5       cli_3.6.2         zeallot_0.1.0    
 [5] rlang_1.1.3       processx_3.8.4    generics_0.1.3    torch_0.12.0.9000
 [9] coro_1.0.4        glue_1.7.0        bit_4.0.5         prettyunits_1.2.0
[13] luz_0.4.0         ps_1.7.6          hms_1.1.3         fansi_1.0.6      
[17] tibble_3.2.1      progress_1.2.3    lifecycle_1.0.4   compiler_4.4.0   
[21] dplyr_1.1.4       fs_1.6.4          Rcpp_1.0.12       pkgconfig_2.0.3  
[25] rstudioapi_0.16.0 R6_2.5.1          tidyselect_1.2.1  utf8_1.2.4       
[29] pillar_1.9.0      callr_3.7.6       magrittr_2.0.3    tools_4.4.0      
[33] bit64_4.0.5     
```

Any clues will be most appreciated. 

Thanks
Gilberto
============================
Prof Dr Gilberto Camara
Senior Researcher
National Institute for Space Research (INPE), Brazil
https://gilbertocamara.org/
=============================
#
Gilberto,

since luz is a contributed package, you should probably start first by asking the authors. Given that the torch ecosystem is quite complex and has several layers that need to work together, even if you talk to them, you probably need to add details such as exact versions used (including the torch and metal layers) and how you installed the pieces (I know you helpfully supplied sesisonInfo() but I suspect that info such as exact torch run-time is pertinent as well). Next step would be to trace the error - check the system crash reporter or run R -d lldb to find out the exact library the crash happens in which may give you more clues. I don't have any M3 machines so I can't check myself, unfortunately.

Cheers,
Simon
#
Dear Simon

Many thanks for your response and suggestions. I have raised an issue in the R torch repository: 

https://github.com/mlverse/torch/issues/1167

All the best
Gilberto
============================
Prof Dr Gilberto Camara
Senior Researcher
National Institute for Space Research (INPE), Brazil
https://gilbertocamara.org/
=============================