correlation of two tests treating items of both tests as random effects?

I thought I had a solution to this problem, but I don't. The problem
is very simple to state. It is to find whether one test correlates
with another, when each test has several items sampled from a larger
population of potential items.

I give two psychological tests to a group of subjects. Each test can
be seen as a sample of items from a population. Vocabulary tests and
arithmetic problems are examples.[1] Usually researchers just get a
total score on each test and look at the Pearson correlation. And
usually this is fine because the correlation is high enough that its
existence is not in doubt, and the magnitude of the correlation is of
primary interest.

But sometimes some theoretical question hinges on whether the tests
correlate at all. They could correlate spuriously because of the
particular sample of items used in each test. So one way to handle
this is to think of items as random effects.

It is easy to do this with lmer() when ONE of the two tests is treated
as a random effect. Each observation is the subject's score on one
item of that test (test 1), and the summary score of the other test
(test 2) is the predictor. The model has crossed random effects for
subjects and test 1 items. The number of rows in the data frame is
(number of subjects) times (number of items in test 1).

I thought it might be possible to extend this idea by making each row
consist of a subject's score on one item of test 1 and her score on
one item of test 2. The total number of rows would be (number of
subjects) times (number of items in test 1) times (number of items in
test 2). And I would include crossed random effects for subjects, test
1 items, and test 2 items. But then what? Do I just predict one test
from the other, as before? (The direction may matter, but that is the
least of my worries.)

I'm stuck. And this may be a blind alley.

Jon

Note:

[1] Not all psychological tests are like this. Some are designed to
represent a balance between different items so that only the test as a
whole, not each item, measures the trait of interest correctly.