The empirical checkerboard copula with known margins is a model for copula that uses priori information on a multidimensional margins to condition a checkerboard construction on the rest of the copula. This vignette describes the model in the first section, and then discuss current implementation in the second section

Empirical checkerboard copula with known margins

Preliminary notations

First of all, for a certain dimension of the data at hand dd and a certain checkerboard parameter mm, let’s consider the ensemble of multi-indexes ℐ={𝐒=(i1,..,id)βŠ‚{1,...,m}d}\mathcal{I} = \{\mathbf{i} = (i_1,..,i_d) \subset \{1,...,m \}^d\} which indexes the boxes :

B𝐒=]π’βˆ’1m,𝐒m]B_{\mathbf{i}} = \left]\frac{\mathbf{i}-1}{m},\frac{\mathbf{i}}{m}\right]

partitioning the space 𝕀d=[0,1]d\mathbb{I}^d = [0,1]^d. Denote the set of thoose boxes by ℬℐ={B𝐒,π’βˆˆβ„}\mathcal{B}_\mathcal{I} = \left\{B_{\mathbf{i}}, \mathbf{i} \in \mathcal{I}\right\}. Furthermore, let’s choose pp dimensions that would be assigned to a known copula, by setting : JβŠ‚{1,...,d},|J|=pJ \subset \{1,...,d\}, |J| = p and let’s define proper projections for the boxes :

B𝐒J={𝐱∈[0,1]p,xj∈]ijβˆ’1m,ijm]βˆ€j∈J}B^J_{\mathbf{i}} = \{ \mathbf{x} \in [0,1]^p, x_j \in \left]\frac{i_j-1}{m},\frac{i_j}{m}\right] \forall j \in J \}

Bπ’βˆ’J={𝐱∈[0,1]p,xj∈]ijβˆ’1m,ijm]βˆ€jβˆ‰J}B^{-J}_{\mathbf{i}} = \{\mathbf{x} \in [0,1]^p, x_j \in \left]\frac{i_j-1}{m},\frac{i_j}{m}\right] \forall j \notin J \}

such that B𝐒=B𝐒JΓ—Bπ’βˆ’JB_{\mathbf{i}} = B^J_{\mathbf{i}} \times B^{-J}_{\mathbf{i}} for all π’βˆˆβ„\mathbf{i} \in \mathcal{I}. The tensor product is here understood taking dimensions of 𝕀d=[0,1]d\mathbb{I}^d = [0,1]^d in the right order such that B𝐒JB^J_{\mathbf{i}} dimensions end up in place of dimensions indexed by JJ. Think of dimensions as re-ordered such that J={1,...,p}J = \{1,...,p\}.

Let now Ξ»\lambda be the dimension-unspecified Lebesgue measure on any power of ℝ\mathbb{R}, that is :

βˆ€dβˆˆβ„•,βˆ€π±,π²βˆˆβ„p,Ξ»([𝐱,𝐲])=∏p=1d(yiβˆ’xi)\forall d \in \mathbb{N}, \forall \mathbf{x},\mathbf{y} \in \mathbb{R}^p, \lambda(\left[\mathbf{x},\mathbf{y}\right]) = \prod\limits_{p=1}^{d} (y_i - x_i)

Let furthermore ΞΌJ\mu^J be a copula measure of dimension pp, corresponding to the known multivariate margin associated to marginals in JJ. Let also ΞΌ\mu and ΞΌΜ‚\hat{\mu} be dimension-unspecific version of the true copula measure of the sample at hand and (respectively) the classical Deheuvels empirical copula, that is :

  • For nn i.i.d observation of the copula of dimension dd, let βˆ€i∈{1,...,d},Ri1,...,Rid\forall i \in \{1,...,d\}, \, R_i^1,...,R_i^d be the marginal ranks for the variable ii.
  • βˆ€π±βˆˆβ„d, let ΞΌΜ‚([0,x])=1nβˆ‘k=1nπŸ™R1k≀x1,...,Rdk≀xd\forall \mathbf{x} \in \mathcal{I}^d, \text{ let } \hat{\mu}([0,x]) = \frac{1}{n} \sum\limits_{k=1}^n \mathbb{1}_{R_1^k\le x_1,...,R_d^k\le x_d}

We are now ready to define the empirical checkerboard copula with known margins.

Definition, estimation and simulation procedures

The empirical copula with known margins is the copula that correspond to the following simulation procedure.

  • Simulate a sample from the known sub-copula ΞΌJ\mu^J, of dimension pp, through any avaliable method (depends on the known copula model). Let B𝐒JB_{\mathbf{i}}^J be the (projected) box containing this sample.
  • Sample one box B𝐒B_{\mathbf{i}} among all boxes with projection B𝐒JB_{\mathbf{i}}^J with probability weights :
    • ΞΌΜ‚(B𝐒)ΞΌΜ‚(B𝐒J)\frac{\hat{\mu}(B_{\mathbf{i}})}{\hat{\mu}(B_{\mathbf{i}}^J)} if the projected box B𝐒JB_{\mathbf{i}}^J contains one or more (empirical) data point, that is ΞΌΜ‚(B𝐒J)β‰ 0\hat{\mu}(B_{\mathbf{i}}^J) \neq 0
    • Ξ»(B𝐒)Ξ»(B𝐒J)\frac{\lambda(B_{\mathbf{i}})}{\lambda(B_{\mathbf{i}}^J)} otherwise.
  • Simulate uniformly from Bπ’βˆ’JB_{\mathbf{i}}^{-J}

This algorithm simulates first the known part of the model (dimensions in JJ), and then, conditionally, the checkerboard part, ensuring that the known copula is respected. The downside of this behavior is that the checkerboard part may have points outside standard checkerboard boxes, making this part of the copula less sparse than a true checkerboard. But is does become sparser as soon as the data fits the known margins. On the other hand, this algorithm allows for a lot of flexibility, mainly in the following points :

  • The β€œgrid” given by ℬℐ\mathcal{B}_\mathcal{I} can be taken more arbitrarily than mdm^d boxes of same volume, as soon as it’s a partition of 𝕀d\mathbb{I}^d.
  • The known copula is not restricted at all and can be chose among all pp-dimensionnal copulas.
  • The estimation of the checkerboard part can be turn into a more flexible patchwork construction, by changing the independence copula for an other one inside the boxes. See Durante2013,Durante2015,Durante2015a

We are now going to define properly the measure associated to this simulation procedure, a.k.a the empirical checkerboard copula with known margins. Let Ξ½\nu be this measure and let 𝐔\mathbf{U} be a random vector drawn from it. Then βˆ€π±βˆˆπ•€d\forall \mathbf{x} \in \mathbb{I}^d, following the above procedure, we have :

Ξ½([0,𝐱])=βˆ‘π’βˆˆβ„β„™(π”βˆ’J∈Bπ’βˆ’J∩[0,π±βˆ’J]|𝐔J∈B𝐒J∩[0,𝐱J])β„™(𝐔J∈B𝐒J∩[0,𝐱J])\nu([0,\mathbf{x}]) = \sum\limits_{{\mathbf{i}} \in \mathcal{I}} \mathbb{P}(\mathbf{U}^{-J} \in B_{\mathbf{i}}^{-J} \cap [0,\mathbf{x}^{-J}] | \mathbf{U}^{J} \in B_{\mathbf{i}}^{J} \cap [0,\mathbf{x}^{J}]) \mathbb{P}(\mathbf{U}^{J} \in B_{\mathbf{i}}^{J} \cap [0,\mathbf{x}^{J}])

While the unconditional term is easy to handle since it’s the measure associated with the known copula, β„™(𝐔J∈B𝐒J∩[0,𝐱J])=ΞΌJ(B𝐒J∩[0,𝐱J])\mathbb{P}(\mathbf{U}^{J} \in B_{\mathbf{i}}^{J} \cap [0,\mathbf{x}^{J}]) = \mu^J(B_{\mathbf{i}}^{J} \cap [0,\mathbf{x}^{J}]), the conditional term can be treated according to the algorithm : it will be Ξ»(Bπ’βˆ’J∩[0,π±βˆ’J])Ξ»(Bπ’βˆ’J)\frac{\lambda(B_{\mathbf{i}}^{-J} \cap [0,\mathbf{x}^{-J}])}{\lambda(B_{\mathbf{i}}^{-J})} inside a box chosen with probability conditional on ΞΌΜ‚(B𝐒J)β‰ 0\hat{\mu}(B_{\mathbf{i}}^J) \neq 0. We finally get the following definition :

The empirical checkerboard copula with parameter mm and with set of known margins JJ following the measure ΞΌJ\mu^J is the copula corresponding to the measure Ξ½\nu given by :

Ξ½([0,x])=βˆ‘π’βˆˆβ„ΞΌJ(B𝐒J∩[0,𝐱J])Ξ»(Bπ’βˆ’J∩[0,π±βˆ’J])Ξ»(Bπ’βˆ’J)[ΞΌΜ‚(B𝐒)ΞΌΜ‚(B𝐒J)πŸ™ΞΌΜ‚(B𝐒J)β‰ 0+Ξ»(B𝐒)Ξ»(B𝐒J)πŸ™ΞΌΜ‚(B𝐒J)=0] \nu([0,x]) = \sum\limits_{{\mathbf{i}} \in \mathcal{I}} \mu^J(B_{\mathbf{i}}^{J} \cap [0,\mathbf{x}^{J}]) \frac{\lambda(B_{\mathbf{i}}^{-J} \cap [0,\mathbf{x}^{-J}])}{\lambda(B_{\mathbf{i}}^{-J})} \left[ \frac{\hat{\mu}(B_{\mathbf{i}})}{\hat{\mu}(B_{\mathbf{i}}^J)} \mathbb{1}_{\hat{\mu}(B_{\mathbf{i}}^J) \neq 0} + \frac{\lambda(B_{\mathbf{i}})}{\lambda(B_{\mathbf{i}}^J)} \mathbb{1}_{\hat{\mu}(B_{\mathbf{i}}^J) = 0} \right]

The next section will discuss the current implementation of this copula;

Current implementation

The package implements the empirical checkerboard copula with known margins through the cbkmCopula class. The constructor of the class takes several arguments :

  • x, the pseudo_data.
  • m=nrow(x), repesenting the checkerboard parameter
  • pseudo=FALSE, is the pseudo_data already given in a pseudo_observation form ?
  • margins_numbers=NULL, the margins index that are associated to the known copula, formerly noted JJ
  • known_cop=NULL, the known copula to be applied to thoose margins : a copula object of right dimension.

for example, let’s take the LifeCycleSavings data :

set.seed(1)
data("LifeCycleSavings")
dataset <- (apply(LifeCycleSavings,2,rank,ties.method="max")/(nrow(LifeCycleSavings)+1))
pairs(dataset,col="2",lower.panel=NULL)
Pairs-plot of original peusdo-observations

Pairs-plot of original peusdo-observations

let’s now estimate a checkerboard copula on margins 2 and 3 with a precise m=25m=25, and consider it to be known.

  known_margins <- c(2,3)
  known_copula <- cbCopula(x = dataset[,known_margins],m = 25,pseudo = TRUE)

Then we can construct the ECBC with this known margin :

  (cop <- cbkmCopula(x = dataset,m = 5,pseudo = TRUE,margins_numbers = known_margins,known_cop = known_copula))
#> This is a cbkmCopula , with : 
#>    dim = 5 
#>    n = 50 
#>    m = 5 5 5 5 5 
#> The variables names are :  sr pop15 pop75 dpi ddpi 
#> The variables  2 3  have a known copula  given by :
#>  This is a cbCopula , with : 
#>     dim = 2 
#>     n = 50 
#>     m = 25 25 
#>  The variables names are :  pop15 pop75

We can then simulate from it :

  simu <- rCopula(1000,cop)
  pairs(rbind(simu,dataset),col=c(rep("black",nrow(simu)),rep("red",nrow(dataset))),gap=0,lower.panel = NULL,cex=0.5)
Pairs-plot of the original data (red) and simulated data from a good model (black)

Pairs-plot of the original data (red) and simulated data from a good model (black)

You can see that the known margins were respected, which is the whole point of this model. Let now see an example with a clearly wrongly-specified copula for the known margins :

  bad_known_copula <- cbCopula(x = dataset[,known_margins],m = 2,pseudo = TRUE)
  cop <- cbkmCopula(x = dataset,m = 5,pseudo = TRUE,margins_numbers = known_margins,known_cop = bad_known_copula)

  simu <- rCopula(1000,cop)
  pairs(rbind(simu,dataset),col=c(rep("black",nrow(simu)),rep("red",nrow(dataset))),gap=0,lower.panel = NULL,cex=0.5)
Pairs-plot of the original data (red) and simulated data from a wrong model (black)

Pairs-plot of the original data (red) and simulated data from a wrong model (black)

This example shows the sensitivity of the estimation on other parts of the model to the good behavior of the prior estimate. The conditioning did suffer for the dependences inside the known multidimensional margin but also for dependencies involving one of the variables from thoose known margins. But the checkerboard construction for the other part of the copula was not harmed.

What now if the 2 parts are clearly independent ?


  true_copula1 <- known_copula
  true_copula2 <- bad_known_copula

  dataset <- cbind(rCopula(1000,true_copula1),rCopula(1000,true_copula2))
  colnames(dataset) <- c("u","v","w","x")
  pairs(dataset,lower.panel=NULL,cex=0.5)
Pairs-plot of the original data with independance

Pairs-plot of the original data with independance

  cop <- cbkmCopula(x = dataset,m = 5,pseudo = TRUE,margins_numbers = c(1,2),known_cop = true_copula1)
  simu <- rCopula(500,cop)
  pairs(rbind(simu,dataset),col=c(rep("black",nrow(simu)),rep("red",nrow(dataset))),gap=0,lower.panel = NULL,cex=0.5)
Pairs-plot of the original data (red) and simulated data (black) -- Independance case

Pairs-plot of the original data (red) and simulated data (black) – Independance case

This fit is quite good.