- Search Menu
- Advance articles
- Author Guidelines
- Submission Site
- Open Access
- About Biometrika
- Editorial Board
- Advertising and Corporate Services
- Self-Archiving Policy
- Dispatch Dates
- Journals on Oxford Academic
- Books on Oxford Academic

## Article Contents

- 1. INTRODUCTION
- 2. DATA, MODEL AND LIKELIHOOD
- 3. SIEVE ESTIMATION AND INFERENCE
- 4. A SIMULATION STUDY
- 5. AN APPLICATION
- 6. CONCLUDING REMARKS
- ACKNOWLEDGEMENT
- SUPPLEMENTARY MATERIAL
- < Previous

## Case-cohort studies with interval-censored failure time data

- Article contents
- Figures & tables
- Supplementary Data

Q. Zhou, H. Zhou, J. Cai, Case-cohort studies with interval-censored failure time data, Biometrika , Volume 104, Issue 1, March 2017, Pages 17–29, https://doi.org/10.1093/biomet/asw067

- Permissions Icon Permissions

The case-cohort design has been widely used as a means of cost reduction in collecting or measuring expensive covariates in large cohort studies. The existing literature on the case-cohort design is mainly focused on right-censored data. In practice, however, the failure time is often subject to interval-censoring: it is known to fall only within some random time interval. In this paper, we consider the case-cohort study design for interval-censored failure time and develop a sieve semiparametric likelihood method for analysing data from this design under the proportional hazards model. We construct the likelihood function using inverse probability weighting and build the sieves with Bernstein polynomials. The consistency and asymptotic normality of the resulting regression parameter estimator are established, and a weighted bootstrap procedure is considered for variance estimation. Simulations show that the proposed method works well in practical situations, and an application to real data is provided.

## Email alerts

Citing articles via.

- Recommend to your Library

## Affiliations

- Online ISSN 1464-3510
- Copyright © 2024 Biometrika Trust
- About Oxford Academic
- Publish journals with us
- University press partners
- What we publish
- New features
- Open access
- Institutional account management
- Rights and permissions
- Get help with access
- Accessibility
- Advertising
- Media enquiries
- Oxford University Press
- Oxford Languages
- University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

- Copyright © 2024 Oxford University Press
- Cookie settings
- Cookie policy
- Privacy policy
- Legal notice

## This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Help | Advanced Search

## Statistics > Methodology

Title: improving estimation efficiency of case-cohort study with interval-censored failure time data.

Abstract: The case-cohort design is a commonly used cost-effective sampling strategy for large cohort studies, where some covariates are expensive to measure or obtain. In this paper, we consider regression analysis under a case-cohort study with interval-censored failure time data, where the failure time is only known to fall within an interval instead of being exactly observed. A common approach to analyze data from a case-cohort study is the inverse probability weighting approach, where only subjects in the case-cohort sample are used in estimation, and the subjects are weighted based on the probability of inclusion into the case-cohort sample. This approach, though consistent, is generally inefficient as it does not incorporate information outside the case-cohort sample. To improve efficiency, we first develop a sieve maximum weighted likelihood estimator under the Cox model based on the case-cohort sample, and then propose a procedure to update this estimator by using information in the full cohort. We show that the update estimator is consistent, asymptotically normal, and more efficient than the original estimator. The proposed method can flexibly incorporate auxiliary variables to further improve estimation efficiency. We employ a weighted bootstrap procedure for variance estimation. Simulation results indicate that the proposed method works well in practical situations. A real study on diabetes is provided for illustration.

## Submission history

Access paper:.

- Download PDF
- Other Formats

## References & Citations

- Google Scholar
- Semantic Scholar

## BibTeX formatted citation

## Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

- Institution

## arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Europe PMC requires Javascript to function effectively.

Either your web browser doesn't support Javascript or it is currently turned off. In the latter case, please turn on Javascript support in your web browser and reload this page.

## Search life-sciences literature ( 43,622,183 articles, preprints and more)

- Free full text
- Citations & impact
- Similar Articles

## Case-cohort studies with interval-censored failure time data.

Author information, affiliations.

Biometrika , 03 Feb 2017 , 104(1): 17-29 https://doi.org/10.1093/biomet/asw067 PMID: 28943643 PMCID: PMC5608290

Free full text in Europe PMC

## Abstract

Free full text .

## Case-cohort studies with interval-censored failure time data

- Associated Data

The case-cohort design has been widely used as a means of cost reduction in assembling or measuring expensive covariates in large cohort studies. The existing literature on the case-cohort design is mainly focused on right-censored data. In practice, however, the failure time is often subject to interval-censoring; it is known only to fall within some random time interval. In this paper, we consider the case-cohort study design for interval-censored failure time and develop a sieve semiparametric likelihood approach for analyzing data from this design under the proportional hazards model. We construct the likelihood function using inverse probability weighting and build the sieves with Bernstein polynomials. The consistency and asymptotic normality of the resulting regression parameter estimator are established and a weighted bootstrap procedure is considered for variance estimation. Simulations show that the proposed method works well for practical situations, and an application to real data is provided.

- 1. Introduction

In epidemiologic cohort studies, the outcomes of interest are often times to failure events, such as cancer, heart disease and HIV infection, which are relatively rare even after a long period of follow-up; the study cohorts are usually chosen very large so as to yield reliable information about the effect of exposure variables on these rare failure times. In many cases, the exposure variables of interest are difficult or expensive to collect or measure. With limited funds, it could be prohibitive to obtain these variables for all subjects in a large cohort. Prentice (1986) proposed the case-cohort design where the expensive exposure variables are obtained only for a random sample, named the subcohort, from the study cohort, as well as for subjects who have experienced the failure event during the follow-up period. Extensive research has been done on this design. Under the proportional hazards model, Prentice (1986) and Self & Prentice (1988) proposed pseudolikelihood approaches; Chen & Lo (1999) and Chen (2001) developed estimating equation methods; Marti & Chavance (2011) and Keogh & White (2013) proposed multiple imputation approaches; Scheike & Martinussen (2004) and Zeng & Lin (2014) considered maximum likelihood estimation; and Kang & Cai (2009) and Kim et al. (2013) developed weighted estimating equation approaches for case-cohort studies with multiple outcomes. Other related cost-effective sampling schemes include outcome-dependent sampling designs ( Zhou et al., 2002 ; Ding et al., 2014 ). All of these designs and methods are primarily focused on right-censored data where the failure time of interest is either exactly observed or is right-censored. In practice, however, the occurrences of some failure events, such as HIV infection and diabetes, are not accompanied by any symptoms and their determinations rely on laboratory tests or physician diagnosis; the exact times to these failure events are not available.

In this paper, we consider the case-cohort study design for interval-censored failure time data, which arise when the failure time of interest is observed or known only to belong to a random time interval ( Sun, 2006 ). Areas that often produce such data include epidemiologic studies, biomedical follow-up studies, demographic studies and social sciences, where the study subjects are only examined for the occurrence of the failure event at discrete visits instead of being continuously monitored. One example is the Atherosclerosis Risk in Communities study, a longitudinal epidemiologic cohort study, where the participants were scheduled to be examined for health status every three years on average. In this study, the occurrence of a disease such as diabetes was known only between two consecutive examinations, so only interval-censored data on time to the disease were available. Interval-censoring is a general type of censoring that includes left- and right-censoring as special cases. If a participant had developed the disease at the first follow-up examination U , we would have a left-censored observation denoted by (0, U ]; if a participant had not yet developed the disease at the last follow-up examination V , we would obtain a right-censored observation denoted by ( V , +∞); otherwise, the observation would be a finite time interval with both endpoints in (0, +∞). Here we consider the interval-censored case-cohort design in which the expensive exposure variables are obtained only for a subcohort that is a simple random sample of the study cohort and for subjects who are known to have experienced the failure event, i.e., who have the right endpoint of the observed interval finite.

To the best of our knowledge, there is no method to date in the literature that deals with the general interval-censored case-cohort design described above, although several papers discuss related issues. Gilbert et al. (2005) considered the case-cohort design for a HIV vaccine trial where they treated the midpoint of the finite observed interval as the exact HIV infection time and then employed Self & Prentice (1988) ’s method developed for right-censored case-cohort data to do the analysis. Li et al. (2008) presented a special interval-censored case-cohort design by assuming that the inspection time intervals are fixed and the same for all study subjects and the number of time intervals does not change with the sample size. Li & Nan (2011) considered fitting the relative risk regression model to the case-cohort sampled current status data, a special case of interval-censored data that arise when each study subject is examined only once for the occurrence of the failure event and thus the failure time is either left- or right-censored at the only examination. In this paper, we consider the case-cohort study design for general interval-censored failure time and develop a novel semiparametric method for fitting the proportional hazards model to data arising from this design.

Many authors have studied regression analysis of interval-censored data, obtained by simple random sampling, under the proportional hazards model. Among others, Finkelstein (1986) considered the maximum likelihood estimation with a discrete hazard assumption; Huang (1996) and Zeng et al. (2016) studied the fully semiparametric maximum likelihood estimation for current status data and mixed-case interval-censored data, respectively; Satten (1996) proposed a marginal likelihood approach which avoids estimating the baseline hazard function but remains computationally intensive; Satten et al. (1998) developed a rank-based procedure using imputed failure times, where a parametric baseline hazard is assumed; Pan (2000) suggested a multiple imputation approach which is semiparametric but did not provide theoretical justification; Lin et al. (2015) and Wang et al. (2016) represented the cumulative baseline hazard function as a monotone spline and then developed methods from Bayesian and frequentist perspectives, respectively, via two-stage Poisson data augmentations; Zhang et al. (2010) proposed a spline-based sieve semiparametric maximum likelihood method and proved that the resulting regression parameter estimator is asymptotically normal and efficient. Zhang et al. (2010) also provided a motivation of the sieve method, reasoning about the choice of basis functions, a theoretical framework and rigorous proofs based on empirical process theory. Besides having attractive asymptotic properties under various scenarios (e.g. Huang & Rossini, 1997 ; Shen, 1998 ; Xue et al., 2004 ), the sieve method is easy to implement and computationally fast as, for example, it usually involves much fewer parameters than a fully semiparametric method. In this paper, we focus on fitting the proportional hazards model to interval-censored data from the case-cohort design. We employ inverse probability weighting to construct the likelihood function and then, following the idea of Zhang et al. (2010) , we develop a Bernstein-polynomial-based sieve likelihood estimation method. We also present a weighted bootstrap procedure for variance estimation.

- 2. Data, model and likelihood

Suppose that there are n independent subjects in a cohort study. Let T i denote the failure time of subject i and Z i a p -dimensional vector of covariates that may affect T i . Suppose that the failure time is subject to interval-censoring and the full cohort data are denoted by

where U i and V i are two random examination times, and (Δ 1 i , 1 − Δ 1 i − Δ 2 i ) indicate left- and right-censored observations, respectively.

Under our interval-censored case-cohort design, the covariates are obtained only for subjects from the subcohort as well as those who are known to have experienced the failure event, i.e., Δ 1 i = 1 or Δ 2 i = 1. Let ξ i indicate that the covariate Z i is obtained, i = 1,…, n . Then the observed data under our interval-censored case-cohort design can be represented by

Since the covariates under our design can be considered as missing at random, we employ inverse probability weighting to construct the likelihood function. In particular, suppose that the failure time follows the proportional hazards model, under which the conditional cumulative hazard function of T i given Z i has the form

where β is a p -dimensional regression parameter and Λ( t ) is an unspecified cumulative baseline hazard function. Assume that T i is conditionally independent of the examination times ( U i , V i ) given Z i and the joint distribution of ( U i , V i , Z i ) does not involve the parameters ( β , Λ). Then that inverse probability weighted log-likelihood function has the form

where the weight w i is

- 3. Sieve estimation and inference

Now we consider the estimation of θ = ( β , Λ). Let

To estimate θ , it is natural to maximize the weighted log-likelihood ( 2 ). However, this is not easy, as l n w involves both the finite-dimensional regression parameter β and the infinite-dimensional nuisance parameter Λ. Since only the values of Λ at the examination times { U i , V i : i = 1,…, n } matter in the log-likelihood l n w , one may follow the conventional approach by taking the nonparametric maximum likelihood estimator of Λ as a right-continuous nondecreasing step function with jumps only at the examination times and then maximizing l n w with respect to β and the jump sizes ( Huang, 1996 ). However, such a fully semiparametric estimation method could involve a large number of parameters ( p + 2 n ) if there are no ties among { U i , V i : i = 1,…, n }. To ease the computation burden, by following the idea of Zhang et al. (2010) , we propose a sieve estimation approach via Bernstein polynomials. In particular, we define the sieve space as

Assume that Conditions (C1) – (C5) given in the Appendix hold. If ν > 1/2 r, we have

in distribution, where

More guidelines and discussion on the choice of m will be given below. The Matlab code that implements the proposed inference procedure is available in the Supplementary Material .

- 4. A simulation study

In this section, we perform a simulation study to evaluate the finite-sample performance of the proposed method. We assumed that the covariate Z had the standard normal distribution and that given Z , the failure time T followed the proportional hazards model ( 1 ) with the cumulative baseline hazard function Λ( t ) = 0.2 t 2 . We considered β = 0 or log 2.

To generate interval-censored data { U i , V i , Δ 1 i = I ( T i ≤ U i ), Δ 2 i = I ( U i < T i ≤ V i ) : i = 1,…, n }, we mimicked biomedical follow-up studies. In particular, we assumed that each study subject was scheduled to be examined at k different follow-up time points within the interval [0, τ ] in addition to the baseline exam at time 0. More specifically, to mimic the Atherosclerosis Risk in Communities study, we chose k equally spaced time points over the interval [0, τ ] denoted by e 1 ,…, e k . For each subject, the k scheduled follow-up time points were generated as e i plus a uniform random variable on [− τ /{3( k + 1)}, τ /{3( k + 1)}], i = 1,…, k . At each of these time points, it was assumed that a subject could miss the scheduled examination with probability ζ , independent of the examination results at other time points. For subject i , if the failure event had already occurred at the first follow-up examination, we defined U i to be the first follow-up examination time, V i = τ and (Δ 1 i , Δ 2 i ) = (1, 0); if the failure event had not yet occurred at the last follow-up examination, we defined V i to be the last follow-up examination time, U i = 0 and (Δ 1 i , Δ 2 i ) = (0, 0); otherwise, we defined U i and V i to be the two consecutive follow-up examination times bracketing T i and (Δ 1 i , Δ 2 i ) = (0, 1). We used k = 8 and ζ = 0.2, and determined the length of study τ according to the desired proportion of events, i.e., subjects with Δ 1 = 1 or Δ 2 = 1. Regarding the proportion of events, we considered 0.05 or 0.15.

Simulation results for comparing different methods

Table 1 shows that the proposed estimator is virtually unbiased. The variance estimates based on the weighted bootstrap procedure are close to the corresponding empirical variances and yield reasonable coverages. In addition, under all situations considered, the proposed estimator is more efficient than the estimators based on subcohort only or a simple random sample of the same size as the case-cohort sample. Especially, when the cohort size is 500 or 1000 and the proportion of events is 0.05, the subcohort-based and simple-random-sample-based estimators yield larger biases and inflated variances while the proposed estimator still has good performance. We also conducted simulations with Λ( t ) = 0.1 t , k = 6, ζ = 0.3, q = 0.25 and m = 4 or 5 as well as other methods for generating interval-censored data and obtained similar results. In particular, the results seem to be fairly robust to the choice of m .

- 5. An application

In this section, we illustrate the proposed method using data from the Atherosclerosis Risk in Communities study, a longitudinal epidemiologic observational study consisting of men and women aged 45–64 at baseline, recruited from four US field centers (Forsyth County, NC (Center-F), Jackson, MS (Center-J), Minneapolis Suburbs, MN (Center-M) and Washington County, MD (Center-W)). Forsyth County, Minneapolis Suburbs, and Washington County include white participants, and Forsyth County and Jackson Center include African American participants. The study began in 1987 and the participants received an extensive examination, including medical, social and demographic data. These participants were scheduled to be re-examined on average of every three years with the first exam occurring in 1987–89, the second in 1990–92, the third in 1993–95 and the fourth in 1996–98. There were participants that missed some scheduled re-visits and thus had less than four follow-up examinations. For each participant, the occurrence of a disease such as diabetes can be observed only between two consecutive examinations and therefore only interval-censored failure time data were available. We illustrate the proposed method by investigating the effect of high-density lipoprotein cholesterol level on the risk of diabetes after adjusting for confounding variables and other risk factors in white women younger than 55 years based on data from an interval-censored case-cohort sample. Specifically, we constructed the interval-censored case-cohort sample in the following way. The cohort of interest consists of 2799 white women younger than 55 years and 202 were observed to have developed diabetes during the study. We selected a simple random sample of the cohort by Bernoulli sampling and set the selection probability equal to q = 0.1. The subcohort had 272 subjects and the final case-cohort sample had 451 subjects. We considered the proportional hazards model

where the vector of covariates Z included high-density lipoprotein cholesterol level, total cholesterol level, body mass index, age, smoking status, and indicators for field centers where Center-M was chosen as reference. We fitted this model using the proposed method and presented the results in Table 2 . For comparison, we also provide in Table 2 the analysis results based on the subcohort only. Regarding the degree of Bernstein polynomials, we chose m = 3 for both analyses according to the AIC criterion described in Section 3. One can see from Table 2 that the proposed method based on the case-cohort sample yielded smaller standard errors and more significant results compared to the method based on the subcohort only. In particular, the results suggest that higher high-density lipoprotein cholesterol, lower total cholesterol and lower body mass index levels are significantly associated with lower risk of diabetes in white women younger than 55 years.

Analysis results for diabetes data from the ARIC study

- 6. Concluding remarks

There are some practical considerations for the implementation of the proposed design and method. First, under our design, the subcohort is a simple random sample of the cohort selected by independent Bernoulli sampling. When the subcohort is selected by sampling without replacement, our method should work, though more complicated arguments would be needed to develop the asymptotic results ( Saegusa & Wellner, 2013 ). Moreover, when some covariates are available for all cohort members, a stratified case-cohort design based on those covariates could be considered to improve the study efficiency and adapting our method to such design should be straightforward. Second, regarding the degree of Bernstein polynomials m , there does not seem to be a single true value. According to the simulation studies, the results seem to be fairly robust to the choice of m . In practice, we suggest to consider several different values such as m = 3 to 8 and base the selection on the AIC criterion. Although similar strategies are commonly used in the literature (e.g. Wang et al., 2016 ), further study on AIC and other model selection criteria or methods in this setting would be appreciated. Third, assessing the goodness-of-fit of the proportional hazards model is an important practical issue. Ren & He (2011) and Wang et al. (2006) considered this problem for univariate and correlated interval-censored data, respectively, obtained by simple random sampling. Extensions of these methods to the case-cohort design warrant future research. Lastly, as suggested by the Associate Editor, the missing data problem may arise when the covariates are not obtainable for some subjects in the case-cohort sample. Accommodating such situation would be practically useful and merits further investigation. Another interesting future research direction, suggested by a referee, is to consider cost-effective sampling designs for more general types of censored or truncated data (e.g. Turnbull, 1976 ; Huber et al., 2009 ).

- Supplementary Material

## SupMaterials

- Acknowledgments

The authors thank the Editor, Associate Editor and two referees for their valuable comments which have led to significant improvement of the paper. This work was partially supported by grants from the National Institutes of Health. The Atherosclerosis Risk in Communities study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute contracts. The authors thank the staff and participants of the Atherosclerosis Risk in Communities study for their important contributions.

## Proofs of Theorems 1 and 2

In this appendix, we provide the proofs of Theorems 1 and 2. Denote the observation on a single subject under our interval-censored case-cohort design by O ξ = { U , V , Δ 1 = I ( T ≤ U ), Δ 2 = I ( U < T ≤ V ), ξZ , ξ }, where U and V are two random examination times, (Δ 1 , 1 − Δ 1 − Δ 2 ) indicate left- and right-censored observations, respectively, and ξ indicates the covariate Z being observed with Pr( ξ = 1) ≡ π q (Δ 1 , Δ 2 ) = Δ 1 + Δ 2 + (1 − Δ 1 − Δ 2 ) q . Before proving the theorems, we first describe the regularity conditions needed as follows:

There exists η > 0 such that P ( V − U ≥ η ) = 1. The union of the supports of U and V is contained in the interval [ σ , τ ], where 0 < σ < τ < + ∞.

The distribution of Z has a bounded support and is not concentrated on any proper subspace of R p . Also E {var( Z | U )} and E {var( Z | V )} are positive definite.

The conditional density g ( u , v | z ) of ( u , v ) given z has bounded partial derivatives with respect to u and v , and the bounds of these partial derivatives do not depend on ( u , v , z ).

0 < q ≤ π q (Δ 1 , Δ 2 ) ≤ 1, where q is a known constant.

Note that Conditions (C1) – (C4) are commonly used in the studies of interval-censored data ( Huang & Rossini, 1997 ; Zhang et al., 2010 ) and are usually satisfied in practice. In the following, we will prove Theorems 1 and 2 under these conditions by employing the empirical process theory and some nonparametric methods or techniques. For the proofs, define Pf = ∫ f ( y ) dP ( y ), the expectation of f ( Y ) taken under the distribution P , and P n f = n - 1 ∑ i = 1 n f ( Y i ) , the expectation of f ( Y ) under the empirical measure P n .

## Proof of Theorem 1

Furthermore, by Lemma 2 given in the Supplementary Material , we have

Define δ ε = inf K ε PM ( θ 0 , O ξ ) − PM ( θ 0 , O ξ ). Then under Condition (C2), using the same arguments as those in Zhang et al. (2010 , p. 352), we can prove δ ε > 0. It follows from ( A.2 ) and ( A.3 ) that

where J [ ] { η , F η , L 2 ( P ) } = ∫ 0 η [ 1 + log N [ ] { ε , F η , L 2 ( P ) } ] 1 / 2 d ε ≤ K ∼ N 1 / 2 η . This yields ϕ n ( η ) = N 1/2 η + N / n 1/2 . It is easy to see that ϕ n ( η )/ η is decreasing in η , and r n 2 ϕ n ( 1 / r n ) = r N N 1 / 2 + r n 2 N / n 1 / 2 ≤ K ∼ n 1 / 2 ; where r n = N −1/2 n 1/2 = n (1− ν )/2 .

## Proof of Theorem 2

where l* ( β , λ; O ) and I ( β ), the efficient score and information for β based on O = { U , V , Δ 1 , Δ 2 , Z }, are defined as in Zhang et al. (2010 , p. 344) with our parameters ( β , Λ) corresponding to theirs ( θ , exp( ϕ )). Note that

Thus, we have

This completes the proof of Theorem 2.

Supplementary material

Supplementary material available at Biometrika online includes the two lemmas used in the proof of Theorem 1 and their proofs, and the Matlab code for the proposed inference procedure.

- Chen K. Generalized case–cohort sampling. J R Statist Soc B. 2001; 63 :791–809. [ Google Scholar ]
- Chen K, Lo SH. Case-cohort and case-control analysis with Cox’s model. Biometrika. 1999; 86 :755–764. [ Google Scholar ]
- Ding J, Zhou H, Liu Y, Cai J, Longnecker MP. Estimating effect of environmental contaminants on women’s subfecundity for the MoBa study data with an outcome-dependent sampling scheme. Biostatistics. 2014; 15 :636–650. [ Europe PMC free article ] [ Abstract ] [ Google Scholar ]
- Feller W. An Introduction to Probability Theory and Its Applications, Volume II. John Wiley; 1971. [ Google Scholar ]
- Finkelstein DM. A proportional hazards model for interval-censored failure time data. Biometrics. 1986; 42 :845–854. [ Abstract ] [ Google Scholar ]
- Gilbert PB, Peterson ML, Follmann D, Hudgens MG, Francis DP, Gurwith M, Heyward WL, Jobes DV, Popovic V, Self SG, et al. Correlation between immunologic responses to a recombinant glycoprotein 120 vaccine and incidence of HIV-1 infection in a phase 3 HIV-1 preventive vaccine trial. J Infect Dis. 2005; 191 :666–677. [ Abstract ] [ Google Scholar ]
- Huang J. Efficient estimation for the proportional hazards model with interval censoring. Ann Statist. 1996; 24 :540–568. [ Google Scholar ]
- Huang J, Rossini A. Sieve estimation for the proportional-odds failure-time regression model with interval censoring. J Am Statist Assoc. 1997; 92 :960–967. [ Google Scholar ]
- Huang J, Wellner JA. Interval censored survival data: a review of recent progress. Proceedings of the First Seattle Symposium in Biostatistics; Springer; 1997. [ Google Scholar ]
- Huber C, Solev V, Vonta F. Interval censored and truncated data: Rate of convergence of NPMLE of the density. J Statist Plann Inference. 2009; 139 :1734–1749. [ Google Scholar ]
- Kang S, Cai J. Marginal hazards model for case-cohort studies with multiple disease outcomes. Biometrika. 2009; 96 :887–901. [ Europe PMC free article ] [ Abstract ] [ Google Scholar ]
- Keogh RH, White IR. Using full-cohort data in nested case–control and case–cohort studies by multiple imputation. Statist Med. 2013; 32 :4021–4043. [ Abstract ] [ Google Scholar ]
- Kim S, Cai J, Lu W. More efficient estimators for case-cohort studies. Biometrika. 2013; 100 :695–708. [ Europe PMC free article ] [ Abstract ] [ Google Scholar ]
- Li Z, Gilbert P, Nan B. Weighted likelihood method for grouped survival data in case–cohort studies with application to HIV vaccine trials. Biometrics. 2008; 64 :1247–1255. [ Abstract ] [ Google Scholar ]
- Li Z, Nan B. Relative risk regression for current status data in case-cohort studies. Canad J Statist. 2011; 39 :557–577. [ Google Scholar ]
- Lin X, Cai B, Wang L, Zhang Z. A Bayesian proportional hazards model for general interval-censored data. Lifetime Data Anal. 2015; 21 :470–490. [ Abstract ] [ Google Scholar ]
- Lorentz GG. Bernstein Polynomials. New York: Chelsea Publishing Co; 1986. [ Google Scholar ]
- Ma S, Kosorok MR. Robust semiparametric M-estimation and the weighted bootstrap. J Multivar Anal. 2005; 96 :190–217. [ Google Scholar ]
- Marti H, Chavance M. Multiple imputation analysis of case–cohort studies. Statist Med. 2011; 30 :1595–1607. [ Abstract ] [ Google Scholar ]
- Pan W. A multiple imputation approach to Cox regression with interval-censored data. Biometrics. 2000; 56 :199–203. [ Abstract ] [ Google Scholar ]
- Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986; 73 :1–11. [ Google Scholar ]
- Ren JJ, He B. Estimation and goodness-of-fit for the Cox model with various types of censored data. J Statist Plann Inference. 2011; 141 :961–971. [ Google Scholar ]
- Saegusa T, Wellner JA. Weighted likelihood estimation under two-phase sampling. Ann Statist. 2013; 41 :269–295. [ Europe PMC free article ] [ Abstract ] [ Google Scholar ]
- Satten GA. Rank-based inference in the proportional hazards model for interval censored data. Biometrika. 1996; 83 :355–370. [ Google Scholar ]
- Satten GA, Datta S, Williamson JM. Inference based on imputed failure times for the proportional hazards model with interval-censored data. J Am Statist Assoc. 1998; 93 :318–327. [ Google Scholar ]
- Scheike TH, Martinussen T. Maximum likelihood estimation for Cox’s regression model under case–cohort sampling. Scand J Statist. 2004; 31 :283–293. [ Google Scholar ]
- Self SG, Prentice RL. Asymptotic distribution theory and efficiency results for case-cohort studies. Ann Statist. 1988; 16 :64–81. [ Google Scholar ]
- Shen X. On methods of sieves and penalization. Ann Statist. 1997; 25 :2555–2591. [ Google Scholar ]
- Shen X. Propotional odds regression and sieve maximum likelihood estimation. Biometrika. 1998; 85 :165–177. [ Google Scholar ]
- Shen X, Wong WH. Convergence rate of sieve estimates. Ann Statist. 1994; 22 :580–615. [ Google Scholar ]
- Sun J. The Statistical Analysis of Interval-Censored Failure Time Data. Springer; 2006. [ Google Scholar ]
- Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data. J R Statist Soc B. 1976; 38 :290–295. [ Google Scholar ]
- van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Springer; 1996. [ Google Scholar ]
- Wang J, Ghosh SK. Shape restricted nonparametric regression with bernstein polynomials. Comput Statist Data Anal. 2012; 56 :2729–2741. [ Google Scholar ]
- Wang L, McMahan CS, Hudgens MG, Qureshi ZP. A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data. Biometrics. 2016; 72 :222–231. [ Europe PMC free article ] [ Abstract ] [ Google Scholar ]
- Wang L, Sun L, Sun J. A goodness-of-fit test for the marginal Cox model for correlated interval-censored failure time data. Biom J. 2006; 48 :1020–1028. [ Abstract ] [ Google Scholar ]
- Xue H, Lam K, Li G. Sieve maximum likelihood estimator for semiparametric regression models with current status data. J Am Statist Assoc. 2004; 99 :346–356. [ Google Scholar ]
- Zeng D, Lin DY. Efficient estimation of semiparametric transformation models for two-phase cohort studies. J Am Statist Assoc. 2014; 109 :371–383. [ Europe PMC free article ] [ Abstract ] [ Google Scholar ]
- Zeng D, Mao L, Lin D. Maximum likelihood estimation for semiparametric transformation models with interval-censored data. Biometrika. 2016; 103 :253–271. [ Europe PMC free article ] [ Abstract ] [ Google Scholar ]
- Zhang Y, Hua L, Huang J. A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data. Scand J Statist. 2010; 37 :338–354. [ Google Scholar ]
- Zhou H, Weaver M, Qin J, Longnecker M, Wang M. A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome. Biometrics. 2002; 58 :413–421. [ Abstract ] [ Google Scholar ]

## Full text links

Read article at publisher's site: https://doi.org/10.1093/biomet/asw067

## Citations & impact

Impact metrics, citations of article over time, article citations, semiparametric regression analysis of case-cohort studies with multiple interval-censored disease outcomes..

Zhou Q , Cai J , Zhou H

Stat Med , 40(13):3106-3123, 29 Mar 2021

Cited by: 0 articles | PMID: 33783001 | PMCID: PMC8691208

## A regularized estimation approach for case-cohort periodic follow-up studies with an application to HIV vaccine trials.

Zhao H , Wu Q , Gilbert PB , Chen YQ , Sun J

Biom J , 62(5):1176-1191, 20 Feb 2020

Cited by: 0 articles | PMID: 32080888 | PMCID: PMC7768636

## Regression Analysis of Case-cohort Studies in the Presence of Dependent Interval Censoring.

Du M , Zhou Q , Zhao S , Sun J

J Appl Stat , 48(5):846-865, 14 Apr 2020

Cited by: 1 article | PMID: 33767519 | PMCID: PMC7986575

## Semiparametric inference for a two-stage outcome-dependent sampling design with interval-censored failure time data.

Lifetime Data Anal , 26(1):85-108, 07 Jan 2019

Cited by: 0 articles | PMID: 30617753 | PMCID: PMC6612481

## Outcome-dependent sampling with interval-censored failure time data.

Biometrics , 74(1):58-67, 03 Aug 2017

Cited by: 1 article | PMID: 28771664 | PMCID: PMC5797528

## Data behind the article

This data has been text mined from the article, or deposited into data resources.

## BioStudies: supplemental material and supporting data

- http://www.ebi.ac.uk/biostudies/studies/S-EPMC5608290?xr=true

## Similar Articles

To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.

## A semi-parametric weighted likelihood approach for regression analysis of bivariate interval-censored outcomes from case-cohort studies.

Lou Y , Wang P , Sun J

Lifetime Data Anal , 29(3):628-653, 02 Mar 2023

Cited by: 0 articles | PMID: 36862277

## Funding

Funders who supported this work.

## NCI NIH HHS (1)

Grant ID: P01 CA142538

733 publication s

## NIEHS NIH HHS (1)

Grant ID: R01 ES021900

38 publication s

Europe PMC is part of the ELIXIR infrastructure

## Browse Econ Literature

- Working papers
- Software components
- Book chapters
- JEL classification

## More features

- Subscribe to new research

## RePEc Biblio

Author registration.

- Economics Virtual Seminar Calendar NEW!

## Case-cohort studies with interval-censored failure time data

- Author & abstract
- 4 Citations
- Related works & more

## Corrections

Suggested citation, download full text from publisher.

Follow serials, authors, keywords & more

Public profiles for Economics researchers

Various research rankings in Economics

## RePEc Genealogy

Who was a student of whom, using RePEc

Curated articles & papers on economics topics

Upload your paper to be listed on RePEc and IDEAS

## New papers by email

Subscribe to new additions to RePEc

## EconAcademics

Blog aggregator for economics research

Cases of plagiarism in Economics

## About RePEc

Initiative for open bibliographies in Economics

News about RePEc

Questions about IDEAS and RePEc

RePEc volunteers

## Participating archives

Publishers indexing in RePEc

## Privacy statement

Found an error or omission?

Opportunities to help RePEc

## Get papers listed

Have your research listed on RePEc

## Open a RePEc archive

Have your institution's/publisher's output listed on RePEc

## Get RePEc data

Use data assembled by RePEc

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

- Publications
- Account settings
- Advanced Search
- Journal List
- J Appl Stat
- v.48(5); 2021

## Regression analysis of case-cohort studies in the presence of dependent interval censoring

a Center for Applied Statistical Research and College of Mathematics, Jilin University, Changchun, People's Republic of China

## Qingning Zhou

b Department of Mathematics and Statistics, The University of North Carolina at Charlotte, Charlotte, NC, USA

## Shishun Zhao

Jianguo sun.

c Department of Statistics, University of Missouri, Columbia, MO, USA

The case-cohort design is widely used as a means of reducing the cost in large cohort studies, especially when the disease rate is low and covariate measurements may be expensive, and has been discussed by many authors. In this paper, we discuss regression analysis of case-cohort studies that produce interval-censored failure time with dependent censoring, a situation for which there does not seem to exist an established approach. For inference, a sieve inverse probability weighting estimation procedure is developed with the use of Bernstein polynomials to approximate the unknown baseline cumulative hazard functions. The proposed estimators are shown to be consistent and the asymptotic normality of the resulting regression parameter estimators is established. A simulation study is conducted to assess the finite sample properties of the proposed approach and indicates that it works well in practical situations. The proposed method is applied to an HIV/AIDS case-cohort study that motivated this investigation.

## 1. Introduction

The case-cohort design is widely used as a means of reducing the cost in large cohort studies, especially when the disease rate is low and covariate measurements may be expensive (Prentice [ 27 ]; Scheike and Martinussen [ 30 ], Self and Prentice [ 32 ]). For the situation, instead of collecting the covariate information on all study subjects, it collects the covariate information only on the subjects whose failures are observed and on a subsample of the remaining subjects. Among others, one area where the design is often used is epidemiological cohort studies in which the outcomes of interest are times to failure events such as AIDS, cancer, heart disease and HIV infection. For such studies, in addition to the incomplete nature on covariate information, another feature is that the observations are usually interval-censored rather than right-censored due to the periodic follow-up nature of the study (Sun [ 34 ]).

By interval-censored data, we usually mean that the failure time of interest is known or observed only to belong to an interval instead of being observed exactly. It is easy to see that interval-censored data include right-censored data as a special case. Furthermore, sometimes one may also face informative censoring, meaning that the failure time of interest and the censoring mechanism are correlated (Huang and Wolfe [ 13 ]; Wang et al. [ 37 ]). An example of informatively interval-censored data may arise in a periodic follow-up study of certain disease where study subjects may not follow the pre-specified visit schedules and instead pay clinical visits according to their disease status or how they feel with respect their treatments. Among others, Huang and Wolfe [ 13 ] and Sun [ 33 ] discussed the issue and pointed out that in the presence of informative censoring, the analysis that ignores it may result in biased or misleading results or conclusions. More discussion on informatively interval-censored data can be found in Sun [ 34 ].

One real study that motivated this investigation is the HVTN 505 Trial to assess the efficacy of a DNA prime-recombinant adenovirus type 5 boost (DNA/rAd5) vaccine to prevent human immunodeficiency virus type 1 (HIV-1) infection (Fong et al. [ 8 ]; Hammer et al. [ 10 ]; Janes et al. [ 14 ]). It is well-known that HIV-1 infection is deadly as it causes AIDS for which there is no cure and thus it is important and essential to develop a safe and effective vaccine for the prevention of the infection. The original study consists of 2504 men or transgender women who had sex with men were examined periodically, thus yielding only interval-censored data on the time to HIV-1 infection. For each subject, the information on four demographic covariates, age, race, BMI and behavioural risk, was collected, and in addition, for a subgroup of HIV infection cases and non-cases, a number of T cell response biomarkers and anti-body response biomarkers were also measured. One goal of the study is to determine or identify the important or relevant covariates or biomarkers for HIV-1 infection.

Many authors have discussed the analysis of case-cohort studies but most of the existing methods are for right-censored failure time data. For example, some of the early work on this was given by Prentice [ 27 ] and Self and Prentice [ 32 ], who proposed some pseudolikelihood approaches based on the modification of the commonly used partial likelihood method under the proportional hazards model. By following them, Chen and Lo [ 3 ] proposed an estimating equation approach that yields more efficient estimators than the pseudolikelihood estimator proposed in Prentice [ 27 ], and Chen [ 2 ] developed an estimating equation approach that applies to a class of cohort sampling designs, including the case-cohort design with the key estimating function constructed by a sample reuse method via local averaging. Also Marti and Chavance [ 25 ] and Keogh and White [ 18 ] proposed some multiple imputation methods and in particular, the latter method extended the former by considering more complex imputation models that include time and interaction or nonlinear terms. In addition, Kang and Cai [ 17 ] and Kim et al. [ 19 ] developed weighted estimating equation approaches for case-cohort studies with multiple disease outcomes, where the latter method improved the efficiency upon the former by utilizing more information in constructing the weights.

Interval-censored failure time data naturally occur in many areas, especially in the studies with periodic follow-ups, and a great deal of literature has been developed for their analysis (Chen et al. [ 5 ]; Finkelstein [ 7 ]; Sun [ 34 ]; Zhou et al. [ 40 ]). In particular, Sun [ 34 ] and Bogaerts et al. [ 1 ] provided comprehensive reviews of the existing literature on interval-censored data. Although there also exist some methods for either informatively interval-censored data or the interval-censored arising from case-cohort studies, there does not seem to exist an established procedure for informatively interval-censored data arising from case-cohort studies. In particular, for the analysis of informatively interval-censored data, two types of approaches are commonly used and they are the frailty model approach and the copula model approach. For example, Zhang et al. (2005, 2007) and Wang et al. [ 36 , 38 ] gave some frailty model estimation procedures, while Ma et al. [ 23 , 24 ] and Zhao et al. (2015) proposed some copula model methods. For the analysis of the interval-censored data arising from case-cohort studies, Gilbert et al. [ 9 ] presented a midpoint imputation procedure and Li and Nan [ 20 ] considered a special case of interval-censored data, current status data, where the failure time of interest is either left- or right-censored (Jewell and van der Laan [ 15 ]). Also Zhou et al. [ 41 ] proposed a likelihood-based approach. However, all of the three methods above assume that the interval censoring mechanism is non-informative or independent of the failure time of interest. As discussed by many authors and above, the informative censoring is a serious and difficult issue and the use of the methods that do not take it into account can yield biased or misleading results and conclusions (Huang and Wolfe [ 13 ]; Ma et al. [ 23 ]). In the following, we will develop a frailty model approach, a generalization of the method proposed in Zhou et al. [ 41 ], for the analysis of the case-cohort studies yielding interval-censored data with informative censoring.

The remainder of the paper is organized as follows. We will begin in Section 2 with introducing some notation and models to be used throughout the paper and in particular, we will present joint frailty models for the failure time of interest and the underlying censoring mechanism. To estimate regression parameters, a sieve inverse probability weighting estimation procedure is then presented in Section 3 and in the method, Bernstein polynomials are employed to approximate unknown functions. Furthermore, we establish the consistency and asymptotic normality of the resulting estimators of regression parameters and provide a weighted bootstrap procedure for variance estimation. Section 4 presents some results obtained from an extensive simulation study conducted to assess the finite sample properties of the proposed methodology and they suggest that the method works well in practical situations. In Section 5 , we apply the proposed method to the HIV/AIDS study described above and Section 6 gives some discussion and concluding remarks.

## 2. Notation and models

Consider a failure time study that consists of n independent subjects. For subject i , let T i denote the failure time of interest and suppose that there exists a p -dimensional vector of covariates denoted by Z i that may affect T i , i = 1 , … , n . Also for subject i , suppose that there exist two examination times denoted by U i and V i with U i ≤ V i and one only observes Δ 1 i = I ( T i ≤ U i ) and Δ 2 i = I ( U i < T i ≤ V i ) , indicating if the failure time T i is left-censored and interval-censored, respectively. Note that here U i and V i are random variables and assumed to be observed and they together with Δ 1 i and Δ 2 i give the observed interval-censored data on the T i 's (Sun [ 34 ]; Zhou et al. [ 41 ]).

For the case-cohort studies, as mentioned above, the information on covariates is available only for the subjects who either have experienced the failure event of interest or with Δ 1 i = 1 or Δ 2 i = 1 or are from the sub-cohort that is a random sample of the entire cohort. Define ξ i = 1 if the covariate Z i is available or observed and 0 otherwise, i = 1 , … , n . For the selection of the subcohort, by following Zhou et al. [ 40 ] and others, we will consider the independent Bernoulli sampling with the selection probability q ∈ ( 0 , 1 ) . Then under the assumption above, the probability that the covariate Z i is observed is given by

i = 1 , … , n , and the observed data have the form

In contrast, if all covariates were observed, the full cohort data would be

To describe the covariate effects and dependent interval censoring, define W i = V i − U i , i = 1 , … , n . By following Ma et al. [ 23 ], we will focus on the situation where the dependent censoring can be characterized by the correlation between the T i 's and W i 's. As mentioned in Ma et al. [ 23 ], one example where this may be the case is follow-up studies where some study subjects may tend to pay more or less clinical visits than the scheduled ones. More comments on this will be given below. For the covariate effects, we assume that there exists a latent variable b i with mean one and known distribution but unknown variance η and given Z i and b i , the hazard functions of T i and W i have the forms

respectively. In the above, λ t ( t ) and λ w ( t ) are unknown baseline hazard functions and β t and β w are p × 1 vectors of unknown regression parameters. Also it will be assumed that given Z i and b i , W i is independent of U i and T i and W i are independent. In other words, the correlation between T i and W i is measured by the parameter η . More comments on this are given below.

Define Δ i = ( Δ 1 i , Δ 2 i ) and θ = ( β t , β w , Λ t , Λ w , η ) , where Λ t ( t ) = ∫ 0 t λ t ( u ) d u and Λ w ( t ) = ∫ 0 t λ w ( u ) d u . Assume that b i is independent of ( U i , Z i ) and the joint distribution of ( U i , Z i ) does not involve the parameters of interest. To motivate the proposed estimation procedure, note that conditional on ( W i , U i , Z i , b i ) , the likelihood of the observation from subject i has the form

Also note that conditional on ( Z i , b i ) , the likelihood of the observation on W i is given by

where Ψ i = I ( W i < ∞ ) . This motivates the following inverse probability weighted log-likelihood function

for estimation of θ , where f ( b i ; η ) denotes the the density function of the b i 's and

If f is the gamma distribution, the function l O ξ ( θ ) has a closed form as

In the next section, for estimation of θ , we will discuss the maximization of the inverse probability weighted log-likelihood function l O ξ ( θ ) .

## 3. Sieve inverse probability weighting estimation

Define the parameter space of θ

where B = { ( β t , β w , η ) ∈ R 2 p × R + , ∥ β t ∥ + ∥ β w ∥ + ∥ η ∥≤ M } with M being a positive constant and M j denotes the collection of all bounded and continuous nondecreasing, nonnegative functions over the interval [ σ j , τ j ] , j = 1, 2. In practice, [ σ 1 , τ 1 ] is usually taken to be the range of the U i 's and V i 's and [ σ 2 , τ 2 ] the range of the W i 's. More comments on this are given below. For the maximization of the inverse probability weighted log-likelihood function l O ξ ( θ ) , it is easy to see that this would not be straightforward since l O ξ ( θ ) involves unknown functions Λ t ( t ) and Λ w ( t ) . To deal with this and by following Ma et al. [ 24 ], Zhou et al. [ 40 ] and others, we propose first to approximate the two functions by Bernstein polynomials.

More specifically, define the sieve space

In the above,

k = 0 , … , m , which Bernstein polynomials of degree m = o ( n ν ) for some ν ∈ ( 0 , 1 ) . Note that some restrictions are needed above on the parameters since Λ t ( t ) and Λ w ( w ) are nonnegative and nondecreasing functions. However, this can be easily removed by some reparameterization. For example, one can reparameterize the parameters { ϕ 0 j , … , ϕ m j } by the cumulative sums of the parameters { exp ( ϕ 0 j ∗ ) , … , exp ( ϕ m j ∗ ) } , j = 1, 2.

Let θ ^ n = ( β ^ t n , β ^ w n , η ^ n , Λ ^ t n , Λ ^ w n ) denote the estimator of θ given by the value of θ that maximizes the inverse probability weighted log-likelihood function l O ξ ( θ ) over the sieve space Θ n . Also let θ 0 = ( β t 0 , β w 0 , η 0 , Λ t 0 , Λ w 0 ) denote the true value of θ , ϑ ^ n = ( β ^ t n , β ^ w n , η ^ n ) , ϑ 0 = ( β t 0 , β w 0 , η 0 ) , and for any θ 1 = ( β t 1 , β w 1 , η 1 , Λ t 1 , Λ w 1 ) and θ 2 = ( β t 2 , β w 2 , η 2 , Λ t 2 , Λ w 2 ) in the parameter space Θ, define the distance

Here ‖ v ‖ denotes the Euclidean norm for a vector v , ‖ Λ t 1 − Λ t 2 ‖ 2 2 = ∫ [ ( Λ t 1 ( u ) − Λ t 2 ( u ) ) 2 + ψ ( Λ t 1 ( u + w ) − Λ t 2 ( u + w ) ) 2 ] d G ( u , w ) , and ‖ Λ w 1 − Λ w 2 ‖ 2 2 = ∫ ψ [ Λ w 1 ( w ) − Λ w 2 ( w ) ] 2 d G ( u , w ) with G ( u , w ) denoting the joint distribution function of U and W . The following two theorems establish the asymptotic properties of θ ^ n .

Suppose that the regularity conditions (C1)–(C4) given in the Appendix hold. Then as n → ∞ , we have that d ( θ ^ n , θ 0 ) → 0 almost surely and d ( θ ^ n , θ 0 ) = O p ( n − min { ( 1 − ν ) / 2 , ν r / 2 } ) , where ν ∈ ( 0 , 1 ) is defined in m = o ( n ν ) and r in the regularity condition (C3).

Suppose that the regularity conditions (C1)–(C5) given in the Appendix hold. Then as n → ∞ and if ν > 1 / 2 r , we have that

in distribution, where

with v ⊗ 2 = v v ′ for a vector v and I ( ϑ ) and l ∗ ( ϑ , O ) , given in the Appendix, denoting the information matrix and efficient score for ϑ = ( β t , β w , η ) based on the complete data.

The proof of the results given above is sketched in the Appendix. For the determination of the proposed estimator θ ^ n , different methods can be used and in the numerical studies below, the Matlab function fmincon is used. Also for the determination of θ ^ n , one needs to choose or specify the degree m of Bernstein polynomials, which controls the smoothness of the approximation. For this, one common approach is to perform the grid search by considering different values of m and choosing the one that minimizes

based on the AIC criterion. Note that instead of this, one may employ other criteria such as the BIC criterion and the numerical results indicate that they give similar performance. Also note that in the approximation of Λ t and Λ w , we used the same degree m and in practice, different m could be used too.

For inference about ϑ 0 = ( β t 0 , β w 0 , η 0 ) , of course, one needs to estimate the covariance matrix of ϑ ^ n = ( β ^ t n , β ^ w n , η ^ n ) . For this, a natural way would be to derive a consistent estimator of Σ. On the other hand, one could see from the Appendix that Σ involves the information matrix I ( ϑ 0 ) and the efficient score l ∗ ( ϑ 0 , O ) and both of them do not have closed forms. Thus, it would be difficult to derive a consistent estimator and instead we propose to employ the weighted bootstraps procedure discussed in Ma and Kosorok [ 22 ], which is easy to implement and seems to work well in the numerical studies described below. Specifically, let { u 1 , … , u n } denote n independent realizations of a bounded positive random variable u satisfying E ( u ) = 1 and v a r ( u ) = ϵ 0 < ∞ and define the new weights p i ′ = u i p i , i = 1 , … , n . Also let ϑ ^ n ′ denote the estimator of ϑ proposed above with replacing the p i 's by the p i ′ 's. Then if we repeat this B times, one can estimate the covariance matrix of ϑ ^ n by the sample covariance matrix of the ϑ ^ n ′ 's. By following Ma and Kosorok [ 22 ], it can be shown that this weighted bootstrap variance estimator is consistent.

## 4. A simulation study

In this section, we report some results obtained from a simulation study conducted to evaluate the finite sample performance of the inverse probability weighted estimation procedure proposed in the previous sections. In the study, it was assumed that the covariate Z followed the Bernoulli distribution with the success probability of 0.5 and to generate the subcohort, as mentioned above, we considered the independent Bernoulli sampling with the selection probability being 0.1. For the proportion of the observed failure events or the event rate, we studied several cases including p e = 0.05 , 0.1 and 0.2. To generate interval-censored data, we first generated the U i 's from the uniform distribution over ( 0 , a ) with a being a positive constant and the latent variable b i 's. Then the T i 's and W i 's were generated based on models (2.1) and (2.2) with λ t = 0.2 t , 0.1 t or 4 t / 9 , λ w = 12 t and the V i 's were defined as V i = U i + W i for all i . The results given below are based on the full cohort size n = 1000 or 2000 with 1000 replications.

Table 1 presents the results obtained on the proposed estimators β ^ t n , β ^ w n and η ^ n with n = 1000, the true values of the parameters being β t 0 = β w 0 = 0 , 0.2 or 0.5 and η 0 = 0.8 , and the b i 's following the gamma distribution. The results include the estimated bias (Bias) given by the average of the proposed estimates minus the true value, the sample standard error (SSE), the average of the estimated standard errors (ESE) and the 95 % empirical coverage probability (CP). Here we took the degree of Bernstein polynomials being m = 3 and the weighted bootstrap sample size B = 100 for variance estimation. Also for the variance estimation, we generated the random sample { u 1 , … , u n } repeatedly from the exponential distribution. Table 2 gives the estimation results obtained under the same set-up as above except n = 2000. One can see from the two tables that the results indicate that the proposed estimator seems to be unbiased and the weighted bootstrap variance estimation procedure seems to work well. Also they indicate that the normal approximation to the distribution of the proposed estimator appears to be reasonable. In addition, as expected, the estimation results became better when the percentage of the observed failure events or the full cohort size increased. We also considered other set-ups including different values for m and B and obtained similar results.

In the proposed estimation procedure, it has been assumed that the distribution of the latent variables b i 's is known up to a variance parameter. Hence in practice, one question of interest may be the robustness of the estimation procedure with respect to the distribution. To investigate this, we repeated the simulation study above giving the results in Table 1 with p e = 0.1 except that we generated the b i 's from the log-normal distribution instead of the gamma distribution but assumed that they followed the gamma distribution. Table 3 presents the results obtained on the proposed estimators β ^ t n and β ^ w n , including the Bias, the SSE, the ESE and the 95 % empirical CP. As before, they suggest that the proposed methodology seems to work well or the estimators β ^ t n and β ^ w n appear to be robust with respect to the distribution of the latent variables.

For the problem discussed here, instead of the inverse probability weighting method proposed above, there exist two commonly used naive approaches that estimate regression parameters by using the regular likelihood approaches. One is to base the estimation only on the selected sub-cohort and the other is to base the estimation on a simple random sample that has the same size as the case-cohort sample. Let β ^ t s u b and β ^ t s r s denote the estimators of β t given by the two naive methods above, respectively, and here we only focus on the estimation of β t . Table 4 gives the estimation results given by the proposed method and the two naive approaches under the set-up similar to that for Table 2 with p e = 0.1 . Note that here for comparison, we also considered the approach given by Zhou et al. [ 40 ], which treated the observation process to be independent of the failure time of interest or ignored the correlation between the failure time and the observation process. The resulting estimator of β t is denoted by β ^ t i n in the table. One can see from Table 4 that the proposed estimate clearly gave better performance than the two naive estimates and one would get biased results if ignoring the correlation between the failure time of interest and the observation process.

As pointed out by a reviewer and motivated by the real data discussed below, we also repeated the study that gave the results in Table 1 with p e = 0.05 in which we generated the subcohort in the same way as before but only from none-case subjects instead of all subjects as above. In other words, the goal here is to assess the performance of the proposed approach for case–control studies. The obtained estimation results are presented in Table 5 and one can see that they are similar to those given in Table 1 . In other words, it seems that the proposed estimation approach seems to give good performance for and can be applied to case–control studies too.

## 5. An application

In this section, we will apply the methodology proposed in the previous sections to the HVTN 505 Trial discussed above. It is a randomized, multiple-sites clinical trial of men or transgender women who had sex with men for assessing the efficacy of the DNA/rAd5 vaccine for HIV-1 infection (Fong et al. [ 8 ]; Hammer et al. [ 10 ]; Janes et al. [ 14 ]). As mentioned above, the original study consists of the subjects randomly assigned to receive either the DNA/rAd5 vaccine or placebo, and in the following, we will focus only on the 1253 subjects in the vaccine group. It is well-known that HIV-1 infection is deadly as it causes AIDS for which there is no cure and thus it is important and essential to develop a safe and effective vaccine for the prevention of the infection. For each subject, four demographic covariates were observed and they are age, race, BMI and behavioural risk. In addition, to assess their relationship with the HIV infection, a number of T cell response biomarkers and antibody response biomarkers were measured for a cohort of 150 subjects consisting of all HIV infection cases (25) and other 125 randomly selected subjects among the vaccine recipients. The failure time of interest here is the time to true HIV-1 infection and for which, only interval-censored data are available.

In all previous analyses, the authors simplified the observed data into right-censored data and also did not consider the possibility of informative censoring (Fong et al. [ 8 ]; Hammer et al. [ 10 ]; Janes et al. [ 14 ]). They identified the T cell response biomarker Env CD8+ polyfunctionality score and the antibody response biomarker IgG.Cconenv03140CF.avi that may have significant effects on the HIV infection time. For simplicity, below we will refer these two biomarkers as to Env CD8 Score and IgG, respectively. For the analysis below, by following Fong et al. [ 8 ] and Janes et al. [ 14 ], we will focus on the cohort of 150 vaccine recipients, which can be treated as a case–control design with the full cohort being all subjects in the vaccine group, and investigate the relationship between the HIV infection time and the four demographic covariates plus the two biomarkers.

Table 6 presents the estimation results given by the application of the methodology proposed in the previous sections to the HVTN 505 Trial, including the estimated covariate effects β ^ t n and β ^ w n , the estimated standard errors (ESE) and the p -values for testing the covariate effect being zero. Here for the degree of Bernstein polynomials, we tried several values, including m = 2, 3, 4, 5, 6 and 7, and the results above were obtained based on m = 3, which gave the smallest AIC defined above, and B = 500. One can see from Table 6 that the proposed estimation procedure suggests that among the six covariates considered here, two demographic covariates, race and behavioural risk, seem to be correlated with the HIV infection time and the two biomarkers also appear to have significant prognostic effects on the development of HIV infection. On the other hand, the age and BMI did not seem to have any effects on the HIV infection. In addition, the race and behavioural risk appear to have significant effects on the observation process too.

For comparison, we also applied the method given in Zhou et al. [ 40 ], which assumed that the HIV infection time and the observation process were independent, to the data and included the estimated covariate effects, which are denoted by β ^ t i n , in the table along with the estimated standard errors and the p -values. One can see from the table that one difference between the results given by the two methods is on the estimation of the effect of the behavioural risk factor, which did not see to have any effect on the development of HIV infection based on the method given in Zhou et al. [ 40 ]. One explanation for this may be due to the fact that the method given in Zhou et al. [ 40 ] ignored the existence of informative censoring.

## 6. Discussion and concluding remarks

This paper discussed the analysis of case-cohort studies that yield informatively interval-censored failure time data arising from the proportional hazards model. As discussed above, a great deal of literature has been developed for the analysis of case-cohort studies that give right-censored data. In practice, however, the observed information on the failure time is more likely and naturally given in the form of interval-censored data, which is especially the case for longitudinal or periodic follow-up studies. One major difference between right-censored data and interval-censored data is that the latter has a much more complex structure than the former, which makes the analysis of the latter much more difficult. Although a large amount of literature has also been established for the analysis of either interval-censored data or case-cohort studies, there is no method available for the informative censoring situation discussed above. As pointed out before and seen in Section 5 , informative censoring often occurs naturally and for the situation, the analysis that ignores it could result in biased or misleading results and conclusions.

As discussed in Sections 4 and 5 , a type of studies that is similar to case-cohort studies is the case–control study and the key difference between the two is the generation of the subcohort. With the case-cohort design, the subcohort is sampled from all study subjects, while the case–control design samples the subcohort only from the subjects who do not experience the failure event of interest during the follow-up. It is apparent that the data structures under the two designs are different but on the other hand, the simulation study suggested that the proposed estimation approach seems to be valid too for the case–control design. A possible explanation for this is that the resulting data may carry similar information about the model and the regression parameters of interest given the low percentage of the event rate.

In practice, interval-censored data may be given in different forms (Sun [ 34 ]). For example, instead of the form discussed here, one may have case K or mixed interval-censored data (Wang et al. [ 37 ]). Note that for the analysis, one can still apply the proposed estimation procedure to these situations by expressing the data using the format described here. However, the derivation or establishment of the asymptotic properties may be different and one may need some other assumptions similar to those described in Huang [ 11 ] and Wang et al. [ 37 ]. In the previous sections, the focus has been on the informative censoring that can be characterized by models (2.1) and (2.2) or through latent variables. More specifically, it has been assumed that the magnitude of the informative censoring can be measured by the parameter η . It is apparent that as with most of frailty model approaches, a natural question would be if one can test η = 0 . Unfortunately it does not seem to exist an established procedure for it in the literature. Another related question is the possibility of performing the goodness-of-fit tests on models (2.1) and (2.2). For this, if η = 0 , one may apply the test procedures given in Ren and He [ 28 ] and McKeague and Utikal [ 26 ], respectively, to test them separately. However, it would be difficult or not straightforward to generalize either of them to the situation discussed here.

As mentioned above, to deal with the informative censoring, another commonly used method is the copula model approach, which directly models the joint distribution of the failure time of interest and censoring variables (Sun [ 34 ]). For example, Cui et al. [ 6 ] and Ma et al. [ 24 ] developed two such methods for regression analysis of current status data with informative censoring, a special case of interval-censored data where each subject is observed only once. Among others, Ma et al. [ 23 ] proposed a copula model approach for regression analysis of general interval-censored data. An advantage of the copula model approach is that it allows one to work or model the marginal distribution and the association parameter separately but it has the limitation that one needs to assume that the underlying copula function is known.

It is well-known that although the proportional hazards model is one of the most commonly used models for regression analysis of failure time data, sometimes one may prefer a different model or a different model may fit the data or describe the problem of interest better (Kalbfleisch and Prentice [ 16 ]). For example, the additive hazards model is usually preferred if the excess risk is of interest and one may want to consider the linear transformation model if the model flexibility is more important. Some literature has been developed for these and other models for regression analysis of general interval-censored data or the analysis of case-cohort studies that yield right-censored data. However, there does not seem to exist an established estimation procedure for the problem discussed here under other models. In other words, it would be useful to generalize the proposed method to the situation under the additive hazards or linear transformation model.

## Acknowledgments

The authors wish to thank the Editor-in-Chief, the Associate Editor and three reviewers for their many critical and constructive comments and suggestions that greatly improved the paper. Also the authors want to thank Dr Peter Gibert for providing the HIV example data.

## Appendix Proofs of the asymptotic properties of θ ^ n .

In this appendix, we will sketch the proof of the asymptotic properties of the proposed estimator θ ^ n . Let τ denote the length of study. Then a single observation can be written as

To establish the asymptotic properties, we need the following regularity conditions, which are commonly used in the studies of interval-censored data and usually satisfied in practice (Huang and Rossini [ 12 ]; Zhang et al. [ 39 ]; Ma et al. [ 23 ]; Zhou et al. [ 40 ]).

- (C1) The distribution of the covariate Z has a bounded support in R p and is not concentrated on any proper subspace of R p .
- (C2) The true parameters ( β t 0 , β w 0 , η 0 ) lie in the interior of a compact set B in R 2 p × R + .
- (C3) The first derivative of Λ t 0 ( ⋅ ) and Λ w 0 ( ⋅ ) , denoted by Λ t 0 ( 1 ) ( ⋅ ) and Λ w 0 ( 1 ) ( ⋅ ) , is Holder continuous with exponent γ ∈ ( 0 , 1 ] . That is, there exists a constant K >0 such that | Λ t 0 ( 1 ) ( t 1 ) − Λ t 0 ( 1 ) ( t 2 ) | ≤ K | t 1 − t 2 | γ for all t 1 , t 2 ∈ [ σ , τ ] , where 0 < σ < τ < ∞ . Let r = 1 + γ .
- (C4) There exists a constant K >0 such that P l ( θ , O ξ ) − P l ( θ 0 , O ξ ) ≤ − K d ( θ , θ 0 ) 2 for every θ in a neighbourhood of θ 0 , where l ( θ , O ξ ) is the weighted log-likelihood function based on a single observation O ξ .
- (C5) The matrix E ( { l ∗ ( ϑ 0 , O ) } ⊗ 2 ) is finite and positive definite, where v ⊗ 2 = v v ′ for a vector v , and l ∗ ( ϑ , O ) is the efficient score for ϑ = ( β t , β w , η ) based on the complete observation O = { U , Ψ W , Ψ , Δ 1 , Δ 2 , Z } and will be given in the proof of Theorem 2.

For the proof, we will mainly employ the empirical process theory and some nonparametric techniques. Let P f = ∫ f ( y ) d P denote the expectation of f ( Y ) under the probability measure P , and P n f = n − 1 ∑ i = 1 n f ( Y i ) , the expectation of f ( Y ) under the empirical measure P n . Define the covering number of the class L n = { l ( θ , O ξ ) : θ ∈ Θ n } , where l ( θ , O ξ ) is the weighted log-likelihood function based on a single observation O ξ . Also for any ϵ > 0 , define the covering number N ( ϵ , L n , L 1 ( P n ) ) as the smallest positive integer κ for which there exists { θ ( 1 ) , … , θ ( κ ) } such that

for all θ ∈ Θ n , where { O 1 ξ , … , O n ξ } represent the observed data and for j = 1 , … , κ , θ ( j ) = ( β t ( j ) , β w ( j ) , η ( j ) , Λ t ( j ) , Λ w ( j ) ) ∈ Θ n . If no such κ exists, define N ( ϵ , L n , L 1 ( P n ) ) = ∞ . Also for the proof, we need the following two lemmas, whose proofs are similar to those for Lemmas 1 & 2 in Zhou et al. [ 40 ] and thus omitted.

Assume that the regularity conditions (C1)–(C3) given above hold. Then we have that the covering number of the class L n = { l ( θ , O ξ ) : θ ∈ Θ n } satisfies

for a constant K , where m = o ( n ν ) with ν ∈ ( 0 , 1 ) is the degree of Bernstein polynomials, and M n = O ( n a ) with a >0 controls the size of the sieve space Θ n .

Assume that the regularity conditions (C1)–(C3) given above hold. Then we have that

almost surely.

## Proof of Theorem 3.1 —

We first prove the strong consistency of θ ^ n . Let l ( θ , O ξ ) denote the weighted log-likelihood function based on a given single observation O ξ and consider the class of functions L n = { l ( θ , O ξ ) : θ ∈ Θ n } . By Lemma A.1, the covering number of L n satisfies

Furthermore, by Lemma A.2, we have

Note that E ( p | O ) = 1 , then P l ( θ , O ξ ) = P { p l ( θ , O ) } = P l ( θ , O ) and θ 0 maximizes P l ( θ , O ξ ) . Let M ( θ , O ξ ) = − l ( θ , O ξ ) , and define K ϵ = { θ : d ( θ , θ 0 ) ≥ ϵ , θ ∈ Θ n } for ϵ > 0 and

If θ ^ n ∈ K ϵ , then we have

Define δ ϵ = inf K ϵ P M ( θ , O ξ ) − P M ( θ 0 , O ξ ) . Under Condition (C4), we have δ ϵ > 0 . It follows from A2 and A3 that

with ζ n = ζ 1 n + ζ 2 n , and hence ζ n ≥ δ ϵ . This gives { θ ^ n ∈ K ϵ } ⊆ { ζ n ≥ δ ϵ } , and by A1 and the strong law of large numbers, we have both ζ 1 n → 0 and ζ 2 n → 0 almost surely. Therefore, ∪ k = 1 ∞ ∩ n = k ∞ { θ ^ n ∈ K ϵ } ⊆ ∪ k = 1 ∞ ∩ n = k ∞ { ζ n ≥ δ ϵ } , which proves that d ( θ ^ n , θ 0 ) → 0 almost surely.

Now we will show the convergence rate of θ ^ n by using Theorem 3.4.1 of van der Vaart and Wellner [ 35 ]. Below we use K ~ to denote a universal positive constant which may differ from place to place. First note from Theorem 1.6.2 of Lorentz [ 21 ] that there exists a Bernstein polynomial Λ t n 0 and Λ w n 0 such that ‖ Λ t n 0 − Λ t 0 ‖ ∞ = O ( m − r / 2 ) and ‖ Λ w n 0 − Λ w 0 ‖ ∞ = O ( m − r / 2 ) . Define θ n 0 = ( β t 0 , β w 0 , η 0 , Λ t n 0 , Λ w n 0 ) . Then we have d ( θ n 0 , θ 0 ) = O ( n − r ν / 2 ) . For any ρ > 0 , define the class of functions F ρ = { l ( θ , O ξ ) − l ( θ n 0 , O ξ ) : θ ∈ Θ n , ρ / 2 < d ( θ , θ n 0 ) ≤ ρ } for a given single observation O ξ . One can easily show that P ( l ( θ 0 , O ξ ) − l ( θ n 0 , O ξ ) ) ≤ K ~ d ( θ 0 , θ n 0 ) ≤ K ~ n − r ν / 2 . From Condition (C4), for large n , we have

for any l ( θ , O ξ ) − l ( θ n 0 , O ξ ) ∈ F ρ .

Following the calculations in Shen and Wong [ 32 ](p. 597), we can establish that for 0 < ε < ρ , log N [ ] ( ε , F ρ , L 2 ( P ) ) ≤ K ~ N log ( ρ / ε ) with N = 2 ( m + 1 ) . Moreover, some algebraic manipulations yield that P ( l ( θ , O ξ ) − l ( θ n 0 , O ξ ) ) 2 ≤ K ~ ρ 2 for any l ( θ , O ξ ) − l ( θ n 0 , O ξ ) ∈ F ρ . Under Conditions (C1)–(C3), it is easy to see that F ρ is uniformly bounded. Therefore, by Lemma 3.4.2 of van der Vaart and Wellner [ 35 ], we obtain

where J [ ] { ρ , F ρ , L 2 ( P ) } = ∫ 0 ρ [ 1 + log N [ ] { ε , F ρ , L 2 ( P ) } ] 1 / 2 d ε ≤ K ~ N 1 / 2 ρ . This yields ϕ n ( ρ ) = N 1 / 2 ρ + N / n 1 / 2 . It is easy to see that ϕ n ( ρ ) / ρ is decreasing in ρ , and r n 2 ϕ n ( 1 / r n ) = r n N 1 / 2 + r n 2 N / n 1 / 2 ≤ K ~ n 1 / 2 , where r n = N − 1 / 2 n 1 / 2 = n ( 1 − ν ) / 2 .

Finally note that P n { l ( θ ^ n , O ξ ) − l ( θ n 0 , O ξ ) } ≥ 0 and d ( θ ^ n , θ n 0 ) ≤ d ( θ ^ n , θ 0 ) + d ( θ 0 , θ n 0 ) → 0 in probability. Thus by applying Theorem 3.4.1 of van der Vaart and Wellner [ 35 ], we have n ( 1 − ν ) / 2 d ( θ ^ n , θ n 0 ) = O p ( 1 ) . This together with d ( θ n 0 , θ 0 ) = O ( n − r ν / 2 ) yields that d ( θ ^ n , θ 0 ) = O p ( n − ( 1 − ν ) / 2 + n − r ν / 2 ) and the proof is completed.

## Proof of Theorem 3.2 —

Now we will prove the asymptotic normality of ϑ ^ n = ( β ^ t n , β ^ w n , η ^ n ) . First we will establish the asymptotic normality for the estimator based on the complete observation O = { U , Ψ W , Ψ , Δ 1 , Δ 2 , Z } . With a little abuse of notation, we still denote the complete-data estimator as ϑ ^ n .

Let V denote the linear span of Θ − θ 0 and define the Fisher inner product for v , v ~ ∈ V as < v , v ~ >= P { l ˙ ( θ 0 , O ) [ v ] l ˙ ( θ 0 , O ) [ v ~ ] } and the Fisher norm for v ∈ V as ‖ v ‖ 2 =< v , v > , where

denotes the first order directional derivative of l ( θ , O ) at the direction v ∈ V (evaluated at θ 0 ). Also let V ¯ be the closed linear span of V under the Fisher norm. Then ( V ¯ , ‖ ⋅ ‖ ) is a Hilbert space. Furthermore, for a vector of ( 2 p + 1 ) dimension b = ( b 1 ′ , b 2 ′ , b 3 ) ′ with ‖ b ‖ ≤ 1 and any v ∈ V , define a smooth functional of θ as h ( θ ) = b 1 ′ β 1 + b 2 ′ β 2 + b 3 η and

whenever the right hand-side limit is well defined. Then by the Riesz representation theorem, there exists v ∗ ∈ V ¯ such that h ˙ ( θ 0 ) [ v ] =< v , v ∗ > for all v ∈ V ¯ and ‖ v ∗ ‖ = ‖ h ˙ ( θ 0 ) ‖ . Also note that h ( θ ) − h ( θ 0 ) = h ˙ ( θ 0 ) [ θ − θ 0 ] . It thus follows from the Cram e ´ r-Wold device that to prove the asymptotic normality for ϑ ^ n , i.e. n 1 / 2 ( ϑ ^ n − ϑ 0 ) → N ( 0 , I − 1 ( ϑ 0 ) ) in distribution, it suffices to show that

since b ′ ( ϑ ^ n − ϑ 0 ) = h ( θ ^ n ) − h ( θ 0 ) = h ˙ ( θ 0 ) [ θ ^ n − θ 0 ] =< θ ^ n − θ 0 , v ∗ > . In fact, A4 holds since one can show that n 1 / 2 < θ ^ n − θ 0 , v ∗ > → d N ( 0 , ‖ v ∗ ‖ 2 ) and ‖ v ∗ ‖ 2 = b ′ I − 1 ( ϑ 0 ) b .

We first prove that n 1 / 2 < θ ^ n − θ 0 , v ∗ > → d N ( 0 , ‖ v ∗ ‖ 2 ) . Let δ n = n − m i n { ( 1 − ν ) / 2 , r ν / 2 } denote the rate of convergence obtained in Theorem 3.1, and for any θ ∈ Θ such that d ( θ , θ 0 ) ≤ δ n , define the first order directional derivative of l ( θ , O ) at the direction v ∈ V as

and the second-order directional derivative at the directions v , v ~ ∈ V as

Note that by Condition (C3) and Theorem 1.6.2 of Lorentz [ 21 ], there exists Π n v ∗ ∈ Θ n − θ 0 such that ‖ Π n v ∗ − v ∗ ‖ = O ( n − ν r / 2 ) . Furthermore, under the assumption ν > 1 / 2 r , we have δ n ‖ Π n v ∗ − v ∗ ‖ = o ( n − 1 / 2 ) . Define r [ θ − θ 0 , O ] = l ( θ , O ) − l ( θ 0 , O ) − l ˙ ( θ 0 , O ) [ θ − θ 0 ] and let ε n be any positive sequence satisfying ε n = o ( n − 1 / 2 ) . Then by the definition of θ ^ n , we have

We will investigate the asymptotic behaviour of I 1 , I 2 and I 3 . For I 1 , it follows from Conditions (C1)–(C3), Chebyshev inequality and ‖ Π n v ∗ − v ∗ ‖ = o ( 1 ) that I 1 = ε n × o p ( n − 1 / 2 ) . For I 2 , by the mean value theorem, we obtain that

where θ ~ lies between θ ^ n and θ ^ n ± ε n Π n v ∗ . By Theorem 2.8.3 of van der Vaart and Wellner [ 35 ], we know that { l ˙ ( θ , O ) [ Π n v ∗ ] : ‖ θ − θ 0 ‖ ≤ δ n } is Donsker class. Therefore, by Theorem 2.11.23 of van der Vaart and Wellner [ 35 ], we have I 2 = ε n × o p ( n − 1 / 2 ) . For I 3 , note that

where θ ~ lies between θ 0 and θ and the last equation follows from Taylor expansion and Conditions (C1)–(C3). Therefore,

where the last equality holds due to the facts δ n ‖ Π n v ∗ − v ∗ ‖ = o ( n − 1 / 2 ) , Cauchy-Schwartz inequality, and ‖ Π n v ∗ ‖ 2 → ‖ v ∗ ‖ 2 . Combining the above facts, together with P l ˙ ( θ 0 , O ) [ v ∗ ] = 0 , we can establish that

Therefore, we obtain ∓ n 1 / 2 ( P n − P ) { l ˙ ( θ 0 , O ) [ v ∗ ] } ± n 1 / 2 < θ ^ n − θ 0 , v ∗ > + o p ( 1 ) ≥ 0 and then n 1 / 2 < θ ^ n − θ 0 , v ∗ >= n 1 / 2 ( P n − P ) { l ˙ ( θ 0 , O ) [ v ∗ ] } + o p ( 1 ) → d N ( 0 , ‖ v ∗ ‖ 2 ) by the central limit theorem and ‖ v ∗ ‖ 2 = ‖ l ˙ ( θ 0 , O ) [ v ∗ ] ‖ 2 .

Next we will prove that ‖ v ∗ ‖ 2 = b ′ I − 1 ( ϑ 0 ) b . For each component ϑ q , q = 1 , 2 , … , ( 2 p + 1 ) , we denote by ψ q ∗ = ( b 1 q ∗ , b 2 q ∗ ) the value of ψ q = ( b 1 q , b 2 q ) minimizing

where l ϑ is the score function for ϑ , l b j is the score operator for Λ j , j = 1, 2, and e q is a ( 2 p + 1 ) -dimensional vector of zeros except the q-th element equal to 1.

Define the q -th element of l ∗ ( ϑ , O ) as l ϑ ⋅ e q − l b 1 [ b 1 q ∗ ] − l b 2 [ b 2 q ∗ ] , q = 1 , … , ( 2 p + 1 ) , and I ( ϑ ) as E ( { l ∗ ( ϑ , O ) } ⊗ 2 ) . By Condition (C5), the matrix I ( ϑ 0 ) is positive definite. Furthermore, by following similar calculations in Chen et al. [ 4 ](sec. 3.2), we obtain

Thus, we have shown that n 1 / 2 ( ϑ ^ n − ϑ 0 ) → N ( 0 , I − 1 ( ϑ 0 ) ) in distribution for the estimator ϑ ^ n based on the complete data.

Now consider the estimator ϑ ^ n based only on the case-cohort data. Note that the weight p = ξ / π q ( Δ 1 , Δ 2 ) is bounded and does not depend on θ , and E { p | O } = 1 . By Theorem 3.2 of Saegusa and Wellner [ 29 ], we have

where I ( ϑ ) and l ∗ ( ϑ , O ) , defined above, are the information and efficient score for ϑ based on the complete data. Note that

Thus, we have

## Funding Statement

The work was partially supported by the National Science Foundation of USA grant DMS-1916170, the National Natural Science Foundation of China grant 11671168, the Science and Technology Developing Plan of Jilin Province of China grant 20170101061JC, and the National Institute of Allergy and Infectious Disease of USA grant 1 R56 AI140953-01.

## Disclosure statement

No potential conflict of interest was reported by the author(s).

## Case-Cohort Studies with Interval-Censored Failure Time Data

Description.

Provides a sieve semiparametric likelihood approach under the proportional hazards model for analyzing data from a case-cohort design with failure times subject to interval-censoring. The likelihood function is constructed using inverse probability weighting, and the sieves are built with Bernstein polynomials. A weighted bootstrap procedure is implemented for variance estimation.

The implementation uses stats::optim() to minimize the likelihood. The hard-coded method is "BFGS". Users are able to make changes to the 'control' input of optim() by passing named inputs through the ellipses. If a call to optim() returns convergence = 1, i.e., optim() reached its internal maximum number of iterations before convergence was attained, the software automatically repeats the call to optim() with input variable par set to the last parameter values. This procedure is repeated at most maxit times.

Input parameters U, V, del1, and del2 are defined as follows. Suppose there are K follow-up examinations at times TE = (T1, T2, ..., TK), and the failure time is denoted as TF. For left-censored data, the failure occurs prior to the first follow-up examination (TF < T1); therefore, define U = T1, V = tau, and (del1,del2)=(1,0). For right-censored data, the failure has not yet occurred at the last follow-up examination (TF > TK); therefore, define U = 0, V = TK, and (del1,del2)=(0,0). For interval-censored data, the failure occurs between two follow-up examinations, e.g. T2 < TF < T3; therefore, define U and V to be the two consecutive follow-up examination times bracketing the failure time TF and (del1,del2)=(0,1).

an object of class CaseCohort (inheriting from ICODS) containing

Zhou, Q., Zhou, H., and Cai, J. (2017). Case-cohort studies with interval-censored failure time data. Biometrika, 104(1): 17–29. <doi:10.1093/biomet/asw067>

## Innovative Methods Program for Advancing Clinical Trials (IMPACT)

- Investigators
- Research Staff
- Ext Advisory Board
- Innovative Clinical Trial Design and Analysis
- Statistical Methods for Biomarkers and Patient Reported Outcome in Cancer Trials
- Statistical/Computational Methods for Pharmacogenomics and Individualized Therapy
- Methods for Discovery, Analysis, and Evaluation of Dynamic Treatment Regimes
- Methods for Missing and Auxiliary Data in Clinical Trials
- Methods for Post Marketing Surveillance and Comparative Effectiveness Research
- All Publications
- Search Publications
- Symposium 2016
- Symposium 2014
- Symposium 2012
- Symposium 2011

## Case-cohort studies with interval-censored failure time data.

Emerging Topics in Modeling Interval-Censored Survival Data pp 221–234 Cite as

## Case-Cohort Studies with Time-Dependent Covariates and Interval-Censored Outcome

- Xiaoming Gao 4 ,
- Michael G. Hudgens 4 &
- Fei Zou 4
- First Online: 15 July 2022

307 Accesses

Part of the ICSA Book Series in Statistics book series (ICSABSS)

In large cohort studies on rare diseases, the case-cohort design is widely used to assess associations between covariates and survival time (e.g., time until disease onset). In many settings, the event of interest is not observed at an exact time point but only known to occur between two study visits. In this chapter, we consider fitting parametric survival models to data from case-cohort studies with interval censored outcomes and both fixed and time-dependent covariates. Simulation results demonstrate the proposed estimator is approximately unbiased and the standard errors are well estimated from the sandwich estimators. The methods are applied to an observational study which examined the association between hormonal contraceptive use and risk of HIV acquisition.

- Case-cohort
- Interval censoring
- Survival analysis

This is a preview of subscription content, log in via an institution .

## Buying options

- Available as PDF
- Read on any device
- Instant download
- Own it forever
- Available as EPUB and PDF
- Compact, lightweight edition
- Dispatched in 3 to 5 business days
- Free shipping worldwide - see info
- Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Chen, Q., May, R. C., Ibrahim, J. G., Chu, H., & Cole, S. R. (2014). Joint modeling of longitudinal and survival data with missing and left-censored time-varying covariates. Statistics in Medicine , 33 (26), 4560–4576. https://doi.org/10.1002/sim.6242

Article MathSciNet Google Scholar

Doumas, S., Kolokotronis, A., & Stefanopoulos, P. (2005). Anti-inflammatory and antimicrobial roles of secretory leukocyte protease inhibitor. Infection and Immunity , 73 (3), 1271–1274. https://doi.org/10.1128/IAI.73.3.1271-1274.2005

Article Google Scholar

Du, M., Li, H., & Sun, J. (2020a). Additive hazards regression for case-cohort studies with interval-censored data. Statistics and Its Interface , 13 (2), 181–191. ISSN 19387989. https://doi.org/10.4310/SII.2020.v13.n2.a4

Article MathSciNet MATH Google Scholar

Du, M., Zhou, Q., Zhao, S., & Sun, J. (2020b). Regression analysis of case-cohort studies in the presence of dependent interval censoring. Journal of Applied Statistics , 1–20. ISSN 0266-4763. https://doi.org/10.1080/02664763.2020.1752633

Fichorova, R. N. (2004). Guiding the vaginal microbicide trials with biomarkers of inflammation. JAIDS Journal of Acquired Immune Deficiency Syndromes , 37 (Suppl 3), S184–S193.

Google Scholar

Fichorova, R. N., Chen, P. L., Morrison, C. S., Doncel, G. F., Mendonca, K., Kwok, C., Chipato, T., Salata, R., & Mauck, C. (2015). The contribution of cervicovaginal infections to the immunomodulatory effects of hormonal contraception. mBio , 6 (5), e00221–15. https://doi.org/10.1128/mBio.00221-15

Fichorova, R. N., Morrison, C. S., Chen, P. L., Yamamoto, H. S., Govender, Y., Junaid, D., Ryan, S., Kwok, C., Chipato, T., Salata, R. A., & Doncel, G. F. (2020). Aberrant cervical innate immunity predicts onset of dysbiosis and sexually transmitted infections in women of reproductive age. Plos One , 15 (1), e0224359. https://doi.org/10.1371/journal.pone.0224359

Fichorova, R. N., Tucker, L. D., & Anderson, D. J. (2001). The molecular basis of nonoxynol-9-induced vaginal inflammation and its possible relevance to human immunodeficiency virus type 1 transmission. The Journal of Infectious Diseases , 184 (4), 418–428. https://doi.org/10.1086/322047

Fitzmaurice, G. M., Laird, N. M., & Ware, J. H. (2011). Applied Longitudinal Analysis (2nd ed.). John Wiley & Sons. ISBN 978-0-470-38027-7. https://doi.org/10.1002/9781119513469

Book MATH Google Scholar

Kupper, L. L., McMichael, A. J., & Spirtas, R. (1975). A hybrid epidemiologic study design useful in estimating relative risk. Journal of the American Statistical Association , 70 (351), 524. ISSN 01621459. https://doi.org/10.2307/2285927

Li, Z., Gilbert, P., & Nan, B. (2008). Weighted likelihood method for grouped survival data in case-cohort studies with application to HIV vaccine trials. Biometrics , 64 (4), 1247–1255. https://doi.org/10.1111/j.1541-0420.2008.00998.x

Mauck, C., Chen, P. L., Morrison, C. S., Fichorova, R. N., Kwok, C., Chipato, T., Salata, R. A., & Doncel, G. F. (2016). Biomarkers of cervical inflammation and immunity associated with cervical shedding of HIV-1. AIDS Research and Human Retroviruses , 32 (5), 443–451. https://doi.org/10.1089/AID.2015.0088

Morrison, C., Fichorova, R. N., Mauck, C., Chen, P. L., Kwok, C., Chipato, T., Salata, R., & Doncel, G. F. (2014). Cervical inflammation and immunity associated with hormonal contraception, pregnancy, and HIV-1 seroconversion. Journal of Acquired Immune Deficiency Syndromes , 66 (2), 109–117. https://doi.org/10.1097/QAI.0000000000000103

Morrison, C. S., Chen, P. L., Yamamoto, H., Gao, X., Chipato, T., Anderson, S., Barbieri, R., Salata, R., Doncel, G. F., & Fichorova, R. N. (2020). Concomitant imbalances of systemic and mucosal immunity increase HIV acquisition risk. Journal of Acquired Immune Deficiency Syndromes , 84 (1). https://doi.org/10.1097/QAI.0000000000002299

Morrison, C. S., Fichorova, R., Chen, P. L., Kwok, C., Deese, J., Yamamoto, H., Anderson, S., Chipato, T., Salata, R., & Doncel, G. F. (2018). A longitudinal assessment of cervical inflammation and immunity associated with HIV-1 infection, hormonal contraception, and pregnancy. AIDS Research and Human Retroviruses , 34 (10), 889–899. https://doi.org/10.1089/AID.2018.0022

Morrison, C. S., Richardson, B. A., Mmiro, F., Chipato, T., Celentano, D. D., Luoto, J., Mugerwa, R., Padian, N., Rugpao, S., Brown, J. M., Cornelisse, P., Salata, R. A., & Hormonal Contraception and the Risk of HIV Acquisition (HC-HIV) Study Group. (2007). Hormonal contraception and the risk of HIV acquisition. AIDS , 21 (1), 85–95. ISSN 0269-9370. https://doi.org/10.1097/QAD.0b013e3280117c8b

Odell, P. M., Anderson, K. M., & D’Agostino, R. B. (1992). Maximum likelihood estimation for interval-censored data using a Weibull-based accelerated failure time model. Biometrics , 48 (3), 951–959. https://doi.org/10.2307/2532360

Prentice, R. L. (1986). A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika , 73 (1), 1–11. ISSN 0006-3444. https://doi.org/10.1093/biomet/73.1.1

Sparling, Y. H., Younes, N., Lachin, J. M., & Bautista, O. M. (2006). Parametric survival models for interval-censored data with time-dependent covariates. Biostatistics , 7 (4), 599–614. https://doi.org/10.1093/biostatistics/kxj028

Article MATH Google Scholar

Stefanski, L. A., & Boos, D. D. (2002). The calculus of m-estimation. The American Statistician , 56 (1), 29–38. ISSN 0003-1305. https://doi.org/10.1198/000313002753631330

Wellner, J. A., & Zhan, Y. (1997). A hybrid algorithm for computation of the nonparametric maximum likelihood estimator from censored data. Journal of the American Statistical Association , 92 (439), 945–959. ISSN 0162-1459. https://doi.org/10.1080/01621459.1997.10474049

Zhao, H., Wu, Q., Gilbert, P. B., Chen, Y. Q., & Sun, J. (2020). A regularized estimation approach for case-cohort periodic follow-up studies with an application to HIV vaccine trials. Biometrical Journal , 62 (5), 1176–1191. https://doi.org/10.1002/bimj.201900180

Zhou, Q., Cai, J., & Zhou, H. (2018). Outcome-dependent sampling with interval-censored failure time data. Biometrics , 74 (1), 58–67 (2018). https://doi.org/10.1111/biom.12744

Download references

## Acknowledgements

This work was supported by NIH grants R01HD077888, R37AI029168, 1UM1AI126619 and P30 AI050410. The authors thank Dr. Raina Fichorova and her laboratory at Brigham and Women’s Hospital and Harvard Medical School for generating and providing the biomarker data from the HC-HIV study. The authors also appreciate Dr. Charles Morrison for his insightful comments. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

## Author information

Authors and affiliations.

Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA

Xiaoming Gao, Michael G. Hudgens & Fei Zou

You can also search for this author in PubMed Google Scholar

## Corresponding author

Correspondence to Michael G. Hudgens .

## Editor information

Editors and affiliations.

Department of Statistics, University of Missouri, Columbia, MO, USA

Jianguo Sun

College of Health Solutions, Arizona State University, Goodyear, AZ, USA

Ding-Geng Chen

## Rights and permissions

Reprints and permissions

## Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

## About this chapter

Cite this chapter.

Gao, X., Hudgens, M.G., Zou, F. (2022). Case-Cohort Studies with Time-Dependent Covariates and Interval-Censored Outcome. In: Sun, J., Chen, DG. (eds) Emerging Topics in Modeling Interval-Censored Survival Data. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-031-12366-5_11

## Download citation

DOI : https://doi.org/10.1007/978-3-031-12366-5_11

Published : 15 July 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-12365-8

Online ISBN : 978-3-031-12366-5

eBook Packages : Mathematics and Statistics Mathematics and Statistics (R0)

## Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

- Publish with us

Policies and ethics

- Find a journal
- Track your research

## IMAGES

## VIDEO

## COMMENTS

In this paper, we consider the case-cohort study design for interval-censored failure time and develop a sieve semiparametric likelihood approach for analyzing data from this design under the proportional hazards model. We construct the likelihood function using inverse probability weighting and build the sieves with Bernstein polynomials.

In this paper, we consider the case-cohort study design for interval-censored failure time data, which arise when the failure time of interest is observed or known only to belong to a random time interval (Sun, 2006). Such data are often produced in epidemiological studies, biomedical follow

The existing literature on the case-cohort design is mainly focused on right-censored data. In practice, however, the failure time is often subject to interval-censoring: it is known to fall only within some random time interval. In this paper, we consider the case-cohort study design for interval-censored failure time and develop a sieve ...

In this paper, we consider the case-cohort study design for interval-censored failure time and develop a sieve semiparametric likelihood method for analysing data from this design...

This paper considers the case-cohort study design for interval-censored failure time and develops a sieve semiparametric likelihood approach for analyzing data from this design under the proportional hazards model. The case-cohort design has been widely used as a means of cost reduction in assembling or measuring expensive covariates in large cohort studies. The existing literature on the case ...

For case-cohort data, in order to examine covariate effects on the failure times, various well-developed semiparametric survival models have been proposed, such as the Cox proportional hazards model ( Prentice, 1986, Self and Prentice, 1988, Lin and Ying, 1993, Barlow, 1994, Chen and Lo, 1999, Chen, 2001b ), the additive hazards model ( Kulich a...

Introduction In many epidemiologic studies and disease prevention trials, the outcome of interest is a failure time that suffers from interval-censoring, i.e., the failure time cannot be exactly observed but only an interval that it belongs to is known or observed (e.g. Sun, 2006; Chen et al., 2012 ).

1 ∣. INTRODUCTION. Epidemiological and biomedical studies often encounter interval-censored failure time data, where time to the occurrence of an event or a disease is observed only to fall within some interval rather than known exactly. 1 One area that commonly produces interval-censored data is the AIDS clinical trials. In this case, investigators may be interested in times to AIDS for HIV ...

In this paper, we consider the case-cohort study design for interval-censored failure time and develop a sieve semiparametric likelihood approach for analyzing data from this design under the proportional hazards model. We construct the likelihood function using inverse probability weighting and build the sieves with Bernstein polynomials.

Among others, one general type of interval-censored data is case interval-censored data, meaning that for each study subject, there exists a sequence of observation times and one only observes if the failure event of interest occurs between two consecutive observation times ( ).

In this paper, we consider regression analysis under a case-cohort study with interval-censored failure time data, where the failure time is only known to fall within an interval instead of being exactly observed.

Among others, one general type of interval-censored data is case K interval-censored data, meaning that for each study subject, there exists a sequence of observation times and one only observes if the failure event of interest occurs between two consecutive observation times (Wang et al., 2016b, 2018). Several methods have been developed for ...

The case-cohort design has been widely used as a means of cost reduction in assembling or measuring expensive covariates in large cohort studies. The existing literature on the ca

Interval-censored data are a general type of time-to-event or failure time data where the failure time of interest is known or observed only to lie in an interval instead of being observed exactly.

Many authors have investigated the analysis of case-cohort studies but most of the existing methods are for right-censored failure time data [2,3,4, 14,15,16, 20, 23,24,25]. Several methods were also developed for the analysis of interval-censored case-cohort data but they apply only to some special types of interval-censored data [9, 18 ...

In this paper, we consider the case-cohort study design for interval-censored failure time and develop a sieve semiparametric likelihood method for analysing data from this design under the proportional hazards model. We construct the likelihood function using inverse probability weighting and build the sieves with Bernstein polynomials.

As discussed by Dr. Finkelstein in Chap. 1, interval-censored failure time data are a general type of failure time or time-to-event data that often occur in many areas, including demographical studies, epidemiological studies, medical or public health research and social science.

For the analysis of the interval-censored data arising from case-cohort studies, Gilbert et al. presented a midpoint imputation procedure and Li and Nan considered a special case of interval-censored data, current status data, where the failure time of interest is either left- or right-censored (Jewell and van der Laan ).

In this work, we formulate the case-cohort design with multiple interval-censored disease outcomes and also generalize it to nonrare diseases where only a portion of diseased subjects are sampled. We develop a marginal sieve weighted likelihood approach, which assumes that the failure times marginally follow the proportional hazards model.

Case-Cohort Studies with Interval-Censored Failure Time Data ... Provides a sieve semiparametric likelihood approach under the proportional hazards model for analyzing data from a case-cohort design with failure times subject to interval-censoring. ... Case-cohort studies with interval-censored failure time data. Biometrika, 104(1): 17-29 ...

Estimation of complier causal treatment effects under the case-cohort studies with interval-censored failure time data Yuqing Ma , Peijie Wang & Jianguo Sun Pages 3285-3307 | Received 07 Nov 2022, Accepted 25 May 2023, Published online: 07 Jun 2023 Cite this article https://doi.org/10.1080/00949655.2023.2220462 Full Article Figures & data

The existing literature on the case-cohort design is mainly focused on right-censored data. In practice, however, the failure time is often subject to interval-censoring; it is known only to fall within some random time interval. In this paper, we consider the case-cohort study design for interval-censored failure time and develop a sieve ...

2.1 Full Cohort. We begin by describing the model for the full cohort. For the ith participant with k i + 1 visits, let x i be the time of the event of interest and \(\tau _{i0},\dots , \tau _{ik_i}\) be the sequence of visit time points where τ i0 = 0 corresponds to the baseline visit. For interval-censored observations, let \(t_{L_i }\) be the last observation time before the event and \(t ...