Examine inhabitants
This examine belongs to the kind of analytical, utilized and data-oriented epidemiological research. The info used on this examine discuss with the knowledge on Iranian households’ medical expenditures obtained from a nationwide mission entitled “The Households Earnings and Expenditure Survey (HIES)” in 2021. The info comes from the Statistical Middle of Iran (SCI), which might be discovered at https://www.amar.org.ir52. This data was offered to the researchers of this mission in uncooked kind and as a random pattern. The ultimate pattern dimension, after making use of entry and exit restrictions, was 8993 family heads in Iranian provinces.
To attenuate potential biases that will outcome from using self-reporting, the HIES survey used standardized procedures and strict high quality management protocols carried out by SCI to make sure the reliability and validity of the information collected. The nationally consultant sampling design of the survey and subsequent information processing additional decreased errors and inconsistencies previous to evaluation.
The family information was saved confidential all through the examine. The examine was carried out after approval from the Analysis Ethics Committee (REC) of Ahvaz Jundishapur Faculty of Medical Sciences (AJUMS) with mission quantity U-02034 and ethics code IR.AJUMS.REC.1402.064.
Predictor variables
For all and based mostly on the knowledge obtainable, the 19 variables that the researchers imagine could have an effect on MC had been grouped into two classes: Data associated to the family head (5 variables) and knowledge associated to the family (15 variables).
Details about the family head contains age (younger adults; middle-aged adults; older adults), gender (male; feminine), schooling stage (illiterate and elementary; decrease than diploma; diploma and affiliate; bachelors; MSc or PhD), marital standing (married; widowed or divorced; single), employment standing (employed; not working; have revenue and not using a job; others). Family variables embody the variety of members of the family (1; 2; 3; 4 or extra individuals), residential space (city, rural), variety of workers (noun, one, two or extra), variety of college students within the household (noun; one; two or extra), variety of educated individuals within the household (noun; one; two or extra), sort of residence possession (have a house; mortgage, hire & different), subsidies (no; sure), web entry (no; sure), automotive possession (no; sure), bicycle possession (no; sure), household revenue within the yr and family expenditure on meals (under; above the nationwide common), clothes (under; above the nationwide common) and housing (under; above the nationwide common) within the yr had been extracted and used within the evaluation part.
End result variable
On this examine, the whole medical prices of a yr (reminiscent of prices for dental and eye care, medicine, dependancy remedy, surgical procedure, and so forth.) for every family had been thought of as whole medical prices and outcomes. The price variable is a optimistic quantity or zero. Assuming that there’s a correlation in the price of healthcare companies inside every province and that there are variations and heterogeneity amongst these cities, the provincial cities within the heart of Iran had been thought of as clusters and random results had been thought of to account for heterogeneity.
Semi-continuous information
Knowledge on healthcare prices are sometimes characterised as semi-continuous and exhibit a non-normal distribution with an unbalanced ratio of zeros to optimistic values34. In well being economics and repair analysis, such information are extensively used and pose a problem for evaluation attributable to their distinctive traits. All these information, often called semi-continuous information with zero inflation, cowl a variety of areas, reminiscent of analysis on healthcare prices, medical care companies, well being assessments34,37, common every day alcohol consumption37,53, annual automotive insurance coverage claims, and the relative abundance of the microbiome37,46.
Within the context of semi-continuous information with zero inflation, the presence of a major proportion of zeros alongside optimistic skewed values requires particular remedy in statistical evaluation. Failure to account for this peculiarity when operating regression fashions can result in biased estimates, incorrect conclusions and finally deceptive outcomes. Bearing in mind the atypical distribution of semi-continuous information with zeros is essential for the accuracy and validity of analysis leads to numerous fields of examine.
Two-part fashions for semi-continuous information
Semi-continuous information often wants two-part combination fashions to successfully seize each the discrete and steady facets of the information. Within the case of impartial observations, the standard format of the two-part mannequin is printed under:
$$fleft({y}_{i}proper)={(1-{pi }_{i})}^{{mathbb{l}}_{({y}_{i}=0)}}instances {left[{pi }_{i}g({y}_{i}|{y}_{i}>0;{mu }_{i},sigma ,kappa )right]}^{{mathbb{l}}_{({y}_{i}>0)}}, { y}_{i}ge 0, i=1,dots ,n$$
(1)
the place ({pi }_{i}=Pr({Y}_{i}>0)), ({mathbb{l}}_{(.)}) serves as an indicator operate, and (g({y}_{i}|{y}_{i}>0)) is a operate that will depend on a selected location parameter ({mu }_{i}), a optimistic scale parameter (sigma), and (kappa in mathfrak{R}) that determines the form or skewness of the distribution. Generally most well-liked densities embody the gamma (G)54 or generalized gamma (GG)48,55,56, lognormal (LN)57,58, weibull (W)59, and log-skew-normal (LSN)57,60, which might be additional elaborated on under.
In above equation, covariates are included in two separate linear predictors, one for ({pi }_{i}) and one for ({mu }_{i}). An occasion of that is the conditional two-part (CTP) mannequin61, the place a logit hyperlink is utilized for the binary half and a optimistic steady distribution is used for (g({y}_{i}|{y}_{i}>0)). The mannequin is structured as under:
$$textual content{Half I}: logitleft({pi }_{i}proper)=logitleft[text{Pr}left({Y}_{i}>0right)right]={{varvec{Z}}}_{i}{prime}boldsymbol{alpha }={alpha }_{0}+{z}_{1i}{alpha }_{1}+dots +{z}_{qi}{alpha }_{q}$$
$$textual content{Half II}: {mu }_{i}=Eleft[text{ln}({Y}_{i}|{Y}_{i}>0)right]={{varvec{X}}}_{i}{prime}{varvec{beta}}={beta }_{0}+{x}_{1i}beta +dots +{x}_{pi}{beta }_{p} , i=1,dots ,n$$
(2)
the place ({{varvec{Z}}}_{{varvec{i}}}^{boldsymbol{^{prime}}}) is a (1times q) covariate vector and (boldsymbol{alpha }) a (qtimes 1) regression coefficient within the binary half. Additionally, ({{varvec{X}}}_{{varvec{i}}}^{boldsymbol{^{prime}}}) is a (1times p) covariate vector and ({varvec{beta}}) a (ptimes 1) regression coefficient within the steady half. By disregarding the intercept, the parts of (boldsymbol{alpha }) point out unit modifications within the log-odds of a optimistic response, whereas the parts of ({varvec{beta}}) characterize unit modifications on the conditional imply of the logged optimistic values, ({mu }_{i}=Eleft[text{ln}({Y}_{i}|{Y}_{i}>0)right]). The conditional interpretation of ({varvec{beta}}) means that it assesses the consequences of covariates on people who exhibit a optimistic response, somewhat than on the general inhabitants.
In statistical modeling, researchers usually concentrate on inspecting the consequences of sure elements on the remodeled marginal means. Nonetheless, there are circumstances the place it’s essential to look at the influence on the untransformed marginal imply, denoted (E({Y}_{i})), with the intention to draw conclusions in regards to the total inhabitants, which incorporates each customers and non-users of well being companies44,45,46,50. To handle the necessity for such inferences, Smith and colleagues launched a marginalized two-part (MTP) mannequin that permits direct parameterization of the consequences of covariates on the marginal imply51. The MTP mannequin is characterised by its parameters as:
$$textual content{Half I}: logitleft({pi }_{i}proper)={{varvec{Z}}}_{i}{prime}boldsymbol{alpha }$$
$$textual content{Half II}:textual content{ E}left({Y}_{i}proper)={nu }_{i}=textual content{exp}({{varvec{X}}}_{i}{prime}{varvec{beta}}) , i=1,dots ,n$$
(3)
On this context, (boldsymbol{alpha }) has the identical which means as within the CTP mannequin and represents a vector of log-odds ratios. The mannequin permits the estimation of covariate results on the general marginal imply and customary error by linear mixtures of the parameters within the second half. Particularly, (textual content{exp}({beta }_{ok})) represents the multiplicative impact on the general imply when the okth covariate will increase by one-unit. With using this parameterization, the marginal means and customary errors predicted by the mannequin might be simply decided by calculating (textual content{exp}({{varvec{X}}}_{i}{prime}{varvec{beta}})) on the specified values of the covariates.
On this class of fashions, totally different distributions can be utilized to successfully analyze the semi-continuous information. Primarily based on AIC and BIC, we used the Vuong take a look at (V) to find out whether or not the zero and optimistic parts of the associated fee variables are generated by totally different processes62. This non-nested speculation take a look at produces a Z statistic, the place a price higher than 1.96 helps the choice assumption that the primary mannequin matches the information higher, whereas a price lower than − 1.96 signifies that the second mannequin gives a greater match. In our evaluation, we evaluated our impartial two-equation mannequin towards a Tobit mannequin that accounts for interdependence. The calculated take a look at statistic was 69,729.1, and since V is bigger than 1.96, we discover proof supporting the speculation of independence of course of. Though Tobit or Heckman fashions may account for interdependence, their use just isn’t justified right here. The precept of parsimony favors the easier impartial two- equation mannequin, which successfully captures the information and gives extra interpretable coefficients with out the complexity of interdependence. Specifically, the MTP mannequin permits for adaptability by contemplating a spread of distributions and variance buildings34,44,50,51,63. To justify the selection of distributions within the marginalized two-part mannequin (MTP), we chosen lognormal and gamma distributions based mostly on their theoretical suitability for modeling right-skewed, non-negative medical price information, their frequent use in associated research, and their applicability to the multilevel construction of the information within the remaining part of the fashions.
Multilevel fashions in cluster evaluation
Based on Ning Li’s analysis, information might be organized in a composite or stratified format, the place hierarchies signify that observations inside equivalent teams or contexts share commonalities or similarities that suggest a point of uniformity64. Consequently, a blended framework can be utilized to explain a mannequin for semi-continuous information with two ranges. The primary stage pertains to observations ((i=1,dots ,{n}_{j})) nested in two-level items ((j=1,dots ,m)) that discuss with heart provinces.
The mannequin’s parameterization is split into two elements which are fitted individually.
Partially I, the binary consequence is modeled as:
$$logitleft({pi }_{ij}proper)=logitleft(textual content{Pr}left({Y}_{ij}>0right)proper)={{varvec{Z}}}_{{varvec{i}}{varvec{j}}}^{boldsymbol{^{prime}}}boldsymbol{alpha }+{{varvec{b}}}_{1{varvec{i}}}$$
(4)
the place ({b}_{1i}sim N(0,{sigma }_{b1}^{2})) represents the random impact that accounts for the correlation inside a cluster (stage 2) within the zero half.
Assuming that the logarithm for the g hyperlink operate, the situation parameter ({mu }_{ij}) for the continual element within the second half is modeled as:
$$gleft(Eleft({Y}_{ij}|{Y}_{ij}>0right)proper)=textual content{log}left({mu }_{ij}|{Y}_{ij}>0right)={{varvec{X}}}_{{varvec{i}}{varvec{j}}}^{boldsymbol{^{prime}}}{varvec{beta}}+{{varvec{b}}}_{2{varvec{i}}}$$
(5)
the place ({b}_{2i}sim N(0,{sigma }_{b2}^{2})) represents the random impact that accounts for the correlation inside a cluster (stage 2) within the steady half. These random results seize the unobserved traits or elements that will affect the end result variable inside every cluster. By together with this random impact within the mannequin, we are able to account for the clustering of observations inside every cluster and higher estimate the true relationship between the predictors and the end result variable. On this context, it’s assumed that the random results ({b}_{1i}) and ({b}_{2i}), pertaining to the processes zero and non-zero, are impartial and uncorrelated.
({{varvec{Z}}}_{{varvec{i}}{varvec{j}}}^{boldsymbol{^{prime}}}) represents the covariates for the i-th topic within the j-th cluster for the binary half, and ({{varvec{X}}}_{{varvec{i}}{varvec{j}}}^{boldsymbol{^{prime}}}) represents the covariates for the i-th topic within the j-th cluster used for the continual half. The 2 elements could have frequent or utterly totally different covariates. (boldsymbol{alpha }) represents the vector of mannequin coefficients for the binary half, whereas ({varvec{beta}}) represents the vector of coefficients for the continual half, below the situation that the values are non-zero.
For a TP mannequin, the marginal imply and variance of ({Y}_{ij}) might be derived as follows:
$$Eleft({Y}_{ij}proper)={pi }_{ij}Eleft({Y}_{ij}|{Y}_{ij}>0right)$$
$$Varleft({Y}_{ij}proper)={pi }_{ij}left[E({Y}_{ij}^{2}|{Y}_{ij}>0)-{pi }_{ij}{E({Y}_{ij}|{Y}_{ij}>0)}^{2}right]$$
(6)
when lognormal is assumed within the steady half, the marginal imply is
$$Eleft({Y}_{ij}proper)={pi }_{ij}instances expleft{{mu }_{ij}+frac{{sigma }^{2}}{2}proper}=frac{1}{1+expleft{-{Z}_{ij}{prime}alpha +{b}_{1i}proper}}instances expleft{{X}_{ij}{prime}beta +{b}_{2i}+frac{{sigma }^{2}}{2}proper}$$
(7)
and when gamma is assumed within the steady half, the marginal imply is
$$Eleft({Y}_{ij}proper)={pi }_{ij}instances expleft{{mu }_{ij}proper}=frac{1}{1+expleft{-{Z}_{ij}{prime}alpha +{b}_{1i}proper}}instances expleft{{X}_{ij}{prime}beta +{b}_{2i}proper}$$
(8)
In binary fashions, the (boldsymbol{alpha }) estimates characterize the typical chances of optimistic values within the inhabitants. On an exponential scale, (textual content{exp}(alpha )) is the percentages ratio for a one-unit enhance within the covariate. In steady fashions, the ({varvec{beta}}) estimates are just for non-zero optimistic values, a subset of the information. When a log hyperlink is used, (textual content{exp}(beta )) reveals the multiplicative change within the total imply because the covariate will increase by one unit, assuming the statement just isn’t zero. To summarize, the binary half estimates the chances of non-zero values within the inhabitants, whereas the continual half reveals the consequences on the inhabitants imply when the values are non-zero. Transferring ahead, to simplify the presentation, we discuss with the marginalized two-part lognormal and marginalized two-part gamma fashions as MTP-LN and MTP-G, respectively.
Parameter estimation and inference for MTP
Let (n={n}_{j}instances m) be the whole variety of topics and assume that topics ((i=1,dots ,{n}_{j})) on totally different clusters ((j=1,dots ,m)) are impartial. Given the random results ({{varvec{b}}}_{1{varvec{i}}}) and ({{varvec{b}}}_{2{varvec{i}}}), The chance operate might be described as such:
$$Lleft(alpha ,beta |{b}_{1},{b}_{2}proper)=$$
$$prod_{i=1}^{{n}_{j}}prod_{j=1}^{m}iint {left(1-{pi }_{ij}proper)}^{{mathbb{l}}_{left({y}_{ij}=0right)}}{left[{pi }_{ij}gleft({y}_{ij}|{y}_{ij}>0,{b}_{2i}right)right]}^{{mathbb{l}}_{left({y}_{ij}>0right)}}varphi left({b}_{1}proper)varphi left({b}_{2}proper)d{b}_{1i}d{b}_{2i}$$
(9)
the place ({n}_{j}) is the variety of topics within the cluster j, ({pi }_{ij}) is given (4) if the logit hyperlink operate is utilized in Half I, (g({y}_{ij}|{y}_{ij}>0)) is rely upon the distribution assumption on ({y}_{ij}>0) (lognormal or gamma), and (varphi left({b}_{1}proper)) and (varphi left({b}_{2}proper)) are the conventional density of two random results ({b}_{1}) and ({b}_{2}).
The chance in Eq. (9) requires the combination of a nonlinear operate over the 2 random results within the chance operate. To acquire most chance estimators for (alpha), (beta), and the random results, numerical strategies mixed with integration approximation methods are important. Some researchers used a high-order Laplace approximation to estimate the marginal chance and employed an approximate Fisher scoring algorithm for maximization65,66,67. Equally, Tooze et al. used a quasi-Newton algorithm at the side of an adaptive Gaussian quadrature for chance maximization68. Hubin69 and Wang70 additionally investigated the Built-in Nested Laplace Approximation (INLA) and a generalized model of the Fisher scoring methodology for estimating the marginal chance and maximizing the chance, respectively.
On this article, we use totally different strategies to estimate the parameters within the utilization fashions. For the MTP-LN mannequin, we use an integration methodology often called adaptive Gauss-Hermite quadrature and an optimization methodology that mixes hybrid EM and quasi-Newton approaches. For the MTP-G mannequin, alternatively, we implement the Laplace approximation as the combination methodology and use most chance estimation through ‘TMB’ (Template Mannequin Builder) because the optimization methodology. These totally different methods meet the distinctive necessities of every mannequin and guarantee correct parameter estimation and sturdy mannequin becoming for each MTP-LN and MTP-G fashions. This technique might be simply carried out in extensively used customary statistical packages. Knowledge cleansing, statistical analyses, and information visualization had been primarily carried out utilizing R 4.3.2 model71. The corresponding codes are included within the Appendix A for reference. Maps had been created utilizing the free plan of the Datawrapper web site (https://app.datawrapper.de)72.
Mannequin match evaluation
The log-likelihood ((LL)) decided by most chance estimation serves as an indicator of how nicely a mannequin matches the information, with increased values indicating a stronger match. Nonetheless, when evaluating totally different fashions, it’s extra applicable to make use of data standards reminiscent of Deviance ((D=-2LL)), Akaike data criterion ((AIC=-2LL+2k)), and Schwarz’s Bayesian data criterion ((BIC=-2LL+ktext{log}n)) the place n is the pattern dimension or the information level in X’s, and Okay is the variety of estimable parameters73. These standards are based mostly on the log chance operate, however embody a penalty for the variety of parameters within the mannequin, which helps to forestall overfitting. The mannequin with the smallest worth of the knowledge criterion is mostly most well-liked, because it represents the perfect stability between mannequin match and complexity. Utilizing compliant and enormous pattern information, we carry out an analysis to find out the time to convergence of the fashions.
To evaluate the match of the fashions used, scatter plots and warmth maps had been created to check the precise values with the fitted values for the MTP fashions. These scatter plots present a visible illustration of how nicely the fashions can predict the information, permitting a extra complete analysis of their efficiency.