Statistical analysis on length of stay in hospital

Use your smartphone to scan this QR code and download this article ABSTRACT The rising financial problems of healthcare institutions make studies of resource distribution more and more important and valuable. Among these studies, identification of length of stay of hospital patients (LOS) has attracted many scientists recently since it contributes to better knowledge of hospital costs and helps these institutions control the costs. This paper is devoted to study the length of stay of inpatients in hospital. Although predicting the length of stay is difficul, it is actually useful and benificial if some key factors that have influence on patient length of stay could be determined. This paper will be the basis for a running example that illustrates alternativemodels of the length of stay of hospital pentients. A total of 1189 episodes, which contains patient records, were analyzed by using some parametric and nonparametric statistical methods. In this study, several factors are first considered and investigated, including date of admission, medical admission unit, dianogsis result, international classification of diseases (icd), age, province, profession, recovery status when discharged, ethnic, and etc. Multiple regression analysis was also carried out for modeling length of stay as a function of several independent variables. Since the number of inpatient hospital stays is concerned, the family of Poisson distributions is used in this study. This approach is also supported by the corresponding histogram. Furthermore, univariate analyses showed that age, province, profession, admission quarter, recovery status when discharged, and diseases significantly influence on LOS. Finally, multivariate analysis of multiple regression model emphasized that type of disease, admission quarter, age group, and profession are the key factors that influence the LOS. These results may have some economic and clinical implications for not only patients but also hospitals.


INTRODUCTION
With the quickly increasing in health care costs, governments and humanitarian funding agencies are still looking for mechanisms to control and evaluate the effectiveness of medical care. One of key factors that have high impact on the medical examination and treatment process, especially their cost, is the length of stay (LOS) of inpatients. In fact, a valid approximation of LOS and accurately identifying some factors that influence the LOS can help significantly improve hospital discharge planning. In addition, this information may help patients and their family to get better preparation. There have been many statistical models studying LOS of inpatient [1][2][3][4] . These studies were conducted using data sets of about 1000 to 2000 patients on various pathologies at some local hospitals. Their results show that many factors influence inpatient hospital stay. Regarding to research results on inpatients in Vietnam, Viet et al. 5 studied the number of inpatients changing during [2003][2004][2005][2006][2007]. However, the authors simply used descriptive statistics and neither inferential statistics nor modeling was carried out. In this study, we will explore some factors that potentially influence LOS at a local hospital. The aim of this study is to determine whether there are differences in LOS among categories defined by independent variables, and which factors significantly predict the variation in LOS. The rest of this paper is arranged as follows. In Section Method, we introduce the methods used in this research. Section Results and Multivatiate analysis discuss the main results of the paper. Finally, some discussion and conclusion are given in Section Discussion and conclusion.

Subjects
The hospital data for the period December 2015 to December 2016 were obtained from the database system of a hospital in Central Highlands of Vietnam, resulting in a total of 1209 episodes. Both reference tables of ICD-9 and ICD-10 published by the World Health Organization (WHO) (see Cartwright 6 and Weatherspoon 7 for more details) were used to classify disease and other health problems recorded in the database.
The ICD-10 diagnosis codes were groups based on advice from local clinical doctors. In order to ensure adequate sample size, the groups that consist of less than 0.8% number of all episodes were excluded. This reduced the numbers of episodes to a total of 1189. The following major diagnostic category were concerned: i) certain infectious and parasitic diseases, ii) diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism, iii) diseases of the nervous system, iv) diseases of the ear and mastoid process, v) diseases of the circulatory system, vi) diseases of the respiratory system, vii) diseases of the digestive system, viii) diseases of the skin and subcutaneous tissue, ix) diseases of the musculoskeletal system and connective tissue, x) diseases of the genitourinary system, xi) injury -poisoning -and certain other consequences of external causes. The dependent variable in the study is the LOS. The factors of a) age, b) the province in which the patient resides, c) residential community, d) profession, e) seasonality of admission and discharge, f) admitted day of the week, g) status of patients when discharged, h) gender, i) ethnicity, and j) diagnosis classification were analyzed to check the influences on LOS.

Statistical Analysis
The Shapiro-Wilk test was used to test the normality of LOS for each group. If the distribution is reasonably normal, then the t-test and the Anova are used to test the significant effect on LOS. Mann-Whitney test was used to compare LOS between two groups, and Kruskall-Wallis test was used to compare LOS between more than two groups. The results were considered to be significant provided that p-value is less than 5%. An appropriate transformation of the LOS was performed before Poisson regression was used. The data were randomly split into 90% of records of training part, and the remaining 10% of records part to test model. The analysis in this study was carried out by using R.

RESULTS
The overall average LOS was 15.7 days with the standard deviation 4.4 days whereas the median LOS was 16 days and the interquartile range (IQR) was 5 days. This shows a negatively skewed distribution of LOS. The average age of patients was 51 years old (range 1 to 84). The plot of LOS versus age shows no pattern. However, if the patients were divided into four groups of age: 1-17, 18-39, 40-60, and 61-84 then the boxplot shows an increasing pattern. The Kruskal-Wallis test indicates that LOS is significantly different among the age groups (p = 2.096 × 10 −7 ). Moreover, the Dunn test in Table 1 shows the significant increasing tendency of LOS across the age groups, presenting the phenomenon that older people need more time for recovery from illness. For convenience, we will use [*], [**], and [***] to indicate that 0.01 < p < 0.05, 0.005 < p < 0.01, and p < 0.005, respectively. The provinces where the patients reside were collapsed into two groups: the local province and the others. By the virtue of Wilcoxon rank-sum test, we notice that LOS significantly decreases for those who reside locally with p = 0.01451. The patients' professions were classified into six groups: government officers (1), intellectuals (2), senior citizens (3), students (4), farmers (5), the others (6). The Kruskal-Wallis test confirms that LOS is significant different among these groups with p = 2.535 × 10 −7 . Table 2 provides more details on multiple comparisons between the groups. Thus, the profession groups can be arranged in increasing order of LOS as follows: intellectuals (2) and students (4), government officers (1) and farmers (5), senior citizens (3) and the others (6). Patients' dates of admission to the hospital were categorized into quarters: January to March (1), April to June (4), July to September (7), and October to December (10). The Kruskal-Wallis test shows a significant LOS difference among the groups with p < 2.2×10 −16 (Table 3). In particular, LOS is longer during the second and the third quarters each year and shorter during other quarters. Besides, omitting 12 records of patients admitted to the hospital during weekends, Kruskal-Wallis test shows that the LOS is not significantly different among weekdays (p-value > 0.5).
There are three categories for the status of patients when discharged: fully recovered, partially recovered, and unrecovered. This status significantly affected the LOS (p = 7.765×10 −6 ). The median LOS for those who were not recovered was 5 days while the median LOS for those fully or partially recovered was 16 days (Table 4). There was no significant ethical effect on LOS, the p-value of the corresponding Kruskal-Wallis test is 0.228. The median LOS for patients of majority ethnic groups was 16 days while that of minority ethnic groups was 17 days. In addition, there were a few more women (53.7%) than men (46.3%) who were admitted to the hospital, and there is not significant evidence to claim that gender influences LOS, due to p = 0.1944 of Wilcoxon rank-sum test.
Concerning the patients' diseases, the WHO disease classification standard was used to categorize the recorded diseases into: certain infectious and parasitic diseases (1), diseases of the nervous system (6), diseases of the ear and mastoid process (8), diseases of the circulatory system (9), diseases of the respiratory system (10), diseases of the digestive system (11), diseases of the skin and subcutaneous tissue (12), diseases of the musculoskeletal system and connective tissue (13), diseases of the genitourinary system (14), injury, poisoning and certain other consequences of external causes (19). The Kruskal-Wallis test was then applied to conclude that patients' diseases affect the LOS (p < 2.2×10 −16 ). Furthermore, the Dunn test of multiple comparison (Table 5) allows us to conclude that the LOS of patients in the following groups is significantly longer than that in the other groups: diseases of the nervous system (6), diseases of the circulatory system (9), diseases of the musculoskeletal system and connective tissue (13), and injury, poisoning and certain other consequences of external causes (19).    Table 6.

MULTIVARIATE ANALYSIS
In this section, Poisson Regression Model, as a popular multivariate statistical method for count data, isapplied to investigate the factors that truly affected on LOS, and to predict LOS. In view of Figure 1, it seems that the LOS itself does not follow Poisson distribution. That's why we first apply a transformation TLOS = 35-LOS to make the distribution look "more Poisson likely" (see Figure 2).

Figure 2: The transformed LOS distribution
Based on a test statistic proposed by Hoaglin 8,9 , we get the plot in Figure 3. The linear pattern in Figure 3 shows that TLOS may follow Poisson distribution. Then we can use the multiple Poisson regression model to fit the data. We built the model by testing for the inclusion of each new independent variable, one after another. Using the results obtained from univariate tests, we arrange the factors in increasing order of the p-values (see Table 7). If the variable's addition was significant at the significant level 5%, then we keep it in the model. Notice that the factor "recovery status" will not be included in the model since using this status to predict the LOS makes no sense. The result on Poisson regression model is summarized in Table 9. In this model, the factor "province" was excluded since the corresponding p-value counted 0.634905. Notice that only 90% of the data set were used to train the model. In this table, the reference baseline includes: certain infectious and parasitic diseases (for factor "disease"), quarter I (for factor "quarter admission"), age group less than 18 years old (for factor "age"), and Government Officers (for factor "profession").
The AIC value of this model fitting as 5950 and the residual deviance was 793.95 on 1048 degrees of freedom. We also obtained a summary of deviance residuals given in Table 9. Using the 10% remaining data, we tested the model and compute the errors which the difference between the predicted values and the observed values ( Table 8). The IQR = 3.134 and SD = 3.819 shows that the prediction was quite reasonable.

DISCUSSION AND CONCLUSION
It is no doubt that understanding of factors that influence LOS is helpful for patients, hospitals, and health service system. This study has provided not only information about potential effect of many factors on LOS but also an appropriate model for predicting LOS. This enables better planning of resources and making decision. Similar research has not been studied in Vietnam. Then this study complements and enhances current research related to LOS. Univariate analyses pointed out that the following factors have significant influence on LOS: age, province, profession, admission quarter, recovery status when discharged, and diseases. It is not a surprise that patients' profession affects LOS since the professions influence their life style which, in turn, may have some impact on LOS. Moreover, intellectuals and students usually have good knowledge on their health and thus keep their lives healthy. The senior citizen group is the one staying longest at the hospital since senior people are not as strong as the younger and often need more time to recover from diseases. Concerning the admission quarter, people get sick more likely from April to September, especially during summers. In addition, people normally want to stay at home with their family during Tet holiday and   therefore, leave the hospital as soon as possible in January and February. Many research papers have shown that admission day of week affects the LOS, which did not happen in our study. This fact might be due to small number of observations for admission in weekends (only 12 records). If treatments for a person do not show expected progress after few days, he/she is often transferred to another hospital with better doctors. That is, a patient who could not get an appropriate treatment at the hospital is likely to leave the hospital sooner than the others. The factor "ethnic" does not really affect the LOS. This emphasizes the good policies of the government on the equity of providing public healthcare services to all the citizens. The results obtained here showed that patients in the following disease groups stay longer at the hospital: diseases of the nervous system (6), diseases of the circulatory system (9), diseases of the musculoskeletal system and connective tissue (13), and injury, poisoning and certain other consequences of external causes (19). A possible reason is that these diseases normally require longer inpatient treatments.
In the studied multiple regression model, type of disease, admission quarter, age group, and profession are the key factors that influence the LOS. This model can be applied to patients admitted to any hospital for individual patient expectation of their LOS and resource preparation. Further researches with more sufficient information collected are needed to reduce the error of the regression model. Further studies is also needed to investigate whether factors such as specialty of doctor, admission year, and distance between patients' residential location and the hospital could be involved in the considered models.

ACKNOWLEDGMENT
The first author's research is funded by Ho Chi Minh City University of Technology -VNU-HCM under grand number T-KHUD-2018-18.

LIST OF ABBREVIATION
LOS: Length of stay TLOS: Transformed length of stay

CONFLICT OF INTEREST
We hereby declare that we have no conflict of whatsoever involved in publishing this research.