Methods on COVID-19 Epidemic Curve Estimation During Emergency Based on Baidu Search Engine and ILI Traditional Surveillance in Beijing,China

2023-03-22 08:04TingZhngLiuyngYngXunHnGuohuiFnJieQinXunhengHuShengjieLiZhongjieLiZhiminLiuLuzhoFengWeizhongYng
Engineering 2023年12期

Ting Zhng, Liuyng Yng, Xun Hn, Guohui Fn, Jie Qin, Xunheng Hu, Shengjie Li,Zhongjie Li, Zhimin Liu*, Luzho Feng,*, Weizhong Yng,*

a School of Population Medicine and Public Health, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100730, China

b Department of management science and information system, Faculty of Management and Economics, Kunming University of Science and Technology, Kunming 650504, China

c WorldPop, School of Geography and Environmental Science, University of Southampton, Southampton SO17 1BJ, UK

d The Third Affiliated Hospital of Kunming Medical University, Kunming 650118, China

Keywords:COVID-19 Epidemic curve Baidu search engine Influenza-like illness Deep learning Transmission dynamics model

ABSTRACT

1.Introduction

In recent years, emerging infectious diseases have been a persistent threat,causing harm to human life,health,economic development, and social order [1], and posing a potential risk to humankind.Disease surveillance is a fundamental element for preventing and controlling diseases and is also a requirement for ending the pandemic.Therefore, establishing a surveillance and early-warning system is advantageous for detecting diseases earlier, thereby allowing for prompt response measures [2], which can diminish the peak of the epidemic and reduce the impact on health.

The current global coronavirus disease 2019 (COVID-19) outbreak has highlighted the inadequacies of traditional surveillance systems.With the policy of no longer considering infected individuals as the primary surveillance subjects,the reported cases cannot accurately reflect the actual infection rate,thus posing a challenge to traditional epidemic prevention and control.Nevertheless, the severity of the disease, the effects of symptoms on health, and the need for medical resources are still essential information that must be tracked.In this regard,it is necessary to reform traditional surveillance systems and pay attention to new types of surveillance, which may serve as a supplement to existing systems.The application of big data and the advancement of modern technology can help significantly in this regard.

The World Health Organization (WHO) proposed in May 2021 to develop a new model for surveillance of emerging threats, the Global Hub for Pandemic and Epidemic Intelligence[3].This model aims to integrate traditional and modern big data surveillance methods, such as artificial intelligence, to combine different data sources and conduct interdisciplinary collaboration, thus increasing the availability of various data and connections.This project will make a significant leap forward in data analysis to aid decision-making [4].Furthermore, media surveillance based on network search engines can make up for the shortcomings of traditional surveillance, especially in backward areas with underdeveloped surveillance networks or in periods of unstable surveillance due to major events and major infectious diseases.Studies have shown that Baidu, Daum, Twitter, Wikipedia, and other social media (including search engines) can be used to detect the prevalence of influenza, Zika virus [5], dengue fever [6], avian influenza[7], and hand, foot, and mouth disease [8].

Baidu is the most-used search engine in China.As of December 2021,the number of users is approximately 829 million,and 80.3%of them use search engines [9].By July 2021, its monthly active users had exceeded 600 million, making it the largest search engine in the country with comprehensive coverage and usage.Thus,Baidu is an ideal choice to surveil the development of the epidemic due to its large population and widespread use,especially in Beijing.Given the prevalence of the Baidu search engine and the relatively stable usage habits of the population, this study verifies its effectiveness in surveilling the epidemic situation.

In the current global context, COVID-19 has been declared an end to the public health emergency of international concern [10].The pathogenicity has weakened,vaccination rates have increased,and experience in prevention and control has accumulated.In China,the goal is to reduce influences on healthcare while considering economic and social impacts, given limited medical treatment and social prevention and control resources.To this end,greater attention should be paid to risk surveillance of key populations and treatment of severe and critical illnesses.Symptom surveillance can provide insight into the epidemic of infectious diseases and is an essential indicator of disease focus, which can also increase the demand for medical resources.

‘‘In dealing with a complex crisis, we should establish upfront which dimension to prioritise,and adapt more quickly to changing situations to not allow the perfect to become the enemy of the good.” as the White paper on Singapore’s response to COVID-19: lessons for the next pandemic summarized [11].When faced with an emergency outbreak, it becomes necessary to adopt innovative approaches to overcome the limitations of traditional surveillance methods.This study examined the use of modern surveillance channels alongside conventional methods in emergency situations to evaluate the scale of COVID-19 infection.The results provide a valuable methodological reference for future infectious disease surveillance, utilizing real-world observations of the pandemic to inform surveillance strategies.

2.Methods

2.1.Data sources

This study used the daily number of influenza-like illness (ILI)cases in Beijing and the daily proportion of ILI among the outpatients (ILI%) as the dependent variables and the daily Baidu index as an independent variable.The research period was from July 1,2013,to December 9,2022.The ILI data were collected from 419 sentinel hospitals in 21 districts of Beijing, with a total of 1 275 742 samples.The Baidu index was formulated using six keywords,including fever,pyrexia,cough,sore throat,anti-fever medicine, and runny nose, which were sourced from both mobile and personal computer platforms.

WHO and the Centers for Disease Control and Prevention(CDC)define an ILI as an acute respiratory illness with a temperature of at least 100°F(38°C)and associated cough,with onset within the past ten days[12].For the 2021-2022 influenza season,case definitions no longer require‘‘no other known etiology other than influenza” [13].The ILI definition issued by the Department of Disease Control and Prevention of the National Health Commission of China is:fever(body temperature ≥38°C)accompanied by either cough or sore throat[14].These definitions of ILI only differ slightly in body temperature,and the composition of symptoms is the same.Additionally,no etiological tests are conducted to confirm the diagnosis of ILI,which includes the current pandemic of COVID-19.

Data sharing statement:the Baidu search data in this study are publicly available, the influenza virological surveillance data in Beijing were retrieved from a previously published study [15].

2.2.Data preprocessing

Data standardization involves the process of adjusting the values in a dataset to a specific scale,thereby enabling different variables to be compared with one another while also eliminating the impact of varying magnitudes.This technique can enhance data quality, streamline data processing, improve model precision,expedite model convergence, reduce model training duration,and enhance the stability and reliability of the model.

In the current study, the data underwent pre-processing utilizing Min-Max scaling of the following aggregation.The normalization method adopted was off-difference, where the data underwent linear scaling based on the maximum and minimum values to ensure that the scaled data values fall within the range of[0,1].This range was deemed suitable for observation and training purposes.The normalized thermal distribution of each feature is presented in Fig.1.

2.3.Establishment of the dataset

(1) Training: July 1, 2013 to May 28, 2018 (1793 days).

(2) Validation: May 28, 2018 to March 24, 2019 (300 days).

(3) Testing: March 24, 2019 to March 23, 2020 (365 days).

(4) Prediction:October 10,2022 to December 9,2022(60 days).

(5) Estimation:November 22,2022 to January 20,2023(60 days).

2.4.Modeling

Fig.1.Thermal distribution of each feature after standardization.To ensure equitable inclusion in model training, we normalize multi-source data using a Min-Max scale within the range of [0,1].In the corresponding visual representation, lighter colors are indicative of values closer to 0, while darker colors signify values approaching 1, as illustrated in the legend.

This study employed a composite model that combined deep learning and a transmission dynamics model to predict the COVID-19 epidemic.First, we used the MABG model to predict the current ILI%and ILI case.Given the multidimensional nature of our data, we developed a prediction model based on the multiattention mechanism and bidirectional gated recurrent unit to handle multi-featured time series.By thoroughly exploring the inherent characteristics of multi-source heterogeneous data and establishing the connection between characteristics and results,the MABG model was able to complete the task of time series prediction effectively and reliably.

When a multi-featured time series was fed to the model, we first connected it to a bidirectional gated recurrent unit (GRU)layer,which was good at processing time series and capturing features between step intervals in the time series.The bidirectional GRU (BGRU) is an improved version of the GRU that offers several advantages, including a higher level of global information utilization,prediction capability,and modeling ability.Unlike traditional recurrent neural networks that can only consider the input of the current moment and the implied state of the previous moment,the BGRU can utilize the information of the before and after states of the current moment.This approach facilitates better global information capture and more accurate output prediction.The structure of the BGRU model is illustrated in Fig.2.

Then,we employed three different attention mechanism modules simultaneously: squeeze and excitation attention [16], channel attention, and spatial attention [17].Combining these three attention mechanisms,we extracted important information between different features and key information within the same feature.In addition, to prevent the gradient from disappearing, after concatenating the results of different attention modules, we connected the results with two pooling layers for residual connection and output the prediction results through the dense layer (Fig.3).

Finally, the study utilized a classical transmission dynamics model to estimate the epidemic curve of COVID-19 infection in Beijing, incorporating predicted results.The transmission dynamics model has various versions, depending on the study’s objectives, and requires defining related parameters to evaluate the effectiveness of pharmaceutical/non-pharmaceutical interventions.To predict the epidemic trend,essential factors must be considered.This study aimed to estimate the epidemic trend based on actual information,utilizing an optimal solution set based on realtime data.The equation used in this study marked the influence of different factors,but the focus was not to distinguish the impact of each factor.Therefore,the index of comprehensive effect was used as a substitute when seeking the optimal solution.The total population,N,was categorized into four classes:susceptible(S),exposed(E), infected (I), and recovered/removed (R).The governing differential Eq.(1) was as follows.A continuous time variable model was established to account for the continuous infection process,as expressed by the Eq.(1).

Fig.2.GRU and BGRU.

Fig.3.MABG-susceptible-exposed-infected-removed (SEIR) model structure.Concat: contatenate.

where Eq.(1)are subject to the initial conditions S(0),E(0),I(0),and R(0).The parameters are defined as: t: time; Λ: per-capita natural birth rate;μ: per-capita natural death rate; c: the effectiveness of public health social measures; v: the effectiveness of all kinds of pharmaceutical interventions; δ: the probability of disease transmission per contact (dimensionless) times the number of contacts per unit time; α: rate of progression from exposure to infectious(the reciprocal is the latent period); γ: recovery or death rate of infectious individuals (the reciprocal is the infectious period).In this study,we did not distinguish the effects of c,v,and δ,but considered their effects together,denoted by the rate per unit of time at which the susceptible become infected β,which could be calculated by R0depend on Eq.(2).

2.5.Assessing the scale of COVID-19 infections in comparison to ILI

In the past, surveillance of ILI in China did not include patients with COVID-19 infections.However, this study took into account those with ILI among the existing COVID-19-infected patients(Fig.4).In addition to those with ILI symptoms,COVID-19 infection also includes asymptomatic cases.Therefore, based on the MABG model’s predictions of ILI,the excess ILI was calculated in combination with the historical baseline levels of ILI.This allowed for the subtraction of the non-ILI population to derive the number of ILI populations infected with COVID-19.Then, based on the proportion of asymptomatic infections of Omicron, the adjustment was made to obtain a rough estimate of the scale of COVID-19.The proportion of asymptomatic infections concerning overall infections was subject to variables such as age distribution, general health status, underlying health conditions, and vaccination coverage.As per previous systematic reviews, meta-analyses [18,19], and official reports [20,21], the asymptomatic proportion ranged from 25.3% to 40.0%.This study was established based on an assumed a symptomatic proportions of 30.0%.

2.6.Study assumptions

(1) Assuming that the motivation of search behavior remains relatively constant once symptoms of ILI are present.

(2) The definition of ILI encompasses the primary symptoms of COVID-19.

(3) The assuming is that the current policy is maintained without considering the potential policy alterations as the epidemic peak approaches.

(4) The prevalence of other ILI diseases did not differ from historical levels.

3.Results

3.1.Model validation

This study was validated by comparing the predicted and actual values from May 28,2018 to March 24,2019(Fig.5).The R2values(a value between 0 and 1,quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables in the model) of ILI cases and ILI% were 0.6540 and 0.6057, the explained variance scores (EVSs) were 0.6596 and 0.6069, the mean absolute errors (MAEs) were 0.1145 and 0.5629, and the mean squared errors (MSEs) were 0.0298 and 0.5688, respectively (Table 1 [22]).

Fig.4.Assessing the scale of COVID-19 infections based on ILI.The relationship between ILI and COVID-19 patients.

3.2.ILI estimation results based on the Baidu index

Analysis of the Baidu index and ILI data concerning the emergence of COVID-19 since January 2020 revealed that ILI cases and ILI% had surpassed the historical baseline levels from December 1, 2022 (p <0.05).Furthermore, the number of ILI cases surged in November and December,prior to the government’s historic policy adjustments on December 7, 2022.These findings suggest that the epidemic had already reached a large scale before the official policy changes were enacted (Fig.6(a)).

3.3.Comparison of ILI% and ILI cases among different models

We also compared the MABG model with other standard traditional statistical models, machine learning, and deep learning models using four metrics R2(Eq.(3)), EVS (Eq.(4)), MAE(Eq.(5)), and MSE (Eq.(6)).The calculation methods of the four metrics are shown below.The results are shown in Table 1, from which we can see that the MABG model we used outperforms other models in most evaluation metrics.

where y is the actual observed values of the dependent variable;︿y is the predicted or estimated values of the dependent variable based on the model; n is the total number of data points or observations in the dataset; i is an index that represents each individual data point in the dataset, ranges from 1 to n.

Table 1Comparison of the ILI% and ILI case between different models.

3.4.Model application on the epidemic curve estimation of COVID-19 infection in Beijing

The present study utilized a variation susceptible-exposed-inf ected-removed (SEIR) model to analyze the epidemiological characteristics of COVID-19 in Beijing.The parameters were calculated based on the infections estimated through the ILI model.The resident population of Beijing is 21 893 095[23],with over 80%having received the COVID-19 vaccination booster [24].The birth rate of Beijing in 2021 is 0.635%, and the death rate is 0.539% [25].Approximately 30.0% of the population is assumed to be asymptomatic during infections.The transmission dynamics of COVID-19 were modeled to simulate the epidemic curve in Beijing.The relevant parameter settings are shown in Table 2.The results of the variation SEIR model suggest that the epidemic’s peak is expected to occur on December 12, with about 1.66 (95% confidence interval (95% CI): 1.61-1.72) million new infections at peak time.The outbreak is expected to conclude in early January.The peak of existing patients’ curve, which refers to the increase in new infections and decrease in recoveries/deaths, is expected to occur on December 15 with more than 5.47 (95% CI: 5.22-5.73)million existing patients at peak time (Fig.6(a)).The duration between the peak of new infections and the peak of existing patients is estimated to be three days.We estimated that the cumulative infection attack rate was 80.25% (95% CI: 77.51%-82.99%) on December 17, and 97.50% (95% CI: 97.00%-98.00%) on January 15, 2023 (Fig.6(b)).The overall trend of corresponding estimated effective reproduction number (Rt) kept fluctuating dropping,and it remained below 1, 0.92(95% CI:0.90-0.95),since December 17, 2022 (Fig.6(c)).

4.Discussion

This research investigated the implementation of the Baidu index to predict the magnitude of ILIs at sentinel hospitals in Beijing, aiming to supplement traditional surveillance and provide novel insights for countries and regions behind in global surveillance.Additionally, the estimation of the size of the population infected by COVID-19 in cities with policy changes was also examined.The findings showed that the number of ILIs in Beijing has surpassed the historical average since December, a trend which could be attributed to the rise in COVID-19 cases.However, an increase in other respiratory infection cases could not be ruled out.At 419 sentinel hospitals included in the study, the number of people with ILI cases and related symptoms increased rapidly.Finally,Baidu provided new ideas for the surveillance of this round of the COVID-19 pandemic.

The positive nucleic acid testing rate[26]and Baidu search data were both peaked on December 14, providing a valuable crossvalidation of the COVID-19 epidemic trend estimation based on two distinct data sources.The purpose of COVID-19 nucleic acid testing is to detect new cases of infection, and once a positive result is obtained, frequent testing is unlikely.Therefore, nucleic acid testing does not reflect the current infected individuals, but rather identifies newly infected individuals in the early stages of the disease.In this study, the peak of the positive rate of nucleic acid testing is compared with the peak of new infections daily.Since December 8, 2022, the nucleic acid testing strategy has shifted from population-wide testing to voluntary testing.Therefore,the absolute values presented in the nucleic acid testing data cannot represent the number of infections, and they are not directly comparable to the absolute values of infections in this study.To a certain degree,the concurrence of peak times provides empirical validation for the reliability of the study method.It is important to note that the model should be tailored to the specificapplication scenario of the transmission dynamics model, rather than striving for excessive complexity and detail.

Table 2Parameters for SEIR model to estimate epidemic curve of COVID-19 infection in Beijing.

Fig.6.Based on the Baidu search engine and ILI surveillance to simulate the COVID-19 epidemic curve in Beijing.(a)Existing and new infections per day.The dark black points are the estimated case by the MABG model,and the blue lines represent new infections per day while the orange line represents existing patients per day.(b) Cumulative infection attack rate per day.(c) Rt from November 28, 2022-January 20, 2023.

This study aligns with Kathy Leung’s research [27], which estimated the transmission dynamics of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Omicron BF.7 in Beijing from November to December 2022.Both studies indicate that the infection peaked before mid-December, 2022, with around 92% of the population infected as of December 22, 2022.But our study found a 97.50%infection rate(95%CI:97.00%-98.00%)as of January 15, 2023, notably higher than Kathy Leung’s estimates.This discrepancy may stem from our model’s uniform assumptions about social interaction, which overlook subgroups like the selfisolating or those with limited mobility, potentially inflating the infection rate.However, the maximum values of Rtin this study(2.79)are lower than them(3.44).This discrepancy may be attributed to different assumptions, data sources, and model parameter errors between the two studies.Therefore, the significance and applicability of the study’s results should be carefully considered in light of the based data source, research hypothesis, and model structure.It is important to acknowledge that this model encounter challenges when attempting to accurately reflect real-world circumstances.

WHO proposes that traditional surveillance of infectious diseases,such as ILI,includes patients receiving medical services,hospitalized patients, laboratory confirmation, gene sequencing, death estimation, active surveillance, tracking, etc.Modern surveillance techniques such as network information, animal health, occupational health,policy reports,community-reported cases,mobile data,public databases, and wearable devices are being employed to supplement these traditional methods.In particular, the use of the Baidu index as a supplementary means of ILI surveillance is an example of this modern surveillance.Studies have demonstrated that modern surveillance methods, such as Google Flu Trends (GFT), can detect signs of disease occurrence earlier than traditional methods, being able to detect the occurrence of ILI one week in advance[28].These Internet-based systems improve the sensitivity of surveillance for developed countries and may be more effective for countries with underdeveloped traditional surveillance systems[8].

The significance of syndrome surveillance lies in its ability to quantify the magnitude of an outbreak and ascertain the demand for medical resources and strategize accordingly.The findings of this study demonstrate that following a surge in new infections there was a subsequent surge in the number of existing patients,posing a significant challenge for the healthcare sector [29].The severity of a disease’s symptoms often leads to an increased likelihood of seeking medical treatment.In situations where laboratory testing is unavailable or unnecessary, it is still important to consider the health and recovery of those infected.Therefore,estimating the number of ILI cases in a particular area can help assess the demand for medical resources.However,it is essential to note that the predicted number of cases refers to the number of people seeking treatment at sentinel surveillance sites,not the total number of ILIs in the area.To obtain an accurate representation of the area’s ILI rate, the hospital’s coverage of services must be taken into account.

Syndrome surveillance is essential for the control and prevention of influenza at a global level[30].The aim of these strategies should be to maximize the health benefits of the population while avoiding economic disruption.For this purpose, surveillance efforts should be concentrated on symptomatic infected individuals.A study[31]conducted in Chaoyang District,Beijing,demonstrated that intensifying influenza surveillance and conducting a comprehensive analysis of the surveillance results can assist in the timely detection of influenza and enable more precise measures to be taken.Additionally, public data from the Baidu search engine can be used to infer the prevalence of respiratory infectious diseases more comprehensively, which can be utilized to anticipate any potential shortage of medical resources,thus allowing for timely adjustments to prevention and control policies.

It is recommended to surveillance the symptoms of COVID-19 based on or in reference to the ILI system of influenza surveillance.The COVID-19 pandemic is expected to persist[32].Surveillance of the symptoms of COVID-19 is essential to comprehend the magnitude of the disease, evaluate the epidemic trend, and assess the demand for medical resources and the burden of the disease.In the past, ILI surveillance sentinel sites in China [33], the United States [34], Japan [35], and the United Kingdom [36] have been instrumental in the surveillance of influenza.The population’s susceptibility and the burden of the disease associated with COVID-19 are higher than those of influenza.Adjustment of preventive measures,preparation for a response,and virus mutation all depend on effective surveillance.

There are some limitations.This study has only estimated the number of people visiting a doctor or obtaining medication, which did not reflect the actual number of infections or symptoms.The SEIR model calculates certain parameters based on assumptions,which can limit their credibility in accurately representing the real world.As a result, not all parameters, such as the recovery rate,may be reliable indicators of real-world dynamics.Also, the SEIR model also could not incorporate all real-world factors into the estimation model.Various factors, such as weather conditions, traffic conditions, holidays, and the risk of cross-infection, influence this behavior.Additionally, this study did not include all Baidu indexes related to influenza-like cases because the Baidu index is subject to interference and guidance from numerous sources,thus introducing certain levels of uncertainty.Furthermore,this study did not differentiate between influenza virus infection, COVID-19, rhinovirus infection, and other specific diseases.

5.Conclusion

The Baidu index effectively gauges the quantity and proportion of individuals who manifest influenza-like symptoms and subsequently visit sentinel hospitals or procure medication within a reliable range.Additionally, Baidu index can be utilized to calculate the dissemination of a virus and the rate of contagion during a pandemic.

Acknowledgments

This study was supported by grants from the Chinese Academy of Medical Sciences(CAMS)Innovation Fund for Medical Sciences(2021-I2M-1-044).All authors would extend thanks to Baidu for the data publication and Sinosoft Company Limited for technical support.

Authors’ contribution

Weizong Yang,Luzhao Feng,and Ting Zhang contributed to the study design; Liuyang Yang, Xuan Han, and Xuancheng Hu were responsible for data collection and curation; Liuyang Yang, Ting Zhang,Zhongjie Li, and Zhimin Liu verified and analyzed the data;Jie Qian and Xuan Han conducted literature review; Ting Zhang,Xuan Han, and Liuyang Yang wrote the first draft of the manuscript; Weizhong Yang, Luzhao Feng, Zhimin Liu, Zhongjie Li,Shengjie Lai,and Guohui Fan reviewed and contributed to the writing of the manuscript.All authors had full access to all the data in the study, approved the revisions, and had final responsibility for the decision to submit for publication.

Compliance with ethics guidelines

Ting Zhang, Liuyang Yang, Xuan Han, Guohui Fan, Jie Qian,Xuancheng Hu, Shengjie Lai, Zhongjie Li, Zhimin Liu, Luzhao Feng,and Weizhong Yang declare that they have no conflict of interest or financial conflicts to disclose.