Our findings confirm that combining historical dengue incidence information with dengue-related Google search data, in a self-adjusting manner, leads to better near real-time dengue activity estimates than those obtained with previous methodologies that exploit the information separately. This also confirms that the hidden Markov model framework used by ARGO is appropriate in this context.
ARGO’s uniform out-performance of other benchmark methods for Mexico, Brazil, Thailand, and Singapore demonstrates its robustness and broad applicability. ARGO achieves this by balancing the influence of Internet search data, which quickly change in the face of outbreaks, and auto-regressive information, which tempers the estimations to mitigate the problem of overshooting. The application of an L1 regularization approach helps identify the query terms most relevant to estimation at any given time, providing easy-to-interpret information as shown in the heatmaps in Figures A, B, C, D, and E.
ARGO dynamically trains on a two-year rolling window, allowing model parameters to adjust over time to account for changes in Internet users’ behavior. The success of our methodology is based on the intuition that the more people are affected by dengue, the higher the number of dengue-related searches will be during an outbreak, and therefore the more likely Google query information will be useful at detecting dengue activity. This is observed in our findings, where the median yearly dengue case counts are strongly associated with the performance of our methodology (i.e. the higher the median yearly cases the higher the correlation of ARGO). In addition, in Brazil, Mexico, and Thailand, the countries where our methodology works best, a clear seasonal pattern is observed in the disease incidence trends over time.
On the other hand, the results from Taiwan illustrate the limitations of our approach. Taiwan does not present either an observable seasonal trend or a high number of dengue cases. As a result, neither ARGO nor the model using only Google search terms reliably track dengue. Low dengue-related Internet search activity during most years and sudden public interest during the outbreaks of 2014 and 2015, causing mis-calibration of the Google Trends data, may be another contributor. Other unique characteristics of the Taiwan outbreaks are that they were largely localized in South Taiwan, where Aedes aegypti is resident, and featured viral strains from neighboring countries rather than endemic strains. Also of interest is that the increased case counts occurred during periods of significantly increased temperature and rainfall. The unpredictable character of these outbreaks present challenges for the performance of ARGO, and generally of all the methods considered in our comparison, but also highlight the potential of incorporating environmental predictors such as temperature and precipitation in our approaches.
While Internet penetration may seem to be an important factor in assessing the quality of Google Trends data, the statistics from Table 2 show that it alone is not as effective as dengue prevalence or seasonality in predicting the overall performance of our methodology. As an example, although Taiwan has high Internet penetration, the dengue case count may be low enough over most years that dengue-related searches motivated by other medical or educational purposes may introduce significant noise in the Google-query data.
On the other hand, ARGO shows strong improvement over the seasonal autoregressive model in Brazil and Singapore, two countries with moderate to high Internet access, compared to Mexico and Thailand, which have low Internet access, suggesting that web penetration is nevertheless still an important factor. Finally, the proportion of the population within a country using Google as a search engine also provides some insight into the performance of ARGO. ARGO shows the lowest correlation in Taiwan, which happens to have the lowest Google market share among the countries studied here.
Despite dengue and flu having very different biological transmission patterns, the fact that modifications to the ARGO methodology yield robust and accurate dengue estimates indicates the strength of our methodological framework. Although the monthly time scale chosen for this study was originally chosen based on data availability, inspection shows that a monthly surveillance approach is better suited for the 2-week serial interval of dengue.
The dengue activity estimates obtained with our methodology, like estimates from any novel digital disease detection tool, are not meant to replace dengue information obtained from traditional healthcare-based disease surveillance; instead, they can help decision-makers confirm (or deny) suspected disease trends ahead of traditional disease surveillance systems. Ultimately, the goal of this effort is to take a step closer to the development of an accurate, real-time modeling platform, where dengue case estimates can be constantly updated to provide authorities and non-governmental organizations with potentially actionable and close to real-time data on which they can make informed decisions, as well as providing travelers visiting high-risk areas with warnings. Such a platform could bring multiple information sources together, including but not limited to traditional epidemiological case reports, Google searches, crowd-sourced data, and climate and transportation information, creating a rapid response and alert system for users based on their specific location. Timely and precise detection may turn out to play a large role in reducing infections in the near future by influencing the timing of vector control efforts, hospital and clinical preparation, and providing public and individual alerts.
The platform would also enable users to verify dengue risk information with their own observations, creating a positive feedback loop that would continuously improve the accuracy of the tool. We are currently implementing two building-blocks that could help shape such a platform. The first one consists of a webpage Healthmap.org/denguetrends where dengue estimates produced with the methodology introduced in this manuscript are continuously displayed, and the second one is a crowd-sourced tool (currently in beta) that offers a user-friendly online chat system which maps dengue cases worldwide, and gives the public free access to toolkits that help reduce their risk of infection. This second effort is led by Break Dengue’s “Dengue Track” initiative www.breakdengue.org/dengue-track/. The potential impact may be far reaching, as the same models could also be used to track and map other infectious and mosquito-borne diseases, like Zika, malaria, yellow fever or Chikungunya.
Real-time implementation of our methods requires robust responses to changes in data quality, availability, and format. For example, Google correlate data shows internal variability attributed to re-sampling when the tool is accessed at different times. In addition, epidemiological data is not always published consistently by countries, creating lags in reporting that would make our methodology (which assumes having access to last month’s dengue case counts) not applicable.
In order to understand the impact of these data limitations, we performed two robustness studies of ARGO with respect to (1) the variations in Google Trends data, and (2) the availability of the most recent dengue case count data. For the first, we obtained multiple data sets containing the search frequencies of the query terms displayed on Table A in S1 Text by accessing Google Trends 10 different times during a week. We then produced Dengue activity estimates with ARGO using these 10 data sets as input. Table B in S1 Text shows that ARGO still outperforms all other methods in Brazil, Mexico, Thailand and Singapore, despite the random variations observed in Google Trends data. For the second, we retrained all the models under the assumption that the dengue case count from the past month was never available due to reporting delays. Table C in S1 Text shows that despite the unavailability of the last month dengue counts, ARGO had competitive predictive performance in the five countries/states when compared to other models (similar to the full data case), suggesting that our methodology is robust to the time delays in reporting in addition to variations in the input variables.
While our methods are designed to self-correct over time, the introduction of an intervention to curb dengue activity that could lead to a reduction in dengue cases, such as vector control or behavioral education (e.g. use of bed nets), may potentially lead our models to temporarily over-predict incidence. However, once such an intervention has been established and remains active in a given location, our models will self-correct over time to predict the new levels of dengue activity. Sporadic, nation-wide mosquito control methods would provide a bigger challenge to dengue case count predictability and, therefore, our model’s usability.
In light of ARGO’s strengths and limitations, future work should analyze the feasibility of applying our methodology to other countries, finer spatial resolutions, and temporal resolutions. This will be followed by routine reassessments of our methods to identify changes in information or potential improvements, including new search terms. As an example of such a change, Brazil has started publishing weekly dengue case counts since 2014. While our work used only the monthly resolution for fair comparison among all countries, adapting our methods to shorter time horizons for regions that provide such information would be useful.
Information on national-level dengue activity may not be ideal for decision-making aft the local level since this information has been aggregated over a wide variety of potentially heterogeneous spatial environments. Future work should explore finer spatial resolution estimations to identify whether region-specific factors may improve or worsen results, similar to what has been done in [15, 38]. The five countries/states explored in this study vary on orders of magnitudes of size; for example, Brazil, Mexico, and Thailand each spans over 100 million square miles. As a result, these three countries contain wide ecological diversity and potentially varying patterns of dengue transmission among different sub-regions. It may be expected, for example, that Brazil would show different levels of seasonality in tropical compared to temperate areas. The success of finer spatial resolutions would depend on the quality of local case count and Google Trends data; the former can be affected by reporting efficiency, and the latter can be subject to Internet availability and Google use in a given region. Using national level data, on the other hand, has the advantage of smoother incidence curves for extraction and extrapolation of signal at the cost of more granular information. This is reflected in the observation that ARGO performed best in the three large countries despite the inherent heterogeneity within each country. This fits with our previous observation that a combination of higher dengue prevalence at the national level, seasonality and Google use in these countries leads to better results. We believe that these strengths and limitations also apply to extending our methodology to other countries/states besides those studied in the paper.
Producing short-term forecasts of dengue activity, in addition to the nowcast presented here should also be pursued (See  for such an extension for flu forecasting). Our approach may help produce dengue activity estimates in higher spatial resolutions that can lead to alert systems for people with an increased risk of exposure to the dengue virus at any given point in time. It is important to keep in mind that state-level or city-level spatial scales with low dengue activity may present similar challenges to the applicability of our approach as seen in Taiwan. The incorporation of other Internet-based data sources [48, 49] and cross-country spatial relationships should also be exploited in order to improve the accuracy in predictions.