Introduction: Accurate and consistent data play a critical role in enabling health officials to make informed decisions regarding emerging trends in SARS-CoV-2 infections. Alongside traditional indicators such as the 7-day-incidence rate, wastewater-based epidemiology can provide valuable insights into SARS-CoV-2 concentration changes. However, the wastewater compositions and wastewater systems are rather complex. Multiple effects such as precipitation events or industrial discharges might affect the quantification of SARS-CoV-2 concentrations. Hence, analysing data from more than 150 wastewater treatment plants (WWTP) in Germany necessitates an automated and reliable method to evaluate data validity, identify potential extreme events, and, if possible, improve overall data quality.
Methods: We developed a method that first categorises the data quality of WWTPs and corresponding laboratories based on the number of outliers in the reproduction rate as well as the number of implausible inflection points within the SARS-CoV-2 time series. Subsequently, we scrutinised statistical outliers in several standard quality control parameters (QCP) that are routinely collected during the analysis process such as the flow rate, the electrical conductivity, or surrogate viruses like the pepper mild mottle virus. Furthermore, we investigated outliers in the ratio of the analysed gene segments that might indicate laboratory errors. To evaluate the success of our method, we measure the degree of accordance between identified QCP outliers and outliers in the SARS-CoV-2 concentration curves.
Results and discussion: Our analysis reveals that the flow and gene segment ratios are typically best at identifying outliers in the SARS-CoV-2 concentration curve albeit variations across WWTPs and laboratories. The exclusion of datapoints based on QCP plausibility checks predominantly improves data quality. Our derived data quality categories are in good accordance with visual assessments.
Conclusion: Good data quality is crucial for trend recognition, both on the WWTP level and when aggregating data from several WWTPs to regional or national trends. Our model can help to improve data quality in the context of health-related monitoring and can be optimised for each individual WWTP to account for the large diversity among WWTPs.