The opportunities and shortcomings of using big data and national databases for sarcoma research

Document Type

Review Article


General Surgery


The rarity and heterogeneity of sarcomas make performing appropriately powered studies challenging and magnify the significance of large databases in sarcoma research. Established large tumor registries and population-based databases have become increasingly relevant for answering clinical questions regarding sarcoma incidence, treatment patterns, and outcomes. However, the validity of large databases has been questioned and scrutinized because of the inaccuracy and wide variability of coding practices and the absence of clinically relevant variables. In addition, the utilization of large databases for the study of rare cancers such as sarcoma may be particularly challenging because of the known limitations of administrative data and poor overall data quality. Currently, there are several large national cancer databases, including the Surveillance, Epidemiology, and End Results database, the National Cancer Data Base of the American College of Surgeons and the American Cancer Society, and the National Program of Cancer Registries of the Centers for Disease Control and Prevention. These databases are often used for sarcoma research, but they are limited by their dependence on administrative or billing data, the lack of agreement between chart abstractors on diagnosis codes, and the use of preexisting documented hospital diagnosis codes for tumor registries, which lead to a significant underestimation of sarcomas in large data sets. Current and future initiatives to improve databases and big data applications for sarcoma research include increasing the utilization of sarcoma-specific registries and encouraging national initiatives to expand on real-world, evidence-based data sets

Publication (Name of Journal)