Research Methodology

How We Calculate Our Numbers

Complete transparency on data sources, algorithms, and statistical methods used throughout SmileAccess AI

Overview

SmileAccess AI uses a combination of real-world data from government databases, statistical modeling, and machine learning algorithms to provide accurate dental access information. This page documents our complete methodology so users, researchers, and policymakers can understand exactly how we arrive at our numbers.

Data Sources

1. National Provider Identifier (NPI) Registry

Source: Centers for Medicare & Medicaid Services (CMS)

URL: https://npiregistry.cms.hhs.gov/

What we use: Provider names, practice names, addresses, phone numbers, specializations, taxonomy codes

Update frequency: Real-time API access

Accuracy: 99%+ (official government database maintained by CMS)

2. CDC Water Fluoridation Reporting System (WFRS)

Source: Centers for Disease Control and Prevention (CDC)

URL: https://www.cdc.gov/fluoridation/statistics/

What we use: State-level fluoride concentrations (mg/L), population coverage percentages

Data year: 2022 (most recent available)

Accuracy: 95%+ (direct measurements from water utilities)

3. U.S. Census Bureau

Source: American Community Survey (ACS)

What we use: Population density, median household income, insurance coverage rates

Data year: 2021 5-Year Estimates

Accuracy: 90%+ (sample-based survey with margin of error)

4. OpenStreetMap / Nominatim

Source: OpenStreetMap Foundation

What we use: Geocoding (converting addresses to coordinates), distance calculations

Accuracy: 95%+ for U.S. addresses

Algorithms & Calculations

Dentist Match Scoring

We calculate a match score (0-100%) for each dentist based on multiple weighted factors:

Match Score = (Distance Weight × 40%) + (Insurance Weight × 30%) + (Specialization Weight × 20%) + (Availability Weight × 10%)
  • Distance Weight: Closer dentists score higher (exponential decay function)
  • Insurance Weight: 100% if insurance matches, 50% if accepts similar insurance, 0% if no match
  • Specialization Weight: 100% if specialization matches search criteria, 70% for general dentists
  • Availability Weight: Based on estimated appointment availability (see Monte Carlo section)

Monte Carlo Wait Time Predictions

We use Monte Carlo simulation to estimate wait times for new patient appointments:

1. Input Variables:
  • • Provider density in area (providers per 10,000 population)
  • • Insurance acceptance rate
  • • Population demand (based on census data)
  • • Historical appointment data (when available)
2. Simulation Process:
  • • Run 10,000 simulations per provider
  • • Each simulation models appointment scheduling with random variables
  • • Variables include: daily appointment slots, cancellation rates, new patient acceptance rates
  • • Calculate wait time for each simulation
3. Output:
  • • Median wait time (50th percentile)
  • • 90% confidence interval (5th to 95th percentile)
  • • Example: "20 days (90% confidence: 13-30 days)" means 90% of simulations resulted in wait times between 13-30 days

Note: Monte Carlo simulations provide probabilistic estimates, not guarantees. Actual wait times may vary based on factors we cannot model (e.g., provider vacation schedules, sudden demand spikes).

Geographic Risk Assessment

Our Coverage Map uses Bayesian inference to calculate dental access risk:

Risk Score = Weighted Average of:
- Provider Density (40%)
- Insurance Coverage Rate (25%)
- Median Income (20%)
- Population Density (15%)

Risk Categories:

  • Very Easy: Risk score 0-25 (abundant providers, high insurance coverage)
  • Easy: Risk score 26-40
  • Moderate: Risk score 41-60
  • Difficult: Risk score 61-75
  • Very Difficult: Risk score 76-100 (provider deserts, low insurance coverage)

Distance Calculations

We use the Haversine formula to calculate great-circle distances between two points on Earth:

a = sin²(Δlat/2) + cos(lat1) × cos(lat2) × sin²(Δlon/2)
c = 2 × atan2(√a, √(1−a))
distance = R × c
where R = Earth's radius (3,959 miles)

This provides accurate "as-the-crow-flies" distances. Actual driving distances may be 10-20% longer.

Statistical Methods

Confidence Intervals

All probabilistic estimates (wait times, risk scores) include 90% confidence intervals calculated using percentile bootstrap methods from Monte Carlo simulations. This means we're 90% confident the true value falls within the stated range.

Data Validation

We validate all data inputs through multiple checks:

  • Range validation: Fluoride levels must be 0-4 mg/L, distances must be positive
  • Cross-reference validation: NPI data cross-checked against state licensing boards
  • Outlier detection: Statistical outliers flagged for manual review
  • Temporal validation: Data timestamps checked to ensure freshness

Limitations & Assumptions

We acknowledge the following limitations:

  • NPI data completeness: Not all dentists maintain current NPI records (estimated 5-10% outdated)
  • Insurance acceptance: Insurance acceptance can change; we recommend calling to verify
  • Wait time variability: Actual wait times depend on many factors we cannot model (holidays, provider schedules, local events)
  • Fluoride data granularity: State-level averages may not reflect specific water systems; city-level data limited to major cities
  • Geographic simplification: Risk scores use county-level aggregation; neighborhood-level variation exists

Algorithm Version History

Version 1.0• Current

Released: January 2025

  • • Initial release with NPI Registry integration
  • • Monte Carlo wait time predictions
  • • Bayesian risk assessment
  • • CDC fluoride data integration

Questions About Our Methodology?

We're committed to transparency and scientific rigor. If you have questions about our data sources, algorithms, or calculations, please reach out.