Nightly Cleanup of Duplicate Data & Fixing Invalid Items

Every night (in production databases), some cleanup scripts run to remove duplicate data, and fix invalid items:

Schools

  • Duplicate schools (matching record, level of study, CEEB, name, from, to, degree, conferred, field, and major) are deleted. 
  • If one of the schools has a transcript attached to it, then that is the school that is kept. Otherwise, the school that was most recently updated is kept.
  • Duplicate schools (matching record, CEEB, and name) are deleted IF level of study is blank, from date is blank, to date is blank, there is no transcript associated with it, GPA is blank, class size is blank, and class rank is blank. 
  • Preference for the school record to keep is based on the most recently updated school.

Tests

  • Duplicate test scores (matching record, type, subtype, date, total, score1, score2, score3, score4, score5, alpha, location) are deleted. Preference is given to the test score that is verified, then to the test record that was most recently updated.
  • Duplicate SELF-REPORTED test scores (matching record, type, subtype, date) are deleted IF a VERIFIED test score for the same test type and subtype exists for the same date.
  • Duplicate GMAT test scores (matching record, type, subtype, test year, test month, total, score1, score2, score3, score4, score5, alpha, location) are deleted. Preference is given to the test score that is verified, then to the test record that was most recently updated.
  • To handle superscores (such as those that are imported from CommonApp, where the highest SAT Math score is reported with the highest SAT Math Date, and highest SAT Critical Reading score is reported with the highest SAT Critical Reading date, where those dates may or may not be the same), self-reported test scores that have different component scores on the same date are merged.

    Test types that are eligible for the merge are: ACT; SATI; SATII; AP; IB; TOEFL; GRE; IB; GMAT; IELTS. The following items must match for it to be considered for a merge: record, type, subtype, date, confirmed (verified status), and location.

    Test scores with no date are not eligible. The three other duplicate merge items above take place first, so if all of the component scores above existed and matched, they would not exist when this merge takes place.
  • To handle test types that expire (Database > Tests > "Expires After x Days" setting), the script will mark the tests as canceled if the current date is greater than the test date + the number of days specified.
  • "Score Report" checklist items are deleted if the record no longer has a self-reported test score of that type or if the Checklist setting for the test type in the Tests section of the Admin Tool is no longer set to Yes.

Passwords

  • Passwords for person records are automatically expired if a person has not attempted to log in for 12 months.
  • PINs for person records are automatically expired if the PIN was set more than 12 months ago.

Addresses

Addresses can be scrubbed in a database (both Production and Test) on-demand using the "Scrub Address Records" tool in the Database.

  • If an address is in the United States, and the region is the full name of the state rather than the abbreviated state code, it is updated to the abbreviated state code. (This can only occur if the name of the region matches the region name in Slate.

    For example, Connecticut will be updated to CT, but Conn will remain Conn).
  • If an address is in the United States, and the postal code is only three or four characters long, leading 0s are added to make it five characters long.
  • Duplicate addresses (matching record, type, effective, expires, first eight characters of the street) are deleted. Priority is given to addresses based on their priority (high > normal > low > inactive), followed by the quality (from an applicant > not validated from an import), and then the most recently updated address.
  • Addresses that have expired or are not yet effective are updated to set the rank to NULL so that it won't be included in exports or filters.
  • Addresses that do not have a rank that either do not have an expires date or have not yet expired and do not have an effective date or has an effective date in the past are updated to have the rank set on it (so it can be used in exports and filters).
  • Geocodes (used for geographic proximity things) are updated for addresses nightly for addresses that do not yet have a geocode
  • Addresses that have not yet been validated get scrubbed (via USPS)

Person Records

  • Phone numbers on the person table are reformatted so that they are uniform, using proper phone formatting. This includes "phone", "mobile", and "business" numbers. This does not include devices on the Device table
  • Secondary Citizenship is nulled out if it is the same as the Primary Citizenship.
  • Birthdates prior to January 1, 1800, are nulled out
  • Slate IDs are nine-digit values that are assigned automatically to the ref.internal column. Each night, the ref.internal column is checked for duplicates. If there is a duplicate (among the ref.internal column IDs), then a new value is assigned to the most recently created person record.

    Note: This dedupe does not look at override IDs. Users set override IDs, and there is nothing that prevents a user from duplicating a Slate ID. Likewise, a user may assign a nine-digit override ID that already exists (or may exist in the future) as an automatically generated Slate ID. These are not deduped.

Application Records

  • Just like person records, applications are assigned nine-digit ref IDs. Each night, these automatically assigned IDs are checked for duplicates.

    If more than one application has the same nine-digit ID, a new one is generated for the most recently created application.

Devices

  • Duplicate devices (matching record, type, value) are deleted. Priority is given to devices based on the rank, which is based on the priority (high > normal > low > inactive), and in the case of a tie, updated, then created dates.
  • Phone number device types (mobile, business, phone) are deleted if the only numeric characters end with "0000000."

Fields

  • CommonApp fields scoped to the person are updated to be application-scoped if the person record has a submitted application. The application used is the one that is ranked 1.
  • Duplicate field values (matching field, record, order, prompt, index, related) are deleted.
  • If a record has multiple field values for a field configured to store a single value, all but a single field is deleted. The priority is based on the field ID (which is random)

Origins

  • Person records updated in the past 48 hours will have their first and last origins recalculated using the Origin Groups and Origin Sources configured in the Admin Tool. To force process these (if they make changes to their Origin Groups or Origin Sources), they can run a Retroactive Refresh in the Rules Editor. For more information about how Origins work, see the Query Filters/Exports Specialist Knowledge article "Origin Sources: Exports/Filters, and other important information."
Was this article helpful?
8 out of 9 found this helpful