Starting every night in production databases, some cleanup scripts run to remove duplicate data, and fix invalid items.
The overnight cleanup process is placed into a queue of the overall nightly process, which begins at 2:30 AM eastern time. Depending on its placement in the queue and other cluster database activity, the cleanup process can run anytime after 2:30 AM eastern time. Typically everything is concluded by 10:30 AM eastern time. However, there can be times when the process runs later in the afternoon.
Person Records
- There are three phone number columns on the person table: phone, mobile, and business. These columns are reformatted so that they are uniform, using proper phone formatting.
- The Phone numbers columns are populated with the rank 1 value from devices that have a device type with a category of 'phone,' 'mobile,' or 'business.' The category corresponds with the person table column.
- Secondary Citizenship is nulled out if it is the same as Primary Citizenship.
- Birthdates before January 1, 1800, are nulled out.
- Slate Reference IDs are nine-digit values assigned automatically to the ref.internal column. Each night, the ref.internal column is checked for duplicates. A new value is assigned to the most recently created person record if there is a duplicate (among the ref.internal column IDs). The Slate Reference ID clean also runs hourly throughout the day in production databases.
Note: This dedupe does not look at override IDs. Users set override IDs, and nothing prevents a user from duplicating a Slate ID. Likewise, a user can assign a nine-digit override ID that already exists (or can exist in the future) as an automatically generated Slate ID. These are not deduped.
Passwords
- Passwords for person records automatically expire if a person has not attempted to log in for 12 months.
- PINs for person records automatically expire if the PIN was set more than 12 months ago.
Addresses
Addresses can be scrubbed in a database (both Production and Test) on-demand using the "Scrub Address Records" tool in the Database.
- If an address is in the United States, and the region is the full name of the state rather than the abbreviated state code, it is updated to the abbreviated state code. (This can only occur if the region's name matches the region's name in Slate.
For example, Connecticut will be updated to CT, but Conn will remain Conn). - If an address is in the United States, and the postal code is only three or four characters long, leading 0s are added to make it five characters long.
- Duplicate addresses (matching record, type, effective, expires, first eight characters of the street) are deleted. Priority is given to addresses based on their priority (high > normal > low > inactive), followed by the quality (from an applicant > not validated from an import), and then the most recently updated address.
- Addresses that have expired or are not yet effective are updated to set the rank to NULL so that it won't be included in exports or filters.
- Addresses that do not have a rank that either does not have an expiration date or has not yet expired and does not have an effective date or has an effective date in the past are updated to have the rank set on it (so it can be used in exports and filters).
- Geocodes (used for geographic proximity things) are updated for addresses nightly for addresses that do not yet have a geocode
- Addresses that have not yet been validated get scrubbed (via USPS)
Devices
- Duplicate devices (matching record, type, value) are deleted. Priority is given to devices based on the rank, which is based on the priority (high > normal > low > inactive), and in the case of a tie, updated, then created dates.
- Phone number device types (mobile, business, phone) are deleted if only the numeric characters end with "0000000."
Fields
- Duplicate field values (matching field, record, order, prompt, index, related) are deleted.
- If a record has multiple field values for a field configured to store a single value, all but a single field is deleted. The priority is based on the field ID (which is random)
Schools
- Duplicate schools (matching record, level of study, CEEB, name, from, to, degree, conferred, field, and major) are deleted.
- If one of the schools has a transcript attached to it, then that is the school that is kept. Otherwise, the school that was most recently updated is kept.
- Duplicate schools (matching record, CEEB, and name) are deleted IF the level of study is blank, from date is blank, to date is blank, there is no transcript associated with it, GPA is blank, class size is blank, and class rank is blank.
- Preference for the school record to keep is based on the most recently updated school.
Origins
- Person records updated in the past 48 hours will have their first and last origins recalculated using the Origin Groups and Origin Sources configured in the Admin Tool. To force process these (if they change their Origin Groups or Origin Sources), they can run a Retroactive Refresh in the Rules Editor. For more information about how Origins work, see the Query Filters/Exports Specialist Knowledge article "Origin Sources: Exports/Filters, and other important information."
Additional Cleanup for Combined Advancement and Admissions
Applications
- Just like person records, applications are assigned nine-digit Application Reference IDs. Each night, these automatically assigned IDs are checked for duplicates.
If more than one application has the same nine-digit ID, a new one is generated for the most recently created application.
Fields
- CommonApp fields scoped to the person are updated to be application-scoped if the person record has a submitted application. The application used is the one that is ranked 1.
Tests
- Duplicate test scores (matching record, type, subtype, date, total, score1, score2, score3, score4, score5, alpha, location) are deleted. Preference is given to the test score that is verified, then to the test record that was most recently updated.
- Duplicate SELF-REPORTED test scores (matching record, type, subtype, date) are deleted IF a VERIFIED test score for the same test type and subtype exists for the same date.
- Duplicate GMAT test scores (matching record, type, subtype, test year, test month, total, score1, score2, score3, score4, score5, alpha, location) are deleted. Preference is given to the test score that is verified, then to the test record that was most recently updated.
- To handle superscores (such as those that are imported from CommonApp, where the highest SAT Math score is reported with the highest SAT Math Date, and the highest SAT Critical Reading score is reported with the highest SAT Critical Reading date, where those dates may or may not be the same), self-reported test scores that have different component scores on the same date are merged.
Test types eligible for the merge are ACT, SATI, SATII, AP, IB; TOEFL, GRE; IB; GMAT, and IELTS. The following items must match for it to be considered for a merge: record, type, subtype, date, confirmed (verified status), and location.
Test scores with no date are not eligible. The three other duplicate merge items above take place first, so if all of the component scores above existed and matched, they would not exist when this merge takes place. - To handle test types that expire (Database > Tests > "Expires After x Days" setting), the script will mark the tests as canceled if the current date is greater than the test date + the number of days specified.
- "Score Report" checklist items are deleted if the record no longer has a self-reported test score of that type or if the Checklist setting for the test type in the Tests section of the Admin Tool is no longer set to Yes.