Day 4 comes with a new challenge – we need to validate passport data that comes in batches that are pretty unstructured. SQL is usually used to tame structured data, but it’s totally possible to use it to deal with unstructured data, too.

The main challenge here for SQL folks is to get the data into a structured form where we can filter it.

I could do the same as before, loading the data via PL/SQL, but I want to do as much as possible inside the boundaries of pure (Oracle) SQL. Therefore, I will use the SQL Loader feature to get access to external data. To be able to do that, we need an Oracle directory and the file stored reachable for the database.

/* As SYSDBA */
create or replace directory ext_tables 
  as '/home/oracle/ext_tables';
grant read on directory ext_tables to sithdb;

The first thing I want to do is to have separate records. Therefore my RECORDS delimiter is actually two newlines, which is represented in my case as ‘\n\n’.

I create an externally organized table and then – to speed up things – create a new table from that input.

create table aoc_day4_ext_input (
  raw_record varchar2(4000)
)
  organization external (
  type oracle_loader
  default directory ext_tables
  access parameters (
    records delimited by '\n\n'
    fields terminated by ','
    missing field values are null
    (
      raw_record char(4000)
    )
  )
  location ('day_4_input.txt')
  );

create table aoc_day4_input as
  select * from aoc_day4_ext_input;

Now we have a separate row for each passport record – perfect starting conditions!

Next, I want to get rid of the newline and space discrepancy in field separators, therefore I normalize all newlines to spaces. In that process, I also add a new column holding the rownum which will be our passport-ID

select
  rownum passport_id,
  replace(raw_record, chr(10), ' ') line
from aoc_day4_input

We can now split the passport rows into several rows (one for each data field we need) by cross-joining it with a collection of data field names. The correct values can then be retreived by some regex in combination with the field name.

with
  fields as (
    select 'byr' name from dual union all
    select 'iyr'      from dual union all
    select 'eyr'      from dual union all
    select 'hgt'      from dual union all
    select 'hcl'      from dual union all
    select 'ecl'      from dual union all
    select 'pid'      from dual union all
    select 'cid'      from dual
  ),
  normalize_newline as (
    select
      rownum passport_id,
      replace(raw_record, chr(10), ' ') line
    from aoc_day4_input
  )
select
      passport_id,
      field.name key,
      substr(
        regexp_substr(line, field.name||':[^ ]+'),
        length(field.name)+2) value
    from normalize_newline, fields field

This again can be now pivoted back into a column-structure. And with that, it’s pretty easy to only select those rows which have all the required fields present.

with
  fields as (
    select 'byr' name from dual union all
    select 'iyr'      from dual union all
    select 'eyr'      from dual union all
    select 'hgt'      from dual union all
    select 'hcl'      from dual union all
    select 'ecl'      from dual union all
    select 'pid'      from dual union all
    select 'cid'      from dual
  ),
  normalize_newline as (
    select
      rownum passport_id,
      replace(raw_record, chr(10), ' ') line
    from aoc_day4_input
  ),
  split_columns as (
    select
      passport_id,
      field.name key,
      substr(
        regexp_substr(line, field.name||':[^ ]+'),
        length(field.name)+2) value
    from normalize_newline, fields field
  ),
  pivot_data as (
    select
      *
    from split_columns
    pivot (
      max(value)
      for key in (
        'byr' as BYR,
        'iyr' as IYR,
        'eyr' as EYR,
        'hgt' as HGT,
        'hcl' as HCL,
        'ecl' as ECL,
        'pid' as PID,
        'cid' as CID
      )
    )
  )
select count(*)
from pivot_data
where byr is not null
  and iyr is not null
  and eyr is not null
  and hgt is not null
  and hcl is not null
  and ecl is not null
  and pid is not null
;

Part 2 introduces a number of very specific validation metrics. This is the part where SQL can really shine, because to implement a validation we can just use a WHERE clause.

There is one more preparation we have to do because of the “Height” validation – this can be in cm or inch, so we have to prepare our data to provide this information. We do so again by using some regex-splitting.

The rest is really straight forwared.

Take care if you use REGEXP_LIKE to add start ^ and end $ indicators, otherwise you’ll get also partial results, which is wrong.

with
  fields as (
    select 'byr' name from dual union all
    select 'iyr'      from dual union all
    select 'eyr'      from dual union all
    select 'hgt'      from dual union all
    select 'hcl'      from dual union all
    select 'ecl'      from dual union all
    select 'pid'      from dual union all
    select 'cid'      from dual
  ),
  normalize_newline as (
    select
      rownum passport_id,
      replace(raw_record, chr(10), ' ') line
    from aoc_day4_input
  ),
  split_columns as (
    select
      passport_id,
      field.name key,
      substr(
        regexp_substr(line, field.name||':[^ ]+'),
        length(field.name)+2) value
    from normalize_newline, fields field
  ),
  pivot_data as (
    select
      *
    from split_columns
    pivot (
      max(value)
      for key in (
        'byr' as BYR,
        'iyr' as IYR,
        'eyr' as EYR,
        'hgt' as HGT,
        'hcl' as HCL,
        'ecl' as ECL,
        'pid' as PID,
        'cid' as CID
      )
    )
  ),
  data_with_height as (
    select
      passport_id,
      byr,
      iyr,
      eyr,
      hgt,
      regexp_replace(hgt, '^([0-9]+)([cmin]{2})$', '\2') hgt_measure,
      regexp_replace(hgt, '^([0-9]+)([cmin]{2})$', '\1') hgt_value,
      hcl,
      ecl,
      pid,
      cid
    from pivot_data
  )
select count(*)
from data_with_height
where byr between 1920 and 2002
  and iyr between 2010 and 2020
  and eyr between 2020 and 2030
  and (
    (hgt_measure = 'cm' and hgt_value between 150 and 193)
    or (hgt_measure = 'in' and hgt_value between 59 and 76)
  )
  and regexp_like(hcl, '^#[a-f0-9]{6}$')
  and ecl in ('amb','blu','brn','gry','grn','hzl','oth')
  and regexp_like(pid, '^[0-9]{9}$')
;

Due to constant “Bad Gateway” errors when I tried to insert the scripted data of aoc_day4_input, there is no Oracle LiveSQL script 🙁


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *