Day 4 comes with a new challenge – we need to validate passport data that comes in batches that are pretty unstructured. SQL is usually used to tame structured data, but it’s totally possible to use it to deal with unstructured data, too.
The main challenge here for SQL folks is to get the data into a structured form where we can filter it.
I could do the same as before, loading the data via PL/SQL, but I want to do as much as possible inside the boundaries of pure (Oracle) SQL. Therefore, I will use the SQL Loader feature to get access to external data. To be able to do that, we need an Oracle directory and the file stored reachable for the database.
/* As SYSDBA */
create or replace directory ext_tables
as '/home/oracle/ext_tables';
grant read on directory ext_tables to sithdb;
The first thing I want to do is to have separate records. Therefore my RECORDS delimiter is actually two newlines, which is represented in my case as ‘\n\n’.
I create an externally organized table and then – to speed up things – create a new table from that input.
create table aoc_day4_ext_input (
raw_record varchar2(4000)
)
organization external (
type oracle_loader
default directory ext_tables
access parameters (
records delimited by '\n\n'
fields terminated by ','
missing field values are null
(
raw_record char(4000)
)
)
location ('day_4_input.txt')
);
create table aoc_day4_input as
select * from aoc_day4_ext_input;
Now we have a separate row for each passport record – perfect starting conditions!
Next, I want to get rid of the newline and space discrepancy in field separators, therefore I normalize all newlines to spaces. In that process, I also add a new column holding the rownum which will be our passport-ID
select
rownum passport_id,
replace(raw_record, chr(10), ' ') line
from aoc_day4_input
We can now split the passport rows into several rows (one for each data field we need) by cross-joining it with a collection of data field names. The correct values can then be retreived by some regex in combination with the field name.
with
fields as (
select 'byr' name from dual union all
select 'iyr' from dual union all
select 'eyr' from dual union all
select 'hgt' from dual union all
select 'hcl' from dual union all
select 'ecl' from dual union all
select 'pid' from dual union all
select 'cid' from dual
),
normalize_newline as (
select
rownum passport_id,
replace(raw_record, chr(10), ' ') line
from aoc_day4_input
)
select
passport_id,
field.name key,
substr(
regexp_substr(line, field.name||':[^ ]+'),
length(field.name)+2) value
from normalize_newline, fields field
This again can be now pivoted back into a column-structure. And with that, it’s pretty easy to only select those rows which have all the required fields present.
with
fields as (
select 'byr' name from dual union all
select 'iyr' from dual union all
select 'eyr' from dual union all
select 'hgt' from dual union all
select 'hcl' from dual union all
select 'ecl' from dual union all
select 'pid' from dual union all
select 'cid' from dual
),
normalize_newline as (
select
rownum passport_id,
replace(raw_record, chr(10), ' ') line
from aoc_day4_input
),
split_columns as (
select
passport_id,
field.name key,
substr(
regexp_substr(line, field.name||':[^ ]+'),
length(field.name)+2) value
from normalize_newline, fields field
),
pivot_data as (
select
*
from split_columns
pivot (
max(value)
for key in (
'byr' as BYR,
'iyr' as IYR,
'eyr' as EYR,
'hgt' as HGT,
'hcl' as HCL,
'ecl' as ECL,
'pid' as PID,
'cid' as CID
)
)
)
select count(*)
from pivot_data
where byr is not null
and iyr is not null
and eyr is not null
and hgt is not null
and hcl is not null
and ecl is not null
and pid is not null
;
Part 2 introduces a number of very specific validation metrics. This is the part where SQL can really shine, because to implement a validation we can just use a WHERE clause.
There is one more preparation we have to do because of the “Height” validation – this can be in cm or inch, so we have to prepare our data to provide this information. We do so again by using some regex-splitting.
The rest is really straight forwared.
Take care if you use REGEXP_LIKE to add start ^
and end $
indicators, otherwise you’ll get also partial results, which is wrong.
with
fields as (
select 'byr' name from dual union all
select 'iyr' from dual union all
select 'eyr' from dual union all
select 'hgt' from dual union all
select 'hcl' from dual union all
select 'ecl' from dual union all
select 'pid' from dual union all
select 'cid' from dual
),
normalize_newline as (
select
rownum passport_id,
replace(raw_record, chr(10), ' ') line
from aoc_day4_input
),
split_columns as (
select
passport_id,
field.name key,
substr(
regexp_substr(line, field.name||':[^ ]+'),
length(field.name)+2) value
from normalize_newline, fields field
),
pivot_data as (
select
*
from split_columns
pivot (
max(value)
for key in (
'byr' as BYR,
'iyr' as IYR,
'eyr' as EYR,
'hgt' as HGT,
'hcl' as HCL,
'ecl' as ECL,
'pid' as PID,
'cid' as CID
)
)
),
data_with_height as (
select
passport_id,
byr,
iyr,
eyr,
hgt,
regexp_replace(hgt, '^([0-9]+)([cmin]{2})$', '\2') hgt_measure,
regexp_replace(hgt, '^([0-9]+)([cmin]{2})$', '\1') hgt_value,
hcl,
ecl,
pid,
cid
from pivot_data
)
select count(*)
from data_with_height
where byr between 1920 and 2002
and iyr between 2010 and 2020
and eyr between 2020 and 2030
and (
(hgt_measure = 'cm' and hgt_value between 150 and 193)
or (hgt_measure = 'in' and hgt_value between 59 and 76)
)
and regexp_like(hcl, '^#[a-f0-9]{6}$')
and ecl in ('amb','blu','brn','gry','grn','hzl','oth')
and regexp_like(pid, '^[0-9]{9}$')
;
Due to constant “Bad Gateway” errors when I tried to insert the scripted data of aoc_day4_input
, there is no Oracle LiveSQL script 🙁
0 Comments