Best strategy is to combine manual review process with
automated quality control
Best strategy is to combine manual review process with
automated quality control
Simply reviewing data visually will do a lot. If you are familiar
with industry, your experience will help to identify things which look off quickly.
Pick random data points and make sure they match.
Number of rows and checksums: if the target website
has 15 pages with 100 records each, we need to have 1500 records in your pipeline.
Nulls checks - be aware how many fields are null or udnefined, and be explicit that you want it that way.
You can inspect the result of your code run and make sure the
snapshot make sense to you. You can simply review output
in JSON in inline editor.
Alternatively, you can download the JSONĀ file or you can copy
link to snapshot JSON file and load into other tool (like Postman).
You can review basic statistics for each field and understand how many of them are null or undefined.
Some automated tools for checking
It checks automatically for each row, each field whether data
matches schema.
We also provide ability to provide custom rules for your dataset. Custom validator is Python program which validates your code output.
The most powerful tool we have is interactive schema editor. It show what fields your dataset is EXPECTED to have VS actually have.
You should strive to keep schema as strict as possible.