RDF, Typescript and Deno - Part 2: sample data
In part 1, I laid out some reasons for looking using RDF in Typescript using Deno. In this installment, I’ll put together some quick sample data before I start looking at RDF specifically.
Sample data is always a challenge. Real-world data is often large and messy, or has inconvenient license terms. On the other hand, synthetic data tends to hide the kinds of issues we have to deal with day-to-day in data processing. Who needs another todo list app, really?
For this exercise, I’m using recent sightings of cetaceans around the coast of the UK by the Sea Watch Foundation. There is - as far as I can see - no explicit license on this data, but as it’s reported by anyone and concerns wild animals in their natural habitat I’m going to assume it’s OK to use this data for tutorial purposes. Obviously if I find out otherwise, I’ll use a different dataset.
There’s a tool on the Sea Watch web site to list recent sightings. It looks like this:
Bottlenose dolphin (x6) - Portland Bill, Dorset at 08:00 on 1 Jul 2021 by Des and Shirley Peadon
Grey Seal (x11) - The Chick Island, Cornwall at 15:17 on 30 Jun 2021 by Newquay Sea Safaris Newquay Sea Safaris
Sunfish (x1) - Towan Headland, Cornwall at 14:53 on 30 Jun 2021 by Newquay Sea Safaris Newquay Sea Safaris
This is reasonably well structured data, to human eyes, but still needs a bit of processing to get it into a form we can conveniently process. By inspection, each sighting contains:
- a species name
- a quantity (optional)
- a location
- a time (optional)
- a date
- a reporter, which could be one or more individuals, or an organisation
There’s a bit of variability in this structure, so it would be convenient to have it as a more structured format, such as JSON:
{
"species": "Bottlenose dolphin",
"quantity": 12,
"location": "Portland Bill, Dorset",
"spotter": "Alan Hold",
"date": "2021-06-30T23:00:00.000Z",
"time": "09:15"
}
Mostly this is a case of splitting each line of data up in a robust way, but also
performing some basic data transformation. The quantity
field is parsed as an
integer, not a string, and the date
is parsed as a JavaScript Date
object. I
decided to keep the time
field separate, as only around half of the sightings
record the time. In theory we could do some data reconciliation to try to recognise
the location and the species, but the data is likely to be too noisy to be able
to do this robustly. So for now, just robustly parsing the strings is enough.
The code for this data conversion step is
available on GitHub.
It’s mostly fairly straightforward; perhaps the main notes (other than Typescript
and Deno, see below) is that the work of the recogniser is a rather large regex.
In JavaScript, regular expressions can capture segments of the input in groups,
and these capture groups can be given a name using the construct (?<name>...)
.
Conveniently, these named groups are returned as a JavaScript object looking like
the JSON structure above, as long as the names of the fields and capture groups line
up.
Of Typescript
TypeScript uses type inference to determine, and then check, the types of variables
and parameters automatically where it can. In this simple program, type inference
was mostly sufficient. The main thing I need to add to make the type checking work
was to constrain the output of the regular expression matcher. As mentioned above, I
wanted the output of the regex match to be an object closely resembling my eventual
result. The groups
field of the match result is an object, but Typescript needs
to have more information before it can check that usages of that object are legal. So
I defined an interface type SightingsData
, which is the interim, not final, form of
a line from the sightings data file:
interface SightingsData {
species: string
quant?: string
location: string
date: string
time?: string
spotter: string
}
Then we can tell the compiler that this will be the result type from parsing, and
us the as
operator to coerce the result (note that LINE_MATCHER
is just the large
regex):
function parseLine(line: string): SightingsData | undefined {
return line.match(LINE_MATCHER)?.groups as (SightingsData | undefined)
}
Since the match can fail, we need the optional chaining operator ?.
, so if the
left-hand expression evaluates to falsey the expression will evaluate to just that
value. And then since the result overall may be undefined, the return type needs
to be the type expression SightingsData | undefined
.
Using VsCode as my editor, TypeScript can work out the rest of the types, and show
them on mouse hover. Here, for example, is the result of hovering on function asData
:
Of Deno
This simple script doesn’t get to exercise much of Deno, but does show a couple of
interesting aspects. First, promises are used as the basis for every (potentially)
asynchronous operation, like writing or reading files. That means lots of async / await
statements, or .then()
calls. No callback functions.
Second, imports can be loaded directly from a URL:
import parse from 'https://deno.land/x/date_fns@v2.22.1/parse/index.js'
In a node.js script, this would have meant adding date_fns
to the package.json
file
dependencies, then yarn install
or npm install
to get the dependency cached into
node_modules
. Deno can work this way (spoiler alert for part 3 of this blog series),
but by default doesn’t need to.
Not diving for yarn add
from the command line did feel a bit weird, but I expect I will
get used to it. More of an issue, I think, will be keeping the versions consistent. If I
use date_fns
from more than one file, and then I need to upgrade to version 2.22.2
,
it seems I’ll have to grep for every use of date_fns@2.22.1
. That doesn’t seem very
DRY, but maybe there are working practices I’ve not got used to yet.
The third thing about Deno is that scripts don’t have permission to perform risky operations by default. “Risky” in this context means things like: reading a file, writing a file, writing to the network, etc. Running without the permission causes an error:
$ deno run data-generation.ts
Check file:///home/ian/projects/personal/deno-experiments/data-generation.ts
error: Uncaught (in promise) PermissionDenied: Requires read access to "./sightings-data.txt", run again with the --allow-read flag
const sourceData = await Deno.readTextFile(source)
^
at deno:core/core.js:86:46
at unwrapOpResult (deno:core/core.js:106:13)
at async open (deno:runtime/js/40_files.js:46:17)
at async Object.readTextFile (deno:runtime/js/40_read_file.js:40:18)
at async readLines (file:///home/ian/projects/personal/deno-experiments/data-generation.ts:18:22)
It’s a nice clear error. But I found it quite easy to forget to add the appropriate flags. The correct version:
$ deno run --allow-read --allow-write data-generation.ts
Check file:///home/ian/projects/personal/deno-experiments/data-generation.ts
Success.
It would be quite easy to create a bash alias with those permissions turned on, but that rather defeats the goal. Security or convenience: pick one!