This chapter is an introduction to the handling of columnar data in Goal.
Goal is a small language and hence doesn’t come with as many specialized features for dataframe manipulation as languages like R. Still, Goal is well-suited for many typical dataframe handling problems, without requiring to learn how to use any new abstractions or DSLs: most of the functionality is there in the core language, without extra magic. Its small size and fast startup-time can also be an advantage in typical short-running scripts.
To visualize the tables in the various examples, we’ll use
for convenience the
fmt.tbl[t;r;c;f]
formatting function available at
lib/fmt.goal
in the language’s repository, which displays at most
r
rows and
c
columns from a table
t
,
using format string
f
for numbers.
/ NOTE: replace with actual path on your system: we assume
/ here we're in the "examples" directory in Goal's repository.
import"../lib/fmt.goal"
/ pp[t] pretty prints dict as table, assuming string keys, limiting results to
/ at most 5 rows and 8 columns, using "%.1f" to format floats.
pp:fmt.tbl[;5;8;"%.1f"]
Recall from the help that
import
can either take an absolute or relative path, or take a
library name with no extension, for example
import"fmt"
,
and search for a file
"fmt.goal"
at the location provided by the
GOALLIB
environment variable (if set).
Also, note how
pp
is defined using projection syntax to derive a new function
from
fmt.tbl
with some arguments fixed.
In this chapter, we’ll use as data the CSV file found at
examples/data/wesnoth-units.csv
in Goal’s repository, that contains (simplified) game data
for most units from the FOSS game
Wesnoth
(for version 1.18).
t:{(*'x)!1_'x}@csv 'read"data/wesnoth-units.csv" / read csv file into t
t:+_+t / lowercase column names
t:@[t;!"level cost xp hp movement";"i"$] / parse integer columns
The first line reads the CSV data into a list of columns
using the
csv
verb, and we then extract the header with keys for the table
and keep the rest as values. Note that Goal represents
tables as dicts: the keys are the names of the columns, and
the values contain the list of data columns. We also use the
'
return early on error syntactic sugar to propagate any
IO errors returned by
read
.
The next line lowercases all column names for simplicity.
We use the monadic form of the
+
verb to exchange keys and values so that we can apply
_
on the column names, and then swap again: this handy
manipulation is possible because tables in Goal are actually
just dicts, which are essentially a pair of key and value
arrays.
The third line maps the
"i"$
parsing form on integer columns.
We can now use
pp
to visualize the first 5 rows and 8 columns of the table:
pp t
/ === Table 307x9 ===
/ id race alignment level cost xp upgrade hp
/ ------------- ------------ --------- ----- ---- --- ----------- --
/ "Blood Bat" "bats" "chaotic" 1 23 37 "Dread Bat" 27
/ "Dread Bat" "bats" "chaotic" 2 32 100 "" 33
/ "Vampire Bat" "bats" "chaotic" 0 13 22 "Blood Bat" 16
/ "Boat" "mechanical" "lawful" 1 10 50 "" 1
/ "Galleon" "mechanical" "lawful" 1 10 50 "" 1
(the output for all examples is showed in comments for reading convenience)
As you can see, unlike in R, primitives in Goal have a simple scope and are designed to be combined with other primitives. They do not have their own mini-language with lots of options configuring all sorts of desired behaviors that could be useful in a specific task like CSV parsing: you use a primitive for reading the file, another for parsing the data, another for performing renames, another for parsing numbers, and so on. This combinatoric, idiom-based, Unix-like approach shows clearly each do-one-thing-and-do-it-well array operation without hiding everything behind a single black box abstraction. This style is something that most array languages share to some degree.
A fundamental operation on tables is filtering, like with dplyr’s
filter
verb or the WHERE clause in SQL. Filtering on tables can be
done easily in Goal using the
X#t
dyadic form. The left argument is a list of filters. The
first applies to the keys (playing the role of
select
),
and the rest applies to the rows in succession. Note that
this means order can have performance implications.
The filters are functions returning booleans, as in the
f#Y
array filtering form. The key selection filter and the first
row filter can also be given directly as booleans; moreover,
the former can also consist of a list of column names.
For example selecting all the rows having
"goblin"
in the
race
column can be done as follows:
pp(1;..race="goblin")#t / rows where race is "goblin"
/ === Table 3x9 ===
/ id race alignment level cost xp upgrade hp
/ ----------------- -------- --------- ----- ---- -- ---------------- --
/ "Goblin Impaler" "goblin" "chaotic" 1 13 50 "" 26
/ "Goblin Rouser" "goblin" "chaotic" 1 13 50 "" 31
/ "Goblin Spearman" "goblin" "chaotic" 0 9 18 "Goblin Impaler" 18
pp(rx/p$/!:;..race="goblin")#t / same but only columns with name ending in p
/ === Table 3x2 ===
/ xp hp
/ -- --
/ 50 26
/ 50 31
/ 18 18
In the first example, the true value
1
tells the filtering verb to keep all columns,
while
..race="goblin"
is a function described using
field expression syntax,
equivalent to
{x["race"]="goblin"}
,
and is used to filter rows. The field expression syntax is
handy when manipulating tables, because it allows to refer
to columns directly by name, as if they were variables. The
second example is the same but uses a regexp to filter
column names.
Several row filtering functions can be used at once:
pp(1;..race="human";..alignment="chaotic")#t
/ === Table 17x9 ===
/ id race alignment level cost xp upgrade hp
/ ---------- ------- --------- ----- ---- --- ------------ --
/ "Outlaw" "human" "chaotic" 2 24 71 "Fugitive" 47
/ "Assassin" "human" "chaotic" 3 46 150 "" 51
/ "Bandit" "human" "chaotic" 2 22 61 "Highwayman" 50
/ "Footpad" "human" "chaotic" 1 14 35 "Outlaw" 30
/ "Fugitive" "human" "chaotic" 3 55 150 "" 68
Because column names are simply the keys of a dictionary, while the column values are its array of values, it’s easy to do location-based selections of both columns and rows:
pp 3#t / select first 3 columns
/ === Table 307x3 ===
/ id race alignment
/ ------------- ------------ ---------
/ "Blood Bat" "bats" "chaotic"
/ "Dread Bat" "bats" "chaotic"
/ "Vampire Bat" "bats" "chaotic"
/ "Boat" "mechanical" "lawful"
/ "Galleon" "mechanical" "lawful"
krng:{[t;x;y](k?x)_(1+k?y)#k:!t} / inclusive key range from x to y
pp krng[t;"alignment";"cost"]#t / select range of columns
/ === Table 307x3 ===
/ alignment level cost
/ --------- ----- ----
/ "chaotic" 1 23
/ "chaotic" 2 32
/ "chaotic" 0 13
/ "lawful" 1 10
/ "lawful" 1 10
Similar to filtering, row indexing can be used to select a subset of rows or columns. This uses a special Goal feature that allows table-like indexing for dicts with string keys when indexing first by row. Unlike filtering, indexing also allows to reorder columns as specified.
pp t[10+!5] / select from row index 10 to 14
/ === Table 5x9 ===
/ id race alignment level cost xp upgrade hp
/ ---------------- ------- --------- ----- ---- --- ---------------- --
/ "Drake Burner" "drake" "lawful" 1 21 43 "Fire Drake" 42
/ "Drake Clasher" "drake" "lawful" 1 19 41 "Drake Thrasher" 43
/ "Drake Enforcer" "drake" "lawful" 3 58 150 "" 85
/ "Drake Fighter" "drake" "lawful" 1 17 41 "Drake Warrior" 39
/ "Fire Drake" "drake" "lawful" 2 35 84 "Inferno Drake" 63
pp t[-5?#*t] / randomly select 5 distinct rows
/ === Table 5x9 ===
/ id race alignment level cost xp upgrade hp
/ ---------------- --------- --------- ----- ---- --- ----------------- --
/ "Giant Rat" "monster" "chaotic" 0 5 25 "" 15
/ "Nibbler" "monster" "neutral" 0 15 26 "Caribe" 28
/ "Gryphon Master" "gryphon" "neutral" 2 40 100 "" 51
/ "Royal Warrior" "human" "lawful" 3 47 150 "" 64
/ "Saurian Oracle" "lizard" "chaotic" 2 28 74 "Saurian Prophet" 29
pp t[!-5;"hp""id"] / select last 5 rows of columns "hp" "id" (reorder)
/ === Table 5x2 ===
/ hp id
/ -- --------------
/ 80 "Ancient Wose"
/ 64 "Elder Wose"
/ 52 "Wose"
/ 26 "Wose Sapling"
/ 56 "Wose Shaman"
pp -5#'"hp""id"#t / same using take on each column and "keep keys" (keep order)
/ === Table 5x2 ===
/ id hp
/ -------------- --
/ "Ancient Wose" 80
/ "Elder Wose" 64
/ "Wose" 52
/ "Wose Sapling" 26
/ "Wose Shaman" 56
Sorting rows so that some column is in ascending or descending order can be done easily:
pp t@>t..level / sort by level in descending order
/ === Table 307x9 ===
/ id race alignment level cost xp upgrade hp
/ ------------------ ---------- --------- ----- ---- --- ------- ---
/ "Elder Mage" "human" "neutral" 5 90 250 "" 60
/ "Fire Dragon" "monster" "chaotic" 5 100 250 "" 101
/ "Armageddon Drake" "drake" "lawful" 4 118 200 "" 98
/ "Dune Paragon" "dunefolk" "lawful" 4 78 200 "" 82
/ "Wyvern Rider" "monster" "liminal" 4 102 200 "" 85
Note how the verb
>
is used to return sorting indices. This method is flexible,
but unlike dataframe libraries in R or python, or SQL
queries, we refer here twice to the table
t
,
because we are simply using the indexing verb
@
that can take inputs unrelated to the table. Field
expressions make it easy to write order-by functions for
tables suitable for chaining:
pp (..x@<level)(..x@<cost)t / ascending order by level and cost
/ === Table 307x9 ===
/ id race alignment level cost xp upgrade hp
/ ---------------- --------- --------- ----- ---- -- --------------- --
/ "Dark Horse" "" "chaotic" 0i 0i 24 "Black Horse" 0i
/ "White Horse" "" "" 0i 0i 0i "" 50
/ "Sand Scamperer" "" "" 0i 0i 0i "Sand Scuttler" 27
/ "Sand Scuttler" "" "" 0i 0i 0i "" 40
/ "Giant Ant Egg" "monster" "neutral" 0 4 4 "Giant Ant" 32
The last example illustrates the case of NaN-like integer
entries
0i
.
In this case, it corresponds to units for which the CSV file
doesn’t provide a level or cost. We may want to remove rows
for which there are NaNs in the sorting columns so that we
can see the cheapest units by level, discarding rows with no
cost or level data.
rmnan:{(0;..|/nan'x@!p.x)^y} / removes rows with NaNs in columns x from table y
pp (..x@<level)(..x@<cost)rmnan["level cost";t]
/ === Table 303x9 ===
/ id race alignment level cost xp upgrade hp
/ --------------- --------- --------- ----- ---- -- ------------------ --
/ "Giant Ant Egg" "monster" "neutral" 0 4 4 "Giant Ant" 32
/ "Giant Rat" "monster" "chaotic" 0 5 25 "" 15
/ "Mudcrawler" "monster" "neutral" 0 5 18 "Giant Mudcrawler" 20
/ "Ruffian" "human" "chaotic" 0 7 18 "Thug" 16
/ "Giant Ant" "monster" "neutral" 0 7 16 "Soldier Ant" 22
Note how we used in
rmnan
field expression syntax
..
for a non-table usage to concisely define a lambda
projection using the prefixed
p.x
form to refer to variable
x
from the parent function (note how it’s different from the
q.
prefix which would have referred to a global or local
variable). It could have been written too as
{(0;{|/nan'x@!y}[;x])^y}
.
As usual for values in array languages, tables are
immutable, so mutation is done by returning a new table with
modified values or even new columns. Goal provides a
convenient syntax for amending columns of a table, which is
simple syntax sugar for the tetradic amend
@[d;y;F;z]
form but allows to refer to columns directly by name, like
with normal field expressions.
pp t..[id:_id] / lower case id names: same as @[t;"id";_]
/ === Table 307x9 ===
/ id race alignment level cost xp upgrade hp
/ ------------- ------------ --------- ----- ---- --- ----------- --
/ "blood bat" "bats" "chaotic" 1 23 37 "Dread Bat" 27
/ "dread bat" "bats" "chaotic" 2 32 100 "" 33
/ "vampire bat" "bats" "chaotic" 0 13 22 "Blood Bat" 16
/ "boat" "mechanical" "lawful" 1 10 50 "" 1
/ "galleon" "mechanical" "lawful" 1 10 50 "" 1
pp t..[level+:1] / increment all levels by one
/ === Table 307x9 ===
/ id race alignment level cost xp upgrade hp
/ ------------- ------------ --------- ----- ---- --- ----------- --
/ "Blood Bat" "bats" "chaotic" 2 23 37 "Dread Bat" 27
/ "Dread Bat" "bats" "chaotic" 3 32 100 "" 33
/ "Vampire Bat" "bats" "chaotic" 1 13 22 "Blood Bat" 16
/ "Boat" "mechanical" "lawful" 2 10 50 "" 1
/ "Galleon" "mechanical" "lawful" 2 10 50 "" 1
pp t..[n:!#id][;!"n id level alignment"] / add new column, then select & reorder
/ === Table 307x4 ===
/ n id level alignment
/ - ------------- ----- ---------
/ 0 "Blood Bat" 1 "chaotic"
/ 1 "Dread Bat" 2 "chaotic"
/ 2 "Vampire Bat" 0 "chaotic"
/ 3 "Boat" 1 "lawful"
/ 4 "Galleon" 1 "lawful"
We may for example want to compute the mean level, cost, xp
and hp for units in the game, returning them as a simple
flat dict. Note how we use the each
'
adverb to apply a function to each column of a table.
avg:{(+/x)%#x:(nan)^x} / return average of x with NaNs filtered out
fmt.dict[;"%.1f"]@avg'(!"level cost xp hp")#t
/ === Dict (4 keys) ===
/ level| 2.0
/ cost | 36.1
/ xp | 90.5
/ hp | 47.1
It should be noted that those results are not very useful from a gameplay perspective: they’re just a simple aggregation that conflates everything together! More interesting results could be for example the mean HP of units by level, race or alignment, which leads us to the next section, introducing “group by” functionality.
Goal doesn’t have a “grouped table” concept like R’s dplyr, nor a specific DSL for performing “group by” operations on tables. It doesn’t have either “keyed tables”, unlike some K dialects, which can be useful to group some columns by some others. Goal offers instead functionality for performing grouping operations with arrays, inspired from BQN, that can easily be used to express the various kinds of interesting grouping operations on tables.
For example, we might want to compute the average cost and
hp of units by race. In order to do that, we’ll combine
various Goal features, among them filtering (to remove rows
with
""
as race), classify
%X
,
and group by indices
=d
.
Also, as is usual in dataframe packages, we’ll sort the
result according to the “by” column.
pp (..x@<race)((0;..""=race)^t).. ..[
race:&race!¿q.by:%race / classify and keep unique
cost:q.avg'=cost!q.by / average cost of each group by race classification
hp:q.avg'=hp!q.by / same for hp
]
/ === Table 22x3 ===
/ race cost hp
/ ---------- ---- ----
/ "bats" 22.7 25.3
/ "drake" 44.5 62.6
/ "dunefolk" 37.2 50.0
/ "dwarf" 36.8 48.1
/ "elf" 48.0 48.1
The first line just does filtering and sorting in the same
way as previously in this tutorial. Next, we use dict syntax
to produce a table with the three columns we want. As
mentioned in the help, within a field expression, it’s
possible to use the
q.
variable prefix to refer to actual variables instead of
column names. We use this feature here several times,
starting with assigning to
q.by
the classification indices needed for grouping by race and
filtering the race column with a unique mask. This last part
could also have been done too with
race:?race
,
but using the unique mask
¿q.by
is faster, because the already computed
q.by
contains small-range integers, and self-search primitives
like
¿
(also named
firsts
)
are very fast on those inputs. The rest should be quite
self-explanatory: use group by indices
=d
on the cost and hp columns using the grouping indices
q.by
,
and then apply the
avg
function, without forgetting the
q.
prefix and the each
'
adverb, so that the average is computed on each group.
The kind of grouping operation that produces a summary table after applying reducing functions on a set of columns grouped by some others is very common, so it can be convenient to write a user-defined function handling that usage.
/ rby groups by cols k, summarises with reducing f[t;by] using grouping indices
/ by, then sorts result by k.
rby:{[k;f;t]
by:%?["A"~@g:t k;{(1+|/'x)/x}@%'g;g]
(t[&¿by;k],f[t;by]){x@<x y}/|k
}
/ average cost, hp by race
pp rby["race";..(..q.avg'=x!p.y)'..[cost;hp]](0;..""=race)^t
/ === Table 22x3 ===
/ race cost hp
/ ---------- ---- ----
/ "bats" 22.7 25.3
/ "drake" 44.5 62.6
/ "dunefolk" 37.2 50.0
/ "dwarf" 36.8 48.1
/ "elf" 48.0 48.1
/ average cost, hp by level
pp rby["level";..(..q.avg'=x!p.y)'..[cost;hp]]rmnan["level";t]
/ === Table 6x3 ===
/ level cost hp
/ ----- ----- ----
/ 0 9.6 21.5
/ 1 15.6 31.7
/ 2 29.5 47.4
/ 3 57.3 61.6
/ 4 123.3 82.8
/ average cost, hp by level, race
pp rby[!"level race";..(..q.avg'=x!p.y)'..[cost;hp]]rmnan["level";t]
/ === Table 73x4 ===
/ level race cost hp
/ ----- -------- ---- ----
/ 0 "bats" 13.0 16.0
/ 0 "falcon" 12.0 18.0
/ 0 "goblin" 9.0 18.0
/ 0 "human" 8.3 17.3
/ 0 "merman" 9.0 20.0
This
rby
function is actually an extension of our previous example,
handling grouping by several columns (not just one).
Studying it can be an interesting exercise, so we’ll just
give some pointers. The first line
by:%?["A"~@g:t k;{(1+|/'x)/x}@%'g;g]
is similar in spirit to the previous example: we compute
grouping indices
by
,
but we also handle several columns when
t k
returns a generic array. That case simply uses classify
%X
on each column, and merges all those columns into one using
decode
I/
and applying classify again on the result.
The second line uses the unique mask
¿by
to filter the grouping columns
t[&¿by;k]
,
and then merges the summary aggregate columns returned by
f[t;by]
.
The final
{x@<x y}/|k
sorts the result rows by the grouping columns.
In the above examples, we aply the same reducing function
avg
to grouped original columns: this allows for common
aggregation using
(..q.avg'=x!p.y)'
on
..[cost;hp]
.
Note that
p.y
refers to the grouping indices
y
of the parent function, and that
..[cost;hp]
is a shorthand for
..[cost:cost;hp:hp]
.
Sometimes, we may want to perform different computations in the summary or introduce new columns, without actually grouping any columns everytime.
/ number of units and max hp by level
pp rby["level";....[n:=y;hpmax:|/'=hp!y];rmnan["level";t]]
/ === Table 6x3 ===
/ level n hpmax
/ ----- --- -----
/ 0 20 35
/ 1 81 52
/ 2 102 72
/ 3 88 85
/ 4 10 142
Here we use the freq
=I
form on the grouping indices to get the number of elements
in each group, without actually performing any grouping or
operation on existing columns. Also, we introduce a new
hpmax
column to list the maximum HP values, instead of reusing the name
hp
.
We could in a similar way compute an
hpmin
column, for example.
The
lib/table.goal
library provides a similarly flexible
by
function with the same semantics as the above
rby
but without the extra sorting.
The example script files examples/dplyr.goal and examples/lil.goal from Goal’s repository provide a translation of the examples found in R’s dplyr’s introduction and Lil’s manual (respectively). They cover similar uses as the present tutorial but with different datasets. The links above point to html-higlighted versions of the code for reading convenience.
The
lib/table.goal
file provides a few table-related user-defined functions for
grouping, sorting, and some common cases of joins.
You might also be interested in the
Goal’s fmt.tbl function
article by semperos that gives a (slightly outdated)
walk-through of the
fmt.tbl
function that we used for printing tables.