Working with tables

Goal is a small language and hence doesn’t come with as many specialized features for dataframe manipulation as languages like R. Still, Goal is well-suited for many typical dataframe handling problems, without requiring to learn how to use any new abstractions or DSLs: most of the functionality is there in the core language, without extra magic. Its small size and fast startup-time can also be an advantage in typical short-running scripts.

To visualize the tables in the various examples, we’ll use for convenience the fmt.tbl[t;r;c;f] formatting function available at lib/fmt.goal in the language’s repository, which displays at most r rows and c columns from a table t, using format string f for numbers.

/ NOTE: replace with actual path on your system: we assume
/ here we're in the "examples" directory in Goal's repository.
import"../lib/fmt.goal"
/ pp[t] pretty prints dict as table, assuming string keys, limiting results to
/ at most 5 rows and 8 columns, using "%.1f" to format floats.
pp:fmt.tbl[;5;8;"%.1f"]

Recall from the help that import can either take an absolute or relative path, or take a library name with no extension, for example import"fmt", and search for a file "fmt.goal" at the location provided by the GOALLIB environment variable (if set).

Also, note how pp is defined using projection syntax to derive a new function from fmt.tbl with some arguments fixed.

6.2 Reading CSV data into a table #

In this chapter, we’ll use as data the CSV file found at examples/data/wesnoth-units.csv in Goal’s repository, that contains (simplified) game data for most units from the FOSS game Wesnoth (for version 1.18).

t:{(*'x)!1_'x}@csv 'read"data/wesnoth-units.csv" / read csv file into t
t:+_+t                                           / lowercase column names
t:@[t;!"level cost xp hp movement";"i"$]         / parse integer columns

The first line reads the CSV data into a list of columns using the csv verb, and we then extract the header with keys for the table and keep the rest as values. Note that Goal represents tables as dicts: the keys are the names of the columns, and the values contain the list of data columns. We also use the ' return early on error syntactic sugar to propagate any IO errors returned by read.

The next line lowercases all column names for simplicity. We use the monadic form of the + verb to exchange keys and values so that we can apply _ on the column names, and then swap again: this handy manipulation is possible because tables in Goal are actually just dicts, which are essentially a pair of key and value arrays.

pp t
/ === Table 307x9 ===
/            id         race alignment level cost  xp     upgrade hp
/ ------------- ------------ --------- ----- ---- --- ----------- --
/   "Blood Bat"       "bats" "chaotic"     1   23  37 "Dread Bat" 27
/   "Dread Bat"       "bats" "chaotic"     2   32 100          "" 33
/ "Vampire Bat"       "bats" "chaotic"     0   13  22 "Blood Bat" 16
/        "Boat" "mechanical"  "lawful"     1   10  50          ""  1
/     "Galleon" "mechanical"  "lawful"     1   10  50          ""  1

As you can see, unlike in R, primitives in Goal have a simple scope and are designed to be combined with other primitives. They do not have their own mini-language with lots of options configuring all sorts of desired behaviors that could be useful in a specific task like CSV parsing: you use a primitive for reading the file, another for parsing the data, another for performing renames, another for parsing numbers, and so on. This combinatoric, idiom-based, Unix-like approach shows clearly each do-one-thing-and-do-it-well array operation without hiding everything behind a single black box abstraction. This style is something that most array languages share to some degree.

6.3 Filtering and indexing #

A fundamental operation on tables is filtering, like with dplyr’s filter verb or the WHERE clause in SQL. Filtering on tables can be done easily in Goal using the X#t dyadic form. The left argument is a list of filters. The first applies to the keys (playing the role of select), and the rest applies to the rows in succession. Note that this means order can have performance implications.

The filters are functions returning booleans, as in the f#Y array filtering form. The key selection filter and the first row filter can also be given directly as booleans; moreover, the former can also consist of a list of column names.

For example selecting all the rows having "goblin" in the race column can be done as follows:

pp(1;..race="goblin")#t / rows where race is "goblin"
/ === Table 3x9 ===
/                id     race alignment level cost xp          upgrade hp
/ ----------------- -------- --------- ----- ---- -- ---------------- --
/  "Goblin Impaler" "goblin" "chaotic"     1   13 50               "" 26
/   "Goblin Rouser" "goblin" "chaotic"     1   13 50               "" 31
/ "Goblin Spearman" "goblin" "chaotic"     0    9 18 "Goblin Impaler" 18
pp(rx/p$/!:;..race="goblin")#t / same but only columns with name ending in p
/ === Table 3x2 ===
/ xp hp
/ -- --
/ 50 26
/ 50 31
/ 18 18

In the first example, the true value 1 tells the filtering verb to keep all columns, while ..race="goblin" is a function described using field expression syntax, equivalent to {x["race"]="goblin"}, and is used to filter rows. The field expression syntax is handy when manipulating tables, because it allows to refer to columns directly by name, as if they were variables. The second example is the same but uses a regexp to filter column names.

pp(1;..race="human";..alignment="chaotic")#t
/ === Table 17x9 ===
/         id    race alignment level cost  xp      upgrade hp
/ ---------- ------- --------- ----- ---- --- ------------ --
/   "Outlaw" "human" "chaotic"     2   24  71   "Fugitive" 47
/ "Assassin" "human" "chaotic"     3   46 150           "" 51
/   "Bandit" "human" "chaotic"     2   22  61 "Highwayman" 50
/  "Footpad" "human" "chaotic"     1   14  35     "Outlaw" 30
/ "Fugitive" "human" "chaotic"     3   55 150           "" 68

Because column names are simply the keys of a dictionary, while the column values are its array of values, it’s easy to do location-based selections of both columns and rows:

pp 3#t / select first 3 columns
/ === Table 307x3 ===
/            id         race alignment
/ ------------- ------------ ---------
/   "Blood Bat"       "bats" "chaotic"
/   "Dread Bat"       "bats" "chaotic"
/ "Vampire Bat"       "bats" "chaotic"
/        "Boat" "mechanical"  "lawful"
/     "Galleon" "mechanical"  "lawful"
krng:{[t;x;y](k?x)_(1+k?y)#k:!t} / inclusive key range from x to y
pp krng[t;"alignment";"cost"]#t / select range of columns
/ === Table 307x3 ===
/ alignment level cost
/ --------- ----- ----
/ "chaotic"     1   23
/ "chaotic"     2   32
/ "chaotic"     0   13
/  "lawful"     1   10
/  "lawful"     1   10

Similar to filtering, row indexing can be used to select a subset of rows or columns. This uses a special Goal feature that allows table-like indexing for dicts with string keys when indexing first by row. Unlike filtering, indexing also allows to reorder columns as specified.

pp t[10+!5] / select from row index 10 to 14
/ === Table 5x9 ===
/               id    race alignment level cost  xp          upgrade hp
/ ---------------- ------- --------- ----- ---- --- ---------------- --
/   "Drake Burner" "drake"  "lawful"     1   21  43     "Fire Drake" 42
/  "Drake Clasher" "drake"  "lawful"     1   19  41 "Drake Thrasher" 43
/ "Drake Enforcer" "drake"  "lawful"     3   58 150               "" 85
/  "Drake Fighter" "drake"  "lawful"     1   17  41  "Drake Warrior" 39
/     "Fire Drake" "drake"  "lawful"     2   35  84  "Inferno Drake" 63
pp t[-5?#*t] / randomly select 5 distinct rows
/ === Table 5x9 ===
/               id      race alignment level cost  xp           upgrade hp
/ ---------------- --------- --------- ----- ---- --- ----------------- --
/      "Giant Rat" "monster" "chaotic"     0    5  25                "" 15
/        "Nibbler" "monster" "neutral"     0   15  26          "Caribe" 28
/ "Gryphon Master" "gryphon" "neutral"     2   40 100                "" 51
/  "Royal Warrior"   "human"  "lawful"     3   47 150                "" 64
/ "Saurian Oracle"  "lizard" "chaotic"     2   28  74 "Saurian Prophet" 29
pp t[!-5;"hp""id"] / select last 5 rows of columns "hp" "id" (reorder)
/ === Table 5x2 ===
/ hp             id
/ -- --------------
/ 80 "Ancient Wose"
/ 64   "Elder Wose"
/ 52         "Wose"
/ 26 "Wose Sapling"
/ 56  "Wose Shaman"
pp -5#'"hp""id"#t / same using take on each column and "keep keys" (keep order)
/ === Table 5x2 ===
/             id hp
/ -------------- --
/ "Ancient Wose" 80
/   "Elder Wose" 64
/         "Wose" 52
/ "Wose Sapling" 26
/  "Wose Shaman" 56

6.4 Sorting rows #

Sorting rows so that some column is in ascending or descending order can be done easily:

pp t@>t..level / sort by level in descending order
/ === Table 307x9 ===
/                 id       race alignment level cost  xp upgrade  hp
/ ------------------ ---------- --------- ----- ---- --- ------- ---
/       "Elder Mage"    "human" "neutral"     5   90 250      ""  60
/      "Fire Dragon"  "monster" "chaotic"     5  100 250      "" 101
/ "Armageddon Drake"    "drake"  "lawful"     4  118 200      ""  98
/     "Dune Paragon" "dunefolk"  "lawful"     4   78 200      ""  82
/     "Wyvern Rider"  "monster" "liminal"     4  102 200      ""  85

Note how the verb > is used to return sorting indices. This method is flexible, but unlike dataframe libraries in R or python, or SQL queries, we refer here twice to the table t, because we are simply using the indexing verb @ that can take inputs unrelated to the table. Field expressions make it easy to write order-by functions for tables suitable for chaining:

pp (..x@<level)(..x@<cost)t / ascending order by level and cost
/ === Table 307x9 ===
/               id      race alignment level cost xp         upgrade hp
/ ---------------- --------- --------- ----- ---- -- --------------- --
/     "Dark Horse"        "" "chaotic"    0i   0i 24   "Black Horse" 0i
/    "White Horse"        ""        ""    0i   0i 0i              "" 50
/ "Sand Scamperer"        ""        ""    0i   0i 0i "Sand Scuttler" 27
/  "Sand Scuttler"        ""        ""    0i   0i 0i              "" 40
/  "Giant Ant Egg" "monster" "neutral"     0    4  4     "Giant Ant" 32

The last example illustrates the case of NaN-like integer entries 0i. In this case, it corresponds to units for which the CSV file doesn’t provide a level or cost. We may want to remove rows for which there are NaNs in the sorting columns so that we can see the cheapest units by level, discarding rows with no cost or level data.

rmnan:{(0;..|/nan'x@!p.x)^y} / removes rows with NaNs in columns x from table y
pp (..x@<level)(..x@<cost)rmnan["level cost";t]
/ === Table 303x9 ===
/              id      race alignment level cost xp            upgrade hp
/ --------------- --------- --------- ----- ---- -- ------------------ --
/ "Giant Ant Egg" "monster" "neutral"     0    4  4        "Giant Ant" 32
/     "Giant Rat" "monster" "chaotic"     0    5 25                 "" 15
/    "Mudcrawler" "monster" "neutral"     0    5 18 "Giant Mudcrawler" 20
/       "Ruffian"   "human" "chaotic"     0    7 18             "Thug" 16
/     "Giant Ant" "monster" "neutral"     0    7 16      "Soldier Ant" 22

Note how we used in rmnan field expression syntax .. for a non-table usage to concisely define a lambda projection using the prefixed p.x form to refer to variable x from the parent function (note how it’s different from the q. prefix which would have referred to a global or local variable). It could have been written too as {(0;{|/nan'x@!y}[;x])^y}.

6.5 Amending values #

As usual for values in array languages, tables are immutable, so mutation is done by returning a new table with modified values or even new columns. Goal provides a convenient syntax for amending columns of a table, which is simple syntax sugar for the tetradic amend @[d;y;F;z] form but allows to refer to columns directly by name, like with normal field expressions.

pp t..[id:_id] / lower case id names: same as @[t;"id";_]
/ === Table 307x9 ===
/            id         race alignment level cost  xp     upgrade hp
/ ------------- ------------ --------- ----- ---- --- ----------- --
/   "blood bat"       "bats" "chaotic"     1   23  37 "Dread Bat" 27
/   "dread bat"       "bats" "chaotic"     2   32 100          "" 33
/ "vampire bat"       "bats" "chaotic"     0   13  22 "Blood Bat" 16
/        "boat" "mechanical"  "lawful"     1   10  50          ""  1
/     "galleon" "mechanical"  "lawful"     1   10  50          ""  1
pp t..[level+:1] / increment all levels by one
/ === Table 307x9 ===
/            id         race alignment level cost  xp     upgrade hp
/ ------------- ------------ --------- ----- ---- --- ----------- --
/   "Blood Bat"       "bats" "chaotic"     2   23  37 "Dread Bat" 27
/   "Dread Bat"       "bats" "chaotic"     3   32 100          "" 33
/ "Vampire Bat"       "bats" "chaotic"     1   13  22 "Blood Bat" 16
/        "Boat" "mechanical"  "lawful"     2   10  50          ""  1
/     "Galleon" "mechanical"  "lawful"     2   10  50          ""  1
pp t..[n:!#id][;!"n id level alignment"] / add new column, then select & reorder
/ === Table 307x4 ===
/ n            id level alignment
/ - ------------- ----- ---------
/ 0   "Blood Bat"     1 "chaotic"
/ 1   "Dread Bat"     2 "chaotic"
/ 2 "Vampire Bat"     0 "chaotic"
/ 3        "Boat"     1  "lawful"
/ 4     "Galleon"     1  "lawful"

6.6 Aggregations #

We may for example want to compute the mean level, cost, xp and hp for units in the game, returning them as a simple flat dict. Note how we use the each ' adverb to apply a function to each column of a table.

avg:{(+/x)%#x:(nan)^x} / return average of x with NaNs filtered out
fmt.dict[;"%.1f"]@avg'(!"level cost xp hp")#t
/ === Dict (4 keys) ===
/ level| 2.0
/ cost | 36.1
/ xp   | 90.5
/ hp   | 47.1

It should be noted that those results are not very useful from a gameplay perspective: they’re just a simple aggregation that conflates everything together! More interesting results could be for example the mean HP of units by level, race or alignment, which leads us to the next section, introducing “group by” functionality.

6.7 Group by #

Goal doesn’t have a “grouped table” concept like R’s dplyr, nor a specific DSL for performing “group by” operations on tables. It doesn’t have either “keyed tables”, unlike some K dialects, which can be useful to group some columns by some others. Goal offers instead functionality for performing grouping operations with arrays, inspired from BQN, that can easily be used to express the various kinds of interesting grouping operations on tables.

For example, we might want to compute the average cost and hp of units by race. In order to do that, we’ll combine various Goal features, among them filtering (to remove rows with "" as race), classify %X, and group by indices =d. Also, as is usual in dataframe packages, we’ll sort the result according to the “by” column.

pp (..x@<race)((0;..""=race)^t)..  ..[
  race:&race!¿q.by:%race / classify and keep unique
  cost:q.avg'=cost!q.by  / average cost of each group by race classification
  hp:q.avg'=hp!q.by      / same for hp
]
/ === Table 22x3 ===
/       race cost   hp
/ ---------- ---- ----
/     "bats" 22.7 25.3
/    "drake" 44.5 62.6
/ "dunefolk" 37.2 50.0
/    "dwarf" 36.8 48.1
/      "elf" 48.0 48.1

The first line just does filtering and sorting in the same way as previously in this tutorial. Next, we use dict syntax to produce a table with the three columns we want. As mentioned in the help, within a field expression, it’s possible to use the q. variable prefix to refer to actual variables instead of column names. We use this feature here several times, starting with assigning to q.by the classification indices needed for grouping by race and filtering the race column with a unique mask. This last part could also have been done too with race:?race, but using the unique mask ¿q.by is faster, because the already computed q.by contains small-range integers, and self-search primitives like ¿ (also named firsts) are very fast on those inputs. The rest should be quite self-explanatory: use group by indices =d on the cost and hp columns using the grouping indices q.by, and then apply the avg function, without forgetting the q. prefix and the each ' adverb, so that the average is computed on each group.

The kind of grouping operation that produces a summary table after applying reducing functions on a set of columns grouped by some others is very common, so it can be convenient to write a user-defined function handling that usage.

/ rby groups by cols k, summarises with reducing f[t;by] using grouping indices
/ by, then sorts result by k.
rby:{[k;f;t]
  by:%?["A"~@g:t k;{(1+|/'x)/x}@%'g;g]
  (t[&¿by;k],f[t;by]){x@<x y}/|k
}

/ average cost, hp by race
pp rby["race";..(..q.avg'=x!p.y)'..[cost;hp]](0;..""=race)^t
/ === Table 22x3 ===
/       race cost   hp
/ ---------- ---- ----
/     "bats" 22.7 25.3
/    "drake" 44.5 62.6
/ "dunefolk" 37.2 50.0
/    "dwarf" 36.8 48.1
/      "elf" 48.0 48.1

/ average cost, hp by level
pp rby["level";..(..q.avg'=x!p.y)'..[cost;hp]]rmnan["level";t]
/ === Table 6x3 ===
/ level  cost   hp
/ ----- ----- ----
/     0   9.6 21.5
/     1  15.6 31.7
/     2  29.5 47.4
/     3  57.3 61.6
/     4 123.3 82.8

/ average cost, hp by level, race
pp rby[!"level race";..(..q.avg'=x!p.y)'..[cost;hp]]rmnan["level";t]
/ === Table 73x4 ===
/ level     race cost   hp
/ ----- -------- ---- ----
/     0   "bats" 13.0 16.0
/     0 "falcon" 12.0 18.0
/     0 "goblin"  9.0 18.0
/     0  "human"  8.3 17.3
/     0 "merman"  9.0 20.0

This rby function is actually an extension of our previous example, handling grouping by several columns (not just one). Studying it can be an interesting exercise, so we’ll just give some pointers. The first line

  by:%?["A"~@g:t k;{(1+|/'x)/x}@%'g;g]

is similar in spirit to the previous example: we compute grouping indices by, but we also handle several columns when t k returns a generic array. That case simply uses classify %X on each column, and merges all those columns into one using decode I/ and applying classify again on the result.

The second line uses the unique mask ¿by to filter the grouping columns t[&¿by;k], and then merges the summary aggregate columns returned by f[t;by]. The final {x@<x y}/|k sorts the result rows by the grouping columns.

In the above examples, we aply the same reducing function avg to grouped original columns: this allows for common aggregation using (..q.avg'=x!p.y)' on ..[cost;hp]. Note that p.y refers to the grouping indices y of the parent function, and that ..[cost;hp] is a shorthand for ..[cost:cost;hp:hp].

Sometimes, we may want to perform different computations in the summary or introduce new columns, without actually grouping any columns everytime.

/ number of units and max hp by level
pp rby["level";....[n:=y;hpmax:|/'=hp!y];rmnan["level";t]]
/ === Table 6x3 ===
/ level   n hpmax
/ ----- --- -----
/     0  20    35
/     1  81    52
/     2 102    72
/     3  88    85
/     4  10   142

Here we use the freq =I form on the grouping indices to get the number of elements in each group, without actually performing any grouping or operation on existing columns. Also, we introduce a new hpmax column to list the maximum HP values, instead of reusing the name hp. We could in a similar way compute an hpmin column, for example.

The lib/table.goal library provides a similarly flexible by function with the same semantics as the above rby but without the extra sorting.

6.8 Learn more #

The example script files examples/dplyr.goal and examples/lil.goal from Goal’s repository provide a translation of the examples found in R’s dplyr’s introduction and Lil’s manual (respectively). They cover similar uses as the present tutorial but with different datasets. The links above point to html-higlighted versions of the code for reading convenience.

The lib/table.goal file provides a few table-related user-defined functions for grouping, sorting, and some common cases of joins.

You might also be interested in the Goal’s fmt.tbl function article by semperos that gives a (slightly outdated) walk-through of the fmt.tbl function that we used for printing tables.

6 Working with tables #

6.1 Introduction #