JuliaDB

February 19, 2021

Key Features

  • Has two main data structures: IndexedTable and NDSparse.
  • Can work with data that is larger than the machine RAM.
  • Allows parallel processing with the addprocs function.
  • Works with Dagger.jl’s save and load functions to make a sort of index file adjacent to your table which allows for fast loading and access.

Table

  • “Table” refers to both an IndexedTable and NDSparse
  • Tables store data in columns
  • Tables are typed, meaning changing a table requires returning a new table
  • JuliaDB has few mutating operations because a new table is necessary in most cases

Indexed Table

  • Basically a named tuple of vectors which behaves like a vector of named tuples
  • IndexedTable can be sorted by any number of primary keys (defined using parameter pkey). The table will be sorted by the first priamry key, then the second, and so on.
  • Iterates over NamedTuples of rows
  • Use table function to create or loadtable to load existing data
  • loadtable should only be used once, then the save function should be used to save the table in bianry format to then be quickly loaded using load in the next session. The save and load functions are Dagger.jl fucntions.
x = 1:6
y = vcat(fill('a', 3), fill('b', 3))
z = randn(6);

t = table(x=x, y=y, z=z); pkey = [:x, :y])

Table with 6 rows, 3 columns:
x  y    z
─────────────────
1  'a'  1.06936
2  'a'  -1.29235
3  'a'  -0.725817
4  'b'  -0.422334
5  'b'  0.0246361
6  'b'  0.669536
t[1]

(x = 1, y = 'a', z = 1.069356265804105)
t[1].y

'a'
for item in t
    println(item)
end

(x = 1, y = 'a', z = -0.5856775236297086)
(x = 2, y = 'a', z = -1.8225049388821863)
(x = 3, y = 'a', z = 0.2538693543229162)
(x = 4, y = 'b', z = 3.0769831992975276)
(x = 5, y = 'b', z = 1.0156881097767552)
(x = 6, y = 'b', z = 0.7371713978290413)

NDSparse

  • Behaves like a sparse array with arbitrary indices
  • The keys of an NDSparse array are sorted
  • Use loadndsparse to load existing data
  • Its made for working with data that is sparse over the domain of the index (stock data)
  • Iterates over NamedTuples of values
nd = ndsparse((x=x, y=y), (z=z,))

2-d NDSparse with 6 values (1 field named tuples):
x  y    z
───────┼──────────
1  'a'  1.0548
2  'a'  1.64493
3  'a'  -0.738508
4  'b'  0.325126
5  'b'  0.299526
6  'b'  0.787615
nd[1, 'a']

(z = 1.0548018299672375,)
nd[:, 'a']

1-d NDSparse with 3 values (1 field named tuples):
x  z
──┼──────────
1  1.0548
2  1.64493
3  -0.738508
for item in nd
    println(item)
end

(z = 1.0548018299672375,)
(z = 1.6449349801308328,)
(z = -0.7385076399317578,)
(z = 0.3251263120611191,)
(z = 0.2995262607206383,)
(z = 0.7876150538404173,)

When To Use a Table or NDSparse?

  • NDSparse cases
    • Stocks. First two columns are stock name and date. You will often want to know the closing price of a particular stock on a particular day. In a table, you would need to query where the apple stock has that particular date. In an NDSparse, it is just getting the index with apple and that date.

API

select

Selects particular data and optionally applies function on it.

select(t, (:x, :z))

Table with 6 rows, 2 columns:
x  z
─────────────
1  1.37942
2  0.525809
3  0.754992
4  -0.0998402
5  -1.43908
6  1.96262
select(t, (:x, :z) => row -> row.x > row.z)

6-element Array{Bool,1}:
 0
 1
 1
 1
 1
 1

reduce

Applies a function pairwise. Its good for functions with the associative property, meaning order doesnt matter. For a table that is four rows long, reduce(+, t) is the same as:

out = +(row1, row2)

out = +(out, row3)

out = +(out, row4)

Which will get the sum of the four rows.

More on JuliaDB API.

map

Map function. map and select have overlapping functionality.

map(row -> row.x > row.z, t)

6-element Array{Bool,1}:
 0
 1
 1
 1
 1
 1

columns

Gives you a NamedTuple of Vectors where each name is the column name and each vector is the column data.

columns(t)

(x = [1, 2, 3, 4, 5, 6], y = ['a', 'a', 'a', 'b', 'b', 'b'], z = [1.5812131489064223, -0.5545552318128536, -0.5490833373597503, 0.291760065744581, -1.411530524868723, -1.449557749618286])

rows

Gvies you a StructArray which is basically a Vector of NamedTuples. Each row is one NamedTuple.

rows(t)[1]

(x = 1, y = 'a', z = 1.5812131489064223)

summarize

Applies a function (or functions) column-wise.

summarize(function, table, by: select)

addprocs

Adds as many workers as there are CPU cores available. addprocs(2) would add two working processes. When there are multiple processes, tables will be loaded as distributed tables across all the workers. So oyu cant index a row based on row number or iterate over all rows.

Statistics

JuliaDB integrates with OnlineStats (a julia stats package) using the functions reduce and groupreduce.

julia > using JuliaDB, OnlineStats
julia > t = table(1:100, rand(Bool, 100), randn(100));
julia > reduce(Mean(), t; select = 3)

Mean: n=100 | value=0.159962

General Wisdom

  • Be as specific as possible when selecting to minimize the amount of data you are passing around.
  • After loading a table for the first time, use Dagger.jl’s save and load functions to work with the table.