Relation for Excel 2.3

by Matthias Bürcher @ 2016-2020 Version 2.3

Freeware

This macro provides functions to make simple relational algebra.
Unlike other Excel solutions, this one is purely functional, not using macros.
Relations are saved as text in one cell with double colon (::) as field separator and space+newline as row separator. The first line are the column names.
Note that in a cell, the text cannot be more than 32k characters.

Download

relation-for-excel-230.zip

The zip file contains:

  • relation-for-excel-230.xlam: The add-in
  • relation-for-excel-230.xlsm: The macro sheet, from that the add-in was exported
  • relation-for-excel-230-examples.xlsm: An Excel sheet with examples

This presentation gives you a good start to understand relational algebra and its use with these functions
relation-for-excel-200-presentation.pdf

Github: https://github.com/bellenuit/relation-for-excel

Relation for Google Sheets is about to be developped.

Installation

You can either copy the module in every sheet or you can install the add-in.

Copy relation-for-excel-230.xlam into the add-in folder of Excel. The location of this folder changes from Excel version to Excel version, on my version it is
/Applications/Microsoft Office 2016/Office/Add-ins

Launch Excel and enable the add-in. On my version, this is the menu Tools:Add-Ins

Open the file relation-for-excel-230-examples.xlsm. The file has the module included, so it works also without the enabled add-in.

To test, if the add-in works, you have two possibilities:

  • The menu Insert:Function should show all functions when you type "rel"
  • The cell D26 with the formula =relRange(C13:D19) should show this result

code::name
1::Sony
2::Creative Labs
3::Hewlett Packard
4::Iomega
5::Fujitsu
6::Winchester

Note that you need to copy the add-in on all Excel installations you use the functions or you will see the error #NAME?# in the cell that use the functions.

Introduction

Many people use Excel as a database because nothing else is available or allowed in the company they work. They collect data, which is often relational, but Excel cannot express the relations within the worksheet. Excel provides advanced filtering and pivot tables, but as these tools are interesting to explore data, they have no persistence: If you change a filter or a sort order, you loose the old one. It is also very difficult to build relations between data. This leads to duplicate data entries and many repetitive hand work.

What Excel is missing is a kind of query language like SQL to explore data and to display the result. Google introduced a query function in Sheets, which is already a step in the right direction. There are also some people who have developed macros that compile SQL queries to Excel formulas and then apply them to the data. The problem of these approaches is that queries are not updated when data changes.

The problem is also conceptual. The result of a SQL query is not a single value, but a table of values. Functions normally calculate only a single value they return to a single cell. You can have a common formula for a range of cells, but things gets quickly complex because you may not know the number of rows and columns in advance.

Relation for Excel goes a different approach in many ways:

  • The solution is purely functional, so results are updated immediately if data is.
  • The solution compiles the entire table in one text that can take place in a single cell. This makes it easy to handle intermediate results when you build up complex queries.
  • The solution does not use SQL but basic relational algebra. The relational algebra can express anything than SQL can, but it integrates more easily as an Excel function.

This is the deal:

  • The relation is created from a range in form of a table in a cell: tuples (rows) are separated by space+newlines, properties (columns) by double colon (::).
  • You apply all relational algebra operators select, project, extend, union, difference, join. We have also non relational operators order, limit and rotate.
  • If you want, you can then retrieve the results again in cells and do whatever you want with it for further calculations.

There is a limitation:

  • The result of an Excel function cannot be more than 32K characters. This might make it not usable for big tables.
  • The separators double colon and newlines cannot be used inside a value
  • Column names must start with a letter, have at least to characters and cannot only have letters and numbers in it.
  • As for now, no column name should be part of another column name. id1 and id2 are fine together, but id and id1 not.

You have to know:

  • Relational algebra works with sets and therefore has no duplicates and also no null values. It may have empty values however.
  • Tuples (rows) in a relation have no particular order. We added the relOrder function because sometimes humans would like to view the tuples in a particular order.

Relation for Excel does simplify the relational model in two ways:

  • Properties (columns) in tuples are named, but the columns can also be adressed by position.
  • Properties are not typed. Excel works with text and numbers, you define only the type when you write the expression or condition. You can however in the result of relFilter explicitely return a number or a text.

I hope you can live with that. The reason for the simplification was to keep the syntax simple.

Imagine you have the SQL query

SELECT country, SUM(sales)
FROM manufacturers
WHERE manufacturer
LIKE 'S%'
GROUP BY country

If the data is in the range A2:C30 (manufacturer,country,sales) then the query would be expressed like

=relOrder(relProject(relSelect(relRange(A2:C30),"relLike($manufactuer,""S*""")"country::sales SUM"),"country A")

or you can also write in one function

=relFilter(A2:C30,"S relLike($manufacturer,""S*"")","P country::sales SUM","O country A")

But this is only half of the story. The power of Relation for Excel is in the fact that you do not have to write one big formula in one cell. You can split it up and each cell is a view with intermediate results you can inspect and use for further operations.

Explore the example workbook, where we have sheets with a presentation and exercises computer store, employees, warehouse, pieces and providers, accounts.

Usage

You can work two ways:

  • Use the basic functions for each operator and build up your query one by one
  • Use the relFilter() function that creates a stack calculator and applies all oprators in one formula

The basics is to convert a range of cells into a relation with relRange(), then make all operations on the relations and then either display the result multiline in a cell or recover the results in individual cells with relCell() or relCellArray().

Relations have more than one line in a cell. To display the relation, you can format the cell: Menu Format:Cells Tab Alignment set Wrap Text and also Text alignment vertical "top". You need also to adjust the size of the row to display all lines.

RelRange(rn ar Range)


Creates a relation from a Range. The result is a text where rows are separated with newlines and columns with ::

Exemple

- A B
1 id manufacturer
2 1 Sony
3 2 Creative Labs
4 3 Hewlett Packard
5 3 Iomega
6 5 Fujitsu
7 6 Winchester
8 id::manufacturer
1::Sony
2::Creative Labs
3::Hewlett Packard
4::Iomega
5::Fujitsu
6::Winchester

A8 has the formula =relRange(A1:B7)

relCell(rel as String, row as integer, col as Integer, Optional Numeric As Boolean = False, Optional noError As Boolean = False)


Gets a cell in a relation.
One based row and column.

Exemple:

relCell(A8,3;2) "Hewlett Packard"

relCellArray(rel as String, Optional noHeader As Boolean = False)


There is an alternative syntax, where you can enter the formula for an entire range with control-shift-enter (adding curly braces) and suppressing errors.

relSelect(rel as String, condition as String)

Filters a relation by a condition, which can be any Excel formula. The formula can use values from the current tuple: columns are selected by name or position and preceded by dollar sign if used as string and a percent sign if used as number. You may have to double quote strings inside the formula and use the comma as parameter separator inside Excel functions.

Exemples:

relSelect(A8,"%id<3") id::manufacturer
1::Sony
2::Creative Labs
relSelect(A8, "LEFT($manufacturer,1)=""F""") id::manufacturer
5::Fujitsu

relProject(rel as String, list as String)

Filters columns based on list of names or columns. The list is separated by double colon(::). Columns are 1-based. You can also aggregate on columns using the aggregators SUM COUNT AVG MIN MAX. As relations work on sets, the operator project can also work as group operator.

Exemples:

relProject(A8, "id") id
1
2
3
4
5
6
relProject(A8, "id SUM) id_sum
21

relExtend(rel as String, expression as String, Optional ByVal name As String)


You can add columns with calculations based on existing columns. The expression can by any Excel formula. Columns are selected by name or position and preceded by dollar sign, if used as string and a percent sign, if used as number. You may have to double quote strings inside the formula.

Example:

relExtend(A8, "%1*2","double") id::manufacturer::double
1::Sony::2
2::Creative Labs::4
3::Hewlett Packard::6
4::Iomega::8
5::Fujitsu::10
6::Winchester::12

relRename(rel as String, list As String)


You can rename columns. This is most needed before joins. You can rename multiple columns separated by double colons ::.

Example:

relRename(A8, "id manid") manid::manufacturer:
1::Sony
2::Creative Labs
3::Hewlett Packard
4::Iomega
5::Fujitsu
6::Winchester

relUnion(rel1 as String, rel2 as String)


Union checks for arity, common columns and eliminates duplicates.

Example:

relUnion(relSelect(A8,"%id4")) id::manufacturer
1::Sony
5::Fujitsu
6::Winchester

relDifference(rel1 as String, rel2 as String)


Difference checks for arity.

Example:

relDifference(A8,relSelect(A8,"%1id>4")) id::manufacturer
1::Sony
2::Creative Labs
3::Hewlett Packard
4::Iomega

relIntersect(rel1 as String, rel2 as String)


Intersection checks for arity.

relJoin(rel1 as String, rel2 as String,condition as string)


Join can be a theta join (specific condition) or a special join based on common column names: Use the as condition the keywords NATURAL, LEFT, RIGHT, OUTER, LEFTSEMI, RIGHTSEMI, LEFTANTISEMI, RIGHTANTISEMI.
For cross product set condition to true.
Self join is also possible.

Example:

- A B
10 id product
11 1 Walkman
12 1 Playstation
13 3 Printer
14 relJoin(A7,Range(A10:B13),"%1=%3") id1::manufacturer::id2::product
1::Sony::1::Walkman
1::Sony::1::Playstation
3::Hewlett Packard::3::Printer
15 relJoin(A7,Range(A10:B13),"NATURAL") id::manufacturer::product
1::Sony::Walkman
1::Sony::Playstation
3::Hewlett Packard::Printer

relOrder(rl as String, list as String)


This is not really relational algebra, as sets have no order, but it is still useful.
List separated by ::, each item is name or column 1-based, space and an order operator
A alpha, Z alpha reverse, 9 number top down, 1 number bottom up.

Example:

relOrder(A8 "manufacturer A") id::manufacturer
2::Creative Labs
5::Fujitsu
3::Hewlett Packard
4::Iomega
1::Sony
6::Winchester

relLimit(rel as String, start as Integer, limit as Integer)


This is not really relational algebra, but useful. Start and limit are 1-based. Limit -1 does not limit.

Example:

relLimit(relOrder(A8,"manufacturer A"),2,3) id::manufacturer
3::Hewlett Packard
4::Iomega
1::Sony

relLike(s as String, pattern as String)


Exposes the VBA LIKE operator to Excel.
Like is not full regex, but has some limited syntax that might be useful for select and join
Operators: ? (any once), * (any zero or more), # (number), [] list [!] exclude list.
Returns true if pattern is matched, so you can use it in relSelect.
relSelect("relLike($2,"*a*,A7) 2::Creative Labs
3::Hewlett Packard
4::Iomega

relRotate(rel as String)


Rotate is not really relational algebra either, but useful, too.
relRotate(A8) col1::col2::col3::col4::col5::col6
Sony::Creative Labs::Hewlett Packard::Iomega:Fujitsu::Winchester

relFilter(paramArray list)


relFilter is a shortcut to push further the limit of the result (VBA has no 32k Limit, Excel has). With some experience, relFilter makes also more readable formulas then the operator functions. We recommend to use relFilter whenever possible.

A parameter can be

  • range (when it has at least 2 cells) is considered as relation
  • any text of more than one line is considered as relation
  • any text with one line but not a space on the second left position is considered as a relation (actually an empty relation with only the header
  • selection "S condition"
  • projection "P list"
  • extension "E expression"
  • rename "R list"
  • union "U"
  • difference "D"
  • intersection "I"
  • join "J condition"
  • rotate "Q"
  • order "O list"
  • limit "L start limit"
  • single value (cell 1,1) "C"
  • single value (cell 1,1) explicitely as string "K"
  • single value (cell 1,1) explicitely as number "Z"
  • comment "#"
  • cut for debugging "!"

relFilter works as a stack machine, add operators and relations like in RPN notation.

Example:

relFilter(C14:D19;"S %id<4"; "P manufacturer"; "O manufacturer A";"L 1 2")

Tip: Give the original tables names so that the formulas are more readable

Example:

relFilter(Products;"S %id<4"; "P manufacturer"; "O manufacturer A";"L 1 2")

relFixpoint(ByVal rel As String, fixpoint As String, ByVal start As String, connect As String)

With relFixpoint() you can walk through tree or a graph of a relation.
This is a recursive union of joins. The fixpoint column will have the initial value start and then get each time the connect column, as long as there are new values.
The column order is forced to the initial one.
See the example file.

relAssert(ByVal rel As String, ByVal constraint As String, ByVal expression As String)

Can be used to assure consistency of th data. Possible asserts

  • ALL expression
  • EXISTS expression
  • UNIQUE expression
  • COLUMNS expression

The function returns true or the error with the tuple.

Current limitations


Text size 32k and therefore a result can not have more than 32k characters or you have an error.
This means typically about 350 rows with 12 columns.
relFilter allows you to make more in one step, as VBA has not the limit, only the Excel worksheet.
Peformance not tested, though it looks decent even with a lot of data.
A select statement can return no tuples, but has still properties. Such an empty relation is further treated as the empty relation with no properties at all.
Code is still subject of refactoring, though it is very stable and used in production.

History


20.8.2020 Relation for Excel 2.3.00
  • Hash join improves dramatically performance of natural join.
  • Hash for duplicate elimination improves performance.

4.4.2017 Relation for Excel 2.0.0

  • Complete rewrite on named relations.

15.8.2016 Relation for Excel 1.1.2

  • relCell and relRow do not return error any more, but empty strings, when out of range

15.6.2016 Relation for Excel 1.1.1

  • relProject gives no results for sum and count even if there is no row
  • relParseExpression: In expressions with more than 10 columns, you do not need to preceed any more the 1-9 columns with a 0 (substution starts top, so there is no ambiguity)

10.5.2016 Relation for Excel 1.1

  • Fixed several small bugs
  • Redefined swFilter as unlimited parameters

20.4. Relation for Excel 1.0