Sortable Semantic Version Strings in Rails (part 1)

Example of converting a Semantic Version string to a sortable integer in Ruby

Sometimes we need to sort string values that aren’t lexicographically sortable. For example, a list of Semantic Version strings like 1.0.0-rc1, 1.2.4, 1.1.0-beta and 2.0.0.

This post describes how these strings can be encoded as a unique integer value that is sortable, using a technique called bit packing.

  • Part 1 explains the concept and the constraints of this method
  • Part 2 describes how to implement this in ActiveRecord
  • Part 3 (TODO) describes an alternative approach that is SQL driven

Defining the problem

Semantic Version strings aren’t sortable using lexicographical sorting methods. So, a version string like 1.11.0 should come after a version string 1.9.0 when ordered using Semantic Versioning, but would come before it in an ascending lexicographical sort.

To see this problem in action, try running the following Ruby code in a console:

%w[ 1.10.0 1.9.0 2.4.1 0.123.0 4.0.0-rc1 ].sort
# => ["0.123.0", "1.10.0", "1.9.0", "2.4.1", "4.0.0-rc1"]

Version 1.9.0 should come before 1.10.0 in this list (because 9 is less than 10 😵‍💫). But comes after, because the strings are compared character by character until a difference appears, and so 1 is compared to 9 and found to be lower.

Sorting records by SemanticVersion as part of a SQL SELECT can be tricky, as well as slower, because we have to apply various functions on each row to split the strings into their component parts (or do a WHERE ... IN (?)) with a set of preselected version numbers.

But what if there was a way to convert our Semantic Version strings into a format that was both sortable and lightning-fast?!

Bit Packing

You probably know that natural numbers can be represented in binary form, and conversely, that binary numbers can be represented as natural numbers. But what if we could store non-numerical information—our Semantic Version string—in an encoded binary form, and then represent that as a unique integer that was sortable by its Semantic Version precedence?

Here’s a simple example to demonstrate the concept…

We can represent integers from zero to 255 using 8 bits of binary. For example, the number 12 can be represented as: 00001100 (8 + 4 = 12)

Bit place value 128 64 32 16 8 4 2 1
State 0 0 0 0 1 1 0 0

This simple example should hopefully be straightforward—we can store natural numbers as binary. But we’re only using 4 bits of data here, not all 8. Database Integers are typically up to 32-bits, and BigInt types are 64-bits long. What if we were to use more of the available bits as places to store additional information?

Let’s look at a slightly more complicated example. The numbers 15 and 12 could separately be represented as the following two 4-bit binary representations 1111 and 1100:

Bit place value 128 64 32 16 8 4 2 1
State 0 0 0 0 1 1 1 1
Bit place value 128 64 32 16 8 4 2 1
State 0 0 0 0 1 1 0 0

But we could join these two separate 4 bit binary values into single 8-bit binary value:

Bit place value 128 64 32 16 8 4 2 1
State 1 1 1 1 1 1 0 0

Here, we’ve represented the numbers 15 and 12 as the binary number 11111100, which is just 1111 and 1100 concatenated together. The natural number value this binary represents is not 1512 (15 and 12), but 252.

In other words, we have found a way to represent our two numbers in a format that is more compact (252 has fewer digits than 1512 would), is unique to this particular combination of values, and can lexicographically sorted and indexed.

Knowing how these values were stored, we can also decode this value, converting the number 252 back into the binary 11111100; splitting it in two to give 1111 and 1100, and converting these back to integers 15 and 12. In other words, the process is reversible.

Can we use this method to store a whole Semantic Version string as an integer? (hint: yes we can!).

Components of a Semantic Version String

The Semantic Versioning Specification (SemVer) is fairly flexible, and allows for a variety of valid version string formats. The majority of use cases, though, can be boiled down to the following components:

MAJOR.MINOR.PATCH-PRERELEASE+BUILD

  • Major version (required). Typically a natural number greater than or equal to zero. Usually, most software projects will have a major version between 0 and 10, but some will go much higher.
  • Minor version (required). Typically a natural number greater than or equal to zero. It’s not uncommon for a major version of a software to have dozens of minor releases.
  • Patch version (required). Typically a natural number greater than or equal to zero. It’s not uncommon for a minor version of a software to have dozens of minor releases.
  • Prerelease label (optional). Something like “alpha”, “beta”, or “rc”. Can often be followed by a prerelease version number.
  • Prerelease version (optional). Typically, a natural number greater than zero; commonly lower than 10. Follows the prerelease label, and indicates the nth version for that prerelease type (e.g. beta.2).
  • Build metadata (optional). The metadata for the current build. Can hold a variety of alphanumeric data. MUST be ignored when determining version precedence.

So, in the case of a version string: 1.0.2-beta.3+foobar, we can say the major version is 1, the minor version is 0, the patch version is 2, the prerelease label is beta and the prerelease version is 3, and the build metadata is foobar (but we don’t care about build metadata).

Note that major, minor, and patch are required for a valid SemVer value, whereas everything after that is optional.

Encoding SemVer strings as unique integers

All of this is very interesting, but how can we pack a long string of mixed numbers, text, and symbols, as an incremental integer?

The trick here is to designate specific bit places (or columns) of our 32-bit integer data to specific components of our SemVer string—just like we did with the numbers 15 and 12 above. To do this, we have to agree to some constraints on the version information we allow ourselves to store.

For example, let’s agree that we’re not likely to release more than 100 major versions of our application in its lifespan? In that case, we can say that the highest major version we will likely store is 99. In binary, 99 is represented as 1100011. So, we need at least 7 bits to represent the major component of our SemVer string (7 bits actually allows gives us up to 127 major versions).

We will likely release more minor versions, so let’s allow ourselves at least up to 999 minor versions per major release. 999 is represented as the binary value 1111100111, and so we’ll need at least 10 bits to represent the minor component of our SemVer string.

We can repeat this for patch versions, and allow another 10 bits to represent our patch version number.

So far, we could represent any MAJOR.MINOR.PATCH version string using 32 bits like so:

(this table is scrollable)

Bit place 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
Represents MAJOR MAJOR MAJOR MAJOR MAJOR MAJOR MAJOR MINOR MINOR MINOR MINOR MINOR MINOR MINOR MINOR MINOR MINOR PATCH PATCH PATCH PATCH PATCH PATCH PATCH PATCH PATCH PATCH unused unused unused unused unused

and still have 5 bits left to spare (we’ll use those later).

So for example, version 8.1.4 could be broken into its constituent parts: 8, 1, and 4.

These are represented in binary as: 1000, 1, and 100, respectively.

ℹ️ You can prove this to yourself by writing 8.to_s(2) in a Ruby console:

8.to_s(2)
# => "1000"

Since we decided above that we would allocate 7 bits to the major version, and 10 bits each to the minor and patch versions, we would represent these values fully as 0001000, 0000000001, and 0000000100.

Concatenate them all together, leaving the last 5 bits unset, we have the 32 bit binary value: 0001000 0000000001 0000000100 00000.

Binary is typically written in 8 bit groups called bytes, so we can adjust the spacing on this to 00010000 00000000 10000000 10000000.

ALLLLLL of this, is to say that: we can represent the SemVer number 8.1.4 with the binary value 00010000 00000000 10000000 10000000, which is equal to the integer 268468352. This number is unique to this SemVer value.

ℹ️ You can prove this to yourself by writing 0b00010000_00000000_10000000_10000000 in a Ruby console:

0b00010000_00000000_10000000_10000000
# => 268468352

So far, so good?


Including the prerelease type

Remember the unused bits in the previous section? We’re going to put them to good use now…

Since SemVer includes the option to have prereleases, such as alpha1 or rc.2, we ought to add some support for these too. To do so, we have to again commit ourselves to some reasonable constraints.

In this case, the constraints are that we should have a limited number of prerelease types that we may use: say (alpha, beta, rc) and assume that everything else is a release release type. With these four types agreed upon, we can represent each of them using just two bits of binary data.

alpha
This comes first in order precedence, and should be represented as 00
beta
This comes second in order precedence, and should be represented as 01
rc
This comes third in order precedence, and should be represented as 10
release
This comes last in order precedence, and should be represented as 11

Now, we have a two-bit representation for the prerelease type component of our SemVer string that will preserve the proper order precedence when we include it in our long binary number. Let’s update that number, using two of the unused bits from our 32 bit binary number in the previous section.

Bit place 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
Represents MAJOR MAJOR MAJOR MAJOR MAJOR MAJOR MAJOR MINOR MINOR MINOR MINOR MINOR MINOR MINOR MINOR MINOR MINOR PATCH PATCH PATCH PATCH PATCH PATCH PATCH PATCH PATCH PATCH PRERELEASE TYPE PRERELEASE TYPE unused unused unused

Now, we could represent the following versions by the these binary values (I’ve added parentheses around the bits that have been changed):

SemVer Binary Integer
8.1.4-alpha 00010000_00000000_10000000_100(00)000 268468352
8.1.4-beta 00010000_00000000_10000000_100(01)000 268468360
8.1.4-rc 00010000_00000000_10000000_100(10)000 268468368
8.1.4 00010000_00000000_10000000_100(11)000 268468376

Storing the prerelease version

We have three unused bits of our 32 bit integer left, and we still don’t have a way of supporting prerelease version numbers (like the 3 in 1.0.0-rc.3). We can store this information in the same way we have stored the major, minor, and patch versions—again, if we accept some reasonable constraints.

Since we have three bits of space left, we can support up to seven numbered prerelease version numbers per prerelease type (alpha7, beta7, or rc7). That’s probably plenty. Encoding these as binary is relatively straightforward, as we don’t have to do anything other than convert the prerelease version number to binary and occupy the first three bits.

Our final mapping of bits to values should look like this:

(again, the table is scrollable)

Bit place 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
Represents MAJOR MAJOR MAJOR MAJOR MAJOR MAJOR MAJOR MINOR MINOR MINOR MINOR MINOR MINOR MINOR MINOR MINOR MINOR PATCH PATCH PATCH PATCH PATCH PATCH PATCH PATCH PATCH PATCH PRERELEASE TYPE PRERELEASE TYPE PRERELEASE VERSION PRERELEASE VERSION PRERELEASE VERSION

Let’s look at a complete example of how valid SemVer strings would look as binary and integers using this system (I’ve added parentheses around the bits that have been changed):

SemVer Binary Integer
8.1.4-alpha 00010000_00000000_10000000_10000(000) 268468352
8.1.4-alpha.1 00010000_00000000_10000000_10000(001) 268468353
8.1.4-alpha.2 00010000_00000000_10000000_10000(010) 268468354
8.1.4-alpha.3 00010000_00000000_10000000_10000(011) 268468355

Note that, in this case, the integer values go up sequentially by 1 each time.

Summary

Let’s summarise what we just did.

We took a 32-bit binary value, and separated it into smaller chunks that represent the various components of a valid SemVer string.
Reading from left to right, the first 7 bits represent the major, the next 10 bits represent the minor, and the next 10 the patch. The following 2 bits represent the prerelease type, and the last 3 bits represent the prerelease version number.

To spare you from taking off your socks, 7 + 10 + 10 + 2 + 3 = 32 bits.

A version string like 52.123.12-beta.2 can be broken into:

  • 52 => 0110100
  • 123 => 0001111011
  • 12 => 0000001100
  • beta => 01
  • 2 => 010

Which can all be concatenated together as 0110100 0001111011 0000001100 01 010.

This is the binary representation for 1748861322.

52.123.12-beta.2 can be represented as 1748861322.

In the next section, we’ll look at how we can implement this in an ActiveRecord model.

Written by

Photo of Gavin Morrice
Gavin Morrice

Software engineer based in Scotland

Connect with me