62
Modeling semi-structured data in Rails
Relational databases are very powerful. Their power comes from their ability to...
- Preserve data integrity with a predefined schema.
- Make complex relationships through joins.
But sometimes, we can stumble accross data that don't fit in the relational model. We call this kind of data: semi-structured data.
When this happens, the things that makes relational databases powerful are the things that gets in our way, and complicate our model instead of simplifying it.
That's why document databases exist, to model and store semi structured data. However, if we choose to use a document database, we'll loose all the power of using a relational database.
Luckily for us, relational databases like Postgres and MySQL now has good JSON support. So most of us won't need to use a document database like MongoDB, as it would be overkill. Most of the time, we only need to denormalize some parts of our model. So it makes more sense to use simple JSON columns for those, instead of going all-in, and dump your beloved relational database for MongoDB.
Currently in Rails, we can have full control over how our JSON data is stored and retrieved from the database, by using the Attributes API to serialize and deserialize our data. So let's see how we can model semi-structured data in a more convinient way.
Let's say that we are building an app to help libraries build and manage an online catalog. When we're browsing through a catalog, we often see item information formatted like this:
Author: Shakespeare, William, 1564-1616.
Title: Hamlet / William Shakespeare.
Description: xiii, 295 pages : illustrations ; 23 cm.
Series: NTC Shakespeare series.
Local Call No: 822.33 S52 S7
ISBN: 0844257443
Series Entry: NTC Shakespeare series.
But in the library world, data is produced and exchanged is this form:
LDR 00815nam 2200289 a 4500
001 ocm30152659
003 OCoLC
005 19971028235910.0
008 940909t19941994ilua 000 0 eng
010 $a92060871
020 $a0844257443
040 $aDLC$cDLC$dBKL$dUtOrBLW
049 $aBKLA
099 $a822.33$aS52$aS7
100 1 $aShakespeare, William,$d1564-1616.
245 10$aHamlet /$cWilliam Shakespeare.
264 1$aLincolnwood, Ill. :$bNTC Pub. Group,$c[1994]
264 4$c©1994.
300 $axiii, 295 pages :$billustrations ;$c23 cm.
336 $atext$btxt$2rdacontent.
337 $aunmediated$bn$2rdamedia.
338 $avolume$bnc$2rdacarrier.
490 1 $aNTC Shakespeare series.
830 0$aNTC Shakespeare series.
907 $a.b108930609
948 $aLTI 2018-07-09
948 $aMARS
This is what we call a MARC (Machine-Readable Cataloging) record. That's how libraries describes the ressources they own.
As you can see, that's really verbose! That's because in the library world, ressources are described very precisely, in order to be "machine-readable".
For convinience, developpers usually represent MARC data in JSON:
{
"leader": "00815nam 2200289 a 4500",
"fields": [
{ "tag": "001", "value": "ocm30152659" },
{ "tag": "003", "value": "OCoLC" },
{ "tag": "005", "value": "19971028235910.0" },
{ "tag": "008", "value": "940909t19941994ilua 000 0 eng " },
{ "tag": "010", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "92060871" }] },
{ "tag": "020", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "0844257443" }] },
{ "tag": "040", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "DLC" }, { "code": "c", "value": "DLC" }, { "code": "d", "value": "BKL" }, { "code": "d", "value": "UtOrBLW" } ] },
{ "tag": "049", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "BKLA" }] },
{ "tag": "099", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "822.33" }, { "code": "a", "value": "S52" }, { "code": "a", "value": "S7" } ] },
{ "tag": "100", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "Shakespeare, William," }, { "code": "d", "value": "1564-1616." } ] },
{ "tag": "245", "indicator1": "1", "indicator2": "0", "subfields": [{ "code": "a", "value": "Hamlet" }, { "code": "c", "value": "William Shakespeare." } ] },
{ "tag": "264", "indicator1": " ", "indicator2": "1", "subfields": [{ "code": "a", "value": "Lincolnwood, Ill. :" }, { "code": "b", "value": "NTC Pub. Group," }, { "code": "c", "value": "[1994]" } ] },
{ "tag": "264", "indicator1": " ", "indicator2": "4", "subfields": [{ "code": "c", "value": "©1994." }] },
{ "tag": "300", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "xiii, 295 pages :" }, { "code": "b", "value": "illustrations ;" }, { "code": "c", "value": "23 cm." } ] },
{ "tag": "336", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "text" }, { "code": "b", "value": "txt" }, { "code": "2", "value": "rdacontent." } ] },
{ "tag": "337", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "unmediated" }, { "code": "b", "value": "n" }, { "code": "2", "value": "rdamedia." } ] },
{ "tag": "338", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "volume" }, { "code": "b", "value": "nc" }, { "code": "2", "value": "rdacarrier." } ] },
{ "tag": "490", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] },
{ "tag": "830", "indicator1": " ", "indicator2": "0", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] },
{ "tag": "907", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": ".b108930609" }] },
{ "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "LTI 2018-07-09" }] },
{ "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "MARS" }] }
]
}
By looking at this JSON representation, we can see that the data is...
- Nested: A MARC record contains many fields, and most of them contains multiple subfields.
- Dynamic: Some fields are repeatable ("264" and "948"), and subfields too. The first fields don't have subfields nor indicators (they're called control fields).
- Encapsulated: The meaning of subfields depends on the field they're in (take a look at the "a" subfield for example).
All those characteristics can be grouped into what we call: semi-structured data.
Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. - Wikipedia
A perfect example of that is HTML documents. An HTML document contains different types of tags which can nested in multiple ways. It wouldn't make sense to model HTML documents with tables and columns. Imagine having to access nested tags through joins, considering the fact that we could potentially have hundreds of them on a single HTML document. That's why we usually store this kind of data in a text field.
In our case, we're using JSON to represent MARC data. Luckily for us, we can store JSON data directly in relational databases like Postgres or MySQL:
# config/initializers/inflections.rb
ActiveSupport::Inflector.inflections(:en) do |inflect|
inflect.acronym "MARC"
end
$ rails g model marc/record leader:string fields:json
$ rails db:migrate
We can then create a MARC record like this:
MARC::Record.create leader: "00815nam 2200289 a 4500", fields: [
{ "tag": "001", "value": "ocm30152659" },
{ "tag": "003", "value": "OCoLC" },
{ "tag": "005", "value": "19971028235910.0" },
{ "tag": "008", "value": "940909t19941994ilua 000 0 eng " },
{ "tag": "010", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "92060871" }] },
{ "tag": "020", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "0844257443" }] },
{ "tag": "040", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "DLC" }, { "code": "c", "value": "DLC" }, { "code": "d", "value": "BKL" }, { "code": "d", "value": "UtOrBLW" } ] },
{ "tag": "049", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "BKLA" }] },
{ "tag": "099", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "822.33" }, { "code": "a", "value": "S52" }, { "code": "a", "value": "S7" } ] },
{ "tag": "100", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "Shakespeare, William," }, { "code": "d", "value": "1564-1616." } ] },
{ "tag": "245", "indicator1": "1", "indicator2": "0", "subfields": [{ "code": "a", "value": "Hamlet" }, { "code": "c", "value": "William Shakespeare." } ] },
{ "tag": "264", "indicator1": " ", "indicator2": "1", "subfields": [{ "code": "a", "value": "Lincolnwood, Ill. :" }, { "code": "b", "value": "NTC Pub. Group," }, { "code": "c", "value": "[1994]" } ] },
{ "tag": "264", "indicator1": " ", "indicator2": "4", "subfields": [{ "code": "c", "value": "©1994." }] },
{ "tag": "300", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "xiii, 295 pages :" }, { "code": "b", "value": "illustrations ;" }, { "code": "c", "value": "23 cm." } ] },
{ "tag": "336", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "text" }, { "code": "b", "value": "txt" }, { "code": "2", "value": "rdacontent." } ] },
{ "tag": "337", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "unmediated" }, { "code": "b", "value": "n" }, { "code": "2", "value": "rdamedia." } ] },
{ "tag": "338", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "volume" }, { "code": "b", "value": "nc" }, { "code": "2", "value": "rdacarrier." } ] },
{ "tag": "490", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] },
{ "tag": "830", "indicator1": " ", "indicator2": "0", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] },
{ "tag": "907", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": ".b108930609" }] },
{ "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "LTI 2018-07-09" }] },
{ "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "MARS" }] }
]
And access it this way:
record = MARC::Record.first
field = record.fields.find { |field| field["tag"] == "245" }
subfield = field["subfields"].first
subfield["value"]
=> "Hamlet"
It works, but...
- It's not very convinient to access nested data this way.
- We cannot easily attach logic to our JSON data without polluting our model.
What if we could interact with our JSON data the same way we do with ActiveRecord associations ? Enters ActiveModel and the AttributesAPI!
First, we have to define a custom type which...
- Maps JSON objects to ActiveModel-compliant objects.
- Handles collections.
To do that, we'll add the following options to our type:
-
:class_name
: The class name of an ActiveModel-compliant object. -
:collection
: Specify if the attribute is a collection. Default tofalse
.
class DocumentType < ActiveModel::Type::Value
attr_reader :document_class, :collection
def initialize(class_name:, collection: false)
@document_class = class_name.constantize
@collection = collection
end
def cast(value)
if collection
value.map { |attributes| process attributes }
else
process value
end
end
def process(value)
document_class.new(value)
end
def serialize(value)
value.to_json
end
def deserialize(json)
value = ActiveSupport::JSON.decode(json)
cast value
end
# Track changes
def changed_in_place?(old_value, new_value)
deserialize(old_value) != new_value
end
end
Let's register our type as we gonna use it multiple times:
# config/initializers/type.rb
ActiveModel::Type.register(:document, DocumentType)
ActiveRecord::Type.register(:document, DocumentType)
Now we can use it in our models:
class MARC::Record < ApplicationRecord
attribute :fields, :document,
class_name: "MARC::Record::Field",
collection: true
def at(tag)
fields.find { |field| field.tag == tag }
end
end
class MARC::Record::Field
include ActiveModel::Model
include ActiveModel::Attributes
include ActiveModel::Serializers::JSON
attribute :tag, :string
attribute :value, :string
attribute :indicator1, :string
attribute :indicator2, :string
attribute :subfields, :document,
class_name: "MARC::Record::Field::Subfield",
collection: true
# Control fields don't have subfields
def attributes
if control_field?
{
"id" => id,
"tag" => tag,
"value" => value
}
else
{
"id" => id,
"tag" => tag,
"indicator1" => indicator1,
"indicator2" => indicator2,
"subfields" => subfields
}
end
end
def control_field?
/00\d/ === tag
end
def at(code)
subfields.find { |subfield| subfield.code == code }
end
alias [] at
# Used to track changes
def ==(other)
attributes == other.attributes
end
end
class MARC::Record::Field::Subfield
include ActiveModel::Model
include ActiveModel::Attributes
include ActiveModel::Serializers::JSON
attribute :code, :string
attribute :value, :string
def ==(other)
attributes == other.attributes
end
end
Let's test this in the console:
record.at("245")["a"].value
=> "Hamlet"
record.changed?
=> false
record.at("245")["a"].value = "Romeo and Juliet"
record.at("245")["a"].value
=> "Romeo and Juliet"
record.changed?
=> true
Et voilà! Home-made associations!
Luckily, you won't need to implement this yourself, as this gem does it for you (and even more).
Here's how we can simplify our models:
class MARC::Record < ApplicationRecord
include ActiveModel::Embedding::Associations
embeds_many :fields
# ...
end
class MARC::Record::Field
include ActiveModel::Embedding::Document
# ...
embeds_many :subfields
# ...
end
class MARC::Record::Field::Subfield
include ActiveModel::Embedding::Document
# ...
end
We can then code our views with nested attributes support out-of-the-box:
# app/views/marc/records/_form.html.erb
<%= form_with model: @record do |record_form| %>
<% @record.fields.each do |field| %>
<%= record_form.fields_for :fields, field do |field_fields| %>
<%= field_fields.label :tag %>
<%= field_fields.text_field :tag %>
<% if field.control_field? %>
<%= field_fields.text_field :value %>
<% else %>
<%= field_fields.text_field :indicator1 %>
<%= field_fields.text_field :indicator2 %>
<%= field_fields.fields_for :subfields do |subfield_fields| %>
<%= subfield_fields.label :code %>
<%= subfield_fields.text_field :code %>
<%= subfield_fields.text_field :value %>
<% end %>
<% end %>
<% end %>
<% end %>
<%= record_form.submit %>
<% end %>
We can even use validations:
class MARC::Record < ApplicationRecord
# ...
validates :fields, presence: true
vallidates_associated :fields
end
class MARC::Record::Field
# ...
validates :subfields, presence: true, unless: :control_field?
validates_associated :subfields, unless: :control_field?
end
class MARC::Record::Field::Subfield
# ...
validates_presence_of :code, :value
end
record = MARC::Record.new
record.valid?
=> false
record.fields = [{ tag: "245" }]
record.valid?
=> false
record.at("245").subfields = [{ code: "a", value: "Ruby on Rails" }]
record.valid?
=> true
We can use custom collections if we need to add custom behaviour:
class MARC::Record::FieldCollection
include ActiveModel::Embedding::Collecting
include Enumerable
def at(tag)
find { |field| field.tag == tag }
end
def repeated?(field)
# ...
end
# ...
end
class MARC::Record < ApplicationRecord
include ActiveModel::Embedding::Associations
embeds_many :fields, collection: "FieldCollection"
delegate :at, :repeated?, to: :fields
# ...
end
record = MARC::Record.first
record.at("245")["a"].value
=> "Hamlet"
record.repeated?("245")
=> false
record.repeated?("264")
=> true
We can use custom types if we need to cast the elements of a collection:
class MARC::Record::FieldType < ActiveModel::Type::Value
def cast(value)
# ...
end
end
class MARC::Record < ApplicationRecord
include ActiveModel::Embedding::Associations
embeds_many :fields, cast_type: "FieldType"
# ...
end
So the next time you need to model semi-structured data in your Rails application...
- Give this gem a try!
- Or use the Attributes API.
62