Modeling semi-structured data in Rails

Relational databases are very powerful. Their power comes from their ability to...
  • Preserve data integrity with a predefined schema.
  • Make complex relationships through joins.
  • But sometimes, we can stumble accross data that don't fit in the relational model. We call this kind of data: semi-structured data.
    When this happens, the things that makes relational databases powerful are the things that gets in our way, and complicate our model instead of simplifying it.
    That's why document databases exist, to model and store semi structured data. However, if we choose to use a document database, we'll loose all the power of using a relational database.
    Luckily for us, relational databases like Postgres and MySQL now has good JSON support. So most of us won't need to use a document database like MongoDB, as it would be overkill. Most of the time, we only need to denormalize some parts of our model. So it makes more sense to use simple JSON columns for those, instead of going all-in, and dump your beloved relational database for MongoDB.
    Currently in Rails, we can have full control over how our JSON data is stored and retrieved from the database, by using the Attributes API to serialize and deserialize our data. So let's see how we can model semi-structured data in a more convinient way.
    Use case: Dealing with bibliographic data
    Let's say that we are building an app to help libraries build and manage an online catalog. When we're browsing through a catalog, we often see item information formatted like this:
    Author:        Shakespeare, William, 1564-1616.
    Title:         Hamlet / William Shakespeare.
    Description:   xiii, 295 pages : illustrations ; 23 cm.
    Series:        NTC Shakespeare series.
    Local Call No: 822.33 S52 S7
    ISBN:          0844257443
    Series Entry:  NTC Shakespeare series.
    But in the library world, data is produced and exchanged is this form:
    LDR 00815nam  2200289 a 4500
    001 ocm30152659
    003 OCoLC
    005 19971028235910.0
    008 940909t19941994ilua          000 0 eng
    010   $a92060871
    020   $a0844257443
    040   $aDLC$cDLC$dBKL$dUtOrBLW
    049   $aBKLA
    099   $a822.33$aS52$aS7
    100 1 $aShakespeare, William,$d1564-1616.
    245 10$aHamlet /$cWilliam Shakespeare.
    264  1$aLincolnwood, Ill. :$bNTC Pub. Group,$c[1994]
    264  4$c©1994.
    300   $axiii, 295 pages :$billustrations ;$c23 cm.
    336   $atext$btxt$2rdacontent.
    337   $aunmediated$bn$2rdamedia.
    338   $avolume$bnc$2rdacarrier.
    490 1 $aNTC Shakespeare series.
    830  0$aNTC Shakespeare series.
    907   $a.b108930609
    948   $aLTI 2018-07-09
    948   $aMARS
    This is what we call a MARC (Machine-Readable Cataloging) record. That's how libraries describes the ressources they own.
    As you can see, that's really verbose! That's because in the library world, ressources are described very precisely, in order to be "machine-readable".
    For convinience, developpers usually represent MARC data in JSON:
    {
      "leader": "00815nam 2200289 a 4500",
      "fields": [
        { "tag": "001", "value": "ocm30152659" },
        { "tag": "003", "value": "OCoLC" },
        { "tag": "005", "value": "19971028235910.0" },
        { "tag": "008", "value": "940909t19941994ilua 000 0 eng " },
        { "tag": "010", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "92060871" }] },
        { "tag": "020", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "0844257443" }] },
        { "tag": "040", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "DLC" }, { "code": "c", "value": "DLC" }, { "code": "d", "value": "BKL" }, { "code": "d", "value": "UtOrBLW" } ] },
        { "tag": "049", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "BKLA" }] },
        { "tag": "099", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "822.33" }, { "code": "a", "value": "S52" }, { "code": "a", "value": "S7" } ] },
        { "tag": "100", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "Shakespeare, William," }, { "code": "d", "value": "1564-1616." } ] },
        { "tag": "245", "indicator1": "1", "indicator2": "0", "subfields": [{ "code": "a", "value": "Hamlet" }, { "code": "c", "value": "William Shakespeare." } ] },
        { "tag": "264", "indicator1": " ", "indicator2": "1", "subfields": [{ "code": "a", "value": "Lincolnwood, Ill. :" }, { "code": "b", "value": "NTC Pub. Group," }, { "code": "c", "value": "[1994]" } ] },
        { "tag": "264", "indicator1": " ", "indicator2": "4", "subfields": [{ "code": "c", "value": "©1994." }] },
        { "tag": "300", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "xiii, 295 pages :" }, { "code": "b", "value": "illustrations ;" }, { "code": "c", "value": "23 cm." } ] },
        { "tag": "336", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "text" }, { "code": "b", "value": "txt" }, { "code": "2", "value": "rdacontent." } ] },
        { "tag": "337", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "unmediated" }, { "code": "b", "value": "n" }, { "code": "2", "value": "rdamedia." } ] },
        { "tag": "338", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "volume" }, { "code": "b", "value": "nc" }, { "code": "2", "value": "rdacarrier." } ] },
        { "tag": "490", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] },
        { "tag": "830", "indicator1": " ", "indicator2": "0", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] },
        { "tag": "907", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": ".b108930609" }] },
        { "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "LTI 2018-07-09" }] },
        { "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "MARS" }] }
      ]
    }
    By looking at this JSON representation, we can see that the data is...
  • Nested: A MARC record contains many fields, and most of them contains multiple subfields.
  • Dynamic: Some fields are repeatable ("264" and "948"), and subfields too. The first fields don't have subfields nor indicators (they're called control fields).
  • Encapsulated: The meaning of subfields depends on the field they're in (take a look at the "a" subfield for example).
  • All those characteristics can be grouped into what we call: semi-structured data.

    Semi-structured data is a form of structured data that does not obey the tabular structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. - Wikipedia

    A perfect example of that is HTML documents. An HTML document contains different types of tags which can nested in multiple ways. It wouldn't make sense to model HTML documents with tables and columns. Imagine having to access nested tags through joins, considering the fact that we could potentially have hundreds of them on a single HTML document. That's why we usually store this kind of data in a text field.
    In our case, we're using JSON to represent MARC data. Luckily for us, we can store JSON data directly in relational databases like Postgres or MySQL:
    # config/initializers/inflections.rb
    ActiveSupport::Inflector.inflections(:en) do |inflect|
      inflect.acronym "MARC"
    end
    $ rails g model marc/record leader:string fields:json
    $ rails db:migrate
    We can then create a MARC record like this:
    MARC::Record.create leader: "00815nam 2200289 a 4500", fields: [
      { "tag": "001", "value": "ocm30152659" },
      { "tag": "003", "value": "OCoLC" },
      { "tag": "005", "value": "19971028235910.0" },
      { "tag": "008", "value": "940909t19941994ilua 000 0 eng " },
      { "tag": "010", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "92060871" }] },
      { "tag": "020", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "0844257443" }] },
      { "tag": "040", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "DLC" }, { "code": "c", "value": "DLC" }, { "code": "d", "value": "BKL" }, { "code": "d", "value": "UtOrBLW" } ] },
      { "tag": "049", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "BKLA" }] },
      { "tag": "099", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "822.33" }, { "code": "a", "value": "S52" }, { "code": "a", "value": "S7" } ] },
      { "tag": "100", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "Shakespeare, William," }, { "code": "d", "value": "1564-1616." } ] },
      { "tag": "245", "indicator1": "1", "indicator2": "0", "subfields": [{ "code": "a", "value": "Hamlet" }, { "code": "c", "value": "William Shakespeare." } ] },
      { "tag": "264", "indicator1": " ", "indicator2": "1", "subfields": [{ "code": "a", "value": "Lincolnwood, Ill. :" }, { "code": "b", "value": "NTC Pub. Group," }, { "code": "c", "value": "[1994]" } ] },
      { "tag": "264", "indicator1": " ", "indicator2": "4", "subfields": [{ "code": "c", "value": "©1994." }] },
      { "tag": "300", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "xiii, 295 pages :" }, { "code": "b", "value": "illustrations ;" }, { "code": "c", "value": "23 cm." } ] },
      { "tag": "336", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "text" }, { "code": "b", "value": "txt" }, { "code": "2", "value": "rdacontent." } ] },
      { "tag": "337", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "unmediated" }, { "code": "b", "value": "n" }, { "code": "2", "value": "rdamedia." } ] },
      { "tag": "338", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "volume" }, { "code": "b", "value": "nc" }, { "code": "2", "value": "rdacarrier." } ] },
      { "tag": "490", "indicator1": "1", "indicator2": " ", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] },
      { "tag": "830", "indicator1": " ", "indicator2": "0", "subfields": [{ "code": "a", "value": "NTC Shakespeare series." }] },
      { "tag": "907", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": ".b108930609" }] },
      { "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "LTI 2018-07-09" }] },
      { "tag": "948", "indicator1": " ", "indicator2": " ", "subfields": [{ "code": "a", "value": "MARS" }] }
    ]
    And access it this way:
    record = MARC::Record.first
    field = record.fields.find { |field| field["tag"] == "245" }
    subfield = field["subfields"].first
    subfield["value"]
    => "Hamlet"
    It works, but...
  • It's not very convinient to access nested data this way.
  • We cannot easily attach logic to our JSON data without polluting our model.
  • What if we could interact with our JSON data the same way we do with ActiveRecord associations ? Enters ActiveModel and the AttributesAPI!
    First, we have to define a custom type which...
  • Maps JSON objects to ActiveModel-compliant objects.
  • Handles collections.
  • To do that, we'll add the following options to our type:
  • :class_name: The class name of an ActiveModel-compliant object.
  • :collection: Specify if the attribute is a collection. Default to false.
  • class DocumentType < ActiveModel::Type::Value
      attr_reader :document_class, :collection
    
      def initialize(class_name:, collection: false)
        @document_class = class_name.constantize
        @collection     = collection
      end
    
      def cast(value)
        if collection
          value.map { |attributes| process attributes }
        else
          process value
        end
      end
    
      def process(value)
        document_class.new(value)
      end
    
      def serialize(value)
        value.to_json
      end
    
      def deserialize(json)
        value = ActiveSupport::JSON.decode(json)
    
        cast value
      end
    
      # Track changes
      def changed_in_place?(old_value, new_value)
        deserialize(old_value) != new_value
      end
    end
    Let's register our type as we gonna use it multiple times:
    # config/initializers/type.rb
    ActiveModel::Type.register(:document, DocumentType)
    ActiveRecord::Type.register(:document, DocumentType)
    Now we can use it in our models:
    class MARC::Record < ApplicationRecord
      attribute :fields, :document,
        class_name: "MARC::Record::Field",
        collection: true
    
      def at(tag)
        fields.find { |field| field.tag == tag }
      end
    end
    class MARC::Record::Field
      include ActiveModel::Model
      include ActiveModel::Attributes
      include ActiveModel::Serializers::JSON
    
      attribute :tag, :string
      attribute :value, :string
      attribute :indicator1, :string
      attribute :indicator2, :string
      attribute :subfields, :document,
        class_name: "MARC::Record::Field::Subfield",
        collection: true
    
      # Control fields don't have subfields
      def attributes
        if control_field?
            {
              "id" => id,
              "tag" => tag,
              "value" => value
            }
          else
            {
              "id" => id,
              "tag" => tag,
              "indicator1" => indicator1,
              "indicator2" => indicator2,
              "subfields" => subfields
            }
          end
      end
    
      def control_field?
        /00\d/ === tag
      end
    
      def at(code)
        subfields.find { |subfield| subfield.code == code }
      end
    
      alias [] at
    
      # Used to track changes
      def ==(other)
        attributes == other.attributes
      end
    end
    class MARC::Record::Field::Subfield
      include ActiveModel::Model
      include ActiveModel::Attributes
      include ActiveModel::Serializers::JSON
    
      attribute :code, :string
      attribute :value, :string
    
      def ==(other)
        attributes == other.attributes
      end
    end
    Let's test this in the console:
    record.at("245")["a"].value
    => "Hamlet"
    
    record.changed?
    => false
    
    record.at("245")["a"].value = "Romeo and Juliet"
    record.at("245")["a"].value
    => "Romeo and Juliet"
    
    record.changed?
    => true
    Et voilà! Home-made associations!
    Luckily, you won't need to implement this yourself, as this gem does it for you (and even more).
    Here's how we can simplify our models:
    class MARC::Record < ApplicationRecord
      include ActiveModel::Embedding::Associations
    
      embeds_many :fields
    
      # ...
    end
    class MARC::Record::Field
      include ActiveModel::Embedding::Document
    
      # ...
    
      embeds_many :subfields
    
      # ...
    end
    class MARC::Record::Field::Subfield
      include ActiveModel::Embedding::Document
    
      # ...
    end
    We can then code our views with nested attributes support out-of-the-box:
    # app/views/marc/records/_form.html.erb
    <%= form_with model: @record do |record_form| %>
      <% @record.fields.each do |field| %>
        <%= record_form.fields_for :fields, field do |field_fields| %>
    
          <%= field_fields.label :tag %>
          <%= field_fields.text_field :tag %>
    
          <% if field.control_field? %>
            <%= field_fields.text_field :value %>
          <% else %>
            <%= field_fields.text_field :indicator1 %>
            <%= field_fields.text_field :indicator2 %>
    
            <%= field_fields.fields_for :subfields do |subfield_fields| %>
              <%= subfield_fields.label :code %>
              <%= subfield_fields.text_field :code %>
              <%= subfield_fields.text_field :value %>
            <% end %>
          <% end %>
        <% end %>
      <% end %>
    
      <%= record_form.submit %>
    <% end %>
    We can even use validations:
    class MARC::Record < ApplicationRecord
      # ...
    
      validates :fields, presence: true
      vallidates_associated :fields
    end
    class MARC::Record::Field
      # ...
    
      validates :subfields, presence: true, unless: :control_field?
      validates_associated :subfields, unless: :control_field?
    end
    class MARC::Record::Field::Subfield
      # ...
    
      validates_presence_of :code, :value
    end
    record = MARC::Record.new
    record.valid?
    => false
    
    record.fields = [{ tag: "245" }]
    record.valid?
    => false
    
    record.at("245").subfields = [{ code: "a", value: "Ruby on Rails" }]
    record.valid?
    => true
    We can use custom collections if we need to add custom behaviour:
    class MARC::Record::FieldCollection
      include ActiveModel::Embedding::Collecting
      include Enumerable
    
      def at(tag)
        find { |field| field.tag == tag }
      end
    
      def repeated?(field)
        # ...
      end
    
      # ...
    end
    class MARC::Record < ApplicationRecord
      include ActiveModel::Embedding::Associations
    
      embeds_many :fields, collection: "FieldCollection"
    
      delegate :at, :repeated?, to: :fields
    
      # ...
    end
    record = MARC::Record.first
    record.at("245")["a"].value
    => "Hamlet"
    
    record.repeated?("245")
    => false
    
    record.repeated?("264")
    => true
    We can use custom types if we need to cast the elements of a collection:
    class MARC::Record::FieldType < ActiveModel::Type::Value
      def cast(value)
        # ...
      end
    end
    class MARC::Record < ApplicationRecord
      include ActiveModel::Embedding::Associations
    
      embeds_many :fields, cast_type: "FieldType"
    
      # ...
    end
    So the next time you need to model semi-structured data in your Rails application...
  • Give this gem a try!
  • Or use the Attributes API.
  • 76

    This website collects cookies to deliver better user experience

    Modeling semi-structured data in Rails