Should You Invent a New Query Language? (Probably Not)

"What's worse than data silos? Data silos that invent their own query language." - Erik Bernhardsson

In his infamous and widely discussed blog post named 'I don't want to learn your garbage query language,' Erik Bernhardsson expressed what so many other Data Engineers and Analysts related with so strongly. Namely, that he "really [doesn't] like software that invents its own query language" and that he "just [wants his] SQL back.

The fairly short-yet-passionate rant summarized the almost universal experience that technologies that require their own language often produce a whole new different set of complexities.

There has been an explosion of different types of databases in the last 10 years (Just take a look at the sheer number of NoSQL database types in my last article), and increased specialization is forcing Data Engineers and Analysts to learn an even greater number of new programming languages.

Each of them has its own syntax and specific weirdness, and while ORMs sometimes help, they often make things even more difficult.

That being said, new Query Languages are often necessary (like in the case of NoSQL databases). This begs the question, how can we tell whether a new language actually adds more value than problems it creates? To answer this, we need to take a deep look at what causes all this conflict.

Why are there so many Query Languages?

There are a lot of Query Languages out there. So many that even the most comprehensive lists leave out many important players. I wouldn't even want to attempt to gather a list like this because it seems new ones are added every minute. So let's take a look at what causes this?

In short, there are so many different Query Languages because there are so many different things to query. In addition, data often has to be stored in various models with varying kinds of schemas and varying degrees of separation between physical and logical schema. For example, data stored in a relational database requires different technology and syntax than data stored in JSON files.

An early hierarchical or network database query language will look quite different from a relational query language. In the early days of databases, one had to explicitly navigate the paths between records. That also explains why the last decade has been particularly "fruitful" due to new challenges created by NoSQL. Just take a look at the number of different NoSQL categories in my last blog post.

Even within relational databases, different vendors want to offer special features and functions through their own SQL dialects (T-SQL, PostgreSQL, etc.).

All of this adds up quickly, and in retrospect, it was inevitable that we would end up with a myriad of options to store and query data. Overall that is actually an excellent thing because that way, there is constant improvement through competition.

However, development slowdowns due to too many different query languages persist, so Object-Relational-Mappers (ORMs) were invented. As we are about to see, though, they also often complicate things even more.

Why ORMs Don't Always Help

Let's first take a deeper look at what ORMs are and why you might want to use them.

Object-Relational-Mapping is a process often used in computer science to convert data between incompatible type systems through object-oriented programming languages. The goal is to create an abstract database layer of 'database objects' that can then be queried using the programming language of your choice.

Long story short, it allows you to interact with your database using your language of choice instead of that database's Query Language. This is best explained through a quick example using SQL:

SELECT * FROM users WHERE name = 'Jane Doe';

Object-relational mapping is the idea of writing queries like the one above and much more complicated ones, using the object-oriented paradigm of your preferred programming language. That is where ORMs come in. ORMs libraries that implement this technique in different programming languages. For example, using a JavaScript ORM, the above query would now look something like this:

var orm = require('generic-orm-library');var user = orm("users").where({ name: 'Jane Doe' });

There are, of course, plenty of good things that can come from using an ORM:

  1. No SQL: If you've read my blog post about learning SQL but still aren't convinced that you should learn SQL, then ORMs allow you to use a programming language of your choice.
  2. No Dialects: ORMs abstract away the database system so that switching from MySQL to PostgreSQL, for example, is much easier.
  3. Extra Features: Some ORMs have advanced features such as support for transactions, connection pooling, migrations, seeds, and streams that can be useful in some cases.
  4. Performance: If you really aren't any good at SQL, then an ORM will perform better than you for some queries.

As you probably already deduced from the title of this section, there are also plenty of cons:

  1. Performance: If you are a master at SQL, you can probably get more performant queries by writing them yourself.
  2. Learning: ORMs don't work like normal Query Languages, so if that's what you are used to, there is overhead involved in learning how to use them.
  3. Too Many: The learning problem only worsens when you consider just how many different ORMs there are, and each has its own learning process.
  4. Configuration: The initial configuration of an ORM can take a long time and be a pain.
  5. Black Box: The layer of abstraction that ORMs provide can hinder getting at the real problem of a query or database issue.

To summarize, ORM's alleged benefit is that they cut down development time by adding an abstraction layer, but it often produces an even bigger headache. Instead of using SQL, you now have to scroll back and forth in some ORM documentation to determine how to write the most basic queries. Debugging can also be an issue because sometimes an ORM translates a query "into some monstrosity that joins 17 tables using a full table scan", and you end up "with bloated higher-level data classes rather than easy to understand tuples or dicts that contain the data in a dumb simple format that is trivial to introspect."

So, ORMs can be useful but don't necessarily help with the underlying problem—there are already too many query languages. So what can be done? Should we follow Erik's advice and put a "30-year moratorium on inventing new query languages."

When Should You Invent a New Query Language?

While Erik's call for a moratorium was certainly more of a joke than a serious solution (I think), it did express the sentiment that we need to stop adding so many query languages so quickly pretty well.

It's certainly appreciated that many of these are necessary to deal with new classes of data, new ways that data is stored, or simply too much new data, but how do we know which ones were necessary and which ones weren't?

This is why I suggest a few simple guiding questions and rules for any new developer who might feel the urge to add to the query language ecosystem:

Is My Language Just Another Dialect?

This one seems too obvious, but even big organizations like Oracle and Microsoft couldn't quite follow this rule, resulting in dialects like T-SQL and PL/SQL.

At this point, almost every database has its own ways of defining stored procedures, functions, triggers, and differences that are even more pronounced than for queries and data definitions.

For example, the scripting language for Sybase and SQL Server is called T-SQL. Strictly speaking, SQL Server is the database, and T-SQL is the language, but the two are often used interchangeably when referring to code.

MySQL's scripting language is quite different from T-SQL, as can be seen in the documentation in the reference manual. The first major difference you will probably encounter is that IF is only allowed in programming blocks. The second will be the delimiter, then the word go, and the differences multiply from there.

So please don't create another SQL dialect. That doesn't mean that we shouldn't iterate and improve on SQL, but don't try to do something unnecessarily fancy that nobody asked for because no one will bother learning it.

Does My Language Solve Anything New?

This rule is very similar to the first one, only more broadly applicable. There are many databases and types of data where standard SQL can't get the job done. Graph Databases are a good example of this because, in their case, both the storage and languages were designed specifically to do things that relational databases simply couldn't.

That doesn't mean that you should add to the list of well-established Graph Query languages like Cypher,which is already extremely popular. It was created by Neo4j for use with their own graph database and eventually open-sourced as a separate project called openCypher, which allows any database to implement the same language. It's very visual—using ASCII characters to make shapes for clean and expressive queries.

MATCH (d:Database)-[:USES]->(Cypher)-[:QUERIES]->(:Model:Graph)

Another widely used graph query language is SPARQL, which looks like SQL and was created by the W3C to query RDF graph models. It’s not as common as the other two languages but has unique features because of the subject-predicate-object data model. It can even be used with SQL databases that have tables modeled after RDF structures. Some graph databases also support their own languages like ArangoDB, DGraph, TigerGraph, etc.

The most common graph query language at the moment would be Gremlin, which is part of the Apache TinkePop graph computing framework. It is simple to write, easy to learn, and widely supported by many graph databases and even non-graph databases that can emulate graph queries. On the other hand, it can be verbose for long queries but generally works well for both OLTP and analysis work.

g.V().has("lang","gremlin").out("supported_by").values("db_name")

That being said, the time is ripe for an international standard graph query language. Industry vendors, including Neo4j and TigerGraph, have called this out, and I definitely agree. As graphs continue to see widespread adoption, we have certainly reached a tipping point for this industry, and at this point, you'd probably be doing more harm than good by adding another language to the mix.

Is My Language Easy to Use?

If you have made it this far and your language solves a novel use case that no one has worked on before, you might actually have a point by making your own query language—this is why the last question focuses on how you should go about it that.

Ask yourself, "How long will it take for others to learn my new language?". If your answer is "a couple of weeks," then you should start dedicating a lot of time to simplify your syntax as much as possible.

This doesn't just mean creating tutorials and documentation, which should be self-explanatory. Instead, this means that you will have to make a serious effort to provide processes and abstractions that let a novice pick up your language and solve your main use case in a matter of hours, if not minutes.

If you don't think you can do that, your language probably isn't ready yet.

Conclusion

SQL is one of the most ubiquitous programming languages today, but countless others deserve recognition as well. These languages are often tied to specific technologies that solve use cases that relational databases simply can't.

However, this has lead to an overproduction of query languages, which has significantly slowed down development times due to steep learning curves. ORMs were introduced to solve that problem but ended up complicating matters even more.

That is why smart developers should be wary of introducing yet another query language to the mix. To guide this decision, I laid out some key questions they should be asking themselves to take a good look at the existing ecosphere, use cases, and simplicity.

That being said, nothing quite beats working in good ole' SQL, especially when you have access to a world-class SQL editor like Arctype.

Arctype has built a collaborative SQL editor that allows you to share databases, queries, and dashboards with anyone easily. Join our growing community and try out Arctype today.

29