Big Data: What to learn

Migrating an oldie but a goodie to the new blog. Back in 2016, I gave a friend who wants to career-change into data science some advice for starting to learn about big data, and I think a lot of the advice has stood the test of time, even if we don't hear that catchphrase as much as we used to.

Firstly, decide whether you want to specialize in:

  1. "data science" / "analysis", or
  2. making the databases such people work with ... well ... work
  • For the former, get good at statistics and program something at least once.
  • For the latter, get good at programming and expose yourself to a few statistics & algebra fundamentals if you didn't as a youth.
  • For both, learn about the 3 V's & the 5 database types.

Great -- so let's presume that you've decided you are willing to learn stats or programming -- or that you already know them and you want to talk your way into a job that will let you learn the "big data" details on the job.

What do you need to know about "big data" to show that you have the fundamental knowledge to pick up the details after you start work?

3 V's

Your homework, for learning about "big data," is not just to read, but to study all 21 articles in this series by Pinal Dave. And by that, I mean be able to answer the questions I suggest below about anything you learn from it (and from various Google tangents I hope you follow as you run into unfamiliar terms).

Pay special attention to "the 3 v's." Velocity, variety, & volume. These are the 3 problems with data that make people call data "big."

Always ask yourself:

  • "Which 3-v's problems does this approach to handling big data attempt to solve?"
  • "How?"
  • "How well is/isn't it doing so?
    • "In what situations is it strong/weak at solving those problems?"
    • "Why?"

Do not skip asking yourself those questions.

Study this like it's school. Grab a blank notebook, take notes on things you learn, and in those notes, answer these questions.

It is being able to answer these questions about the things you learn that will make you understand it so well "you could explain it to a 6-year-old."

Which also means it you can explain it to the non-technical department heads interviewing you for a job and explain how well-suited this deep understanding makes you to solve their particular business problems.

5 database types

If you want to go deeper, I suggest that you especially learn about the 5 main types of database involved in "big data." Be sure to take notes with "why?" questions as you learn about these principles, too.

Relational

This is your classic FileMakerPro / Microsoft Access / Oracle / MySQL / PostgreSQL / SQL Server database. It's the way you think of the interlinking between Salesforce objects. It's the product you probably think of as a database.

If you've never used any of these tools, a relational database is a bit like having a bunch of Excel spreadsheets that cross-reference each other with VLOOKUP, and you're not allowed to add or remove columns willy-nilly or merge cells.

Key-Value

The heart of most "NoSQL" (nonrelational) database architecture.

Columnar

A specialized form of key-value database.

Learn how, plus why "columnar" is different enough to get its own name.

Document

A specialized form of key-value database.

Learn how, plus why "document" is different enough to get its own name.

Graph

Some versions are a specialized form of key-value database.

Learn how, plus why "graph" different enough to get its own name.

Compare and contrast

About each of these 5 database types, learn the following principles deeply enough to do "compare & contrast" between them:

  • "Which '3 Vs' problems + other problems does it try to address, how, and how well / for what types of data storage-retrieval needs?"
  • "What types of data storage & retrieval is it optimized for, speed-wise?"
  • "What types of data storage & retrieval is it optimized for, coding-wise? (What operations do & don't give programmers a nervous breakdown?)"
  • "How well can it be 'distributed' so that multiple computers can break up & simultaneously work on sub-pieces of a 'store' or 'retrieve' or 'retrieve-and-aggregate' (min, max, avg, etc.) request?"

If you want to start a successful catering business out of your home, you need to have some sense of how a kitchen's layout and tools impact what's easy to cook in it. There's no point adding ice cream to the menu if you can't easily freeze things.

Same idea with "big data." It's important to be able to recognize the pros & cons of a given environment + set of tools for solving a given problem.

If an interviewer says a company is having trouble with their [insert brand here] database and that they're trying to solve [insert problem here], how important to the company would you be if you're the person who can see that there's a fundamental mismatch between what they're trying to do and how they're organizing the data they need to do it with?

Alternatively, if there isn't a mismatch and they just need a good analyst/programmer on board, you will be able to recognize the interesting challenges that you could tackle for them.

For this deeper dive, I highly recommend a book called Making Sense of NoSQL by Ann Kelly and Dan McCreary.

What's old is new again

Finally, be aware that this is all just new ways of thinking about applying very old mathematics and logic to solve data storage/retrieval/analysis/visualization problems in light of the problems being "bigger" (see "3-V's").

You can't go wrong studying the old stuff.

  • For analysts, statistics and visualization fundamentals date back to the 1800's.
  • For programmers, important "information retrieval" principles from throughout the 1900's apply.

Good luck!

20