26
XML, JSON, YAML - Python data structures and visualization for infrastructure engineers
As developers, we need to store persistent data for a variety of reasons:
Let's cover a computer science concept being used here - semaphores. Edsger Dijkstra coined this term from Greek sema(sign) and phero(bearer) (you may remember him from OSPF) to solve Inter-Process Communications(IPC) issues.
To provide a reductionist example, process A and process B need to communicate somehow, but shouldn't access each other's memory or, in the '60s, it wasn't available. To solve this problem, the developer needs to develop a method of storing variables in a manner that is both efficient and can be consistently interpreted.
Dijkstra's example, in this case, was binary, and required a schema to interpret the same data - but was not specifically limited to single binary blocks. This specific need actually influenced one of the three data types we're comparing here - consequently the oldest.
Spoiler alert - anyone working with automation MUST learn all three to be truly effective. My general guidance would be:
- YAML User input drivers can also parse JSON, making this an extremely flexible approach.
json.dumps(dict, indent=4)
is pretty handy for previewing what your code thinks structured data looks like. Technically this is possible with YAML, but conventions on, say, a string
literal can be squishy.
- YAML with
name: True
could be interpreted as: - JSON of
"name": true
, indicating a Boolean value - JSON of
"name": "True"
, indicating a String - Sure, this is oversimplified, and YAML can be explicitly typed, but generally, YAML is awesome for its speed low initial friction. If an engineer knows YAML really well (and writes their own classes for it) going all-YAML here is completely possible - but to me that's just too much work.
- If you use it in the way I recommend, just learn to interpret JSON and use Python's JSON library natively, and remember
json.dumps(dict, indent=4)
for outputs. You'll pick it up in about half an hour and just passively get better over time.
Element
and ElementTree
constructs are more nuanced than dictionaries, so a package like defusedXML
is probably the best way to get started. There are a lot of binary invocations/security issues with XML, so using basic XML libraries by themselves is ill-advised. xmltodict is pretty handy if you just want to convert it into another format.Note: JSON and XML both support Schema Validation, an important aspect of semaphores. YAML doesn't have a native function like this, but I have used Python's Cerberus modules to do the same thing here.
YAML is pretty easy to start using in Python. I'm a big fan of the
ruamel.YAML
library, which adds on some interesting capabilities when parsing human inputs. I've found a nifty way to parse using try
/except
blocks - making a parser that is supremely agnostic, ingesting JSON or YAML , as a string or a file!--------message: items: item: "@tag": Blue "#text": Hello, World!
#!/usr/bin/python3import jsonfrom ruamel.yaml import YAMLfrom ruamel.yaml import scanner# Load Definition Classesyaml_input = YAML(typ='safe')yaml_dict = {}# Input can take a file first, but will fall back to YAML processing of a stringtry: yaml_dict = yaml_input.load(open('example.yml', 'r'))except FileNotFoundError: print('Not found as file, trying as a string...') yaml_dict = yaml_input.load('example.yml')finally: print(json.dumps(yaml_dict, indent=4))
{ "message": { "items": { "item": { "@tag": "Blue", "#text": "Hello, World!" } } }}
#!/usr/bin/python3import jsonwith open('example.json', 'r') as file: print(json.dumps(json.loads(file.read())))
Typically, I'll just use
json.dumps(dict, indent=4)
on a live dict
when I'm done with it - dumping it to a file. JSON is a well-defined standard and software support for it is excellent.Due to its IETF bias, JSON's future seems to focus on streaming/logging required for infrastructure management. JSON-serialized Syslog is a neat application here, as you can write it to a file as a single line, but also explode for readability, infuriating
grep
users everywhere.XML's document and W3C bias read very strongly. Older Java-oriented platforms like Jenkins CI heavily leverage XML for semaphores, document reporting, and configuration management. Strict validation (MUST be well-formed) required for compiled languages to synergize well with the capabilities provided. XML also heavily uses HTML-style escaping and tagging approaches, making it familiar to some web developers.
XML has plenty of downsides. Crashing on invalid input is generally considered excessive or "Steve-Ballmer"-esque, making the language favorable for mission-critical applications where misinterpretation of data MUST not be processed, and miserable everywhere else. For human inputs, it's pretty wordy which impacts readability quite a bit.
XML has two tiers of schema - Document Type Definition (DTD) and XML Schema. DTD is very similar to HTML DTDs and provides a method of validating that the language is correctly used. XML Schema definitions (XSD) provide typing and structures for validation and is a more commonly used tool.
XML Leverages the
Element
and ElementTree
constructs in Python instead of dicts
. This is due to XML being capable of so much more, but it's still pretty easy to use:XML Document:
<?xml version="1.0" encoding="ISO-8859-1" ?><message> <items> <item tag="Blue">Hello, World!</item> </items></message>
#!/usr/bin/python3from defusedxml.ElementTree import parseimport xmltodictimport jsondocument = parse('example.xml').getroot()print(document[0][0].text + " " + json.dumps(document[0][0].attrib))file = open('example.xml', "r").read()print(json.dumps(xmltodict.parse(file), indent=4))
After using both methods, I generally prefer using
xmltodict
for data processing - it lets me use a common language, Python lists
and dicts
to process all data regardless of source, allowing me to focus more on the payload. We're really fortunate to have this fantastic F/OSS community enabling that!26