11
XML, making everything just a little bit harder.
So here's a fun exercise in XML, standards, and data catalogs.
I'm working on ingesting a bunch of records from a variety of data catalogs, of a variety of types. One I'm looking at now uses OAI-PMH. Fortunately there's a nice little Python library called Sickle that abstracts most of the pain away. Until you're dealing with non Dublin Core datasets.
Sickle does allow you to plug in a parser for other types (oh hi XPath, I haven't missed you at all)
The dataset I'm using uses the ANZLIC profile for OAI-PMH (fun side note, the official repo for the info had a broken link to the standard, because bitrot even comes for ISO committees.) It's catchier name is "AS/NZS ISO 19115.1:2015 Metadata".
So I need to write a custom parser for this.
Then I hit this, when looking for keywords in a data record.
<gmd:descriptiveKeywords>
<gmd:MD_Keywords>
<gmd:keyword>
<gco:CharacterString>040104</gco:CharacterString>
</gmd:keyword>
<gmd:thesaurusName>
<gmd:CI_Citation>
<gmd:title>
<gco:CharacterString>Australian and New Zealand Standard Research Classification</gco:CharacterString>
</gmd:title>
<gmd:alternateTitle>
<gco:CharacterString>ANZSRC</gco:CharacterString>
</gmd:alternateTitle>
<gmd:date>
<gmd:CI_Date>
<gmd:date>
<gco:Date>2008</gco:Date>
</gmd:date>
<gmd:dateType>
<gmd:CI_DateTypeCode codeList="http://asdd.ga.gov.au/asdd/profileinfo/gmxCodelists.xml#CI_DateTypeCode" codeListValue="creation">creation</gmd:CI_DateTypeCode>
</gmd:dateType>
</gmd:CI_Date>
</gmd:date>
</gmd:CI_Citation>
</gmd:thesaurusName>
</gmd:MD_Keywords>
</gmd:descriptiveKeywords>
What on earth is this nonsense? So we can ignore the date code bit (fun sidenote, the asdd.ga.gov.au domain no longer exists, lucky it's not important - I reckon I can interpret "2008" as a date without an xsd file)
So obviously 040104
is a reference to something.
A bunch of googling and staring at the wall finally led me to the Australian Bureau of Statistics, in particular to standard 1297.0 Australian and New Zealand Standard Research Classification (2008) is the current version.
From there, you can go to the downloads tab, and find the table where it maps the 2008 codes to 2020 codes. In a 1.5Mb Excel file. So I exported out the relevant bits of the 2008 codes, throw them in a small sqlite db, and end up with 040104,Climate Change Processes
So all that XML up above? It could have just been
<keyword>Climate Change Processes</keyword>
But nooooo.
11