Convert multiple xsd files into SQL - mysql

I have a challenging situation.
ECHA publishes the schema for their IUCLID product as zipped collections of xsd files. I want to be able to import data into SQL and use those xsd files to build the SQL tables.
The trouble is while they publish a structure showing an ERD style relationship layout - https://poisoncentres.echa.europa.eu/documents/1789887/5577602/pcn_format_data_model_en.pdf/d667afb6-a36b-4596-48dc-3b2de436d3de?t=1635233313478 - the xsd does not have any sign of those relationships.. (though I might simply be missing them)
The xsd are found in here - https://poisoncentres.echa.europa.eu/documents/1789887/10223884/PCN+Format_v4.0.zip/7d7641c0-facc-898c-bffd-45b080bfdb75?t=1635502393893 and the general page is here - https://poisoncentres.echa.europa.eu/poison-centres-notification-format
I think my option is to do it the long way by converting xsd into SQL (pref Mysql dialect) via maybe a json schema and then hand stitching things together with the hope that there are no huge differences when they release a new version - but this feels icky :)
Does anyone have any thoughts on a better method?
EDIT:
So I also noticed this https://poisoncentres.echa.europa.eu/documents/1789887/6428404/echa_example_1_dossier.i6z/98873135-5373-d2df-0cc2-9177a403cad0?t=1590667188569 which is an example PCN document.. this at least gives relationships between chunks of xml, but I'm not sure these will always be available on new versions of the schema.. and boy its painful to parse..

So I'm going to steadily update this as I progress..
The first challenge is to bring all these XSDs into a single file so we can toss what is not being used and throw it into things like xsd2xml for generation of sample XML files..
How could I merge an XSD schema with imports and includes into a single file? Covers this - but much of what is mentioned is gone.
This - The 'minOccurs' attribute cannot be present looks tantalizing, but as usual all gone.
This looks the thing - https://sourceforge.net/projects/graphvisu/ but only wants HTTP - doesn't like file
Managed to get xsdwalker running - which shows up a problem.. it seems that the set of xsds doesn't really fit cleanly into a hierarchy.. which means we need to run with the sample xmls instead
Update:
Stupid me - the format uses href:xlink to reference other portions - but this isn't an XML cross reference, its just a href - so I am manually replacing the "name": { "xlink:type": "simple", "xlink:href": "52a44784-64a6-4836-b8d6-b84315fd958e_f53d48a9-17ef-48f0-8d0e-76d03007bdfe.i6d", "content": "Ethane-1,2-diol" },
with the expanded i6c:Document format - this is not pretty at all, and once I've coded something then I'll be cherry picking out what is needed... also as a complete facepalm and abuse of the XML format - almost everything in the XML i6c document is coded values - even though the idea of wasting all that bandwidth with readable variable names - is to have readable variables as well.. FFS
I compromised and inserted a "contents' key holding the referenced file contents - see Searching an XML structure but modifying a node higher in the hierarchy

Related

How should a "project" file be written?

With popular software packages, like Microsoft Word or Photoshop, we often have an option to save our progress as a "project" file and later can open that file to edit our works furthermore. This file often contains all the options and the progress that the user has made (i.e the essay you typed in Word).
So my question is, if I am doing a similar application that requires creating a similar "project" file, how should I go about this? My application is a scientific application, which means it required a lot of (multi-dimension) arrays. I understand there will be a lot of options to do this, but I would like to know the de facto way.
Here are some of the options I have outline out:
XML: Human readable. The size is too big and it's too much work to deal with arrays.
JSON: More popular/modern. Good with array.
Protocol Buffer: It is created by Google. Probably faster.
Database: Probably not a good use case since "project" files are most likely "temporary". Also, working with arrays is not very straight forward.
Creating your own binary format: Might be the most difficult solution for an inexperienced programmer like myself.
???
I would like to get some advice from you guys. Thank you :).
(Good question. :) Only some thoughts) I'd prefer text format for the main project file. You can make diffs and open and read and modify it easily. Large ascii or binary data can be stored as serialized data in external files or in a database like SQLite from where it can be easily accessed and processed through the application. The main project has links to the external data store. My advice for the main project file is a simple XML format that can easily be transformed to JSON format. A list of key value pairs (dict) is good for the beginning. value can be of basic datatype or be an array or dict. A complicated XML tree is not good. The key name can also help to describe and structure data. So i'd prefer key="rect.4711.pos.x" value="500" and not <rect id="4711"><pos><x>500</x>...</pos>.... Important aspect is that the project data is portable and self-contained, and the user can see the project as a single unit even if it is a directory on the file system, for this purpose supporting some kind of zipped format of project data is good.

Cutting a config file down to size

Morning!
I've got an app with a config file that's become unwieldy - many switches with no intuition as to which combinations are valid. Right now, all the switches are stored in an XML file. The config file specifies inputs for a large HPC job.
I'm thinking of writing some a formal grammar for a run - that is, the sort of combinations that are acceptable, and from the parsing of it, the switches needed will automatically be inferred. The values would still be read from the XML file, but only when needed.
Is this sort of approach reasonable? How would I go about implementing a grammar without a parser?
If I understand you correctly, you want to implement a Domain Specific Language (DSL), the purpose of which is to specify validation rules for the contents of an XML-based configuration file.
Some people implement a DSL by defining a parser specific to the needs of the DSL. However, some other people shoehorn the semantics of their DSL into the syntax of an existing file format, such as XML or JSON. So if you want to avoid having to write a parser, you could express your DSL in XML syntax.

Application Input Specification: Drawing input data of method

Does anyone know a good way to to draw the exact structure of input data for a method? In my case I have to specify the correct input data for a server application. The server gets an http post with data. Because this data is a very complex json data structure, I want to draw this, so next developer can easily check the drawing and is able to understand, what data is needed for the http post. It would be nice if I can also draw http headers mark data as mandatory or nice to have.
I dont need a data flow diagramm or sth. like that. What I need is a drawing, how to build a valid json for the server method.
Please if anyone have an idea, just answer or comment this question, even if you just have ideas for buzz words, I can google myself.
In order to describe data structure consider (1) using the UML class diagram with multiplicities and ownership and "named association ends". Kirill Fakhroutdinov's examples uml-diagrams.org: Online Shopping and uml-diagrams.org: Sentinel HASP Licensing Domain illustrate what your drawing might look like.
As you need to specifically describe json structure then (2) Google: "json schema" to see how others approached the same problem.
Personally, besides providing the UML diagram I'd (3) consider writing a TypeScript definition file which actually can describe json structure including simple types, nested structures, optional parts etc. and moreover the next developer can validate examples of data structures (unit tests) against the definition by writing a simple TypeScript script and trying to compile it

Is there anything wrong with YAML format to be joined to the web standards

Well, I think YAML is really fantastic...
It's beautiful, easy to read, clever syntax...compared to any other data serialization format.
As a superset of JSON we could say it's more elaborated, hence its language evolution.
But I see some different opinions out there, such:
YAML is dead,
don't use yaml and so on...
I simply can't understand on what this is based because it seems so nice :)
If we take few well succeeded examples over the web such as Ruby on Rails, we know they use yaml for simple configuration, but one thing that gets me curious is why yaml is not being part of most used formats over web like XML and JSON.
If you take twitter for example...why not offer the data in YAML format from the API as well?
Is there something wrong by doing it?
We can see the evolution on no-sql databases like couchdb, mongo, all json based, even one great project called jsondb which looks very lightweight and it definitely can do the job.
But when writing data structures in json I really can't understand why YAML is not being used instead.
So one of my concerns would be if is there something wrong with YAML?
People can say it's complex, but well, if you pretend to use the same features you would get in json it's definitely not. You will get a more beautiful file for sure tho and with no hassle. It would be indeed more complex if you decide to use more features, but that's how things are, at least you have the possibility to use it if you want to.
The possibility to choose if you want or not to use double-quotes for string is fantastic makes everything cleaner and easier to read....well you see what's my point :)
So my question would be, why YAML is not vastly used in place of JSON?
Why it doesn't seem that it will be used for data structure transfers within the online community?
All I can see is people using it for simple configuration files and nothing else...
Please bear with me since I might be completely wrong and very big projects might be happening and my ignorance on the subject didn't allow me to be a part of it :)
If is there any big project based on yaml out there I would be very happy to know about it
Thanks in advance
It's not that there's something wrong with YAML — it's just that it doesn't offer any compelling benefits in many cases. YAML is basically a superset of JSON. For most purposes, JSON is quite sufficient — people wouldn't be using advanced YAML features even if they had a full YAML parser — and its close ties to JavaScript make it fit in well with the technologies that Web developers are using anyway.
TLDR: People are already using as much YAML as they need. In most cases, that's JSON.
YAML uses more data than non-prettified JSON. It's great for files that humans might want to edit themselves but when all you're doing is passing data around, you're wasting bandwidth if you're using YAML.
If you need an explanation: each space in UTF-16 is two bytes. YAML uses spaces for indentation, and newline characters for nesting.
Take this example:
foo:
bar:
- foo
- bar
This requires 44 characters (including newline characters). The equivalent JSON would be only 29 characters:
{"foo":{"bar":["foo","bar"]}}
Then just imagine what happens if you URL-encode the YAML. It becomes 95 characters:
foo%3A%0A%20%20%20%20bar%3A%0A%20%20%20%20%20%20%20%20-%20foo%0A%20%20%20%20%20%20%20%20-%20bar
Meanwhile the JSON just becomes 64 characters:
%7B%22foo%22%3A%7B%22bar%22%3A%5B%22foo%22%2C%22bar%22%5D%7D%7D
The size increase to YAML from JSON is more than double when it's URL-encoded, in the above example. And I'm sure you can just imagine that the longer your YAML file, the more and more this difference will increase.
Oh, and one other reason not to use YAML: stackoverflow.com does not support YAML syntax highlighting... ! (Of course, I would argue that YAML is so beautiful that it doesn't need syntax highlighting. That's kind of the point of YAML, I think.)
In Ruby many people argue that configuration should be Ruby, rather than YAML. This saves the parsing stage, means you don't have to learn the new syntax, and don't end up with ERB tags everywhere when you are dynamically generating YAML content (Rails fixtures).
Personally I have to agree, and can't see what YAML would offer to network transfers that would make it a worthwhile consideration over JSON.
YAML has an amount of problems, there is a good article
YAML: probably not so great after all on that.
Short summary (in addition to problems already listed in other answers):
Unreadable except for simple and short things
Insecure by default
Has portability issues
Very complex, with amount of surprising behaviors
I considered using YAML few times and never did. The reason always had to do with white spaces for indentation. While I personally love this, even to me it sounded like asking for trouble, because
For sure someone will make a mistake, not expecting that changing white spaces will break the file. Sometimes someone who has no idea about the language / format has to go to the file to change one number or string.
You can't guarantee that everybody everywhere will have it's comparison / merging / SC software configured properly to catch white space or empty lines differences.

Simple HTML interface to XSD?

I'm writing an app that, at its heart, uses a hierarchical tree of nodes
in XML, it looks like this:
<node>
<name>Node1</name>
<Attribute1>Something</Attribute1>
<Attribute2>SomethingElse</Attribute2>
<child>Node2</child>
<child>Node4</child>
<child>Node7</child>
</node>
And so on (all child elements must refer to an existing node, though the node inquestion doesnt have to precede the first reference to it)
For a simple structure like this is there a simple tool to generate a html page that will allow a user to enter Nodes and dynamically update a server-side xml file?
Im basically writing a tool that will use such a file, but the people who's job it is to create the file arent especially techno-literate, so creating the XML by hand is a no-no.
I could hand-crank one fairly quickly, but if I can get a tool to do it, even better (especially as the format may change in future)....
Xopus is a browser based XML editor that you could use for this. It is designed for the non techno-literate people out there.
Disclaimer: I work at Xopus.
I am pretty sure there is nothing that will do that for you automagically and you'll need to write that bit yourself.
Your options are to create a web based interface to do it, using HTML POST and writing the output to a file or database (then reloading it on submission) or something more advance with Javascript (e.g. that could do it dynamically with AJAX).
You can't do it in HTML alone - either way you'd need something to handle outputting the existing data and accepting HTTP POST requests, but you don't mention what language or platform you are using to write this. Being clear on that will help people suggest appropriate solutions.
You might want to rethink the XML structure ... Elements called "attribute{anything}" are ill advised (as are elements named in the convention foo1, foo2, (etc)). The whole <child>Node2</child> thing doesn't seem like a good way to go either. I suggest posting an actual example of the XML in question.
From what you've said, it sounds like there is no specific need for it to be in XML at all. Not that XML is bad (it isn't) but if putting it in an SQL database is a valid option and you have one of those anyway (e.g. your using a LAMP stack) then that's something to consider.
Would an XML editor like http://www.oxygenxml.com/ suffice? I don't know of any html web ones unless you write one yourself and use AJAX to send the data. At least an XML editor can generate a form that you can use to create and edit XML documents. Microsoft do infopath as well - which is actually designed more for questionaires but might do what you need, if the non-tech people would prefer something more office like.