Saving sklearn model to h5py - json

When I'm doing gridsearch in sklearn it would be convinient to save different data in the same hdf5 file using h5py. Having to keep track of different files creates a mess of files that is hard to keep track on.
Of particular interest is:
best parameters (best_params_dict)
results of gridsearch (cv_results_dict of numpy (masked) ndarrays)
model (best_estimator_estimator)
Since the first two are dictionaries they can be converted to strings using json.dumps and then saved into h5py as strings. However the model is a class object so it needs to be saved with pickle.
h5py doesn't seem to support pickle so I'm wondering if there's any way to get around this and be able to save the model into the hdf5 file...?

It's possible to use the numpy void dtype (https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.void) to hold arbitrary binary data, which h5py will store as the HDF5 OPAQUE type (and labelling with other dtypes is covered at https://docs.h5py.org/en/stable/special.html#storing-other-types-as-opaque-data, but see the warning).
It's unclear though why you're using HDF5 if you're planning on dumping arbitrary data into it. If you're going to dump things as JSON and pickles, why not use zip files for example, and take advantage of the compression of the JSON?
For example, naively I'd expect best_params_dict to be composed of simple key-value pairs, which could be directly saved in HDF5 as attributes on a group or dataset (e.g. dset.attrs.update(best_params_dict)), which avoids all the issues with JSON and numbers. It's likely the other things you're interested in could be encoded in a less roundabout way, and would allow for other code to meaningfully read your data (the pickled data especially, but JSON also is generally a bad format for non-string data).

Related

What is the cleanest way to perform nested conversion of numpy types to python (JSON friendly) types?

What is the cleanest way to perform nested conversion of a "deep" object that contains mixed python / numpy types to an object containing only python types?
The question is motivated by the need to send the data as JSON, but here I do not have control over json.dumps() because that is the province of a different application. In other words, I cannot specify the JSON encoder.
One possible solution involves adopting the JSON encoder solution anyway, followed by a conversion back to JSON with json.loads(). This would mean every message has two round trips to JSON rather than one which might not be the end of the way. But is there a "better" alternative?
Note that I need to apply recursively so that fact that tolist() or item() sometimes works isn't a complete solution here.

How should I process nested data structures (e.g. JSON, XML, Parquet) with Dask?

We often work with scientific datasets distributed as small (<10G compressed), individual, but complex files (xml/json/parquet). UniProt is one example, and here is a schema for it.
We typically process data like this using Spark since it is supported well. I wanted to see though what might exist for doing work like this with the Dataframe or Bag APIs. A few specific questions I had are:
Does anything exist for this other than writing custom python functions for Bag.map or Dataframe/Series.apply?
Given any dataset compatible with Parquet, are there any secondary ecosystems of more generic (possibly JIT compiled) functions for at least doing simple things like querying individual fields along an xml/json path?
Has anybody done work to efficiently infer a nested schema from xml/json? Even if that schema was an object that Dask/Pandas can’t use, simply knowing it would be helpful for figuring out how to write functions for something like Bag.map. I know there are a ton of Python json schema inference libraries, but none of them look to be compiled or otherwise built for performance when applied to thousands or millions of individual json objects.

Convert huge linked data dumps (RDF/XML, JSON-LD, TTL) to TSV/CSV

Linked data collections are usually given in RDF/XML, JSON-LD, or TTL format. Relatively large data dumps seem fairly difficult to process. What is a good way to convert an RDF/XML file to a TSV of triplets of linked data?
I've tried OpenRefine, which should handle this, but a 10GB file, (e.g. the person authority information from German National Library) is too difficult to process on a laptop with decent processing power.
Looking for software recommendations or some e.g. Python/R code to convert it. Thanks!
Try these:
Lobid GND API
http://lobid.org/gnd/api
Supports OpenRefine (see blogpost) and a variety of other queries. The data is hosted as JSON-LD (see context) in an elasticsearch cluster. The service offers a rich HTTP-API.
Use a Triple Store
Load the data to a triple store of your choice, e.g. rdf4j. Many triple stores provide some sort of CSV serialization. Together with SPARQL this could be worth a try.
Catmandu
http://librecat.org/Catmandu/
A strong perl based data toolkit that comes with a useful collection of ready-to-use transformation pipelines.
Metafacture
https://github.com/metafacture/metafacture-core/wiki
A Java-Toolkit to design transformation pipelines in Java.
You could use the ontology editor Protege: There, you can SPARQL the data according to your needs and save them as TSV file. It might be important, however, to configure the software beforehand in order to make the amounts of data manageable.
Canonical N-Triples may be already what you are after, as it is essentially a space-separated line-based format for RDF (you cannot naively split at space though, as you need to take care of literals, see below). Of the dataset you cited, many files are available as N-Triples. If not, use a parsing tool like rapper for the conversion to N-Triples, eg.
rapper -i turtle -o ntriples rdf-file-in-turtle-format.ttl > rdf-file-in-ntriples-format.nt
Typically, the n-triples exporters do not exploit all that is allowed in the specification regarding whitespace and use canonical n-triples. Hence, given a line in a canonical n-triples file such as:
<http://example.org/s> <http://example.org/p> "a literal" .
you can get CSV by replacing the first and the second space character of a line with a comma and remove everything after and including the last space character. As literals are the only RDF term where spaces are allowed, and as literals only allowed in object position, this should work for canonical n-triples.
You can get TSV by replacing said space characters with tab. If you also do that for the last space character and do not remove the dot, you have a file that is both a valid n-triples and a TSV file. If you take these positions as split positions, you can work with canonical n-triples files without conversion to CSV/TSV.
Note that you may have to deal with commas/tabs in the RDF terms (eg. by escaping), but that problem exists in any solution for RDF as CSV/TSV.

JSON library in Scala and Distribution of the computation

I'd like to compute very large JSON files (about 400 MB each) in Scala.
My use-case is batch-processing. I can receive several very big files (up to 20 GB, then cut to be processed) at the same moment and I really want to process them quickly as a queue (but it's not the subject of this post!). So it's really about distributed architecture and performance issues.
My JSON file format is an array of objects, each JSON object contains at least 20 fields. My flow is composed of two major steps. The first one is the mapping of the JSON object into a Scala object. And the second step is some transformations I'm making on the Scala object data.
To avoid loading all the file in memory, I'd like a parsing library where I can have incremental parsing. There are so many libraries (Play-JSON, Jerkson, Lift-JSON, the built in scala.util.parsing.json.JSON, Gson) and I cannot figure out which one to take, with the requirement to minimize dependencies.
Do you have any ideas of a library I can use for high-volume parsing with good performances?
Also, I'm searching a way to process in parallel the mapping of the JSON file and the transformations made on the fields (between several nodes).
Do you think I can use Apache Spark to do it? Or are there alternative ways to accelerate/distribute the mapping/transformation?
Thanks for any help.
Best regards, Thomas
Considering a scenario without Spark, I would advise to stream the json with Jackson Streaming (Java) (see for example there), map each Json object to a Scala case class and send them to an Akka router with several routees that do the transformation part in parallel.

Getting Entity data from Autocad

This is a two part question.
1) Is there any way to get a csv file of all the entity data, including xdata, for an autocad dwg, either using autocad or some other method?
2) Is there an easy way to parse a autocad dxf file to get the entity data into a csv file?
Unfortunately, neither approach provides an easy method, but it is possible with a little effort.
With a DWG file, the file itself is binary so your best bet would be to write a plugin or script to AutoCAD, using .NET or ObjectArx, but this may be a troublesome approach. AutoLISP would be easier, but I don't think you could output to a file.
Getting the enitity data out of a DXF would be significantly easier, since the DXF is primarily a text format. This would be possible with any programming language, but since there are many possible entities it would take some effort to handle all of the cases. The DXF reference is available at the AutoDESK website. XData is certainly also included in the DXF in a text format, so that shouldn't be a problem.
You can write output to a file using autolisp, even binary output with some slight of hand. However, writing dxf data to a csv file, with or without xdata, by either reading the data directly (in-situ) or by parsing a dxf file, is completely impractical, given the nature of dxf group codes and associated data. Perhaps the OP can identify what he wants to achieve, rather than specifying what appears to me to be an inappropriate format for the data.
Michael.