I'm planning to save my q table to a text file (as a string) for future use, but I wondered what the pitfalls of this might be? Also, any advice on what might be a better way to store the q table would be appreciated – would it better to store it as JSON, for example?
In machine learning, a common solution to save models and data is to use the HDF5 format. You could try http://www.h5py.org/, which is a Python library that allows you to save, read and manipulate HDF5 files in a quite easy way.
I don't know if it's the best solution for your purpose or if there's a more specific solution (because this also depends e.g. on the size of your Q tables), but this format/library allows you to save data in a hierarchical way, among other advantages.
For example, if you have multiple trained agents, you could save them as a dictionary e.g. d = {"agent1": q_table1, "agent2": q_table2 }. Also, not only can you save them in this hierarchical fashion, you can also read them and then work with their content as if they were dictionaries. Of course, this is just an example to give you an idea of what you can do with this library.
So, HDF5 is probably a good solution if you plan to have multiple agents and your Q tables can be large, but, at the same time, you want to be able to inspect and change your Q-tables easily and flexibly.
For more details about the advantages of this format compared to other common alternatives (such as databases, simple text files, etc.), see e.g. the questions Is there an analysis speed or memory usage advantage to using HDF5 for large array storage (instead of flat binary files)? and What are the advantages of HDF compared to alternative formats?.
Related
Can machine learning be used to transform/modifiy a list of numbers.
I have many pairs of binary files read from vehicle ECUs, an original or stock file before the vehicle was tuned, and a modified file which has the engine parameters altered. The files are basically lists of little or big endian 16 bit numbers.
I was wondering if it is at all possible to feed these pairs of files into machine learning, and for it to take a new stock file and attempt to transform or tune that stock file.
I would appreciate it if somebody could tell me if this is something which is at all possible. All of the examples I've found appear to make decisions on data rather than do any sort of a transformation.
Also I'm hoping to use azure for this.
We would need more information about your specific problem to answer. But, supervised machine learning can take data with a lot of inputs (like your stock file, perhaps) and an output (say a tuned value), and learn the correlations between those inputs and output, and then be able to predict the output for new inputs. (In machine learning terminology, these inputs are called "features" and the output is called a "label".)
Now, within supervised machine learning, there is a category of algorithms called regression algorithms. Regression algorithms allow you to predict a number (sounds like what you want).
Now, the issue that I see, if I'm understanding your problem correctly, is that you have a whole list of values to tune. Two things:
Do those values depend on each other and influence each other? Do any other factors not included in your stock file affect how the numbers should be tuned? Those will need to be included as features in your model.
Regression algorithms predict a single value, so you would need to build a model for each of the values in your stock file that you want to tune.
For more information, you might want to check out Choosing an Azure Machine Learning Algorithm and How to choose algorithms for Microsoft Azure Machine Learning.
Again, I would need to know more about your data to make better suggestions, but I hope that helps.
Is it possible to display only significant P-values and/or R-values in the output of SPSS?
This would simplify output significantly and reduces the tables to display only the relevant parts (the ones I need).
I'm not sure that this is a good idea, but if you want to do things such as highlight significant coefficients in a regression or blank out nonsignificant correlations in a correlation matrix, the SPSSINC MODIFY OUTPUT extension command can do this. It is included in the Python Essentials for SPSS Statistics and can be downloaded from the SPSS Community site at www.ibm.com/developerworks/spssdevcentral or, for V21, from the same site where Statistics is kept for download or the trial site.
I agree that there are many cases where this is not a good idea.
In general, I find post-processing of SPSS output tables to be a little bit awkward. This is one area in which R is a lot easier to use.
For occasional analyses I often find it useful to paste an SPSS output table into Excel for further processing. For example, you could sort columns by size (e.g., mean difference, p-value, r etc.), calculate new values (e.g., mean differences, absolute correlation, etc.), make table easier to read and so on.
Am I completely off-track to think about using Semantic MediaWiki to store (and organise, report on, etc.) 'tabular' data such as financial transactions or weather readings that would usually live in a spreadsheet or database?
It seems that one would need a separate, tiny, page for each tuple; but then, that's by design and perhaps it's perfectly okay.
I ask, simply because SMW seems like such a quick and easy way to get a collaborative data repository up and running.
Semantic MediaWiki is better suited for keeping track of Factual or Encyclopedic data, where you can have pages about everything you need to know about a certain topic.
For tabular or numerical data such as measurements, financial, sensor data, you would indeed need to create little pages about each data point, which is not practical in many cases.
However, there are extensions to Media Wiki that allow you to integrate external data sources (in MySQL databases or CSV files somewhere) with MediaWiki pages. This can allow you to have the best of both worlds - dynamic access and queries of tabular data and semantic annotations of pages around them.
Take a look at :
http://www.mediawiki.org/wiki/Extension:External_Data
No, I don't think it's such a bad idea.
Using SemanticForms you could enter lots of little data pages quickly and easily (for example, an invoice might require additional pages for each line item, but they could all be entered from one form using the 'multiple' feature of the for template form tag). So although I've never tried logging weather data in SMW, I think it would be pretty easy. I don't see what the problem would be with storing data across so many pages; it's easy enough to combine it in whatever formats you require.
Give it a go and let us know how it goes!
You can use either the Semantic Internal Objects extension (SIO), or SMW's built in subobjects (the former works well with the already mentioned External Data extension), to store multiple semantic objects (could be the rows of your spreadsheet) in one page.
However, unless you are really looking for a collaborative tool with semantic capabilities, I doubt SMW is the best suited piece of software for your task.
edit (november 2015): Since SMW version 1.9, there nothing that SIO can do that the built-in subobjects can't, so I would recommend the latter.
Does anyone know how I can store large binary values in Riak?
For now, they don't recommend storing files larger than 50MB in size without splitting them. See: FAQ - Riak Wiki
If your files are smaller than 50MB, than proceed as you would with storing non binary data in Riak.
Another reason one might pick Riak is for flexibility in modeling your data. Riak will store any data you tell it to in a content-agnostic way — it does not enforce tables, columns, or referential integrity. This means you can store binary files right alongside more programmer-transparent formats like JSON or XML. Using Riak as a sort of “document database” (semi-structured, mostly de-normalized data) and “attachment storage” will have different needs than the key/value-style scheme — namely, the need for efficient online-queries, conflict resolution, increased internal semantics, and robust expressions of relationships.Schema Design in Riak - Introduction
#Brian Mansell's answer is on the right track - you don't really want to store large binary values (over 50 MB) as a single object, in Riak (the cluster becomes unusably slow, after a while).
You have 2 options, instead:
1) If a binary object is small enough, store it directly. If it's over a certain threshold (50 MB is a decent arbitrary value to start with, but really, run some performance tests to see what the average object size is, for your cluster, after which it starts to crawl) -- break up the file into several chunks, and store the chunks separately. (In fact, most people that I've seen go this route, use chunks of 1 MB in size).
This means, of course, that you have to keep track of the "manifest" -- which chunks got stored where, and in what order. And then, to retrieve the file, you would first have to fetch the object tracking the chunks, then fetch the individual file chunks and reassemble them back into the original file. Take a look at a project like https://github.com/podados/python-riakfs to see how they did it.
2) Alternatively, you can just use Riak CS (Riak Cloud Storage), to do all of the above, but the code is written for you. That's exactly how RiakCS works -- it breaks an incoming file into chunks, stores and tracks them individually in plain Riak, and reassembles them when it comes time to fetch it back. And provides an Amazon S3 API for file storage, for your convenience. I highly recommend this route (so as not to reinvent the wheel -- chunking and tracking files is hard enough). Yes, CS is a paid product, but check out the free Developer Trial, if you're curious.
Just like every other value. Why would it be different?
Use either the Erlang interface ( http://hg.basho.com/riak/src/461421125af9/doc/basic-client.txt ) or the "raw" HTTP interface ( http://hg.basho.com/riak/src/tip/doc/raw-http-howto.txt ). It should "just work."
Also, you'll generally find a better response on the riak-users mailing list than you will here. http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com (No offense to z8000, who seems to also have answers.)
What tools are available to aid in decoding unknown binary data formats?
I know Hex Workshop and 010 Editor both support structures. These are okay to a limited extent for a known fixed format but get difficult to use with anything more complicated, especially for unknown formats. I guess I'm looking at a module for a scripting language or a scriptable GUI tool.
For example, I'd like to be able to find a structure within a block of data from limited known information, perhaps a magic number. Once I've found a structure, then follow known length and offset words to find other structures. Then repeat this recursively and iteratively where it makes sense.
In my dreams, perhaps even automatically identify possible offsets and lengths based on what I've already told the system!
Here are some tips that come to mind:
From my experience, interactive scripting languages (I use Python) can be a great help. You can write a simple framework to deal with binary streams and some simple algorithms. Then you can write scripts that will take your binary and check various things. For example:
Do some statistical analysis on various parts. Random data, for example, will tell you that this part is probably compressed/encrypted. Zeros may mean padding between parts. Scattered zeros may mean integer values or Unicode strings and so on. Try to spot various offsets. Try to convert parts of the binary into 2 or 4 byte integers or into floats, print them and see if they make sence. Write some functions that will search for repeating or very similar parts in the data, this way you can easily spot headers.
Try to find as many strings as possible, try different encodings (c strings, pascal strings, utf8/16, etc.). There are some good tools for that (I think that Hex Workshop has such a tool). Strings can tell you a lot.
Good luck!
For Mac OS X, there's a great tool that's even better than my iBored: Synalyze It!
(http://www.synalysis.net/)
Compared to iBored, it is better suited for non-blocked files, while also giving full control over structures, including scriptability (with Lua). And it visualizes structures better, too.
Tupni; to my knowledge not directly available out of Microsoft Research, but there is a paper about this tool which can be of interest to someone wanting to write a similar program (perhaps open source):
Tupni: Automatic Reverse Engineering of Input Formats (# ACM digital library)
Abstract
Recent work has established the importance of automatic reverse
engineering of protocol or file format specifications. However, the
formats reverse engineered by previous tools have missed important
information that is critical for security applications. In this
paper, we present Tupni, a tool that can reverse engineer an input
format with a rich set of information, including record sequences,
record types, and input constraints. Tupni can generalize the format
specification over multiple inputs. We have implemented a
prototype of Tupni and evaluated it on 10 different formats: five
file formats (WMF, BMP, JPG, PNG and TIF) and five network
protocols (DNS, RPC, TFTP, HTTP and FTP). Tupni identified all
record sequences in the test inputs. We also show that, by aggregating
over multiple WMF files, Tupni can derive a more complete
format specification for WMF. Furthermore, we demonstrate the
utility of Tupni by using the rich information it provides for zeroday
vulnerability signature generation, which was not possible with
previous reverse engineering tools.
My own tool "iBored", which I released just recently, can do parts of this. I wrote the tool to visualize and debug file system formats (UDF, HFS, ISO9660, FAT etc.), and implemented search, copy and later even structure and templates support. The structure support is pretty straight-forward, and the templates are a way to identify structures dynamically.
The entire thing is programmable in a Visual BASIC dialect, allowing you to test values, read specific blocks, and all.
The tool is free, works on all platforms (Win, Mac, Linux), but as it's personal tool which I just released to the public to share it, it's not much documented.
However, if you want to give it a try, and like to give feedback, I might add more useful features.
I'd even open source it, but as it's written in REALbasic, I doubt many people will join such a project.
Link: iBored home page
I still occasionally use an old hex editor called A.X.E., Advanced Hex Editor. It seems to have largely disappeared from the Internet now, though Google should still be able to find it for you. The last version I know of was version 3.4, but I've really only used the free-for-personal-use version 2.1.
Its most interesting feature, and the one I've had the most use for deciphering various game and graphics formats, is its graphical view mode. That basically just shows you the file with each byte turned into a color-coded pixel. And as simple as that sounds, it has made my reverse-engineering attempts a lot easier at times.
I suppose doing it by eye is quite the opposite of doing automatic analysis, though, and the graphical mode won't be much use for finding and following offsets...
The later version has some features that sound like they could fit your needs (scripts, regularity finder, grammar generator), but I have no idea how good they are.
There is Hachoir which is a Python library for parsing any binary format into fields, and then browse the fields. It has lots of parsers for common formats, but you can also write own parsers for your files (eg. when working with code that reads or writes binary files, I usually write a Hachoir parser first to have a debugging aid). Looks like the project is pretty much inactive by now, though.
Kaitai is an open-source language for describing binary structures in data streams. It comes with a translator that can output parsing code for many programming languages, for inclusion in your own program code.
My project icebuddha.com supports this using python to describe the format in the browser.
A cut'n'paste of my answer to a similar question:
One tool is WinOLS, which is designed for interpreting and editing vehicle engine managment computer binary images (mostly the numeric data in their lookup tables). It has support for various endian formats (though not PDP, I think) and viewing data at various widths and offsets, defining array areas (maps) and visualising them in 2D or 3D with all kinds of scaling and offset options. It also has a heuristic/statistical automatic map finder, which might work for you.
It's a commercial tool, but the free demo will let you do everything but save changes to the binary and use engine management features you don't need.