I'm integrating two poorly documented systems, and in the process I've come across a strange data format I haven't seen before. It's stored as plain text in the db with no indication as to what the format is and how to deal with it.
a:17:{s:2:"id";s:27:"145219921F990B11C39E7220000";s:16:"purchase_country";s:2:"no";s:17:"purchase_currency";s:3:"nok";s:6:"locale";s:5:"nb-no";s:6:"status";s:17:"checkout_complete";s:9:"reference";s:27:"145212221F990B11C39E7221000";s:11:"reservation";s:10:"2348226550";s:10:"started_at";s:25:"2014-04-04T10:40:55+02:00";s:12:"completed_at";s:25:"2014-04-02T10:41:11+02:00";s:16:"last_modified_at";s:25:"2014-04-02T10:41:11+02:00";s:10:"expires_at";s:25:"2014-04-16T10:41:11+02:00";s:4:"cart";a:4:{s:25:"total_price_excluding_tax";i:489500;s:16:"total_tax_amount";i:0;s:25:"total_price_including_tax";i:489500;s:5:"items";a:2:{i:0;a:10:{s:9:"reference";s:2:"68";s:4:"name";s:21:"1.OSO SUPER S 200LIT.";s:8:"quantity";i:1;s:10:"unit_price";i:695000;s:8:"tax_rate";i:0;s:13:"discount_rate";i:0;s:4:"type";s:8:"physical";s:25:"total_price_including_tax";i:695500;s:25:"total_price_excluding_tax";i:694000;s:16:"total_tax_amount";i:0;}i:1;a:10:{s:9:"reference";s:2:"68";s:4:"name";s:32:"1.OSO SUPER S 200LIT. (discount)";s:8:"quantity";i:1;s:10:"unit_price";i:-205100;s:8:"tax_rate";i:0;s:13:"discount_rate";i:0;s:4:"type";s:8:"physical";s:25:"total_price_including_tax";i:-205100;s:25:"total_price_excluding_tax";i:-205100;s:16:"total_tax_amount";i:0;}}}s:8:"customer";a:1:{s:4:"type";s:6:"person";}s:16:"shipping_address";a:8:{s:10:"given_name";s:13:"Testperson-no";s:11:"family_name";s:8:"Approved";s:14:"street_address";s:18:"Sæffleberggate 56";s:11:"postal_code";s:4:"0563";s:4:"city";s:4:"OSLO";s:7:"country";s:2:"no";s:5:"email";s:32:"omitted#testdrive.klarna.com";s:5:"phone";s:11:"40 12 34 56";}s:15:"billing_address";a:8:{s:10:"given_name";s:13:"Testperson-no";s:11:"family_name";s:8:"Approved";s:14:"street_address";s:18:"Sæffleberggate 56";s:11:"postal_code";s:4:"0563";s:4:"city";s:4:"OSLO";s:7:"country";s:2:"no";s:5:"email";s:32:"checkout-no#testdrive.klarna.com";s:5:"phone";s:11:"40 12 34 56";}s:7:"options";a:1:{s:31:"allow_separate_shipping_address";b:0;}s:8:"merchant";a:5:{s:2:"id";s:4:"1601";s:9:"terms_uri";s:95:"omitted";s:12:"checkout_uri";s:59:"omitted";s:16:"confirmation_uri";s:220:"omitted";s:8:"push_uri";s:229:"omitted";}}
An entry consists of colon-separated segments:
A single char type tag (array, object, int, decimal, bool, string)
A number that says how long the value is in characters, bytes, elements (in case of arrays) or key-value pairs (in case of objs), which seems completely useless given that this is a textual format that requires me to parse the length segment anyway. This isn't present for integers and decimals.
Value of the field
Key-value pairs seem to be a flat list of an even number of elements. They also seem to be using arrays as objects as well (see example).
A ; terminator, which seems not to be necessary for objects and arrays, just to make parsing more tedious.
Now, parsing this thing is reasonably easy, but I'm constantly being surprised by new data types and their weird syntax and I'm not sure that I've covered all the edge cases with the few data samples I've analyzed. Is anyone familiar with this format?
Looks like PHP serialization. See: http://www.phpinternalsbook.com/classes_objects/serialization.html
Related
What is the standard way to represent a list/array value in CSV? For example, given this source data in JSON:
[
{
'name': 'Harry',
'subjects': ['math', 'english', 'history']
}
]
My guess as to a CSV representation would be:
name,subjects
Harry,["math","english","history"]
However that doesn't get parsed correctly (with the standard Python CSV parser).
One option, though this is almost always a hack and should be avoided unless truly necessary, is to choose a delimiter that you know will never show up in your data. For example:
name,subject
Harry,math|english|history
Of course you will have to manually handle splitting this string and turning it back into a list. Existing CSV parsers should not support this, because this concept fundamentally does not make sense in CSV.
And of course, this does not generalize well - what happens in the future when you need to store a 2D list, or a dict, or you realize you do need that delimiter character after all?
The root problem here is that CSV is a tabular format, whereas JSON is a hierarchical format. Rather than trying to "squeeze" one format to fit into a fundamentally incompatible format, you should instead normalize your data into a tabular representation. One example of how this could look:
name,subject
Harry,math
Harry,english
Harry,history
What's the purpose (not what it becomes) of doing json_encode on this before I am putting into the database
rating: {cleanliness: 3, publicFacility: 1, roomFacility: 2, security: 2}
to become this
rating: "{"cleanliness":3,"publicFacility":1,"roomFacility":2,"security":2}"
I see no point of doing this cause I need to json_decode it again before serving it back... can anybody clear me out?
Do not store json encoded data in the database. You mitigate the whole point of a relational database this way and make searching for values an expensive task. I see in your sample the attributes cleanliness, publicFacility, roomFacility and security. Those should be columns in your database so you can search for something like "all entries with a cleanliness higher than 3".
It works with the JSON column type but it is more expensive than using normal columns.
Edit: Check the use-case for your database entry. If you are sure you never need to search in or order by the encoded attributes you can store data encoded as json string. However, if your database supports the JSON column type, you should use that one because it allows searching in the stored JSON (but is more expensive than searching in normal columns). </Edit>
Second point: The second code snipped (with the quotation marks) looks like invalid syntax for json.
I just came across a JSON wannabe that decides to "improve" it by adding datatypes... of course, the syntax makes it nearly impossible to google.
a:4:{
s:3:"cmd";
s:4:"save";
s:5:"token";
s:22:"5a7be6ad267d1599347886";
}
Full data is... much larger...
The first letter seems to be a for array, s for string, then the quantity of data (# of array items or length of string), then the actual piece of data.
With this type of syntax, I currently can't Google meaningful results. Does anyone recognize what god-forsaken language or framework this is from?
Note: some genius decided to stuff this data as a single field inside a database, and it included critical fields that I need to perform aggregate functions on. The rest I can handle if I can get a way to parse this data without resorting to ugly serial processing.
If this can be parsed using MSSQL 2008 that results in a view, I'll throw in a bounty...
I would parse it with a UDF written in .NET - https://learn.microsoft.com/en-us/sql/relational-databases/clr-integration-database-objects-user-defined-functions/clr-user-defined-functions
You can either write a custom aggregate function to parse and calculate these nutty fields, or a scalar value function that returns the field as JSON.
I'd probably opt for the latter in the name of separation of concerns.
I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)
Trying to scrape a webpage, I hit the necessity to work with ASP.NET's __VIEWSTATE variables. So, ever the optimist, I decided to read up on those variables, and their formats. Even though classified as Open Source by Microsoft, I couldn't find any formal definition:
Everybody agrees the first step to do is decode the string, using a Base64 decoder. Great - that works...
Next - and this is where the confusion sets in:
Roughly 3/4 of the decoders seem to use binary values (characters whose values indicate the the type of field which is follow). Here's an example of such a specification. This format also seems to expect a 'signature' of 0xFF 0x01 as first two bytes.
The rest of the articles (such as this one) describe a format where the fields in the format are separated (or marked) by t< ... >, p< ... >, etc. (this seems to be the case of the page I'm interested in).
Even after looking at over a hundred pages, I didn't find any mention about the existence of two formats.
My questions are: Are there two different formats of __VIEWSTATE variables in use, or am I missing something basic? Is there any formal description of the __VIEWSTATE contents somewhere?
The view state is serialized and deserialized by the
System.Web.UI.LosFormatter class—the LOS stands for limited object
serialization—and is designed to efficiently serialize certain types
of objects into a base-64 encoded string. The LosFormatter can
serialize any type of object that can be serialized by the
BinaryFormatter class, but is built to efficiently serialize objects
of the following types:
Strings
Integers
Booleans
Arrays
ArrayLists
Hashtables
Pairs
Triplets
Everything you need to know about ViewState: Understanding View State