I'm trying to find a standard (or, at least, not completely obscure) binary data format with which to represent an X/Y/Z+R/G/B point cloud, with the added requirement that I need to have some additional metadata attached to each point. Specifically, I need to attach zero or more "source" attributes, each of which is an integer, to each point.
Is there an existing binary data format which is well-suited to this? Or, perhaps, would it be wiser for me to go with two separate data files, where the metadata just refers to the points in the cloud by their index into the full list of points?
From what I can tell, the PLY format allows arbitrary length lists of attributes attached to each element and can have either ASCII or "binary" format:
http://www.mathworks.com/matlabcentral/fx_files/5459/1/content/ply.htm
As far as I know, PCD only allows a fixed number of fields:
http://pointclouds.org/documentation/tutorials/pcd_file_format.php
Related
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON --autodetect ERIC_KOLOTYLUK_BQ_POC_DATASET.Test2 small_data_clean.jsonl
seems to work well if the JSON is very clean, but very fragile with abstruse diagnostics when the JSON is not very clean. Well, the feature is still experimental, so no point in complaining. For example, if JSON property names contain - characters, while these are valid JSON, they are not valid BigQuery Column Names.
My questions is, are there some existing tools/utilities for ingesting generic JSON into BigQuery that works better than --autodetect?
Presumably Google will improve --autodetect over time, but for now I am looking for any advice/experience people may have. I have already written some code to replace - with _ in property names, so I was wondering if other people have similarly creates tools/utilities...
As you've mentioned for a sample scenario, you have a JSON property that has "-" for a column name, however per this Specifying a Schema Documentation,
A column name must contain only letters (a-z, A-Z), numbers (0-9), or
underscores (_), and it must start with a letter or underscore.
Any characters non-compliant with the above will result to error on creation of columns during the definition of schema.
On the other hand, when using --autodetect, BigQuery's official documentation on Auto-Detection already has a disclaimer saying,
When BigQuery detects schemas, it might, on rare occasions, change a
field name to make it compatible with BigQuery SQL syntax.
Since there is no other tools yet available to auto-correct/format JSON data to fit the BigQuery requirements for schema definition, the best approach for this kind of scenario is to write a code to replace unwanted characters on JSON data's column names, in which you already did.
What is the standard way to represent a list/array value in CSV? For example, given this source data in JSON:
[
{
'name': 'Harry',
'subjects': ['math', 'english', 'history']
}
]
My guess as to a CSV representation would be:
name,subjects
Harry,["math","english","history"]
However that doesn't get parsed correctly (with the standard Python CSV parser).
One option, though this is almost always a hack and should be avoided unless truly necessary, is to choose a delimiter that you know will never show up in your data. For example:
name,subject
Harry,math|english|history
Of course you will have to manually handle splitting this string and turning it back into a list. Existing CSV parsers should not support this, because this concept fundamentally does not make sense in CSV.
And of course, this does not generalize well - what happens in the future when you need to store a 2D list, or a dict, or you realize you do need that delimiter character after all?
The root problem here is that CSV is a tabular format, whereas JSON is a hierarchical format. Rather than trying to "squeeze" one format to fit into a fundamentally incompatible format, you should instead normalize your data into a tabular representation. One example of how this could look:
name,subject
Harry,math
Harry,english
Harry,history
I have a Python script which collects data and sends it to my MySQL table.
I noticed that the "Cost" sometimes is 0,95 which results in 0 in my table since my table use "0.95" instead of "0,95".
I assume the best solution is to convert the , to . in my Python script by using:
variable.replace(",", ".")
However, couldn't one solution be to change format in my MySQL table? So that I store numbers in this format:
1100
0,95
0,1
150000
My Django Model
cost = models.DecimalField(max_digits=10, decimal_places=4, default=None)
Any feedback on how to best solve this issue?
Thanks
Your first instinct is correct: convert the "unusual" (comma-decimal) input into the standard format that MySQL used by default (dot-decimal) at the first point where you receive it.
there's lots of ways to write numbers
Be careful, though that you don't get stung by people using commas as thousands separators like "3,203,907.23", or the European form "3.203.907,23", the Swiss "3'203'907,23' or even this form, which is widely used in India: "32,03,907.71" (yes, I did mean to type only two digits there!)
To make your life easier, the rule for currencies is relatively simple:
where a dot or comma is followed by only two digits at the end of the string, that character is acting as the decimal separator.
Once you know which is the decimal separator, you can safely remove all other non-digits from the string, change the decimal separator you found to . then use any standard library string-to-number conversion.
Storage format isn't presentation format
Yes, you can tell MySQL to use comma as its decimal separator, but doing that will break so much of your code - including the parts of the framework that read from the database and expect dot-decimal numbers - that you'll regret doing it that way very quickly...
There's a general principle at work here: you should do your data storage and processing using a format that is easy to process, interchangeable with other systems, and understood by other software developers.
Consider what happens if you need to allow a different framework to access your MySQL database to generate reports... whoever develops that software (and it may be you) will be glad that the numbers are all stored the way numbers are "always" stored in databases.
Convert on the way in, re-convert on the way out
Where you need to accept input in a different format, convert that input into your standardised format as early as possible.
When you need to use an output format, do the conversion to that format as late as possible.
The idea is to keep as much of your system "unexceptional" as possible. A programmer who has to remember what numeric format will in force at the time when a given method is called is not a happy programmer.
P.S.
The option you're talking about in MySQL is an example of this pattern: it doesn't change how numeric data is stored. All that changes is how you pass numbers to MySQL and how it presents them back to you.
If I have two properties:
foo=1
bar=2345
Is there a way to specify that foo is a number and bar is a string?
I assume: bar="2345" would do but I wonder if there's a widely accepted convention
A properties file is a text file which contains data in some standard format, which can be read by the application using it. It is mostly used for configuration of the application and also for internationalization.
As per the wiki document https://en.wikipedia.org/wiki/.properties
Each parameter is stored as a pair of strings, one storing the name of
the parameter (called the key), and the other storing the value.
There is no way to specify / force the value to be number or string only (instead it is always a string). It is majorly the functionality of the framework / application which; while reading the properties file tries to parse the values. If it fails to parse the value (of certain specific type like number) it may fallback to some default value or will simply terminate the program.
Trying to scrape a webpage, I hit the necessity to work with ASP.NET's __VIEWSTATE variables. So, ever the optimist, I decided to read up on those variables, and their formats. Even though classified as Open Source by Microsoft, I couldn't find any formal definition:
Everybody agrees the first step to do is decode the string, using a Base64 decoder. Great - that works...
Next - and this is where the confusion sets in:
Roughly 3/4 of the decoders seem to use binary values (characters whose values indicate the the type of field which is follow). Here's an example of such a specification. This format also seems to expect a 'signature' of 0xFF 0x01 as first two bytes.
The rest of the articles (such as this one) describe a format where the fields in the format are separated (or marked) by t< ... >, p< ... >, etc. (this seems to be the case of the page I'm interested in).
Even after looking at over a hundred pages, I didn't find any mention about the existence of two formats.
My questions are: Are there two different formats of __VIEWSTATE variables in use, or am I missing something basic? Is there any formal description of the __VIEWSTATE contents somewhere?
The view state is serialized and deserialized by the
System.Web.UI.LosFormatter class—the LOS stands for limited object
serialization—and is designed to efficiently serialize certain types
of objects into a base-64 encoded string. The LosFormatter can
serialize any type of object that can be serialized by the
BinaryFormatter class, but is built to efficiently serialize objects
of the following types:
Strings
Integers
Booleans
Arrays
ArrayLists
Hashtables
Pairs
Triplets
Everything you need to know about ViewState: Understanding View State