I have a Python script which collects data and sends it to my MySQL table.
I noticed that the "Cost" sometimes is 0,95 which results in 0 in my table since my table use "0.95" instead of "0,95".
I assume the best solution is to convert the , to . in my Python script by using:
variable.replace(",", ".")
However, couldn't one solution be to change format in my MySQL table? So that I store numbers in this format:
1100
0,95
0,1
150000
My Django Model
cost = models.DecimalField(max_digits=10, decimal_places=4, default=None)
Any feedback on how to best solve this issue?
Thanks
Your first instinct is correct: convert the "unusual" (comma-decimal) input into the standard format that MySQL used by default (dot-decimal) at the first point where you receive it.
there's lots of ways to write numbers
Be careful, though that you don't get stung by people using commas as thousands separators like "3,203,907.23", or the European form "3.203.907,23", the Swiss "3'203'907,23' or even this form, which is widely used in India: "32,03,907.71" (yes, I did mean to type only two digits there!)
To make your life easier, the rule for currencies is relatively simple:
where a dot or comma is followed by only two digits at the end of the string, that character is acting as the decimal separator.
Once you know which is the decimal separator, you can safely remove all other non-digits from the string, change the decimal separator you found to . then use any standard library string-to-number conversion.
Storage format isn't presentation format
Yes, you can tell MySQL to use comma as its decimal separator, but doing that will break so much of your code - including the parts of the framework that read from the database and expect dot-decimal numbers - that you'll regret doing it that way very quickly...
There's a general principle at work here: you should do your data storage and processing using a format that is easy to process, interchangeable with other systems, and understood by other software developers.
Consider what happens if you need to allow a different framework to access your MySQL database to generate reports... whoever develops that software (and it may be you) will be glad that the numbers are all stored the way numbers are "always" stored in databases.
Convert on the way in, re-convert on the way out
Where you need to accept input in a different format, convert that input into your standardised format as early as possible.
When you need to use an output format, do the conversion to that format as late as possible.
The idea is to keep as much of your system "unexceptional" as possible. A programmer who has to remember what numeric format will in force at the time when a given method is called is not a happy programmer.
P.S.
The option you're talking about in MySQL is an example of this pattern: it doesn't change how numeric data is stored. All that changes is how you pass numbers to MySQL and how it presents them back to you.
Related
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON --autodetect ERIC_KOLOTYLUK_BQ_POC_DATASET.Test2 small_data_clean.jsonl
seems to work well if the JSON is very clean, but very fragile with abstruse diagnostics when the JSON is not very clean. Well, the feature is still experimental, so no point in complaining. For example, if JSON property names contain - characters, while these are valid JSON, they are not valid BigQuery Column Names.
My questions is, are there some existing tools/utilities for ingesting generic JSON into BigQuery that works better than --autodetect?
Presumably Google will improve --autodetect over time, but for now I am looking for any advice/experience people may have. I have already written some code to replace - with _ in property names, so I was wondering if other people have similarly creates tools/utilities...
As you've mentioned for a sample scenario, you have a JSON property that has "-" for a column name, however per this Specifying a Schema Documentation,
A column name must contain only letters (a-z, A-Z), numbers (0-9), or
underscores (_), and it must start with a letter or underscore.
Any characters non-compliant with the above will result to error on creation of columns during the definition of schema.
On the other hand, when using --autodetect, BigQuery's official documentation on Auto-Detection already has a disclaimer saying,
When BigQuery detects schemas, it might, on rare occasions, change a
field name to make it compatible with BigQuery SQL syntax.
Since there is no other tools yet available to auto-correct/format JSON data to fit the BigQuery requirements for schema definition, the best approach for this kind of scenario is to write a code to replace unwanted characters on JSON data's column names, in which you already did.
I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)
I'm trying to find a standard (or, at least, not completely obscure) binary data format with which to represent an X/Y/Z+R/G/B point cloud, with the added requirement that I need to have some additional metadata attached to each point. Specifically, I need to attach zero or more "source" attributes, each of which is an integer, to each point.
Is there an existing binary data format which is well-suited to this? Or, perhaps, would it be wiser for me to go with two separate data files, where the metadata just refers to the points in the cloud by their index into the full list of points?
From what I can tell, the PLY format allows arbitrary length lists of attributes attached to each element and can have either ASCII or "binary" format:
http://www.mathworks.com/matlabcentral/fx_files/5459/1/content/ply.htm
As far as I know, PCD only allows a fixed number of fields:
http://pointclouds.org/documentation/tutorials/pcd_file_format.php
I have to generate codes with custom fields: id of field+name of field+values of the field.
How long is the data I can encode inside the QRcode? I need to know how many fields\values I can insert.
Should I use XML or JSON or CSV? What is most generic and efficient?
XML / JSON will not qualify for a QR code's alphanumeric mode since it will include lower-case letters. You'll have to use byte mode. The max is 2,953 characters. But, the practical limit is far less -- perhaps a few hundred characters.
It is far better to encode a hyperlink to data if you can.
As Terence says, no reader will do anything with XML/JSON except show it. You need a custom reader anyway to do something useful with that data. (Which suggests this is not a good use case for QR codes.) But if you're making your own reader, you can use gzip compression to make the payload much smaller. Your reader would know to unzip it.
You might get away with something workable but this is not a good approach in general.
The maximum number of alphanumeric characters you can have is 4,296. Although this will require the lowest form of error correction and will be very hard to scan.
JSON is generally more efficient at data storage than XML.
However, you will need to write your own app to scan the code - I don't know of any which will process raw JSON or XML. All the scanners will show you the text, though.
I have a MySQL query that returns a result with a single column of integers. Is there any way to get the MySQL C API to transfer this as actually integers rather than as ASCII text? For that matter is there a way to get MySQL to do /any/ of the API stuff as other than ASCII text. I'm thinking this would save a bit of time in sprintf/sscanf or whatever else is used as well as in bandwidth.
You're probably out of luck, to be honest. Looking at the MySQL C API (http://dev.mysql.com/doc/refman/5.0/en/mysql-fetch-row.html, http://dev.mysql.com/doc/refman/5.0/en/c-api-datatypes.html, look at MYSQL_ROW) there doesn't seem to be a mechanism for returning data in its actual type... the joys of using structs I guess.
You could always implement a wrapper which checks against the MYSQL_ROW's type attribute (http://dev.mysql.com/doc/refman/5.0/en/c-api-datatypes.html) and returns a C union, but that's probably poor advice; don't do that.