I am using Telegraf to insert a huge CSV file that contain 4 years of data into my InfluxDB database.
I am facing a problem, the content of the columns of the CSV file is very badly arranged.
I'm looking for a way to parse a field key and create a multiple tag value based on this field key.
Let's take the following example:
Column "date":"1527151459"
Column "meter_a10$1_power_L123_factor_min":"-0.990000"
In Influx Line Protocol the result would be:
Measurement meter_a10$1_power_L123_factor_min=-0.990000 1527151459
But i would like to parse the Field Key to get new Tag Value, Influx Line Protocol would be like this:
Measurement,meter="a10$1",type="power",modbus="L123",power="factor" min=-0.990000 1527151459
I have the following JSON stored in S3:
{"data":"this is a test for firehose"}
I have created the table test_firehose with a varchar column data, and a file_format called JSON with type JSON and the rest in default values. I want to copy the content from s3 to snowflake, and I have tried with the following statement:
COPY INTO test_firehose
FROM 's3://s3_bucket/firehose/2020/12/30/09/tracking-1-2020-12-30-09-38-46'
FILE_FORMAT = 'JSON';
And I receive the error:
SQL compilation error: JSON file format can produce one and only one column of type
variant or object or array. Use CSV file format if you want to load more than one column.
How could I solve this? Thanks
If you want to keep your data as JSON (rather than just as text) then you need to load it into a column with a datatype of VARIANT, not VARCHAR
I am writing a dataset to json using:
ds.coalesce(1).write.format("json").option("nullValue",null).save("project/src/test/resources")
For records that have columns with null values, the json document does not write that key at all.
Is there a way to enforce null value keys to the json output?
This is needed since I use this json to read it onto another dataset (in a test case) and cannot enforce a schema if some documents do not have all the keys in the case class (I am reading it by putting the json file under resources folder and transforming to a dataset via RDD[String], as explained here: https://databaseline.bitbucket.io/a-quickie-on-reading-json-resource-files-in-apache-spark/)
I agree with #philantrovert.
ds.na.fill("")
.coalesce(1)
.write
.format("json")
.save("project/src/test/resources")
Since DataSets are immutable you are not altering the data in ds and you can process it (complete with null values and all) in any following code. You are simply replacing null values with an empty string in the saved file.
Since Pyspark 3, one can use the ignoreNullFields option when writing to a JSON file.
spark_dataframe.write.json(output_path,ignoreNullFields=False)
Pyspark docs: https://spark.apache.org/docs/3.1.1/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.json
I am loading a JSON data from S3 into Redshift using the COPY command:
COPY <table> FROM '<s3 path of input json data>'
CREDENTIALS 'aws_access_key_id=<>;aws_secret_access_key=<>'
GZIP format as json '<s3 path of jsonpathfile>';
What I want is that the COPY command should fail, or atleast should raise a warning, when a column of the Redshift table is missing as a key in the input json. What I mean is, if the table has, say, two columns A and B and input json has only one column B, COPY should fail or raise a warning saying column A missing. Right now what it does is, it sets the missing column A values as NULL and copies rest of the columns (B). So, one obvious workaround is to set all the columns as NOT NULL while CREATING the table. But I do not want to do that because in my data I can have a JSON like {"A" : null, "B" : "something"}. But in this case key 'A' is indeed present, then also it will fail as it is NULL but as per schema it is supposed to be NOT NULL. I want it to fail only when I receive a JSON such as {"B":"something"} i.e. key A is not present at all in the JSON.
Is there any other neat way to achieve that using the COPY command?
I also tried the auto option as follows, with same results:
COPY <table> FROM '<s3 path of input json data>'
CREDENTIALS 'aws_access_key_id=<>;aws_secret_access_key=<>'
GZIP format as json 'auto';
I have file named key and another csv file named val.csv. As you can imagine, the file named key looks something like this:
123
012
456
The file named val.csv has multiple columns and corresponding values. It looks like this:
V1,V2,V3,KEY,V5,V6
1,2,3,012,X,t
9,0,0,452,K,p
1,2,2,000,L,x
I would like get the subset of lines from val.csv whose value in the KEY column matches the values in the KEY file. Using the above example, I would like to get an output like this:
V1,V2,V3,KEY,V5,V6
1,2,3,012,X,t
Obviously these are just toy examples. The real KEY file I am using has nearly 500,000 'keys' and the val.csv file has close to 5 million lines in them. Thanks.
$ awk -F, 'FNR==NR{k[$1]=1;next;} FNR==1 || k[$4]' key val.csv
V1,V2,V3,KEY,V5,V6
1,2,3,012,X,t
How it works
FNR==NR { k[$1]=1;next; }
This saves the values of all keys read from the first file, key.
The condition is FNR==NR. FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, if FNR==NR, we are still reading the first file.
When reading the first file, key, this saves the value of key in associative array k. This then skips the rest of the commands and starts over on the next line.
FNR==1 || k[$4]
If we get here, we are working on the second file.
This condition is true either for the first line of the file, FNR==1, or for lines whose fourth field is in array k. If the condition is true, awk performs the default action which is to print the line.