Actually, I'm sending data to Cosmos via Cygnus. The Cosmos directory where Cygnus put the data is, for example, /user/myUser/mysetdata. I've created my hive table with this columns: recvTimeTs, recvTime, entityId, entityType, attrName, attrType, attrValue.
Now, I want to put data into Cosmos directly via HttpFS to the same directory that is putting Cygnus.
How could be the ".txt" file format? It have to be comma delimited? For example:
recvTimeTs;recvTimem;entityId;entityType;attrName;attrType;attrValue
value;value;value;...
Hive tables contains the structured data within files located in the HDFS folder which was given in the Hive table creation command.
With Cygnus 0.1, such structured data is achieved by using CSV-like files, thus adding a new file to the HDFS folder or appending new data to an already existent file within that folder is as easy as composing new CSV-like lines of data. The separator character must be the same you specified when creating the table, e.g.:
create external table <table_name> (recvTimeTs bigint, recvTime string, entityId string, entityType string, attrName string, attrType string, attrValue string) row format delimited fields terminated by '|' location '/user/<myusername>/<mydataset>/';
Thus, being the example separator |, the new data lines must be like:
<ts>|<ts_ms>|<entity_name>|<entity_type>|<attribute_name>|<attribute_type>|<value>
From Cugnus 0.2 (inclusive), the structured data is achieved by using Json-like files. In this case you do not have to deal with separators, nor table creation (see this question), since Json does not use separators and the table creation is automatic. In this case, you have to compose a new file or new data to be appended to an already existing file by following any of this formats (depending if you are storing the data in row or column mode, respectively):
{"recvTimeTs":"13453464536", "recvTime":"2014-02-27T14:46:21", "entityId":"Room1", "entityType":"Room", "attrName":"temperature", "attrType":"centigrade", "attrValue":"26.5", "attrMd":[{name:ID, type:string, value:ground}]}
{"recvTime":"2014-02-27T14:46:21", "temperature":"26.5", "temperature_md":[{"name":"ID", "type":"string", "value":"ground"}]}
It is worth mentioning there exists scripts in charge of moving 0.1-like format into 0.2-like (or higher) format.
Related
I'm trying to load, filter and unload some json files using AWS Athena:
CREATE EXTERNAL TABLE IF NOT EXISTS
json_based_table(file_line string)
LOCATION 's3://<my_bucket>/<some_path/';
UNLOAD
(SELECT file_line from json_based_table limit 10)
TO 's3://<results_bucket>/samples/'
WITH (format = 'JSON');
Problem is the output is a set of files containing a json per line that has a single key "file_line" who's value is a json line from the original file as a string.
How do I UNLOAD such a table values only? (ignoring the column name I had to create to load the files)
It seems that by choosing
WITH (format = 'TEXTFILE');
I can get what I want.
Choosing JSON as a format is good for preserving the tabular structure of the table in a file and was a misleading name in this case.
I have a CSV file with all fields quoted with ".
I have a Sink to SQL Server and the Copy Data activity is supposed to insert data into a Table directly.
Empty strings from the CSV file are not treated as a NULL value in the SQL table and are treated as empty strings.
Unfortunately, I can't find a way to configure the Copy Data activity to change this behavior.
There is no way to do this in the Copy activity.
Some workarounds:
Use Data flow to change the empty string to NULL and save the csv file to Azure blob storage. Then use the Copy activity to copy to SQL server.
Create a SP in your SQL server to change the empty string to NULL, and invoke it in SP activity.
Create a trigger to change the empty string to NULL, and just use Copy activity to copy data.
I have a bunch of json snappy compressed files in HDFS.
They are HADOOP snappy compressed (not python, cf other SO questions)
and have nested structures.
Could not find a method to load them into
into HIVE (using json_tuple) ?
Can I get some ressources/hints on how to load them
Previous references (does not have valid answers)
pyspark how to load compressed snappy file
Hive: parsing JSON
Put all files in HDFS folder and create external table on top of it. If files have names like .snappy Hive will automatically recognize them. You can specify SNAPPY output format for writing table:
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
CREATE EXTERNAL TABLE mydirectory_tbl(
id string,
name string
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/mydir' --this is HDFS/S3 location
;
JSONSerDe can parse all complex structures, it is much easier than using json_tuple. Simple attributes in json are mapped to columns as is All in the square brackets [] is an array<>, in {} is a struct<> or map<>, complex types can be nested. Carefully read Readme: https://github.com/rcongiu/Hive-JSON-Serde. There is a section about nested structures and many examples of CREATE TABLE.
If you still want to use json_tuple, then create table with single STRING column then parse using json_tuple. But it is much more difficult.
All JSON records should be in single line (no newlines inside JSON objects, as well as \r) . The same is mentioned here https://github.com/rcongiu/Hive-JSON-Serde
If your data is partitioned (ex. by date)
Create the table in Hive
CREATE EXTERNAL TABLE IF NOT EXISTS database.table (
filename STRING,
cnt BIGINT,
size DOUBLE
) PARTITIONED BY ( \`date\` STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'folder/path/in/hdfs'
Recover the partition (before the recovery, the table seems to be empty)
MSCK REPAIR TABLE database.table
I want to upload csv data into BigQuery. When the data has different types (like string and int), it is capable of inferring the column names with the headers, because the headers are all strings, whereas the other lines contains integers.
BigQuery infers headers by comparing the first row of the file with
other rows in the data set. If the first line contains only strings,
and the other lines do not, BigQuery assumes that the first row is a
header row.
https://cloud.google.com/bigquery/docs/schema-detect
The problem is when your data is all strings ...
You can specify --skip_leading_rows, but BigQuery still does not use the first row as the name of your variables.
I know I can specify the column names manually, but I would prefer not doing that, as I have a lot of tables. Is there another solution ?
If your data is all in "string" type and if you have the first row of your CSV file containing the metadata, then I guess it is easy to do a quick script that would parse the first line of your CSV and generates a similar "create table" command:
bq mk --schema name:STRING,street:STRING,city:STRING... -t mydataset.myNewTable
Use that command to create a new (void) table, and then load your CSV file into that new table (using --skip_leading_rows as you mentioned)
14/02/2018: Update thanks to Felipe's comment:
Above comment can be simplified this way:
bq mk --schema `head -1 myData.csv` -t mydataset.myNewTable
It's not possible with current API. You can file a feature request in the public BigQuery tracker https://issuetracker.google.com/issues/new?component=187149&template=0.
As a workaround, you can add a single non-string value at the end of the second line in your file, and then set the allowJaggedRows option in the Load configuration. Downside is you'll get an extra column in your table. If having an extra column is not acceptable, you can use query instead of load, and select * EXCEPT the added extra column, but query is not free.
I have thousands of jsons on google cloud storage, but they have a specific field name (campaign name)
with a space, but before load (or create an external table) on bigquery I need to replace the space for underscore (campaign_name). I'm getting the following error when I try to create without replace:
Error in query string: Illegal field name: campaign name Table: raw_km_all_data
Is there any other solution that not includes download all the files to a server, do the replace and then upload again to cloud storage?
Thanks!
You can pretend that these JSON files are CSV with single column containing big string. Then, once it is loaded into BigQuery as a single column table - use REPLACE or REGEXP_REPLACE functions to replace spaces with underscores. Then you can use JSON_EXTRACT family of functions to parse JSON and populate table with real columns.