I have a JSON data from Firebase Backup.
Data generated is such that every key is preceded by a hyphen.
Sample data is as follows:
"-GuGCJDEprMKczAMDUj8":{"deviceId":"399a649c6cee6209","dow":"Thursday","downloadFlag":"N","event":"streamStart","halfHourFull":"18h1","liveFlag":"Y","localDate":"2009-01-01","localHalfHour":1,"minutesSinceMidnight":1080,"quarterHourFull":"18q1","stationName":"hit 105","streamListenMethod":"Headphones","timestampLocal":"2009-01-01T18:00:33.679+10:00","timestampUTC":"2009-01-01T08:00:33.679Z"}
When we are trying to load that data into Bigquery then we are encountered with the below mentioned error:
Fields must contain only letters, numbers, and underscores, start with
a letter or underscore, and be at most 128 characters long.
Is this a bigquery limitation?
If yes, then what's the proposed solution here.
Any help/suggestion is much appreciated.
Is this a bigquery limitation? If yes, then what's the proposed solution here.
You need to use different field names instead. One option is to load the data into a single STRING column, e.g. by using 'CSV' for the format with a field delimiter of '|' (or any other character that doesn't appear in your data). Then you can use the JSON_EXTRACT_SCALAR function to extract fields from the JSON, e.g.:
CREATE TABLE dataset.table AS
SELECT
JSON_EXTRACT_SCALAR(json_string, '$.-GuGCJDEprMKczAMDUj8.deviceId') AS deviceId,
JSON_EXTRACT_SCALAR(json_string, '$.-GuGCJDEprMKczAMDUj8.dow') AS dow,
JSON_EXTRACT_SCALAR(json_string, '$.-GuGCJDEprMKczAMDUj8.downloadFlag') AS downloadFlag,
...
FROM dataset.single_column_table
Related
What exactly is the format for Hive LazySimpleSerDe?
A format like ParquetHiveSerDe tells me that Hive will read the HDFS files in parquet format.
But what is LazySimpleSerDe? Why not call it something explicit like CommaSepHiveSerDe or TabSepHiveSerDe, given LazySimpleSerDe is for delimited files?
LasySimpleSerde - fast and simple SerDe, it does not recognize quoted values, though it can work with different delimiters, not only commas, default is TAB (\t). You can specify STORED AS TEXTFILE in table DDL and LasySimpleSerDe will be used. For quoted values use OpenCSVSerDe, it is not as fast as LasySimpleSerDe but works correctly with quoted values.
LasySimpleSerDe is simple for the sake of performance, also it creates Objects in a lazy way, to provide better performance, this is why it is preferable when possible (for text files).
See this example with pipe-delimited (|) file format: https://stackoverflow.com/a/68095278/2700344
show create table command for such table prints serde class as org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, STORED AS TEXTFILE is a shortcut.
I am working on some benchmarks and need to compare ORC, Parquet and CSV formats. I have exported TPC/H (SF1000) to ORC based tables. When I want to export it to Parquet I can run:
CREATE TABLE hive.tpch_sf1_parquet.region
WITH (format = 'parquet')
AS SELECT * FROM hive.tpch_sf1_orc.region
When I try the similar approach with CSV, then I get the error Hive CSV storage format only supports VARCHAR (unbounded). I would assumed that it would convert the other datatypes (i.e. bigint) to text and store the column format in the Hive metadata.
I can export the data to CSV using trino --server trino:8080 --catalog hive --schema tpch_sf1_orc --output-format=CSV --execute 'SELECT * FROM nation, but then it gets emitted to a file. Although this works for SF1 it quickly becomes unusable for SF1000 scale-factor. Another disadvantage is that my Hive metastores wouldn't have the appropriate meta-data (although I could patch it manually if nothing else works).
Anyone an idea how to convert my ORC/Parquet data to CSV using Hive?
In Trino Hive connector, the CSV table can contain varchar columns only.
You need to cast the exported columns to varchar when creating the table
CREATE TABLE region_csv
WITH (format='CSV')
AS SELECT CAST(regionkey AS varchar), CAST(name AS varchar), CAST(comment AS varchar)
FROM region_orc
Note that you will need to update your benchmark queries accordingly, e.g. by applying reverse casts.
DISCLAIMER: Read the full post, before using anything discussed here. It's not real CSV and you migth screw up!
It is possible to create typed CSV-ish tables when using the TEXTFILE format and use ',' as the field separator:
CREATE TABLE hive.test.region (
regionkey bigint,
name varchar(25),
comment varchar(152)
)
WITH (
format = 'TEXTFILE',
textfile_field_separator = ','
);
This will create a typed version of the table in the Hive catalog using the TEXTFILE format. It normally uses the ^A character (ASCII 10), but when set to ',' it resembles the same structure as CSV formats.
IMPORTANT: Although it looks like CSV, it is not real CSV. It doesn't follow RFC 4180, because it doesn't properly quote and escape. The following INSERT will not be inserted co:
INSERT INTO hive.test.region VALUES (
1,
'A "quote", with comma',
'The comment contains a newline
in it');
The text will be copied unmodified to the file without escaping quotes or commas. This should have been written like this to be proper CSV:
1,"A ""quote"", with comma","The comment contains a newline
in it"
Unfortunately, it is written as:
1,A "quote", with comma,The comment contains a newline
in it
This results in invalid data that will be represented by NULL columns. For this reason, this method can only be used when you have full control over the text-based data and are sure that it doesn't contain newlines, quotes, commas, ...
I have thousands of jsons on google cloud storage, but they have a specific field name (campaign name)
with a space, but before load (or create an external table) on bigquery I need to replace the space for underscore (campaign_name). I'm getting the following error when I try to create without replace:
Error in query string: Illegal field name: campaign name Table: raw_km_all_data
Is there any other solution that not includes download all the files to a server, do the replace and then upload again to cloud storage?
Thanks!
You can pretend that these JSON files are CSV with single column containing big string. Then, once it is loaded into BigQuery as a single column table - use REPLACE or REGEXP_REPLACE functions to replace spaces with underscores. Then you can use JSON_EXTRACT family of functions to parse JSON and populate table with real columns.
And so we found a 3.6GB csv that we have uploaded onto S3 and now want to import into Redshift, then do the querying and analysis from iPython.
Problem 1:
This comma delimited file contains values free text that also contains commas and this is interfering with the delimiting so can’t upload to Redshift.
When we tried opening the sample dataset in Excel, Excel surprisingly puts them into columns correctly.
Problem 2:
A column that is supposed to contain integers have some records containing alphabets to indicate some other scenario.
So, the only way to get the import through is to declare this column as varchar. But then we can do calculations later on.
Problem 3:
The datetime data type requires the date time value to be in the format YYYY-MM-DD HH:MM:SS, but the csv doesn’t contain the SS and the database is rejecting the import.
We can’t manipulate the data on a local machine because it is too big, and we can’t upload onto the cloud for computing because it is not in the correct format.
The last resort would be to scale the instance running iPython all the way up so that we can read the big csv directly from S3, but this approach doesn’t make sense as a long-term solution.
Your suggestions?
Train: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train.csv (3.4GB)
Train Sample: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train-sample.csv (133MB)
Try having different delimiter or use escape characters.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_preparing_data.html
For second issue, if you want to extract only numbers from the column after loading into char use regexp_replace or other functions.
For third issue, you can as well load it into VARCHAR field and then use substring cast(left(column_name, 10)||' '||right(column_name, 6)||':00' as timestamp)
to load it into final table from staging table
For the first issue, you need to find out a way to differentiate between the two types of commas - the delimiter and the text commas. Once you have done that, replace the delimiters with a different delimiter and use the same as delimiter in the copy command for Redshift.
For the second issue, you need to first figure out if this column needs to be present for numerical aggregations once loaded. If yes, you need to get this data cleaned up before loading. If no, you can directly load this as char/ varchar field. All your queries will still work but you will not be able to do any aggregations (sum/ avg and the likes) on this field.
For problem 3, you can use Text(date, "yyyy-mm-dd hh:mm:ss") function in excel to do a mass replace for this field.
Let me know if this works out.
I have a CSV file that I need to format (i.e., turn into) a SQL file for ingestion into MySQL. I am looking for a way to add the text delimiters (single quote) to the text, but not to the numbers, booleans, etc. I am finding it difficult because some of the text that I need to enclose in single quotes have commas themselves, making it difficult to key in to the commas for search and replace. Here is an example line I am working with:
1239,1998-08-26,'Severe Storm(s)','Texas,Val Verde,"DEL RIO, PARKS",'No',25,"412,007.74"
This is FEMA data file, with 131246 lines, I got off of data.gov that I am trying to get into a MySQL database. As you can see, I need to insert a single quote after Texas and before Val Verde, so I tried:
s/,/','/3
But that only replaced the first occurrence of the comma on the first three lines of the file. Once I get past that, I will need to find a way to deal with "DEL RIO, PARKS", as that has a comma that I do not want to place a single quote around.
So, is there a "nice" way to manipulate this data to get it from plain CSV to a proper SQL format?
Thanks
CSV files are notoriously dicey to parse. Different programs export CSV in different ways, possibly including strangeness like embedding new lines within a quoted field or different ways of representing quotes within a quoted field. You're better off using a tool specifically suited to parsing CSV -- perl, python, ruby and java all have CSV parsing libraries, or there are command line programs such as csvtool or ffe.
If you use a scripting language's CSV library, you may also be able to leverage the language's SQL import as well. That's overkill for a one-off, but if you're importing a lot of data this way, or if you're transforming data, it may be worthwhile.
I think that I would also want to do some troubleshooting to find out why the CSV import into MYSql failed.
I would take an approach like this:
:%s/,\("[^"]*"\|[^,"]*\)/,'\1'/g
:%s/^\("[^"]*"\|[^,"]*\)/'\1'/g
In words, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Next, for the first column in a row, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Try the csv plugin. It allows to convert the data into other formats. The help includes an example, how to convert the data for importing it into a database
Just to bring this to a close, I ended up using #Eric Andres idea, which was the MySQL load data option:
LOAD DATA LOCAL INFILE '/path/to/file.csv'
INTO TABLE MYTABLE FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
The initial .csv file still took a little massaging, but not as much as I were to do it by hand.
When I commented that the LOAD DATA had truncated my file, I was incorrect. I was treating the file as a typical .sql file and assumed the "ID" column I had added would auto-increment. This turned out to not be the case. I had to create a quick script that prepended an ID to the front of each line. After that, the LOAD DATA command worked for all lines in my file. In other words, all data has to be in place within the file to load before the load, or the load will not work.
Thanks again to all who replied, and #Eric Andres for his idea, which I ultimately used.