Bigquery export splits into multiple files with some empty files - json

I am trying to use the bigquery export functionality to push data out to GCS in json format.
At the end of the process inorder to validate the count of exported records in the GCS file, I am creating an external table with auto schema detection just to take a count of records in the GCS files exported.
This works for single exported files. But for tables greater than 1gb in size, i use the wild card inorder to split into multiple files. This results in multiple files with some empty files as well created.
The empty files are causing an error while querying the external table : "400 Schema has no fields".
Please suggest any ideas to:
Either make sure that empty files do not get created in the export operation for multiple files scenario
To ignore empty files in the external table creation.
Any other way to take count of records in GCS after the export operation

I had the same problem but I found a workaround: it seems a TEMP TABLE does the trick.
(EDIT: reading the doc I noticed "export data" has always been described for BigQuery tables, non for custom selects. And since I never experienced empty files when exporting real tables, I gave temp tables the same chance)
Imagine we have the following query:
EXPORT DATA OPTIONS(
uri='gs://mybucket/extract-here/*.csv.gz'
, format='CSV'
, compression='GZIP'
, overwrite=true
, header=true
, field_delimiter=","
) AS (
WITH mytable AS (
SELECT col FROM UNNEST([1,2,3,4,5,6,7,8]) AS col
)
SELECT * FROM mytable
);
You can rewrite it as following:
BEGIN
CREATE TEMP TABLE _SESSION.tmpExportTable AS (
WITH mytable AS (
SELECT col FROM UNNEST([1,2,3,4,5,6,7,8]) AS col
)
SELECT * FROM mytable
);
EXPORT DATA OPTIONS(
uri='gs://mybucket/extract-here/*.csv.gz'
, format='CSV'
, compression='GZIP'
, overwrite=true
, header=true
, field_delimiter=","
) AS
SELECT * FROM _SESSION.tmpExportTable;
END;

Related

Export Sql data to .csv and generate separate .csv files per table based on query

Currently, I am using Go to access my database. Ideally, I'd like to generate .csv based on the table's name and export data to those files based on the query.
For example, if I ran:
select t1.*,t2.* from table1 t1
inner join table2 t2 on t2.table_1_id = t1.id
where t1.linking_id = 22
I'd like a .csv file generated for both table 1 and table 2 where data from each table would generate and then export into these two generated files with the same names as the table's names.
I know in PHP I can use $fp = fopen(getcwd().'/table1.csv', 'w');
fputcsv($fp, $columns);
to generate the .csv files with the table's row names. But, I don't believe go needs continuous duplication of foreah columns to generate separate .csv files.
I would like some guidance on exporting and generating sql data to .csv files.
Thank you!
My current import set-up is:
import (
"database/sql"
_ "github.com/go-sql-driver/mysql"
"github.com/joho/sqltocsv"
)
Im able to connect to my database without issue, and query my database when initializing the query via rows, _:= db.Query(SELECT * FROM table1 WHERE id = 22)
I am able to write the query data to a pre-existing .csv file using err = sqltocsv.WriteFile("results.csv", rows)
Iterate the rows with a loop to fetch each row
Use package encoding/csv to write a CSV file.

Importing a series of .CSV files that contain one field while adding additional 'known' data in other fields

I've got a process that creates a csv file that contains ONE set of values that I need to import into a field in a MySQL database table. This process creates a specific file name that identifies the values of the other fields in that table. For instance, the file name T001U020C075.csv would be broken down as follows:
T001 = Test 001
U020 = User 020
C075 = Channel 075
The file contains a single row of data separated by commas for all of the test results for that user on a specific channel and it might look something like:
12.555, 15.275, 18.333, 25.000 ... (there are hundreds, maybe thousands, of results per user, per channel).
What I'm looking to do is to import directly from the CSV file adding the field information from the file name so that it looks something like:
insert into results (test_no, user_id, channel_id, result) values (1, 20, 75, 12.555)
I've tried to use "Bulk Insert" but that seems to want to import all of the fields where each ROW is a record. Sure, I could go into each file and convert the row to a column and add the data from the file name into the columns preceding the results but that would be a very time consuming task as there are hundreds of files that have been created and need to be imported.
I've found several "import CSV" solutions but they all assume all of the data is in the file. Obviously, it's not...
The process that generated these files is unable to be modified (yes, I asked). Even if it could be modified, it would only provide the proper format going forward and what is needed is analysis of the historical data. And, the new format would take significantly more space.
I'm limited to using either MATLAB or MySQL Workbench to import the data.
Any help is appreciated.
Bob
A possible SQL approach to getting the data loaded into the table would be to run a statement like this:
LOAD DATA LOCAL INFILE '/dir/T001U020C075.csv'
INTO TABLE results
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ','
( result )
SET test_no = '001'
, user_id = '020'
, channel_id = '075'
;
We need the comma to be the line separator. We can specify some character that we are guaranteed not to tppear to be the field separator. So we get LOAD DATA to see a single "field" on each "line".
(If there isn't trailing comma at the end of the file, after the last value, we need to test to make sure we are getting the last value (the last "line" as we're telling LOAD DATA to look at the file.)
We could use user-defined variables in place of the literals, but that leaves the part about parsing the filename. That's really ugly in SQL, but it could be done, assuming a consistent filename format...
-- parse filename components into user-defined variables
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(f.n,'T',-1),'U',1) AS t
, SUBSTRING_INDEX(SUBSTRING_INDEX(f.n,'U',-1),'C',1) AS u
, SUBSTRING_INDEX(f.n,'C',-1) AS c
, f.n AS n
FROM ( SELECT SUBSTRING_INDEX(SUBSTRING_INDEX( i.filename ,'/',-1),'.csv',1) AS n
FROM ( SELECT '/tmp/T001U020C075.csv' AS filename ) i
) f
INTO #ls_u
, #ls_t
, #ls_c
, #ls_n
;
while we're testing, we probably want to see the result of the parsing.
-- for debugging/testing
SELECT #ls_t
, #ls_u
, #ls_c
, #ls_n
;
And then the part about running of the actual LOAD DATA statement. We've got to specify the filename again. We need to make sure we're using the same filename ...
LOAD DATA LOCAL INFILE '/tmp/T001U020C075.csv'
INTO TABLE results
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ','
( result )
SET test_no = #ls_t
, user_id = #ls_u
, channel_id = #ls_c
;
(The client will need read permission the .csv file)
Unfortunately, we can't wrap this in a procedure because running LOAD DATA
statement is not allowed from a stored program.
Some would correctly point out that as a workaround, we could compile/build a user-defined function (UDF) to execute an external program, and a procedure could call that. Personally, I wouldn't do it. But it is an alternative we should mention, given the constraints.

cypher - load multiple csv files

I have many csv files with names 0_0.csv , 0_1.csv , 0_2.csv , ... , 1_0.csv , 1_1.csv , ... , z_17.csv.
I wanted to know how can I import them in a loop or something ?
Also I wanted to know am I doing it good ? ( each file is 50MB and whole files size is about 100GB )
This is my code :
create index on :name(v)
create index on :value(v)
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///0_0.txt" AS csv
FIELDTERMINATOR ','
MERGE (n:name {v:csv.name})
MERGE (m:value {v:csv.value})
CREATE (n)-[:kind {v:csv.kind}]->(m)
You could handle multiple files by constructing a file name. Unfortunately this seems to break when using the USING PERIODIC COMMIT query hint so it won't be a good option for you. You could create a script to wrap it up and send the commands to bin/cypher-shell though.
UNWIND ['0','1','z'] as outer
UNWIND range(0,17) as inner
LOAD CSV WITH HEADERS FROM 'file:///'+ outer +'_' + toString(inner) + '.csv' AS csv
FIELDTERMINATOR ','
MERGE (n:name {v:csv.name})
MERGE (m:value {v:csv.value})
CREATE (n)-[:kind {v:csv.kind}]->(m)
As far as your actual load query goes. Do you name and value nodes come up multiple times in the files? If they are unique, you would be better off loading the the data in multiple passes. Load the nodes first without the indexes; then add the indexes once the nodes are loaded; and then do the relationships as the last step.
Using CREATE for the :kind relationship will result in multiple relationships even if it is the same value for csv.kind. You might want to use MERGE instead if that is the case.
For 100 GB of data though if you are starting with an empty database and are looking for speed, I would take a look at using bin/neo4j-admin import.

Athena AWS bad field name and multiple folders with Hive DDL

I'm new into AWS Athena, and I'm trying to query multiple S3 buckets containing JSON files. I encountered a number of problems that don't have any answer in documentation (sadly their error log is not informative enough to try to solve it myself):
How to query a JSON field named with parenthesis? For example I have a field named "Capacity(GB)", and when I'm trying to include in the CREATE EXTERNAL statement I receive an error:
CREATE EXTERNAL TABLE IF NOT EXISTS test-scema.test_table (
`device`: string,
`Capacity(GB)`: string)
Your query has the following error(s):
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
java.lang.IllegalArgumentException: Error: : expected at the position
of 'Capacity(GB):string>' but '(' is found.
My files are located in sub folders in S3 in a following structure:
'location_name/YYYY/MM/DD/appstring/'
and I want to query all the dates of a specific app-string (out of many). is there any 'wildcard' I can use to replace the dates path?
Something like this:
LOCATION 's3://location_name/%/%/%/appstring/'
Do I have to load the raw data as-is using CREATE EXTERNAL TABLE, and only then query it, or I can add some WHERE statements build-in? Specifically is someting like this is possible:
CREATE EXTERNAL TABLE IF NOT EXISTS test_schema.test_table (
field1:string,
field2:string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://folder/YYYY/MM/DD/appstring'
WHERE field2='value'
What would be the outcomes in terms of billing? Cause right now I'm building this CREATE statement only to re-use the data in a SQL query once-again.
Thanks!
1. JSON field named with parenthesis
There is no need to create a field called Capacity(GB). Instead, create the field with a different name:
CREATE EXTERNAL TABLE test_table (
device string,
capacity string
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='device,Capacity(GB)')
LOCATION 's3://xxx';
If you are using Nested JSON then you can use the Serde's mapping property (which I saw on issue with Hive Serde dealing nested structs):
CREATE external TABLE test_table (
top string,
inner struct<device:INT,
capacity:INT>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.capacity" = "Capacity(GB)"
)
LOCATION 's3://xxx';
This works nicely with an input of:
{ "top" : "123", "inner": { "Capacity(GB)": 12, "device":2}}
2. Subfolders
You cannot wildcard mid-path (s3://location_name/*/*/*/appstring/). The closest option is to use partitioned data but that would require a different naming format for your directories.
3. Creating tables
You cannot specify WHERE statements as part of the CREATE TABLE statement.
If your aim is to reduce data costs, then use partitioned data to reduce the number of files scanned or store in a column-based format such as Parquet.
For examples, see: Analyzing Data in S3 using Amazon Athena

SSIS - Reading from a multi part text file

I'm trying to build an SSIS package that reads from a text file and outputs into another text file. The catch is that the file I'm trying to read from has multiple sections and I can't find anything that shows how to do that.
The file looks like this:
[sectionA]
key1=value1
key2=value2
key3=value3
[sectionB]
key4=value4
key5=value5
key6=value6
I started with a couple of tutorials that read from a flat file source but the data gets pulled into an equally ugly table. Hoping someone has some input on this.
The SSIS Flat File Connection is built for speed so it doesnt allow for niceties like that.
I would still use the Flat File Connection but just load all the data into a single, wide NVARCHAR column in a SQL table. I would add an IDENTITY column to that table to get a relative Row Number.
Then I would add downstream tasks using SQL to select by Sections e.g. for Section A rows:
WHERE File_Row_Number > ( SELECT MIN ( File_Row_Number ) FROM Staging_Table WHERE nvarchar_column = '[sectionA]' )
AND File_Row_Number < ( SELECT MIN ( File_Row_Number ) FROM Staging_Table WHERE nvarchar_column = '[sectionB]' )
If the split requirements are as simple as those shown I might attempt them in SQL e.g.
How do I split a string so I can access item x?
But I would probably lean towards using Strings.Split in a Script Task where the code will be simpler and safer.