I have a MySQL table whose data I have to export to .csv and then ingest this .csv to GeoMesa.
My Mysql table structure is like below:
[
Now, as you can see the the_geom attribute of table has data type point and in database it is stored as blob like shown below:
Now I have two problems :
When I export the MySQL data into a (.csv) file my csv file shows (...) for the_geom attribute as shown below instead of any binary representation or anything which will allow it to be ingested in GeoMesa. So, how to overcome this?
Csv file also shows # for any attribute with datetime datatype but if you expand the column the date time can be seen as sown in below picture (however my question is does it will cause problem in geomesa?).
For #1, MySQL's export is not automatically converting the Point datatype into text for you. You might need to call a conversion function such as AsWKT to output the geometry as Well Known Text. The WKT format can be used by GeoMesa to read in the Point data.
For #2, I think you'll need to do the same for the date field. Check out the date and time functions.
Related
I have a table having 2 million records. I am trying to dump the contents of the table in json format. This issue is that the TPT export does not allow JSON columns and BTEQ export would take a lot of time to do this export. Is there any way to handle this export in a more optimized way.
Your help is really appreciated.
If the JSON values are not too large, you could potentially CAST them in your SELECT as VARCHAR(64000) CHARACTER SET LATIN, or VARCHAR(32000) CHARACTER SET UNICODE if you have non-LATIN characters, and export them in-line.
Otherwise each JSON object has to be transferred DEFERRED BY NAME where each object is stored in a separate file and the corresponding filename stored in the output row. In that case you would need to use BTEQ, or TPT SQL Selector operator - or write your own application.
You can do one thing. Load the json formatted rows in another teradata table.
Keep that table column as varchar and then do a tptexport of that column/table.
It should work.
INSERT INTO test (col1,col2...,jsn_obj)
SELECT col1,col2,..
JSON_Compose(<. columns you want to inlcude in your json file)
FROM <schemaname>.<tablename>
;
I have a table in Cassandra DB and one of the column has value in JSON format. I am using Datastax DevCenter for querying the DB and when I try to export the result to CSV, JSON value gets broken to separate column wherever there is coma(,). I even tried to export from command prompt without giving and delimiter, that too resulted in broken JSON value.
Is there anyway to achieve this task?
Use the COPY command to export the table as a whole with a different delimiter.
For example :
COPY keyspace.your_table (your_id,your_col) TO 'your_table.csv' WITH DELIMETER='|' ;
Then filter on this data programmatically in whatever way you want.
I am trying to load some CSV files into BigQuery from Google Cloud Storage and wrestling with schema generation. There is an auto-generate option but it is poorly documented. The problem is that if I choose to let BigQuery generate the schema, it does a decent job of guessing data types, but only sometimes does it recognizes the first row of the data as a header row, and sometimes it does not (treats the 1st row as data and generates column names like string_field_N). The first rows of my data are always header rows. Some of the tables have many columns (over 30), and I do not want to mess around with schema syntax because BigQuery always bombs with an uninformative error message when something (I have no idea what) is wrong with the schema.
So: How can I force it to recognize the first row as a header row? If that isn't possible, how do I get it to spit out the schema it generated in the proper syntax so that I can edit it (for appropriate column names) and use that as the schema on import?
I would recommend doing 2 things here:
Preprocess your file and store the final layout of the file sans the first row i.e. the header row
BQ load accepts an additional parameter in form of a JSON schema file, use this to explicitly define the table schema and pass this file as a parameter. This allows you the flexibility to alter schema at any point in time, if required
Allowing BQ to autodetect schema is not advised.
Schema auto detection in BigQuery should be able to detect the first row of your CSV file as column names in most cases. One of the cases for which column name detection fails is when you have similar data types all over your CSV file. For instance, BigQuery schema auto detect would not be able to detect header names for the following file since every field is a String.
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
The "Header rows to skip" option in the UI would not help fixing this shortcoming of schema auto detection in BigQuery.
If you are following the GCP documentation for Loading CSV Data from Google Cloud Storage you have the option to skip n number of rows:
(Optional) An integer indicating the number of header rows in the source data.
The option is called "Header rows to skip" in the Web UI, but it's also available as a CLI flag (--skip_leading_rows) and as BigQuery API property (skipLeadingRows)
Yes you can modify the existing schema (aka DDL) using bq show..
bq show --schema --format=prettyjson project_id:dataset.table > myschema.json
Note that this will result in you creating a new BQ table all together.
I have way to schema for loading csv into bigquery. You just enough edit value column, for example :
weight|total|summary
2|4|just string
2.3|89.5|just string
if use schema generator by bigquery, field weight and total will define as INT64, but when insert second rows so error or failed. So, you just enough edit first rows like this
weight|total|summary
'2'|'4'|just string
2.3|89.5|just string
You must set field weight & total as STRING, and if you want to aggregate you just use convert type data in bigquery.
cheers
If 'column name' type and 'datatype' are the same for all over the csv file, then BigQuery misunderstood that 'column name' as data. And add a self generated name for the column. I couldn't find any technical way to solve this. So I took another approach.
If the data is not sensitive, then add another column with the 'column name' in string type. And all of the values in the column in number type. Ex. Column name 'Test' and all values are 0. Upload the file to the BigQuery and use this query to drop the column name.
ALTER TABLE <table name> DROP COLUMN <Test>
Change and according to your Table.
I am trying to select data in MySQL JDBC in MATLAB. One of the columns (file_data) in the table has the type "longblob". More details of the content of the file_data is shown on the picture below.
The two first lines are the header, and the remain-rows are the data. Before I save this file into the database MySQL (using spring-MVC and JSP), the file is a CSV file, which has delimiters ";".
The picture below is the table of the database. You can see the type of column file_data is "longblob".
But why when I try to preview the database in database explorer in MATLAB, it should be blob, but it showed "int8". More details please look at the picture below.
According to that, I have questions:
How make the file_data can be read in MATLAB as "longblob" and not "int8"?
if the column "file_data" already successful change to longblob, how to read it as a table, so make me possible to process the data?
CountOfBytesx1 Int8 is equal to longblob. Int8 is the Matlab Byte and CountOfBytes is the number of bytes stored.
You can convert it back to a file with something like:
fid = fopen(file_name,'w')
fwrite(fid,file_data,'*int8')
fclose(fid)
Code from matlab answers. Where file_name and file_data are the fetched values from your query.
Alternatively, you can try one of the solutions of Retrieve blob field from MySQL database with MATLAB
To read the CSV-Data use csvread
When you have solved your question, add the solution to your cross post too.
BTW: I never used Matlab, but a simple websearch for Matlab int8 longblob guided me.
And so we found a 3.6GB csv that we have uploaded onto S3 and now want to import into Redshift, then do the querying and analysis from iPython.
Problem 1:
This comma delimited file contains values free text that also contains commas and this is interfering with the delimiting so can’t upload to Redshift.
When we tried opening the sample dataset in Excel, Excel surprisingly puts them into columns correctly.
Problem 2:
A column that is supposed to contain integers have some records containing alphabets to indicate some other scenario.
So, the only way to get the import through is to declare this column as varchar. But then we can do calculations later on.
Problem 3:
The datetime data type requires the date time value to be in the format YYYY-MM-DD HH:MM:SS, but the csv doesn’t contain the SS and the database is rejecting the import.
We can’t manipulate the data on a local machine because it is too big, and we can’t upload onto the cloud for computing because it is not in the correct format.
The last resort would be to scale the instance running iPython all the way up so that we can read the big csv directly from S3, but this approach doesn’t make sense as a long-term solution.
Your suggestions?
Train: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train.csv (3.4GB)
Train Sample: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train-sample.csv (133MB)
Try having different delimiter or use escape characters.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_preparing_data.html
For second issue, if you want to extract only numbers from the column after loading into char use regexp_replace or other functions.
For third issue, you can as well load it into VARCHAR field and then use substring cast(left(column_name, 10)||' '||right(column_name, 6)||':00' as timestamp)
to load it into final table from staging table
For the first issue, you need to find out a way to differentiate between the two types of commas - the delimiter and the text commas. Once you have done that, replace the delimiters with a different delimiter and use the same as delimiter in the copy command for Redshift.
For the second issue, you need to first figure out if this column needs to be present for numerical aggregations once loaded. If yes, you need to get this data cleaned up before loading. If no, you can directly load this as char/ varchar field. All your queries will still work but you will not be able to do any aggregations (sum/ avg and the likes) on this field.
For problem 3, you can use Text(date, "yyyy-mm-dd hh:mm:ss") function in excel to do a mass replace for this field.
Let me know if this works out.