I'm trying to export a MySQL table data to MongoDB, creating a set of "Create" statements in Rails.
My issue is this: in my original table I have "created_at" and "updated_at" fields and I would like to keep the original values even when I export the data to my new MongoDB document. But after I create a new row in Mongo, even if I tell it to set "created_at" = [my original date], Mongo sets it to the current datetime.
How can I avoid this? This is my MongoMapper model:
class MongoFeedEvent
include MongoMapper::Document
key :event_type, String
key :type_id, Integer
key :data, String
timestamps!
end
You're probably better off dumping your MySQL table as JSON and then using mongoimport to import that JSON; this will be a lot faster than doing it row by row through MongoMapper and it will bypass your problem completely as a happy side effect.
There's a gem that will help you dump your MySQL database to JSON called mysql2xxxx:
How to export a MySQL database to JSON?
I haven't used it but the author seems to hang out on SO so you should be able to get help with it if necessary. Or, write a quick one-off script to dump your data to JSON.
Once you have your JSON, you can import it with mongoimport and move on to more interesting problems.
Also, mongoimport understands CSV and mysqldump can write CSV directly:
The mysqldump command can also generate output in CSV, other delimited text, or XML format.
So skip MongoMapper and row-by-row copying completely for the data transfer. Dump your data to CSV or JSON and then import that all at once.
Related
I am exporting GA360 table from Big query to snowflake as json format using bq cli command. I am losing some fields when I load it as table in snowflake. I use the copy command to load my json data from GCS external stage in snowflake to snowflake tables. But, I am missing some fields that are part of nested array. I even tried compressing the file when I export to gcs but I still loose data. Can someone suggest me how I can do this. I don't want to flatten the table in bigquery and transfer that. My daily table size is minimum of 1.5GB to maximum of 4GB.
bq extract \
--project_id=myproject \
--destination_format=NEWLINE_DELIMITED_JSON \
--compression GZIP \
datasetid.ga_sessions_20191001 \
gs://test_bucket/ga_sessions_20191001-*.json
I have set up my integration, file format, and stage in snowflake. I copying data from this bucket to a table that has one variant field. The row count matches with Big query but the fields are missing.
I am guessing this is due to the limit snowflake has where each variant column should be of 16MB. Is there some way I can compress each variant field to be under 16MB?
I had no problem exporting GA360, and getting the full objects into Snowflake.
First I exported the demo table bigquery-public-data.google_analytics_sample.ga_sessions_20170801 into GCS, JSON formatted.
Then I loaded it into Snowflake:
create or replace table ga_demo2(src variant);
COPY INTO ga_demo2
FROM 'gcs://[...]/ga_sessions000000000000'
FILE_FORMAT=(TYPE='JSON');
And then to find the transactionIds:
SELECT src:visitId, hit.value:transaction.transactionId
FROM ga_demo1, lateral flatten(input => src:hits) hit
WHERE src:visitId='1501621191'
LIMIT 10
Cool things to notice:
I read the GCS files easily from Snowflake deployed in AWS.
JSON manipulation in Snowflake is really cool.
See https://hoffa.medium.com/funnel-analytics-with-sql-match-recognize-on-snowflake-8bd576d9b7b1 for more.
I am trying to migrate some json blocks stored inside a Postgres database using the scripts below. They work fine in most cases but fails on a few data sets(I think ones where I double quotes in the json). Basically I am trying to promote some individual approved blocks of json to the prod database using an export script that copies out the data to file and an import script that locates the destination node of json in the destination database and rewrites that node with my new data appended.
I think those data sets that fail are the ones where the json data contains double quotes in the data. An example of how the json with quotes is exports is shown here. I think the issue is the way the copy function is writing out(escaping) this data. My reason for this guess is 1)Only records that fail seem to be ones where quotes in the data 2)A few json tools failed when parsing the json in the area where the quotes were escaped.
"description": "See the chapter \\"Key fields\\" for a general description of keys.
How can I adjust my scripts to be more friendly to what type of data is stored in the json?
Export
psql -h demo_ip -c "\COPY (select f_values from view_export_f where id = '$id' and f_kind='$f_kind') TO '/tmp/demo.json'"
Import
psql -h prod_ip -ef "import.psql" -v ex="$ex"
import.psql
\set json_block `cat /tmp/demo.json`
UPDATE data
SET value= jsonb_set(value, '{json_location}', value->'json_location' || (:'json_block'), true) where id=(:'ex')
I have a table in Cassandra DB and one of the column has value in JSON format. I am using Datastax DevCenter for querying the DB and when I try to export the result to CSV, JSON value gets broken to separate column wherever there is coma(,). I even tried to export from command prompt without giving and delimiter, that too resulted in broken JSON value.
Is there anyway to achieve this task?
Use the COPY command to export the table as a whole with a different delimiter.
For example :
COPY keyspace.your_table (your_id,your_col) TO 'your_table.csv' WITH DELIMETER='|' ;
Then filter on this data programmatically in whatever way you want.
I have around four *.sql self-contained dumps ( about 20GB each) which I need to convert to datasets in Apache Spark.
I have tried installing and making a local database using InnoDB and importing the dump but that seems too slow ( spent around 10 hours with that )
I directly read the file into spark using
import org.apache.spark.sql.SparkSession
var sparkSession = SparkSession.builder().appName("sparkSession").getOrCreate()
var myQueryFile = sc.textFile("C:/Users/some_db.sql")
//Convert this to indexed dataframe so you can parse multiple line create / data statements.
//This will also show you the structure of the sql dump for your usecase.
var myQueryFileDF = myQueryFile.toDF.withColumn("index",monotonically_increasing_id()).withColumnRenamed("value","text")
// Identify all tables and data in the sql dump along with their indexes
var tableStructures = myQueryFileDF.filter(col("text").contains("CREATE TABLE"))
var tableStructureEnds = myQueryFileDF.filter(col("text").contains(") ENGINE"))
println(" If there is a count mismatch between these values choose different substring "+ tableStructures.count()+ " " + tableStructureEnds.count())
var tableData = myQueryFileDF.filter(col("text").contains("INSERT INTO "))
The problem is that the dump contains multiple tables as well each of which needs to become a dataset. For which I need to understand if we can do it for even one table. Is there any .sql parser written for scala spark ?
Is there a faster way of going about it? Can I read it directly into hive from .sql self-contained file?
UPDATE 1: I am writing the parser for this based on Input given by Ajay
UPDATE 2: Changing everything to dataset based code to use SQL parser as suggested
Is there any .sql parser written for scala spark ?
Yes, there is one and you seem to be using it already. That's Spark SQL itself! Surprised?
The SQL parser interface (ParserInterface) can create relational entities from the textual representation of a SQL statement. That's almost your case, isn't it?
Please note that ParserInterface deals with a single SQL statement at a time so you'd have to somehow parse the entire dumps and find the table definitions and rows.
The ParserInterface is available as sqlParser of a SessionState.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface
Spark SQL comes with several methods that offer an entry point to the interface, e.g. SparkSession.sql, Dataset.selectExpr or simply expr standard function. You may also use the SQL parser directly.
shameless plug You may want to read about ParserInterface — SQL Parser Contract in the Mastering Spark SQL book.
You need to parse it by yourself. It requires following steps -
Create a class for each table.
Load files using textFile.
Filter out all the statements other than insert statements.
Then split the RDD using filter into multiple RDDs based on the table name present in insert statement.
For each RDD, use map to parse values present in insert statement and create object.
Now convert RDDs to datasets.
I would like to export each table of my SQLite3 database to CSV files for further manipulation with Python and after that I want to export the CSV files into a different database format (PSQL). The ID column in SQLite3 is of type GUID, hence jiberrish when I export tables to CSV as text:
l_yQ��rG�M�2�"�o
I know that there is a way to turn it into a readable format since the SQLite Manager addon for Firefox does this automatically, sadly without reference regarding how or which query is used:
X'35B17880847326409E61DB91CC7B552E'
I know that QUOTE (GUID) displays the desired hexadecimal string, but I don't know how to dump it to the CSV instead of the BLOB.
I found out what my error was - not why it doesn't work, but how I get around it.
So I tried to export my tables as staded in https://www.sqlite.org/cli.html , namely a multiline command, which didn't work:
sqlite3 'path_to_db'
.headers on`
.mode csv
.output outfile.csv
SELECT statement
and so on.
I was testing a few things and since I'm lazy while testing, I used the single line variant, which got the job done:
sqlite3 -header -csv 'path_to_db' "SELECT QUOTE (ID) AS Hex_ID, * FROM Table" > 'output_file.csv'
Of course it would be better if I would specify all column names instead of using *, but this sufices as an example.