How can I import nested json data into multiple connected redshift subtables? - json

I have server log data that looks something like this:
2014-04-16 00:01:31-0400,583 {"Items": [
{"UsageInfo"=>"P-1008366", "Role"=>"Abstract", "RetailPrice"=>2, "EffectivePrice"=>0},
{"Role"=>"Text", "ProjectCode"=>"", "PublicationCode"=>"", "RetailPrice"=>2},
{"Role"=>"Abstract", "RetailPrice"=>2, "EffectivePrice"=>0, "ParentItemId"=>"396487"}
]}
What I'd like to a relational database that connects two tables - a UsageLog table and a UsageLogItems table, connected by a primary key id.
You can see that the UsageLog table would have feilds like:
UsageLogId
Date
Time
and the UsageLogItems table would have fields like
UsageLogId
UsageInfo
Role
RetailPrice
...
However, I am having trouble writing these into Redshift and being able to associate each record with unique and related ids as keys.
What I am currently doing is I use a ruby script that reads each line of the log file, parses out the UsageLog info (such as date and time), writes it to the database (writing single lines to Redshift is VERY slow), then creates a csv of the data from the UsageLogItems information and imports that to Redshift via S3, querying the largest id of the UsageLogs table and using that number to relate the two (this is also slow, because lots of UsageLogs do not contain any items, so I frequently load in 0 records from csv files).
This currently does work, but it is far too painfully slow to be effective at all. Is there a better way to handle this?

Amazon Redshift supports JSON ingestion using JSONPaths via COPY command.
http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-json.html

Related

mongoimport error when importing JSON from BigQueryToCloudStorageOperator output

Summary: I am trying to improve our daily transfer of data from BigQuery into a MongoDB Cluster using Airflow - getting mongoimport error at the end.
We need to transfer 5 - 15GB of data daily from BigQuery to MongoDB. This transfer is not brand new data, it replaces data from the previous day that is outdated (our DB size mostly stays the same, does not grow 15GB per day). Our Airflow DAG used to use a PythonOperator to transfer data from BigQuery into MongoDB, which relied on a Python function that did everything (connect to BQ, connect to Mongo, query data into pandas, insert into Mongo, create Mongo indexes). It looked something like the function in this post.
This task took the name of a BigQuery table as a parameter, and using python it creates a table with the same name in our MongoDB Cluster. However it is slow. as_dataframe() is slow, and insert_many() is slow-ish as well. When this task is run for many big tables at once, we also receive out-of-memory errors. To handle this, we've created a separate transfer_full_table_chunked function that also transfers from BigQuery to Mongo, but uses a for-loop, querying/inserting small chunks at a time. This prevents the out of memory errors, but it results in 100 queries + inserts per table, which seems like a lot for 1 table transfer.
Per an answer from here with a few upvotes:
I would recommend writing data from BigQuery to a durable storage service like Cloud Storage then loading into MongoDB from there.
Following the suggestion, we've created this task:
transfer_this_table_to_gcs = BigQueryToCloudStorageOperator(
task_id='transfer_this_table_to_gcs',
source_project_dataset_table='myproject.mydataset.mytable',
destination_cloud_storage_uris=['gs://my-bucket/this-table/conference-*.json'],
export_format='JSON',
bigquery_conn_id='bigquery_conn_id'
)
...which does successfully export our BigQuery table into GCS, automatically creating multiple files using the partition key we've set in BigQuery for the table. Here's a snippet of one of those files (this is newline delimited JSON I think):
However, when we try to run mongoimport --uri "mongodb+srv://user#cluster-cluster.dwxnd.gcp.mongodb.net/mydb" --collection new_mongo_collection --drop --file path/to/conference-00000000000.json
we get the error:
2020-11-13T11:53:02.709-0800 connected to: localhost
2020-11-13T11:53:02.866-0800 dropping: mydb.new_mongo_collection
2020-11-13T11:53:03.024-0800 Failed: error processing document #1: invalid character '_' looking for beginning of value
2020-11-13T11:53:03.025-0800 imported 0 documents
I see that the very first thing in the file is a _id, which is probably the invalid character that the error message is referring to. This _id is explicitly in there for MongoDB as the unique identifier for the row. I'm not quite sure what to do about this / if this is an issue with the data format (ndjson), or with our table specifically.
How can I transfer large data daily from BigQuery to MongoDB in general?

Parsing CSV in Athena by column names

I'm trying to create an external table based on CSV files. My problem is that not all CSV files are the same (for some of them there are missing columns) and the order of columns is not always the same.
The question is whether I can make Athena parse the columns by name, instead of by their order
No, athena cannot parse the columns by name instead of their order. The data should be in exact same order as defined in your table schema. You will need to preprocess you CSV's and change the column orders before writing them to S3.
Adding quotes from aws athena documentation :
When you create a new table schema in Athena, Athena stores the schema
in a data catalog and uses it when you run queries.
Athena uses an approach known as schema-on-read, which means a schema
is projected on to your data at the time you execute a query. This
eliminates the need for data loading or transformation.
When you create a database and table in Athena, you are simply
describing the schema and the location where the table data are
located in Amazon S3 for read-time querying. Database and table,
therefore, have a slightly different meaning than they do for
traditional relational database systems because the data isn't stored
along with the schema definition for the database and table.
Reference : Tables and databases in athena

Querying data from 2 MySQL Databases to a new MySQL database

I want to query data from two different MySQL databases to a new MySQL database.
I have two databases with a lot of irrelevant data and I want to create what can be seen as a data warehouse where only relevent data should be present coming from the two databases.
As of now all data gets sent to the two old databases, however I would like to have scheduled updating so the new database is up to speed. There is a key between the two databases so in best case I would like all data to be present in one table however this is not crucial.
I have done similar work with Logstash and ES, however I do not know how to do it when it comes to MySQL.
Best way to do that is create a ETL process with Pentaho Data Integrator or any ETL tool. Where your source will be two different databases, in the transformation part you can remove or add any business logic then load those data into new database.
If you create this ETL you can schedule it once a day so that your database will be up to date.
If you want to do this without an ETL than your database must be in same host. Than you can just add database name just before table name in query. like SELECT * FROM database.table_name

SSIS Script component - Reference data validation

I am in the process of extending an SSIS package, which takes in data from a text file, 600,000 lines of data or so, modifies some of the values in each line based on a set of business rules and persists the data to a database, database B. I am adding in some reference data validation, which needs to be performed on each row before writing the data to database B. The reference data is stored in another database, database A.
The reference data in database A is stored in seven different tables; each tables only has 4 or 5 columns of type varchar. Six of the tables contain < 1 million records and the seventh has 10+ million rows. I don't want to keep hammering the database for each line in the file and I just want to get some feedback on my proposed approach and ideas on how best to manage the largest table.
The reference data checks will need to be performed in the script component, which acts as a source in the data flow. It has an ado.net connection. On pre-execute, I am going to retrieve the reference data from database 'A', the tables which have < 1 million rows, using the ado.net connection, loop through them all using a sqldatareader, convert them to .Net objects; one for each table and add them to a dictionary.
As I process each line in the file, I can use the dictionaries to perform the reference data validation. Is this a good approach? Anybody got any ideas on how best to manage the largest table?

Importing MYSQL database to NeO4j

I have a mysql database on a remote server which I am trying to migrate into Neo4j database. For this I dumped the individual tables into csv files and am now planning to use the LOAD CSV functionality to create graphs from the tables.
How does loading each table preserve the relationship between tables?
In other words, how can I generate a graph for the entire database and not just a single table?
Load each table as a CSV
Create indexes on your relationship field (Neo4j only does single property indexes)
Use MATCH() to locate related records between the tables
Use MERGE(a)-[:RELATIONSHIP]->(b) to create the relationship between the tables.
Run "all at once", this'll create a large transaction, won't go to completion, and most likely will crash with a heap error. Getting around that issue will require loading the CSV first, then creating the relationships in batches of 10K-100K transaction blocks.
One way to accomplish that goal is:
MATCH (a:LabelA)
MATCH (b:LabelB {id: a.id}) WHERE NOT (a)-[:RELATIONSHIP]->(b)
WITH a, b LIMIT 50000
MERGE (a)-[:RELATIONSHIP]->(b)
What this does is find :LabelB records that don't have a relationship with the :LabelA records and then creates that relationship for the first 50,000 records it finds. Running this repeatedly will eventually create all the relationships you want.