mongoimport error when importing JSON from BigQueryToCloudStorageOperator output - json

Summary: I am trying to improve our daily transfer of data from BigQuery into a MongoDB Cluster using Airflow - getting mongoimport error at the end.
We need to transfer 5 - 15GB of data daily from BigQuery to MongoDB. This transfer is not brand new data, it replaces data from the previous day that is outdated (our DB size mostly stays the same, does not grow 15GB per day). Our Airflow DAG used to use a PythonOperator to transfer data from BigQuery into MongoDB, which relied on a Python function that did everything (connect to BQ, connect to Mongo, query data into pandas, insert into Mongo, create Mongo indexes). It looked something like the function in this post.
This task took the name of a BigQuery table as a parameter, and using python it creates a table with the same name in our MongoDB Cluster. However it is slow. as_dataframe() is slow, and insert_many() is slow-ish as well. When this task is run for many big tables at once, we also receive out-of-memory errors. To handle this, we've created a separate transfer_full_table_chunked function that also transfers from BigQuery to Mongo, but uses a for-loop, querying/inserting small chunks at a time. This prevents the out of memory errors, but it results in 100 queries + inserts per table, which seems like a lot for 1 table transfer.
Per an answer from here with a few upvotes:
I would recommend writing data from BigQuery to a durable storage service like Cloud Storage then loading into MongoDB from there.
Following the suggestion, we've created this task:
transfer_this_table_to_gcs = BigQueryToCloudStorageOperator(
task_id='transfer_this_table_to_gcs',
source_project_dataset_table='myproject.mydataset.mytable',
destination_cloud_storage_uris=['gs://my-bucket/this-table/conference-*.json'],
export_format='JSON',
bigquery_conn_id='bigquery_conn_id'
)
...which does successfully export our BigQuery table into GCS, automatically creating multiple files using the partition key we've set in BigQuery for the table. Here's a snippet of one of those files (this is newline delimited JSON I think):
However, when we try to run mongoimport --uri "mongodb+srv://user#cluster-cluster.dwxnd.gcp.mongodb.net/mydb" --collection new_mongo_collection --drop --file path/to/conference-00000000000.json
we get the error:
2020-11-13T11:53:02.709-0800 connected to: localhost
2020-11-13T11:53:02.866-0800 dropping: mydb.new_mongo_collection
2020-11-13T11:53:03.024-0800 Failed: error processing document #1: invalid character '_' looking for beginning of value
2020-11-13T11:53:03.025-0800 imported 0 documents
I see that the very first thing in the file is a _id, which is probably the invalid character that the error message is referring to. This _id is explicitly in there for MongoDB as the unique identifier for the row. I'm not quite sure what to do about this / if this is an issue with the data format (ndjson), or with our table specifically.
How can I transfer large data daily from BigQuery to MongoDB in general?

Related

AWS Lambda use case: Processing CSV, comparing values to records in database

I have a particular requirement for processing a CSV of rows, each containing a database row id and other basic information (prices, stock amount, etc).
Currently, the file is being uploaded via my Restful api, then processing happens on an EC2 instances that's connected to the DB. It loops through each row, checks to see if there's a matching row id in the DB, if so, it updates the values of the existing row, otherwise, it creates a new row in the database.
Also, if there are any errors in the CSV (validation issues, etc) then I don't process any of the rows and exit from the process entirely (with error messages).
The question is, if I convert this to a Lambda function, I would need the DB id checking code in the Lambda also? I believe it would slow things down in the Lambda if it had to process a large CSV.
One approach would be to have one function to initially check the CSV for any errors and split the CSV into rows parts, then to have those onto a queue (SQS?), the have separate Lambdas to watch the queue and process to add / update each in the database. Does this sound like a reasonable solution?
Your recommended approach to validate the CSV and then send each update task to an SQS queue sounds perfectly reasonable. I recommend going with that approach.

How can I import nested json data into multiple connected redshift subtables?

I have server log data that looks something like this:
2014-04-16 00:01:31-0400,583 {"Items": [
{"UsageInfo"=>"P-1008366", "Role"=>"Abstract", "RetailPrice"=>2, "EffectivePrice"=>0},
{"Role"=>"Text", "ProjectCode"=>"", "PublicationCode"=>"", "RetailPrice"=>2},
{"Role"=>"Abstract", "RetailPrice"=>2, "EffectivePrice"=>0, "ParentItemId"=>"396487"}
]}
What I'd like to a relational database that connects two tables - a UsageLog table and a UsageLogItems table, connected by a primary key id.
You can see that the UsageLog table would have feilds like:
UsageLogId
Date
Time
and the UsageLogItems table would have fields like
UsageLogId
UsageInfo
Role
RetailPrice
...
However, I am having trouble writing these into Redshift and being able to associate each record with unique and related ids as keys.
What I am currently doing is I use a ruby script that reads each line of the log file, parses out the UsageLog info (such as date and time), writes it to the database (writing single lines to Redshift is VERY slow), then creates a csv of the data from the UsageLogItems information and imports that to Redshift via S3, querying the largest id of the UsageLogs table and using that number to relate the two (this is also slow, because lots of UsageLogs do not contain any items, so I frequently load in 0 records from csv files).
This currently does work, but it is far too painfully slow to be effective at all. Is there a better way to handle this?
Amazon Redshift supports JSON ingestion using JSONPaths via COPY command.
http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-json.html

Big data migration from Oracle to MySQL

I received over 100GB of data with 67million records from one of the retailers. My objective is to do some market-basket analysis and CLV. This data is a direct sql dump from one of the tables with 70 columns. I'm trying to find a way to extract information from this data as managing itself in a small laptop/desktop setup is becoming time consuming. I considered the following options
Parse the data and convert the same to CSV format. File size might come down to around 35-40GB as more than half of the information in each records is column names. However, I may still have to use a db as I cant use R or Excel with 66 million records.
Migrate the data to mysql db. Unfortunately I don't have the schema for the table and I'm trying to recreate the schema looking at the data. I may have to replace to_date() in the data dump to str_to_date() to match with MySQL format.
Are there any better way to handle this? All that I need to do is extract the data from the sql dump by running some queries. Hadoop etc. are options, but I dont have the infrastructure to setup a cluster. I'm considering mysql as I have storage space and some memory to spare.
Suppose I go in the MySQL path, how would I import the data? I'm considering one of the following
Use sed and replace to_date() with appropriate str_to_date() inline. Note that, I need to do this for a 100GB file. Then import the data using mysql CLI.
Write python/perl script that will read the file, convert the data and write to mysql directly.
What would be faster? Thank you for your help.
In my opinion writing a script will be faster, because you are going to skip the SED part.
I think that you need to setup a server on a separate PC, and run the script from your laptop.
Also use tail to faster get a part from the bottom of this large file, in order to test your script on that part before you run it on this 100GB file.
I decided to go with the MySQL path. I created the schema looking at the data (had to increase a few of the column size as there were unexpected variations in the data) and wrote a python script using MySQLdb module. Import completed in 4hr 40mins on my 2011 MacBook Pro with 8154 failures out of 67 million records. Those failures were mostly data issues. Both client and server are running on my MBP.
#kpopovbg, yes, writing script was faster. Thank you.

How to export data from Amazon DynamoDB into MySQL server

I have no experience dealing with nosql databases such as Amazon AWS DynamoDB.
I have some data stored in Amazon AWS DynamoDB.
Is it possible to export data from DynamoDB to MySQL Server ?
If so, how to go about accomplishing that ?
Thanks,
I would extract the data in CSV format. This "DynamoDBtoCSV" tool seems promising. Then you can import this CSV file into your MySQL database with LOAD DATA INFILE.
The drawback is that you 1. need to create the receiving structure first and 2. repeat the process for each table. But it shouldn't be too complicated to 1. generate a corresponding CREATE TABLE statement from the first line output by DynamoDBtoCSV, and 2. run the operation in a loop from a batch.
Now I am asking myself if MySQL is your best call as a target database. MySQL is a relational database, while DynamoDB is NoSQL (with variable length aggregates, non-scalar field values, and so on). Flatenning this structure into a relational schema may not be such a good idea.
Even though it is pretty old question, still leaving it here for future researchers.
Dynamodb supports streams which can be enabled on any table (from overview section in dynamodb table), which then can be taken via a lambda function (look for trigger tab in dynamodb table) to any storage including but not limited to mysql.
Data flow:
Dynamodb update/insert > Stream > Lambda > Mysql.

Is it possible to read MongoDB data, process it with Hadoop, and output it into a RDBS (MySQL)?

Summary:
Is it possible to:
Import data into Hadoop with the «MongoDB Connector for Hadoop».
Process it with Hadoop MapReduce.
Export it with Sqoop in a single transaction.
I am building a web application with MongoDB. While MongoDB work well for most of the work, in some parts I need stronger transactional guarantees, for which I use a MySQL database.
My problem is that I want to read a big MongoDB collection for data analysis, but the size of the collection means that the analytic job would take too long to process. Unfortunately, MongoDB's built-in map-reduce framework would not work well for this job, so I would prefer to carry out the analysis with Apache Hadoop.
I understand that it is possible read data from MongoDB into Hadoop by using the «MongoDB Connector for Hadoop», which reads data from MongoDB, processes it with MapReduce in Hadoop, and finally outputs the results back into a MongoDB database.
The problem is that I want the output of the MapReduce to go into a MySQL database, rather than MongoDB, because the results must be merged with other MySQL tables.
For this purpose I know that Sqoop can export result of a Hadoop MapReduce into MySQL.
Ultimately, I want too read MongoDB data then process it with Hadoop and finally output the result into a MySQL database.
Is this possible? Which tools are available to do this?
TL;DR: Set an an output formatter that writes to a RDBS in your Hadoop job:
job.setOutputFormatClass( DBOutputFormat.class );
Several things to note:
Exporting data from MongoDB to Hadoop using Sqoop is not possible. This is because Sqoop uses JDBC which provides a call-level API for SQL-based database, but MongoDB is not an SQL-based database. You can look at the «MongoDB Connector for Hadoop» to do this job. The connector is available on GitHub. (Edit: as you point out in your update.)
Sqoop exports are not made in a single transaction by default. Instead, according to the Sqoop docs:
Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the --staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.
The «MongoDB Connector for Hadoop» does not seem to force the workflow you describe. According to the docs:
This connectivity takes the form of allowing both reading MongoDB data into Hadoop (for use in MapReduce jobs as well as other components of the Hadoop ecosystem), as well as writing the results of Hadoop jobs out to MongoDB.
Indeed, as far as I understand from the «MongoDB Connector for Hadoop»: examples, it would be possible to specify a org.apache.hadoop.mapred.lib.db.DBOutputFormat into your Hadoop MapReduce job to write the output to a MySQL database. Following the example from the connector repository:
job.setMapperClass( TokenizerMapper.class );
job.setCombinerClass( IntSumReducer.class );
job.setReducerClass( IntSumReducer.class );
job.setOutputKeyClass( Text.class );
job.setOutputValueClass( IntWritable.class );
job.setInputFormatClass( MongoInputFormat.class );
/* Instead of:
* job.setOutputFormatClass( MongoOutputFormat.class );
* we use an OutputFormatClass that writes the job results
* to a MySQL database. Beware that the following OutputFormat
* will only write the *key* to the database, but the principle
* remains the same for all output formatters
*/
job.setOutputFormatClass( DBOutputFormat.class );
I would recommend you take a look at Apache Pig (which runs on top of Hadoop's map-reduce). It will output to MySql (no need to use Scoop). I used it to do what you are describing. It is possible to do an "upsert" with Pig and MySql. You can use Pig's STORE command with piggyBank's DBStorage and MySql's INSERT DUPLICATE KEY UPDATE (http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html).
Use MongoHadoop connector to read data from MongoDB and process it using Hadoop.
Link:
https://github.com/mongodb/mongo-hadoop/blob/master/hive/README.md
Using this connector you can use Pig and Hive to read data from Mongo db and process it using Hadoop.
Example of Mongo Hive table:
CREATE EXTERNAL TABLE TestMongoHiveTable
(
id STRING,
Name STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id","Name":"Name"}')
LOCATION '/tmp/test/TestMongoHiveTable/'
TBLPROPERTIES('mongo.uri'='mongodb://{MONGO_DB_IP}/userDetails.json');
Once it is exported to hive table you can use Sqoop or Pig to export data to mysql.
Here is a flow.
Mongo DB -> Process data using Mongo DB hadoop connector (Pig) -> Store it to hive table/HDFS -> Export data to mysql using sqoop.