How to know the Cygnus notifications table name in Cosmos? - fiware

I'm using Cygnus to send Orion Context Broker notifications to Cosmos via httpfs.
Where are stored the data sent to Cosmos in the Hive history? What's the name of the table where Cygnus data are stored?

The Orion context data persisted by Cygnus in Cosmos is stored in plain text HDFS files. The content of this files, if properly structured, can be loaded into Hive tables which can be queried by using HiveQL, a SQL-like language.
The way the Hive tables are created depends on the Cygnus version you are using:
Cygnus 0.1: you have to create the Hive external table by yourself. In order to do that:
Log into the Cosmos Head Node using your SSH credentials.
Invoke the Hive CLI by typing hive
Add the following HiveQL sentence:
create external table <table_name> (recvTimeTs bigint, recvTime string, entityId string, entityType string, attrName string, attrType string, attrValue string) row format delimited fields terminated by '|' location '/user/<myusername>/<mydataset>/';
Please observe all the entities data is stored within the same and unique Hive table. This is possible because all the lines/rows within HDFS files/Hive table refer to an attribute of certain type belonging to an entity identifier or certain type.
Cygnus 0.2: the above Hive external table is automatically created. The table name is <myusername>_<mydataset>. As in Cygnus 0.1, all the entities data is stores within the same and unique Hive table.
Cygnus 0.3 or greater: at the moment of writing this response, Cygnus 0.3 has not yet been released, but within such release the Orion data will not be exclusively persisted by adding a new line/row per each attribute, and the possibility to add new lines/rows containing full entity's attribute list is expected. In that case, due to the lines/rows may not have the same dimenssion, it is envioned a Hive table is created per each entity.

Related

data migration from mysql to mongodb, i have issue with convergion between IDs

I'm making data migration from MySQL to MongoDB and it's my first time, I followed these steps:-
select all data from SQL in the specific table and save it in one .CSV file.
set headers to the file data so every object has a key.
import the .csv file to DB using MongoDB compass.
the problem is the IDs in SQL away different from the MongoDB objectId, so how can I handle this?
note that the old database "SQL" has primary and foreign keys and my MongoDB schema has references too using objetcId.

mongoimport error when importing JSON from BigQueryToCloudStorageOperator output

Summary: I am trying to improve our daily transfer of data from BigQuery into a MongoDB Cluster using Airflow - getting mongoimport error at the end.
We need to transfer 5 - 15GB of data daily from BigQuery to MongoDB. This transfer is not brand new data, it replaces data from the previous day that is outdated (our DB size mostly stays the same, does not grow 15GB per day). Our Airflow DAG used to use a PythonOperator to transfer data from BigQuery into MongoDB, which relied on a Python function that did everything (connect to BQ, connect to Mongo, query data into pandas, insert into Mongo, create Mongo indexes). It looked something like the function in this post.
This task took the name of a BigQuery table as a parameter, and using python it creates a table with the same name in our MongoDB Cluster. However it is slow. as_dataframe() is slow, and insert_many() is slow-ish as well. When this task is run for many big tables at once, we also receive out-of-memory errors. To handle this, we've created a separate transfer_full_table_chunked function that also transfers from BigQuery to Mongo, but uses a for-loop, querying/inserting small chunks at a time. This prevents the out of memory errors, but it results in 100 queries + inserts per table, which seems like a lot for 1 table transfer.
Per an answer from here with a few upvotes:
I would recommend writing data from BigQuery to a durable storage service like Cloud Storage then loading into MongoDB from there.
Following the suggestion, we've created this task:
transfer_this_table_to_gcs = BigQueryToCloudStorageOperator(
task_id='transfer_this_table_to_gcs',
source_project_dataset_table='myproject.mydataset.mytable',
destination_cloud_storage_uris=['gs://my-bucket/this-table/conference-*.json'],
export_format='JSON',
bigquery_conn_id='bigquery_conn_id'
)
...which does successfully export our BigQuery table into GCS, automatically creating multiple files using the partition key we've set in BigQuery for the table. Here's a snippet of one of those files (this is newline delimited JSON I think):
However, when we try to run mongoimport --uri "mongodb+srv://user#cluster-cluster.dwxnd.gcp.mongodb.net/mydb" --collection new_mongo_collection --drop --file path/to/conference-00000000000.json
we get the error:
2020-11-13T11:53:02.709-0800 connected to: localhost
2020-11-13T11:53:02.866-0800 dropping: mydb.new_mongo_collection
2020-11-13T11:53:03.024-0800 Failed: error processing document #1: invalid character '_' looking for beginning of value
2020-11-13T11:53:03.025-0800 imported 0 documents
I see that the very first thing in the file is a _id, which is probably the invalid character that the error message is referring to. This _id is explicitly in there for MongoDB as the unique identifier for the row. I'm not quite sure what to do about this / if this is an issue with the data format (ndjson), or with our table specifically.
How can I transfer large data daily from BigQuery to MongoDB in general?

Parsing CSV in Athena by column names

I'm trying to create an external table based on CSV files. My problem is that not all CSV files are the same (for some of them there are missing columns) and the order of columns is not always the same.
The question is whether I can make Athena parse the columns by name, instead of by their order
No, athena cannot parse the columns by name instead of their order. The data should be in exact same order as defined in your table schema. You will need to preprocess you CSV's and change the column orders before writing them to S3.
Adding quotes from aws athena documentation :
When you create a new table schema in Athena, Athena stores the schema
in a data catalog and uses it when you run queries.
Athena uses an approach known as schema-on-read, which means a schema
is projected on to your data at the time you execute a query. This
eliminates the need for data loading or transformation.
When you create a database and table in Athena, you are simply
describing the schema and the location where the table data are
located in Amazon S3 for read-time querying. Database and table,
therefore, have a slightly different meaning than they do for
traditional relational database systems because the data isn't stored
along with the schema definition for the database and table.
Reference : Tables and databases in athena

Query metadata from HIVE using MySQL as metastore

I am looking for a way to query the metadata of my HIVE data with a HiveQL command.
I configured a MySQL metastore, but it is necessary to query the metadata via HIVE command because then I want to access the data with ODBC connection to the HIVE system.
To see them from Hive, you must use the commands to display the DDL. You'll probably need to parse it.
Database metadata:
describe database extended <db_name>
To see table and columns metadata:
describe formatted <db_name>.<table_name>
Another option is to directly connect to the matastore database, but you'll be outside Hive.
You can do it currently using Hive JDBC StorageHandler: https://github.com/qubole/Hive-JDBC-Storage-Handler
Example of table creation from their page:
DROP TABLE HiveTable;
CREATE EXTERNAL TABLE HiveTable(
id INT,
id_double DOUBLE,
names STRING,
test INT
)
STORED BY 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler'
TBLPROPERTIES (
"mapred.jdbc.driver.class"="com.mysql.jdbc.Driver",
"mapred.jdbc.url"="jdbc:mysql://localhost:3306/rstore",
"mapred.jdbc.username"="root",
"mapred.jdbc.input.table.name"="JDBCTable",
"mapred.jdbc.output.table.name"="JDBCTable",
"mapred.jdbc.password"="",
"mapred.jdbc.hive.lazy.split"= "false"
);
I tested, it works fine with MySQL. And FilterPushDown also works.
There are already a bunch of tables in the SYS db and the INFORMATION_SCHEMA db which maps to the tables in the RDBMS.
Check this file:
https://github.com/apache/hive/blob/1e3e07c87e71dc16f05ad269b250d65ad7c02232/metastore/scripts/upgrade/hive/hive-schema-4.0.0-alpha-2.hive.sql
If the table isn't there you can create hive tables pointing to mysql, the same way it is done in this file.
eg.
https://github.com/apache/hive/blob/1e3e07c87e71dc16f05ad269b250d65ad7c02232/metastore/scripts/upgrade/hive/hive-schema-4.0.0-alpha-2.hive.sql#L1520-L1546

Neo4j - How to emulate MySQL multiple schema deployment in Neo4j

With single instance of Neo4j server ( not embedded), how to add multiple schema kind of deployment ( similar to MySQL) in Neo4j ?
How is it possible to add / delete schema run time in Neo4j deployed as a server?
You can translate each table to a node type, columns to node's (or relationship) properties and foreign keys to relationships (here you can store more properties).
Neo4j is schema free, but What you can do in Neo4j is create nodes linked to your root node, each one representing a "class". If you link all the instances to the "class" node you can navigate through them like iterating in a SQL-like table or know the "schema" that follows this node.
Here is an example about how to model categories from SQL to Neo4j:
http://blog.neo4j.org/2010/03/modeling-categories-in-graph-database.html