Parsing multiple JSON schemas with Spark - json

I need to collect a few key pieces of information from a large number of somewhat complex nested JSON messages which are evolving over time. Each message refers to the same type of event but the messages are generated by several producers and come in two (and likely more in the future) schemas. The key information from each message is similar but the mapping to those fields is dependent on the message type.
I can’t share the actual data but here is an example:
Message A
—header:
|—attribute1
|—attribute2
—typeA:
|—typeAStruct1:
||—property1
|-typeAStruct2:
||-property2
Message B
-attribute1
-attribute2
-contents:
|-message:
||-TypeB:
|||-property1
|||-TypeBStruct:
||||-property2
I want to produce a table of data which looks something like this regardless of message type:
| MessageSchema | Property1 | Property2 |
| :———————————- | :———————— | :———————— |
| MessageA | A1 | A2 |
| MessageB | B1 | B2 |
| MessageA | A3 | A4 |
| MessageB | B3 | B4 |
My current strategy is read the data with schema A and union with the data read with Schema B. Then I can filter the nulls that result from parsing a type A message with a B schema and vice versa. This seems very inefficient especially once a third or fourth schema emerge. I would like to be able to parse the message correctly on the first pass and apply the correct schema.

As i see it - there is only one way:
For each message type you create an 'adapter' that will create dataframe from input and transform it to the common schema dataframe
Then union outputs of the adapters
Obviously, if you change 'common' schema - you will need to tailor your 'adapters' as well.

Related

More than 255 Fields in Access 2000/2010

I am converting a 20-year old system from DBase IV into Access 2010, via Access 2000, in order to be more suitable for Windows 10. However, I have about 350 fields in the database as it is a parameters table and MS-Access 2000 and MS-Access 2010 are complaining about it. I have repaired the database to removed the internal count problem but am rather surprised that Windows 10 software would have such a low restriction. Does anyone know how to bypass this? Obviously I can break it into 2 tables but this seems rather archaic.
When you start to run up against limitations such as this, it reeks of poor database design.
Given that you state that the table in question is a 'parameters' table, with so many parameters, have you considered structuring the table such that each parameter occupies its own record?
For example, consider the following approach, where ParamName is the primary key for the table:
+----------------+------------+
| ParamName (PK) | ParamValue |
+----------------+------------+
| Param1 | Value1 |
| Param2 | Value2 |
| ... | |
| ParamN | ValueN |
+----------------+------------+
Alternatively, if there is the possibility that each parameter may have multiple values, you can simple add one additional field to differentiate between multiple values for the same parameter, e.g.:
+----------------+--------------+------------+
| ParamName (PK) | ParamID (PK) | ParamValue |
+----------------+--------------+------------+
| Param1 | 1 | Value1 |
| Param1 | 2 | Value2 |
| Param1 | 3 | Value3 |
| Param2 | 1 | Value2 |
| ... | ... | ... |
| ParamN | 1 | Value1 |
| ParamN | N | ValueN |
+----------------+--------------+------------+
I had similar problem - we have more than 300 fields in one Contact table on SQL sever linked to Access. You probably do not need to display 255 fields on one form - that would not be user friendly. You can split it to several sub-forms with different underlined queries for each form with less than the limitation. All sub-forms would be linked by the ID.
Sometimes splitting tables as suggested above is not the best idea because of performance.
As Lee Mac described a sample change in structure of a "parameters" table really would be your better choice. You could then define some constants for each of these to be used in code to prevent accidental misspelling later in code in case used in many places.
Then you could create a function (or functions) that take a parameter of what parameter setting you are looking for, it queries the table for that as the key and returns the value. Not being a VB/Access developer, but would think cant overload the functions to have a single function but returning different data types such as string, int, dates, etc. So you may want functions something like
below samples in C#, but principle would be the same.
public int GetAppParmInt( string whatField )
public DateTime GetAppParmDate( string whatField )
public string GetAppParmString( string whatField )
etc...
Then you could get the values by calling the function that has the sole purpose of querying the parameters table for that one key and returns the value as stored.
Hopefully a combination of offered solutions here can help you in your upgrade, even if your parameter table (expanding a bit on Lee Mac's answer) has each data type you are storing to correspond with the "GetAppParm[type]"
ParmsTable
PkID ParmDescription ParmInt ParmDate ParmString
1 CompanyName Your Company
2 StartFiscalYear 2019-06-22
3 CurrentQuarter 4
4... etc.
Then you don't have to worry about changing / data conversions all over the place. They are stored in the proper data type you expect and return that type.

Using MySQL JSON field to join on a table with custom fields

So I made this system to store custom objects with custom fields for an app that I'm developing. First I have object_def where I save the object definitions:
id | name | fields
------------------------------------------------------------
101 | Group 1 | [{"name": "Title", "id": "AbCdE123"}, ...]
102 | Group 2 | [{"name": "Name", "id": "FgHiJ456"}, ...]
So we have ID (INT), name (VARCHAR) and fields (LONGTEXT). In fields are the object fields like this: {id: string, type: string, name: string}[].
Now In the object table, I have this:
id | object_def_id | object_values
------------------------------------------------------------
235 | 101 | {"AbCdE123": "The Object", ... }
236 | 102 | {"FgHiJ456": "John Perez", ... }
Where object_values is a LONGTEXT also. With that system, I'm able to show the objects on a table in my app using JSON.parse().
Now I've learned that there is a JSON type in MySQL and I want it to use it to do queries and stuff (I'm really new to this).
I've changed the LONGTEXT to JSON and now I wanted to do a SELECT that show the results like this:
#Select objects in group 1:
id | group | Title | ... | other_custom_field
-------------------------------------------------------
235 | Group 1 | The Object | ... | other_custom_value
#Select objects in group 2:
id | group | Name | ... | other_custom_field
-------------------------------------------------------
236 | Group 2 | John Perez | ... | other_custom_value
Id, then group name (I can do this with INNER JOIN) and then all the custom fields with the respective values.
Is this possible? How can I achieve this (hopefully without changing my database structure)? I'm learning MySQL, SQL and databases as I go so I really appreciate your help. Thanks!
Problems I see with your design:
Incorrect JSON format.
[{name: 'Title', id: 'AbCdE123'}, ...]
Should be:
[{"name": "Title", "id": "AbCdE123"}, ...]
You should use the JSON data type instead of LONGTEXT, because JSON will at least reject invalid JSON syntax.
Setting column headings based on data. You can't do this in SQL. Columns and headings must be fixed at the time you prepare the query. You can't do an SQL query that changes its own column headings.
Your object def has an array of attributes, but there's no way in MySQL 5.7 to loop over the "rows" of a JSON array. You'll need to use the JSON_TABLE() in MySQL 8.0.
That will get you closer to being able to look up object values, but then you'll still have to pivot the data into the result set you describe, with one attribute in each column, as if the data had been stored in a traditional way. But SQL doesn't allow you to do dynamic pivoting in a single query. You can't make an SQL query that dynamically grows its own select-list based on the data it finds.
This all makes me wonder...
Why don't you just store the data in the traditional way?
Create a table per object type. Add one column to that table per attribute. That way you get column names. You get column types. You get column constraints — for example, how would you simulate NOT NULL or UNIQUE in your current system?
If you don't want to use SQL, then don't. There are alternatives, like document databases or key/value databases. But don't torture poor SQL by using it to implement an Inner-Platform.

Whats the best way to retrieve array data from MySql

I'm storing a object / data structure like this inside a MySql (actually a MariaDb) database:
{
idx: 7,
a: "content A",
b: "content B",
c: ["entry c1", "entry c2", "entry c3"]
}
And to store it I'm using 2 tables, very similar to the method described in this answer: https://stackoverflow.com/a/17371729/3958875
i.e.
Table 1:
+-----+---+---+
| idx | a | b |
+-----+---+---+
Table 2:
+------------+-------+
| owning_obj | entry |
+------------+-------+
And then made a view that joins them together, so I get this:
+-----+------------+------------+-----------+
| idx | a | b | c |
+-----+------------+------------+-----------+
| 7 | content A1 | content B1 | entry c11 |
| 7 | content A1 | content B1 | entry c21 |
| 7 | content A1 | content B1 | entry c31 |
| 8 | content A2 | content B2 | entry c12 |
| 8 | content A2 | content B2 | entry c22 |
| 8 | content A2 | content B2 | entry c32 |
+-----+------------+------------+-----------+
My question is what is the best way I can get it back to my object form? (e.g. I want an array of the object type specified above of all entries with idx between 5 and 20)
There are 2 ways I can think of, but both seem to be not very efficient.
Firstly we can just send this whole table back to the server, and it can make a hashmap with the keys being the primary key or some other unique index, and collect up the different c columns, and rebuild it that way, but that means it has to send a lot of duplicate data, and take a bit more memory and processing time to rebuild on the server. This method also won't be very pleasant to scale if we have multiple arrays, or have arrays within arrays.
Second method would be to do multiple queries, filter Table 1 and get back the list of idx's you want, and then for each idx, send a query for Table 2 where owning_obj = current idx. This would mean sending a whole lot more queries.
Neither of these options seems very good, so I'm wondering if there is a better way. Currently I'm thinking it can be something that utilizes JSON_OBJECT(), but I'm not sure how.
This seems like a common situation, but I can't seem to find the exact wording to search for to get the answer.
PS: The server interfacing with MySql/MariaDb is written in Rust, don't think this is relevant in this question though
You can use GROUP_CONCAT to combine all the c values into a comma-separated string.
SELECT t1.idx, t1.a, t1.b, GROUP_CONCAT(entry) AS c
FROM table1 AS t1
LEFT JOIN table2 AS t2 ON t1.idx = t2.owning_obj
GROUP BY t1.idx
Then explode the string in PHP:
$result_array = [];
while ($row = $result->fetch_assoc()) {
$row['c'] = explode(',', $row['c']);
$result_array[] = $row;
}
However, if the entries can be long, make sure you increase group_concat_max_len.
If you're using MySQL 8.0 you can also use JSON_ARRAYAGG(). This will create a JSON array of the entry values, which you can convert to a PHP array using json_decode(). This is a little safer, since GROUP_CONCAT() will mess up if any of the values contain comma. You can change the separator, but you need a separator that will never be in any values. Unfortunately, this isn't in MariaDB.

"You cannot add or change a record because a related record is required", but related record exists?

I have two related tables, results and userID.
results looks like this:
+----+--------+--------+
| ID | userID | result |
+----+--------+--------+
| 1 | abc | 124 |
| 2 | abc | 792 |
| 3 | def | 534 |
+----+--------+--------+
userID looks like this:
+----+--------+---------+
| id | userID | name |
+----+--------+---------+
| 1 | abc | Angela |
| 2 | def | Gerard |
| 3 | zxy | Enrico |
+----+--------+---------+
In results, the userID field is a lookup field; it stores userID.id but the combo box has userID.userID as its choices.
When I try to enter data into results by setting the userID combo box and entering a value for result, I get this error message:
You cannot add or change a record because a related record
is required in table `userID`.
This is strange, because I'm specifically selecting a value that's provided in the userID combo box.
Oddly, there are about 100 rows of data already in results with the same value for userID.
I thought this might be a database corruption issue, so i created a blank database and imported all the tables into it. But I still got the same error. What's going on here?
Both tables include a text field named LanID. You are using that field in this relationship, which enforces referential integrity:
The problem you're facing is due to the Lookup field properties. This is the Row Source:
SELECT [LanID].ID, [LanID].LanID FROM LanID ORDER BY [LanID];
But the value which gets stored (the Bound Column property) is the first column from that SELECT statement, which is the Long Integer [LanID].ID. So that number will not satisfy the relationship, which requires results.LanID = [LanID].LanID.
You must change the relationship or change the Lookup properties so both reference the same field value.
But if it were me, I would just eliminate the Lookup on the grounds that simple operations (such as this) become unnecessarily confusing when Lookup fields are involved. Make results.LanID a plain numeric or text field. If you want some kind of user-friendly drop-down for data entry, build a form with a combo or list box.
For additional arguments against Lookup fields, see The Evils of Lookup Fields in Tables.
If you are using a parameter query, make sure you have them in the same order as the table you are modifying and the query you have created. You might have one parameter inserting the conflicting data. Parameters are used in the order they are created...not the name of the parameter. I had the same problem and all I had to do was switch the order they were in so they matched the query. This is an old thread, so I hope this helps someone who is just now having this problem.

How to create dynamic 'types' in database

My question relates to a php website with a mysql database.
Currently in the database we have a 'token' entity
a token can hold 0 or more 'object' entities and each of these objects can be of type x, y, z
so
----------
| Token |
|--------|
--------------
|token_object|
| |
|tokenID |
|objectID |
--------------
------------
| object |
| |
|objectType|
------------
------------ --------- ---------
| objectX | |objectY| |objectZ|
| | | | | |
------------ --------- ---------
Now this works okish, because when I want the data from an object which, due to the nature of our caching system, is almost always one object at the time I can just make two queries to get the data from object and objectX/Y/Z
(SELECT * FROM object WHERE id = $id)
and
switch($objectType)
case 'objectX':
SELECT * from ObjectX WHERE id = $id
case 'objectY':
etc etc..
But now I've been asked to allow users of the website to create new object types, ie objectA/B etc.
I could just create a new objectXn table for each new object type, and by choosing a good value for objectType that would work
(SELECT * FROM $objectType WHERE blabla)
but that doesn't seem to be a very clean solution, especially the creation of new tables
Are there any better solutions?
Worth noting is that I expect that there are a lot of objects relative to tokens and that the creation of extra object types won't happen a lot/almost never
What exactly are you storing? Does it need complex queries against it or do you need a simple object-storage by id? In the latter case, you can create a table that has an ID, Type and Data, where data is the serialized object that you can restore in your code (for example, storing your object as JSON, and de-serializing it with the stored Type and the Data). A little more context in your question is appreciated if this doesn't satisfy your needs.