Manual Insert into MySQLx Collection - mysql

I have a large amount of JSON data that needs to be inserted into a MySQLx Collection table. The current Node implementation keeps crashing when I attempt to load my JSON data in, and I suspect it's because I'm inserting too much at once through the collection API. I'd like to manually insert the data into the database using a traditional SQL statement (in the hope that they will get me pass this NodeJs crash).
The problem is that I have this table def:
+--------------+---------------+------+-----+---------+-------------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+---------------+------+-----+---------+-------------------+
| doc | json | YES | | NULL | |
| _id | varbinary(32) | NO | PRI | NULL | STORED GENERATED |
| _json_schema | json | YES | | NULL | VIRTUAL GENERATED |
+--------------+---------------+------+-----+---------+-------------------+
But when running
insert into documents values ('{}', DEFAULT, DEFAULT)
I get:
ERROR 3105 (HY000): The value specified for generated column '_id' in table 'documents' is not allowed.
I've tried with not providing the DEFAULTs, with NULL (but _id doesn't allow NULL even though that's the default), with 0 for _id, with numbers and with uuid_to_bin(uuid()) but I still get the same error.
How can I insert this data into the table directly (I'm using session.sql('INSERT...').bind(JSON.stringify(data)).execute() - using the #mysql/xdevapi library)

The _id column is auto generated from the value of the namesake field in the JSON document. When you use the CRUD interface to insert documents, the X Plugin is capable of generating a unique value for this field. However, by executing a plain SQL statement, you are also by-passing that logic. So, you are able to insert documents if you generate the _ids yourself, otherwise you will bump into that error.
As an example (using crypto.randomInt()):
const { randomInt } = require('crypto')
session.sql('insert into documents (doc) values (?)')
.bind(JSON.stringify({ _id: randomInt(Math.pow(2, 48) - 1) }))
.execute()
Though I'm curious about the issue with the CRUD API and I wanted to see if I was able to reproduce it as well. How are you inserting those documents in that case and what kind of feedback (if any) is provided when it "crashes"?
Disclaimer: I'm the lead developer of the MySQL X DevAPI connector for Node.js

Related

More than 255 Fields in Access 2000/2010

I am converting a 20-year old system from DBase IV into Access 2010, via Access 2000, in order to be more suitable for Windows 10. However, I have about 350 fields in the database as it is a parameters table and MS-Access 2000 and MS-Access 2010 are complaining about it. I have repaired the database to removed the internal count problem but am rather surprised that Windows 10 software would have such a low restriction. Does anyone know how to bypass this? Obviously I can break it into 2 tables but this seems rather archaic.
When you start to run up against limitations such as this, it reeks of poor database design.
Given that you state that the table in question is a 'parameters' table, with so many parameters, have you considered structuring the table such that each parameter occupies its own record?
For example, consider the following approach, where ParamName is the primary key for the table:
+----------------+------------+
| ParamName (PK) | ParamValue |
+----------------+------------+
| Param1 | Value1 |
| Param2 | Value2 |
| ... | |
| ParamN | ValueN |
+----------------+------------+
Alternatively, if there is the possibility that each parameter may have multiple values, you can simple add one additional field to differentiate between multiple values for the same parameter, e.g.:
+----------------+--------------+------------+
| ParamName (PK) | ParamID (PK) | ParamValue |
+----------------+--------------+------------+
| Param1 | 1 | Value1 |
| Param1 | 2 | Value2 |
| Param1 | 3 | Value3 |
| Param2 | 1 | Value2 |
| ... | ... | ... |
| ParamN | 1 | Value1 |
| ParamN | N | ValueN |
+----------------+--------------+------------+
I had similar problem - we have more than 300 fields in one Contact table on SQL sever linked to Access. You probably do not need to display 255 fields on one form - that would not be user friendly. You can split it to several sub-forms with different underlined queries for each form with less than the limitation. All sub-forms would be linked by the ID.
Sometimes splitting tables as suggested above is not the best idea because of performance.
As Lee Mac described a sample change in structure of a "parameters" table really would be your better choice. You could then define some constants for each of these to be used in code to prevent accidental misspelling later in code in case used in many places.
Then you could create a function (or functions) that take a parameter of what parameter setting you are looking for, it queries the table for that as the key and returns the value. Not being a VB/Access developer, but would think cant overload the functions to have a single function but returning different data types such as string, int, dates, etc. So you may want functions something like
below samples in C#, but principle would be the same.
public int GetAppParmInt( string whatField )
public DateTime GetAppParmDate( string whatField )
public string GetAppParmString( string whatField )
etc...
Then you could get the values by calling the function that has the sole purpose of querying the parameters table for that one key and returns the value as stored.
Hopefully a combination of offered solutions here can help you in your upgrade, even if your parameter table (expanding a bit on Lee Mac's answer) has each data type you are storing to correspond with the "GetAppParm[type]"
ParmsTable
PkID ParmDescription ParmInt ParmDate ParmString
1 CompanyName Your Company
2 StartFiscalYear 2019-06-22
3 CurrentQuarter 4
4... etc.
Then you don't have to worry about changing / data conversions all over the place. They are stored in the proper data type you expect and return that type.

How to INSERT or UPDATE a large number of rows (regarding the auto_increment value of a table)

I have a MySQL table with around 3 million rows (listings) at the moment. These listings are updated 24/7 (around 30 listings/sec) by a python script (Scrapy) using pymsql - so the performance of the queries is relevant!
If a listing doesn't exist (i.e. the UNIQUE url), a new record will be inserted (which is around every hundredth listing). The id is set to auto_increment and I am using a INSERT INTO listings ... ON DUPLICATE KEY UPDATE last_seen_at = CURRENT_TIMESTAMP. The update on last_seen_at is necessary to check if the item is still online, as I am crawling the search results page with multiple listings on it and not checking each individual URL each time.
+--------------+-------------------+-----+----------------+
| Field | Type | Key | Extra |
+--------------+-------------------+-----+----------------+
| id | int(11) unsigned | PRI | auto_increment |
| url | varchar(255) | UNI | |
| ... | ... | | |
| last_seen_at | timestamp | | |
| ... | ... | | |
+--------------+-------------------+-----+----------------+
The problem:
At first, it all went fine. Then I noticed larger and larger gaps in the auto_incremented id column and found out it's due to the INSERT INTO ... statement: MySQL attempts to do the insert first. This is when the id gets auto incremented. Once incremented, it stays. Then the duplicate is detected and the update happens.
Now my question is: Which is the best solution regarding performance for with long term perspective?
Option A: Set the id column to unsigned INT or BIGINT and just ignore the gaps. Problem here is I'm afraid of hitting the maximum after a couple of years updating. I'm already at an auto_increment value of around 12,000,000 for around 3,000,000 listings after two days of updating...
Option B: Switch to an INSERT IGNORE ... statement, check the affected rows and UPDATE ... if necessary.
Option C: SELECT ... the existing listings, check existence within python and INSERT ... or UPDATE ... dependingly.
Any other wise options?
Additonal Info: I need an id for information related to a listing stored in other tables (e.g. listings_images, listings_prices etc.). IMHO using the URL (which is unique) won't be the best option for foreign keys.
+------------+-------------------+
| Field | Type |
+------------+-------------------+
| listing_id | int(11) unsigned |
| price | int(9) |
| created_at | timestamp |
+------------+-------------------+
I was in exact situation as yours
I have millions of records being entered by scraper into table, scraper was running every day
I tried following but failed
Load all urls into a Python tuple or list and while scraping, only scrape those which are not in the list - FAILED because at the time of loading urls into a Python tuple or list script consumed so much of server's RAM
Check each record before entering - FAILED because it made INSERTion process too slow because it first have to query the table with millions of rows and then decide whether to INSERT or not
SOLUTION WORKED FOR ME: (for table with millions of rows)
I removed id column because it is irreverent and I do not need that
Make url PRIMARY KEY since it will be unique
Add UNIQUE INDEX -- THIS IS MUST TO DO - It will increase your table's performance drastically
Do bulk inserts instead of inserting one-by-one (see pipeline code below)
Notice it is using INSERT IGNORE INTO, so only new records will be entered and if it exists, it will be ignored completely
If you use REPLACE INTO instead of INSERT IGNORE INTO in MySQL, the new records will be entered, but if a record exists, it will be updated
class BatchInsertPipeline(object):
def __init__(self):
self.items = []
self.query = None
def process_item(self, item, spider):
table = item['_table_name']
del item['_table_name']
if self.query is None:
placeholders = ', '.join(['%s'] * len(item))
columns = '`' + '`, `'.join(item.keys()).rstrip(' `') + '`'
self.query = 'INSERT IGNORE INTO '+table+' ( %s ) VALUES ( %s )' \
% (columns, placeholders)
self.items.append(tuple(item.values()))
if len(self.items) >= 500:
self.insert_current_items(spider)
return item
def insert_current_items(self,spider):
spider.cursor.executemany(self.query, self.items)
self.items = []
def close_spider(self, spider):
self.insert_current_items(spider)
self.items = []

Parsing multiple JSON schemas with Spark

I need to collect a few key pieces of information from a large number of somewhat complex nested JSON messages which are evolving over time. Each message refers to the same type of event but the messages are generated by several producers and come in two (and likely more in the future) schemas. The key information from each message is similar but the mapping to those fields is dependent on the message type.
I can’t share the actual data but here is an example:
Message A
—header:
|—attribute1
|—attribute2
—typeA:
|—typeAStruct1:
||—property1
|-typeAStruct2:
||-property2
Message B
-attribute1
-attribute2
-contents:
|-message:
||-TypeB:
|||-property1
|||-TypeBStruct:
||||-property2
I want to produce a table of data which looks something like this regardless of message type:
| MessageSchema | Property1 | Property2 |
| :———————————- | :———————— | :———————— |
| MessageA | A1 | A2 |
| MessageB | B1 | B2 |
| MessageA | A3 | A4 |
| MessageB | B3 | B4 |
My current strategy is read the data with schema A and union with the data read with Schema B. Then I can filter the nulls that result from parsing a type A message with a B schema and vice versa. This seems very inefficient especially once a third or fourth schema emerge. I would like to be able to parse the message correctly on the first pass and apply the correct schema.
As i see it - there is only one way:
For each message type you create an 'adapter' that will create dataframe from input and transform it to the common schema dataframe
Then union outputs of the adapters
Obviously, if you change 'common' schema - you will need to tailor your 'adapters' as well.

Seeking coding example for TSQLTimeStamp

Delphi XE2 and MySql.
My previous question led to the recommendation that I should be using MySql's native TIMESTAMP datatype to store date/time.
Unfornately, I can't seem to find any coding examples, and I am getting weird results.
Given this table:
mysql> describe test_runs;
+------------------+-------------+------+-----+---------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+-------------+------+-----+---------------------+-------+
| start_time_stamp | timestamp | NO | PRI | 0000-00-00 00:00:00 | |
| end_time_stamp | timestamp | NO | | 0000-00-00 00:00:00 | |
| description | varchar(64) | NO | | NULL | |
+------------------+-------------+------+-----+---------------------+-------+
3 rows in set (0.02 sec)
I woudl like to :
declare a variable into which I can store the result of SELECT CURRENT_TIMESTAMP - what type should it be? TSQLTimeStamp?
insert a row at test start which has start_time_stamp = the variable above
and end_time_stamp = some "NULL" value ... "0000-00-00 00:00:00"? Can I use that directly, or do I need to declare a TSQLTimeStamp and set each field to zero? (there doesn't seem to be a TSQLTimeStamp.Clear; - it's a structure, not a class
upadte the end_time_stamp when the test completes
calcuate the test duration
Can somene please point me at a URL with some Delphi code whcich I can study to see how to do this sort of thing? GINMF.
I don't know why you want to hassle around with that TIMESTAMP and why you want to retrieve the CURRENT_TIMESTAMP just to put it back.
And as already stated, it is not a good advice to use a TIMESTAMP field as PRIMARY KEY.
So my suggestion is to use this TABLE SCHEMA
CREATE TABLE `test_runs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start_time_stamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`end_time_stamp` timestamp NULL DEFAULT NULL,
`description` varchar(64) NOT NULL,
PRIMARY KEY (`id`)
);
Starting a test run is handled by
INSERT INTO test_runs ( description ) VALUES ( :description );
SELECT LAST_INSERT_ID() AS id;
and to finalize the record you simply call
UPDATE test_runs SET end_time_stamp = CURRENT_TIMESTAMP WHERE id = :id
just declare a TSQLQuery (or the correct component for the data access layer of your choice), attach it to a valid connection and populate it's SQL property with:
select * from test_runs;
double click on the query to launch it's fields editor and select add all fields from the contextual menu of that editor.
It will create the correct field type, according to the data access layer and driver you're using to access your data.
Once that's done, if you need to use the value in code, usually you do it by using the AsDateTime property of the field, so you just use a plain TDateTime Delphi type and let the database access layer deal with the specific database details to store that field.
For example, if your query object is named qTest and the table field is named start_time_stamp, your Delhi variable associated with that persistent field will be named qTeststart_time_stamp, so you can do something like this:
var
StartTS: TDateTime;
begin
qTest.Open;
StartTS := qTeststart_time_stamp.AsDateTime;
ShowMessage('start date is ' + DateTimeToStr(StartTS));
end;
If you use dbExpress and are new to it, read A Guide to Using dbExpress in Delphi database applications
I don't know about MySQL, but if the TField subclass generated is a TSQLTimeStampField, you will need to use the type and functions in the SqlTimSt unit (Data.SqlTimSt for XE2+).
You want to declare the local variables as TSQLTimeStamp
uses Data.SQLTmSt....;
....
var
StartTS: TSQLTimeStamp;
EndTS: TSQLTimeStamp;
begin
StartTS := qTeststart_time_stamp.AsSQLTimeStamp;
SQLTmSt also includes functions to convert to and from TSQLTimeStamp, e.g. SQLTimeStampToDateTime and DateTimeToSQLTimeStamp.
P.S. I tend to agree that using a timestamp as a primary key is likely to cause problems. I would tend to use a auto incrementing surrogate key as Sir Rufo suggests.

MySQL: Appending records: find then append or append only

I'm writing a program, in C++, to access tables in MySQL via the MySql C++ Connector.
I retrieve a record from the User (via GUI or Xml file).
Here are my questions:
Should I search the table first for
the given record, then append if it
doesn't exist,
Or append the record, and let MySQL
append the record if it is unique?
Here is my example table:
mysql> describe ing_titles;
+----------+----------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+----------+------+-----+---------+-------+
| ID_Title | int(11) | NO | PRI | NULL | |
| Title | char(32) | NO | | NULL | |
+----------+----------+------+-----+---------+-------+
In judgment, I am looking for a solution that will enable my program to respond quickly to the User.
During development, I have small tables (less than 5 records), but I am expecting them to grow when I formally release the application.
FYI: I am using Visual Studion 2008, C++, wxWidgets, and MySQL C++ Connector on Windows XP and Vista.
Mark the field in question with a UNIQUE constraint and use INSERT ... ON DUPLICATE KEY UPDATE or INSERT IGNORE.
The former will update the records if they already exists, the latter will just do nothing.
Searching the table first is not efficient, since it requires two roundtrips to the server: the first one to search, the second one to insert (or to update).
The syntaxes above do the same in one sentence.