I am writing an application where I am communicating with an SQL server, which provides an array of bytes in a blob field. I have a TObjectDictionary where I store objects, and each object stores the start byte and the number of bytes I need to read, and convert it to the required datatype.
The objects in the TObjectDictionary are referring to different SQL queries. So, to reduce the response time, my plan is to fire the queries at the same time, and whenever one of them finishes, it updates the global TObjectDictionary.
I know TObjectDictionary itself is not thread-safe, and if another thread would delete the object from the TObjectDictionary, I would have an issue, but this won't happen. Also, 2 or more threads won't be writing the same object.
At the moment I use TCriticalSection, so only 1 thread writes to objects in the dictionary, but I was wondering if this is really necessary?
Most RTL containers, including TObjectDictionary, are NOT thread-safe and do require adequate cross-thread serialization to avoid problems. Even just the act of adding an object to the dictionary will require protection. A TCriticalSection would suffice for that.
Unless you add all of the objects to the dictionary from the main thread before then starting worker threads that just access the existing objects and don't add/remove objects. Then you shouldn't need to serialize access to the dictionary.
Related
I'll describe the application I'm trying to build and the technology stack I'm thinking at the moment to know your opinion.
Users should be able to work in a list of task. These tasks are coming from an API with all the information about it: id, image urls, description, etc. The API is only available in one datacenter and in order to avoid the delay, for example in China, the tasks are stored in a queue.
So you'll have different queues depending of your country and once that you finish with your task it will be send to another queue which will write this information later on in the original datacenter
The list of task is quite huge that's why there is an API call to get the tasks(~10k rows), store it in a queue and users can work on them depending on the queue the country they are.
For this system, where you can have around 100 queues, I was thinking on redis to manage the list of tasks request(ex: get me 5k rows for China queue, write 500 rows in the write queue, etc).
The API response are coming as a list of json objects. These 10k rows for example need to be stored somewhere. Due to you need to be able to filter in this queue, MySQL isn't an option at least that I store every field of the json object as a new row. First think is a NoSQL DB but I wasn't too happy with MongoDB in the past and an API response doesn't change too much. Like I need relation tables too for other thing, I was thinking on PostgreSQL. It's a relation database and you have the ability to store json and filter by them.
What do you think? Ask me is something isn't clear
You can use HStore extension from PostgreSQL to store JSON, or dynamic columns from MariaDB (MySQL clone).
If you can move your persistence stack to java, then many interesting options are available: mapdb (but it requires memory and its api is changing rapidly), persistit, or mvstore (the engine behind H2).
All these would allow to store json with decent performances. I suggest you use a full text search engine like lucene to avoid searching json content in a slow way.
I want to cache data that I got from my MySQL DB and for this I am currently storing the data in an object.
Before querying the database, I check if the needed data exists in the meantioned object or not. If not, I will query and insert it.
This works quiet well and my webserver is now just fetching the data once and reuses it.
My concern is now: Do I have to think of concurrent writes/reads for such data structures that lay in the object, when using nodejs's clustering feature?
Every single line of JavaScript that you write on your Node.js program is thread-safe, so to speak - at any given time, only a single statement is ever executed. The fact that you can do async operations is only implemented at a low level implementation that is completely transparent to the programmer. To be precise, you can only run some code in a "truly parallel" way when you do some input/output operation, i.e. reading a file, doing TCP/UDP communication or when you spawn a child process. And even then, the only code that is executed in parallel to your application is that of Node's native C/C++ code.
Since you use a JavaScript object as a cache store, you are guaranteed no one will ever read or write from/to it at the same time.
As for cluster, every worker is created its own process and thus has its own version of every JavaScript variable or object that exists in your code.
I have a C++ application that loads lots of data from a database, then executes algorithms on that data (these algorithms are quite CPU- and data-intensive that's way I load all the data before hand), then saves all the data that has been changed back to the database.
The database-part is nicely separate from the rest of the application. In fact, the application does not need to know where the data comes from. The application could even be started on file (in this case a separate file-module loads the files into the application and at the end saves all data back to the files).
Now:
the database layer only wants to save the changed instances back to the database (not the full data), therefore it needs to know what has been changed by the application.
on the other hand, the application doesn't need to know where the data comes from, hence it does not want to feel forced to keep a change-state per instance of its data.
To keep my application and its datastructures as separate as possible from the layer that loads and saves the data (could be database or could be file), I don't want to pollute the application data structures with information about whether instances were changed since startup or not.
But to make the database layer as efficient as possible, it needs a way to determine which data has been changed by the application.
Duplicating all data and comparing the data while saving is not an option since the data could easily fill several GB of memory.
Adding observers to the application data structures is not an option either since performance within the application algorithms is very important (and looping over all observers and calling virtual functions may cause an important performance bottleneck in the algorithms).
Any other solution? Or am I trying to be too 'modular' if I don't want to add logic to my application classes in an intrusive way? Is it better to be pragmatic in these cases?
How do ORM tools solve this problem? Do they also force application classes to keep a kind of change-state, or do they force the classes to have change-observers?
If you can't copy the data and compare, then clearly you need some kind of record somewhere of what has changed. The question, then, is how to update those records.
ORM tools can (if they want) solve the problem by keeping flags in the objects, saying whether the data has been changed or not, and if so what. It sounds as though you're making raw data structures available to the application, rather than objects with neatly encapsulated mutators that could update flags.
So an ORM doesn't normally require applications to track changes in any great detail. The application generally has to say which object(s) to save, but the ORM then works out what needs persisting to the DB in order to do that, and might apply optimizations there.
I guess that means that in your terms, the ORM is adding observers to the data structures in some loose sense. It's not an external observer, it's the object knowing how to mutate itself, but of course there's some overhead to recording what has changed.
One option would be to provide "slow" mutators for your data structures, which update flags, and also "fast" direct access, and a function that marks the object dirty. It would then be the application's choice whether to use the potentially-slower mutators that permit it to ignore the issue, or the potentially-faster mutators which require it to mark the object dirty before it starts (or after it finishes, perhaps, depending what you do about transactions and inconsistent intermediate states).
You would then have two basic situations:
I'm looping over a very large set of objects, conditionally making a single change to a few of them. Use the "slow" mutators, for application simplicity.
I'm making lots of different changes to the same object, and I really care about the performance of the accessors. Use the "fast" mutators, which perhaps directly expose some array in the data. You gain performance in return for knowing more about the persistence model.
There are only two hard problems in Computer Science: cache invalidation and naming things.
Phil Karlton
This question already has answers here:
What is Serialization?
(16 answers)
Closed 2 years ago.
I've seen the term "serialized" all over, but never explained. Please explain what that means.
Serialization usually refers to the process of converting an abstract datatype to a stream of bytes (You sometimes serialize to text, XML or CSV or other formats as well. The important thing is that it is a simple format that can be read/written without understanding the abstract objects that the data represents). When saving data to a file, or transmitting over a network, you can't just store a MyClass object, you're only able to store bytes. So you need to take all the data necessary to reconstruct your object, and turn that into a sequence of bytes that can be written to the destination device, and at some later point read back and deserialized, reconstructing your object.
Serialization is the process of taking an object instance and converting it to a format in which it can be transported across a network or persisted to storage (such as a file or database). The serialized format contains the object's state information.
Deserialization is the process of using the serialized state to reconstruct the object from the serialized state to its original state.
real simple explanation, serialization is the act of taking something that is in memory like an instance of a class (object) and transforming into a structure suitable for transport or storage.
A common example is XML serialization for use in web services - I have an instance of a class on the server and need to send it over the web to you, I first serialize it into xml which means to create an xml version of that data in the class, once in xml I can use a transport like HTTP to easily send it.
There are several forms of serialization like XML or JSON.
There are (at least) two entirely different meanings to serialization. One is turning a data structure in memory into a stream of bits, so it can be written to disk and reconstituted later, or transmitted over a network connection and used on another machine, etc.
The other meaning relates to serial vs. parallel execution -- i.e. ensuring that only one thread of execution does something at a time. For example, if you're going to read, modify and write a variable, you need to ensure that one thread completes a read, modify, write sequence before another can start it.
What they said. The word "serial" refers to the fact that the data bytes must be put into some standardized order to be written to a serial storage device, like a file output stream or serial bus. In practice, the raw bytes seldom suffice. For example, a memory address from the program that serializes the data structure may be invalid in the program that reconstructs the object from the stored data. So a protocol is required. There have been many, many standards and implementations over the years. I remember one from the mid 80's called XDR, but it was not the first.
You have data in a certain format (e.g. list, map, object, etc.)
You want to transport that data (e.g. via an API or function call)
The means of transport only supports certain data types (e.g. JSON, XML, etc.)
Serialization: You convert your existing data to a supported data type so it can be transported.
The key is that you need to transport data and the means by which you transport only allows certain formats. Your current data format is not allowed so you must "serialize" it. Hence as Mitch answered:
Serialization is the process of taking an object instance and converting it to a format in which it can be transported.
I would like to store very large sets of serialized Ruby objects in db (mysql).
1) What are the cons and pros?
2) Is there any alternative way?
3) What are technical difficulties if the objects are really big?
4) Will I face memory issues while serializing and de-serializing if the objects are really big ?
Pros
Allows you to store arbitrary complex objects
Simplified your db schema (no need to represent those complex objects)
Cons
Complicates your models and data layer
Potentially need to handle multiple versions of serialized objects (changes to object definition over time)
Inability to directly query serialized columns
Alternatives
As the previous answer stated, an object database or document oriented database may meet your requirements.
Difficulties
If your objects are quite large you may run into difficulties when moving data between your DBMS and your program. You could minimize this by separating the storage of the object data and the meta data related to the object.
Memory Issues
Running out of memory is definitely a possibility with large enough objects. It also depends on the type of serialization you use. To know how much memory you'd be using, you'd need to profile your app. I'd suggest ruby-prof, bleak_house or memprof.
I'd suggest using a non-binary serialization wherever possible. You don't have to use only one type of serialization for your entire database, but that could get complex and messy.
If this is how you want to proceed, using an object oriented dbms like ObjectStore or a document oriented dbms like CouchDB would probably be your best option. They're better designed and targeted for object serialization.
As an alternative you could use any of the multitude of NoSQL databases. If you can serialize your object to JSON then it should be easily stored in CouchDB.
You have to bear in mind that the serialized objects in terms of disk space are far larger than if you saved them in your own way, and loaded them in your own way. I/O from the hard drive is very slow and if you're looking at complex objects, that take a lot of processing power, it may actually be faster to load the file(s) and process it on each startup; or perhaps saving the data in such a way that's easy to load.