Data theory, json and client applications - json

So, I've been coding web apps for some time now... typically I've done both the data structs and retrieval and the client side coding. I now have a data admin teammate working with me and his sole job is to return data from a database to an api that serves json; standard stuff.
Recently, I have been having a disagreement with him on how this data should be returned. Essentially, we have two json objs, the first loaded remotely once (which includes racer name, racer number, etc...) when the application starts. Secondly, (during the race which is a recurring timed data call) we receive positions incrementally which contains a racer's lat/lon, spd etc.
Where we differ is that he is stating that it is "inefficient" to return the racer name (the first call) in the telem string (the second call). What this forces me to do is to keep the first data obj in a global obj, and then essentially get the racer's lat/long, spd from the second data obj "on the fly" using a join lookup function, which then returns a new json obj that I populate to a racer grid using jqGrid (looks something like this: getRaceDataByID(json[0].id){//lookup race data by racer id in json[1] where json[1].id == json[0].id[lat/lon, spd] and return new json obj row to populate jqgrid})).
The result seems to be to be an overly-coded/slow client (jquery) application.
My question is about theory. Of course I understand traditional data structs, normalization, sql etc... But in today world of "webapps" and the idea that it seems that larger web services are going away from "traditional sql" data structures and just returning the data as the client needs. In this sense, it would mean adding about 3 fields (name, bib number, vehicle type, etc...) to the sql call on each position telem call so I can display the data on the client per the interface's requirement (a data table that display real-time speed, lat/lon, etc...).
So finally, my question: has anyone had to deal with a situation like this and am I "all wet" in thinking that 3 fields per row, in today's world of massive data dependent web applications, that this is not a huge issue to be squabling over.
Please note: I understand that traditionally, you would not want to send more data than you need and that his understanding of data structs and inefficient data transfers (not sending more data than you need) is actually correct.
But, many times when I'm coding a web apps, it's often looked at a bit differently b/c of the stateless nature of the browser, and IMHO and it's much easier to just send the data that is needed. My question, is not being driven by not wanting to code the solution, but rather trying to put less load on the client by not having to re-stitch the json obj into something that I needed in the first place.

I think it makes sense to send these 3 fields with the rest of the data, even if this warrants some sort of duplication. You get the following advantages:
You don't have to maintain the names of racers from the first call in your browser
Your coding logic is simplified (don't have to match up racer names to subsequent calls, the packet contains the info. already)
As far as speed goes, you are doing the majority of the work in your remote call, adding another 3 fields doesn't matter IMHO. It makes your app cleaner.
So I guess I agree w/you.

Related

Best way to store multiple json values for analytics?

Suppose one is trying to save such API responses for analytics later, ie, a single response has about a 1000 persons
Each object has about 26 properties.
The API query is made every 5 minutes for example.
{person1 : {propertyA:a1, propertyB:b1 ....... propertyZ:z1}
person2 : {propertyA:a2, propertyB:b2 ....... propertyZ:z2}
....
....
person999: {propertyA:a999, propertyB:b999 ....... propertyZ:z999}
person1000: {propertyA:a1000, propertyB:b1000 ....... propertyZ:z1000}}
What is the best way to store such kind of data for analytics later? What kind of database? (the simpler the better)
Should the multiple responses of such API calls be stored in single rows or make multiple columns for each object? Or some other way like JSON dbs?
Note - the person might change over time, eg person100 might stop being updated or become inactive .... so an API resposne in future might not include person100 instead another record for person1001 might be added (unrelated to person100 becoming inactive)
Additional info :
Data would be updated say every 5 mins for a say 5 years (to give an idea about usage/retention of data).
Queries would mostly be limited to how a personX is changing over a given time frame that is likely to range from a few hours to over 6 months.
Properties of a person are likely to have same/similar profile of attributes, althoug their values would obviously change over time
the simpler the better
The simplest would presumably be to keep the results of each API query in a single file, though if you did so, it would probably best to use a JSONLines format, with
one line per person. However, in either case, I would almost certainly add an 'id' field to make it trivially easy to query for a particular person, and to migrate the data elsewhere should that become necessary.
A variant of the above would be to have one file per person, again with a JSONLines format, but with the addition of some kind of timestamp.
Next up the ladder of complexity, you might want to consider a SQLite database. If you want to retain the JSON format, then you'd presumably want to add
indices, e.g. on the person id.
If the JSON object representation of each person is flat and the property list stable, then the conventional wisdom would be to store the data in columnar format. A reasonable compromise would be to move the properties of interest to columns, and to relegate all the other (relevant) details to JSON-valued columns.
Of course there are umpteen other database options, and you can climb the complexity ladder as high as it goes. Likewise for cost. You might like to look at TimescaleDB for starters.
Managing Scale
If the data for an individual does not change very often, there will
presumably be various ways to reduce the redundancy.
At one end of the spectrum of possibilities, you could simply discard
an entire record if the prior retained record for that person is essentially the same.
Towards the other end of the spectrum, you could recast the data as a
series of events that would be easy to store as a table:
timestamp id propertyName value
This would have the advantage of giving you flexibility w.r.t. both
the universe of persons and the set of properties of interest.
See also https://www.timescale.com/blog/time-series-compression-algorithms-explained/
Footnote: The PmWiki system https://en.m.wikipedia.org/wiki/PmWiki illustrates how a fairly complex “database” system can be constructed using the underlying file system.

Can a JSON file be used as database?

Theorically i have a group (2000 - 3000) of data structure it's a local obituary project, with 5 item/value for each (name, birth, death, function, photo url) and i want :
to load this data in html page,
format each data into styled div,
allow user to filter (name, function etc) these divs (hide/show) via js or jquery and inputs
i know the best way to do this is mysql + php, but since i have no knowledge yet in this field My question is :
Is it correct to use a json file to store these data and populate divs ?
can we filter json data with jquery ?
Thanks
Define database. Can it be used to store data? Yes. But it has no search abilities, it isn't ACID compliant, and accessing it would run in the current process. Any non-trivial filtering would require an O(N) process, unless you're going to build index tables in memory. It would also be nearly impossible to keep in sync across processes/machines if the user is allowed to edit any of the data at all. If its read only and N is small, it can work. If N is not small or you need to be able to update it, a JSON file is a bad idea.

socket.io performance one emit per database row

I am trying to understand what is the best way to read and send a huge amount of database rows (50K-100K) to the client.
Should I simply read all the rows at once from the database at the backend and then send all the rows in a json format? This isn't that much responsive as user is just waiting for a long time, but this is faster for small no. of rows.
Should I stream the rows from the database and upon each reading of the row from the database, I call a socket.emit()? This causes too many socket emits, but is more responsive, but slow...
I am using node.js, socket.io
Rethink the Interface
First off, a user interface design that shows 50-100k rows on a client is probably not the best user interface in the first place. Not only is that a large amount of data to send down to the client and for the client to manage and is perhaps impractical in some mobile devices, but it's obviously way more rows than any single user is going to actually read in any given interaction with the page. So, the first order might be to rethink the user interface design and create some sort of more demand-driven interface (paged, virtual scroll, keyed by letter, etc...). There are lots of different possibilities for a different (and hopefully better) user interface design that lessens the data transfer amount. Which design would be best depends entirely upon the data and the likely usage models by the user.
Send Data in Chunks
That said, if you were going to transfer that much data to the client, then you're probably going to want to send it in chunks (groups of rows at a time). The idea with chunks is that you send a consumable amount of data in one chunk such that the client can parse it, process it, show the results and then be ready for the next chunk. The client can stay active the whole time since it has cycles available between chunks to process other user events. But, sending it in chunks reduces the overhead of sending a separate message for each single row. If your server is using compression, then chunks gives a greater chance for compression efficiency too. How big a chunk should be (e.g. how many rows of data is should contain) depends upon a bunch of factors and is likely best determined through experimentation with likely clients or the lowest power expected client. For example, you might want to send 100 rows per message.
Use an Efficient Transfer Format for the Data
And, if you're using socket.io to transfer large amounts of data, you may want to revisit how you use the JSON format. For example, sending 100,000 objects that all repeat exactly the same property names is not very efficient. You can often invent your own optimizations that avoid repeating property names that are exactly the same in every object. For example, rather than sending 100,000 of these:
{"firstname": "John", "lastname": "Bundy", "state": "Az", "country": "US"}
if every single object has the exact same properties, then you can either code the property names into your own code or send the property names once and then just send a comma separated list of values in an array that the receiving code can put into an object with the appropriate property names:
["John", "Bundy", "Az", "US"]
Data size can sometimes be reduced by 2-3x by simply removing redundant information.

Is there any negatives to storing JSON in MySQL?

I am building a complex ordering system and I am struggling with whether I should store some of the more detailed information in a single column as JSON or if I should create the multiple tables and logic to keep JSON out of the picture.
Since each order will have multiple required dates, ship dates, parts, kits (collections of parts), and more. It just seems easier to store this as JSON of a single 'order'row.
Are there any major down sides to doing this?
JSON is geared more towards short term storage to send data from one thing to another. It is horribly inefficient space and computationally wise for long term storage compared to a database. You will also loose the ability to query the data directly without parsing it first (e.g "select * from table where orderdate < today"). You'll also have to develop your own tools to view the data, since if you try to view it in the database directly, everything will run together.
In short, this is almost always a really bad idea.

Database optimized for searching in large number of objects with different attributes

Im am currently searching for an alternative to our aging MySQL database using an EAV approach. Current projects seem to have outgrown traditional table oriented database structures and especially searches in such database.
I head and researched about various NoSQL database systems but I can't find anything that seems to be what Im looking for. Maybe you can help.
I'll show you a generalized example on what kind of data I have and what operations I want to execute on them:
I have an object that has a small number of META attributes. Attributes that are common to all instanced of my objects. For example these
DataObject Common (META) Attributes
Unique ID (Some kind of string containing a unique identifier)
Created Date (A date time showing creation time of the object)
Type (Some kind of type identifier, maybe something like "Article", "News", "Image" or "Video"
... I think you get the Idea
Then each of my Objects has a variable number of other attributes. Most probably, many Objects will share a number of these attributes, but there is no rule. For my sample, we say each Object instance has between 5 to 20 such attributes. Here are some samples
Data Object variable Attributes
Color (Some CSS like color string)
Name (A string)
Category (The category or Tag of this item) (Maybe we also have more than one of these?)
URL (a url containing some website)
Cost (a number with decimals
... And a whole lot of other stuff mostly being of the usual column types
References to other data is an idea, but not a MUST at the moment. I could provide those within my application logic if needed.
A small sample:
Image
Unique ID = "0s987tncsgdfb64s5dxnt"
Created Date = "2013-11-21 12:23:11"
Type = "Image"
Title = "A cute cat"
Category = "Animal"
Size = "10234"
Mime = "image/jpeg"
Filename = "cat_123.jpg"
Copyright = "None"
Typical Operations
An average storage would probably have around 1-5 million such objects, each with 5-20 attributes.
Apart from the usual stuff like writing one object to database or readin it by it's uid, the most problematic operations are these:
Search by several attributes - Select every DataObject that has Type "News" the Titel contains "blue" and the Created Date is after 2012.
Paged bulk read - Get a large number of objects from a search (see above) starting at element 100 and ending at 250
Get many objects with all of their attributes - When reading larger numbers of objects, I need to get every object with all of it's attributes in one call.
Storage Requirements
Persistance - The storage needs to be persistance and not in memory only. If the server reboots, the data has to be at the same point in time as when it shut down before. No memory only systems.
Integrity - All data is important, nothing can be ignored. So every single write action has to be securely stored. Systems (Redis?) that tend to loose something now and then arent usable. Systems with huge asynchronity are also problematic. If data changes, every responsible node should see that.
Complexity - The system should be fairly easy to setup and maintain. So, systems that force the admin to take many week long courses in it's use arent really a solution here. Same goes for huge data warehouses with loads of nodes. Clustering is nice, but it should also be possible to get a cheap system with one node.
tl;dr
Need super fast database system with object oriented data and fast searched even with hundreds of thousands of items.
A reason as to why I am searching for a better alternative to mysql can be found here: Need MySQL optimization for complex search on EAV structured data
Update
Key-Value stores like Redis weren't an option as we need to do some heavy searching insode our data. Somethng which isnt possible in a typical Key-Value store.
In the end, we are using MongoDB with a slightly optimized scheme to make best use of MongoDBs use of indizes.
Some small drawback still remain but are acceptable at the moment:
- MongoDBs aggregate function can not wotk with very large result sets. We have to use find (and refine our data structure to make that one sufficient)
- You can not sort large datasets on specific values as it would take up to much memory. You also cant create indizes on those values as they are schema free.
I don't know if you wan't a more sophisticated answer than mine. But maybe i can inspire you a little.
MySql are scaleable and can be used for exactly your course. I think it's more of an optimization and server problem if you database i slow. Many system with massive amount of data i using MySql and works perfectly, Though NoSql (Not-Only SQL) is built for large amount of data with different attributes.
There's many diffrent NoSql providers and they have different ways of handling you data.
Think about that before you choose a NoSql platform.
The possibilities are
Key–value Stores - ex. Redis, Voldemort, Oracle BDB
Column Store - ex. Cassandra, HBase
Document Store - ex. CouchDB, MongoDb
Graph Database - ex. Neo4J, InfoGrid, Infinite Graph
Most website uses document based storing, but ex. facebook are using the column based, because of the many dynamic atrribute.
You can try the Document based NoSql at http://try.mongodb.org/
In the end, it really depends on how you build and optimize you database, and not from which technology you choose, though chossing the right technology can save a bunch of time.
The system we have developed are using a a combination of MySql and NoSql depending on what data we are working with. MySql for the system itself and NoSql for all the data we import via API's.
Hope this inspires a little and feel free to ask any westions