How to store large numbers of Python dictionaries in a datastore and filter/query it? - mysql

I have a python dictionary with the following fields:
{
"attribute_a": "3898801b-4595-4113-870b-ee5906457edf", # UUID
"attribute_b": "50df4979-7448-468a-994c-96797b0f958b", # UUID
"attribute_c": "0f6b2331-f86b-4e76-9efe-42ef8d843273", # UUID
"attribute_d": "blah1", # string
"attribute_e": "blah2", # string
"attribute_f": "72.154.80.0", # IP Address
"attribute_g": "blah3", # string
"created_timestamp": datetime.datetime.now() # datetime
}
Now comes the hard part: I will have about 3 millions such records created daily and I need to store at least the last 90-days worth of these records in some type of datastore.
Once each record is stored, it will never need to be updated (but it can be deleted). And I will need the ability to occasionally query this datastore to find all records based matching any of the first 7 attributes and/or date comparison with the last attribute created_timestamp. Mostly I will only be filtering on attribute_a and will want to see the matching records sorted by created_timestamp.
Which datastore should I use? I fear that if I tried to store this vast quantity of data in a MYSQL table with even one index on it, that much data will cause the insertions to become too slow. And if there are no indexes on it, then querying it becomes impossible. So I'm leaning towards a NoSQL solution like MongoDB.
However, I have no experience with NoSQL and am concerned if I can use it for this purpose. Will I be able to filter across multiple fields? Will it be able to handle the created_timestamp field as an actual date instead of as a string? Should I set one attribute_a as the primary key and all the other attributes as secondary global keys? If i do so, will inserts become exceedingly slow? Can it return the data to me date-sorted by created_timestamp?

Related

How does a relational database organize data?

I was thinking that a relational database will store every possible query and the values to return for that query in a hash table.
So like, if each entry in your table had 5 attributes, then you would make a copy of that element for each subset of the 5 attributes that appear in any given query that should return that specific entry. So every individual entry would appear 2^5 = 32 times in the table. This seems like it would be very memory inefficient for large data sets with many entries, but it also allows for the fastest possible query time.
Do real world relational-databases have a mixed version of this where some response time for queries/lookups is traded off for more memory efficiency? If so, how would this be implemented?
That's not how relational databases store data. Keep in mind it's a lot more than 2^32, because you can make queries that have expressions, not simply references to attribute columns. Also queries that are joins, which expands the possibilities immensely.
Even if you could store all possible combinations, it would be a waste because most of them will never be needed.
Instead, databases typically store records, where a record includes all columns of one table. If you run a query that only needs some columns, the DBMS still fetches the whole record, and simply ignores columns that you didn't ask for. Then it evaluates any expressions in your query. And finally returns the result set.
MySQL does not use hash tables to store these records, it uses a B+Tree data structure, so looking up a record by its primary key takes O(log n) time.

Storing large amount data as a single JSON field - extract important fields to their own field?

I'm planning of storing a large amount of data from a user submitted form (around 100 questions) in a json field.
I will only need to access for queries for two pieces of data from the form, name and type.
Would it be advisable (and more efficient), to extract name and type to their own fields for querying or shall I just whack it all in one json field and query that json field since json searching is now supported?
If you are concerned about performance, then maintaining separate fields for the name and type is probably the way to go here. The reason for this is that if these two points of data exist as separate fields, it leaves open the possibility to do things like add indices to those columns. While you can use MySQL's JSON API to query by name and type, it would most likely would never be able to compete with an index lookup, at least not in terms of performance.
From a storage point of view, you would not pay much of a price to maintain two separate columns. The main price you would pay is that everytime the JSON gets updated, you would have to also update the name and type columns.

When to use Json over key/value tables in postgres for billions of rows

I am doing a project, where I need to store billions of rows of unstructured history_data in a sql database (postgres) 2-3 years. The data/columns may change from day to day.
So example, day one the user might save {“user_id”:”2223”, “website”:”www.mywebsite.org”, “webpage”:”mysubpageName”}.
And the following day {“name”:”username”, “user_id”: “2223”, “bookclub_id”:”1” }.
I have been doing a project earlier, where we used the classic entity key/value table model for this problem. We saved maybe up to 30 key/values pr entity. But when exceeding 70-100 mill rows, the queries began to run slower and slower (too many inner joins).
Therefore i am wondering if I should change using the Json model in postgres. After searching the web, and reading blogs, I am really confused. What are the pro and con changing this to json in postgres?
You can think about this in terms of query complexity. If you have an index to the json documents (maybe user_id) you can do a simple index-scan to access the whole json string very fast.
You have to dissect it on the client side then, or you can pass it to functions in postgres, if e.g. you want to extract only data for specific values.
One of the most important features of postgres when dealing with json is having functional indexes. In comparison to "normal" index which index the value of a column, function indexes apply a function to a value of one (or even more) column values and index the return value. I don't know the function that extracts the value of a json string, but consider you want the user that have bookclub_id = 1. You can create an index like
create index idx_bookblub_id on mytable using getJsonValue("bookclub_id",mytable.jsonvalue)
Afterwards queries like
select * from mytable where getJsonValue("bookclub_id",mytable.jsonvalue) = 1
are lightning fast.

indexing varchars without duplicating the data

I've huge data-set of (~1 billion) records in the following format
|KEY(varchar(300),UNIQE,PK)|DATA1(int)|DATA2(bool)|DATA4(varchar(10)|
Currently the data is stored in MySAM MYSQL table, but the problem is that the key data (10G out of 12G table size) is stored twice - once in the table and once as index. (the data is append only there won't ever be UPDATE query on the table)
There are two major actions that run against the data-set :
contains - Simple check if a key is found
count - Aggregation (mostly) functions according to the data fields
Is there a way to store the key data only once?
One idea I had is to drop the DB all together and simply create 2-5 char folder structure.
this why the data assigned to the key "thesimon_wrote_this" would be stored in the fs as
~/data/the/sim/on_/wro/te_/thi/s.data
This way the data set will function much as btree and the "contains" and data retrieval functions will run in almost O(1) (with the obvious HDD limitations).
This makes the backups pretty easy (backing up only files with A attribute) but the aggregating functions became almost useless as I need to grep 1 billion of files every time. The allocation unit size is irrelevant as I can adjust the file structure so that only 5% of the disk space is taken without use.
I'm pretty sure that there is another - much more elegant way to do that, I can't Google it out :).
It would seem like a very good idea to consider having a fixed-width, integral key, like a 64-bit integer. Storing and searching a varchar key is very slow by comparison! You can still add an additional index on the KEY column for fast lookup, but it shouldn't be your primary key.

Dedicated SQL table containing only unique strings

I can't seem to find any examples of anyone doing this on the web, so am wondering if maybe there's a reason for that (or maybe I haven't used the right search terms). There might even already be a term for this that I'm unaware of?
To save on database storage space for regularly reoccurring strings, I'm thinking of creating a MySQL table called unique_string. It would only have two columns:
"id" : INT : PRIMARY_KEY index
"string" : varchar(255) : UNIQUE index
Any other tables anywhere in the database can then use INT columns instead of VARCHAR columns. For example a varchar field called browser would instead be an INT field called browser_unique_string_id.
I would not use this for anything where performance matters. In this case I'm using it to track details of every single page request (logging web stats) and an "audit trial" of user actions on intranets, but other things potentially too.
I'm also aware the SELECT queries would be complex, so I'm not worried about that. I'll most likely write some code to generate the queries to return the "real" string data.
Thoughts? I feel like I might be overlooking something obvious here.
Thanks!
I have used this structure for a similar application -- keeping track of URIs for web logs. In this case, the database was Oracle.
The performance issues are not minimal. As the database grows, there are tens of millions of URIs. So, just identifying the right string during an INSERT is challenging. We handled this by building most of the update logic in hadoop, so the database table was, in essence, just a copy of a hadoop table.
In a regular database, you would get around this by building an index, as you suggest in your question. And, an index solution would work well up to your available memory. In fact, this is a rather degenerate case for an index, because you really only need the index and not the underlying table. I'm do not know if mysql or SQL Server recognize this, although columnar databases (such as Vertica) should.
SQL Server has another option. If you declare the string as VARCHAR(max), then it is stored no a separate data page from the rest of the data. During a full table scan, there is no need to load the additional page in memory, if the column is not being referenced in the query.
This is a very common design pattern in databases where the cardinality of the data is relatively small compared to the transaction table that it's linked to. The queries wouldn't be very complex, just a simple join to the lookup table. You can include more than just a string on the lookup table, other information that is commonly repeated. You're simply normalizing your model to remove duplicate data.
Example:
Request Table:
Date
Time
IP Address
Browser_ID
Browser Table:
Browser_ID
Browser_Name
Browser_Version
Browser_Properties
If you planning on logging data in real time (as opposed to a batch job) then you want to ensure your time to write a record to the database is as quick as possible. If you are logging synchronously then obviously the record creating time will directly affect the time it takes for a http request to complete. If this is async then slow record creation times will lead to a bottleneck. However if this is batch job then performance will not matter so long as you can confidently create all the batched records before the next batch runs.
In order to reduce the time it takes to create a record you really want to flatten out your database structure, your current query in pseudo might look like
SELECT #id = id from PagesTable
WHERE PageName = #RequestedPageName
IF #id = 0
THEN
INSERT #RequestedPageName into PagesTable
#id = SELECT ##IDENTITY 'or whatever method you db supports for
'fetching the id for a newly created record
END IF
INSERT #id, #BrowserName INTO BrowersLogTable
Where as in a flat structure you would just need 1 INSERT
If you are concerned about data Integrity, which you should be, then typically you would normalise this data by querying at writing it into a separate set of tables (or a separate database) at regular intervals and use this for querying against.