Database to store compressed files - mysql

We use Redis database as:
key -> (file1, file2, file3)
The value is always a list of three compressed files.
Not all the keys have the 3 files i.e.
key2 -> (file4, file5)
Files are compressed using zlib.
Size of file is between 50 - 120 K compressed.
I want to store the "complete" keys (the ones with 3 files in the list) ,
in a database.
Until now I was storing the data in a MySQL table :
key_id : INTEGER , PRIMARY KEY
first : BLOB
second : BLOB
third : BLOB
This works pretty fine with the exception of slow inserts
(Mysql Server does at the same time other things).
I will query the data very rare but I want to be able to get them
one by one easily.
Redis is a database and I know I can dump to a file (rdb files).
So I think it is redudancy to use another DBMS.
But the redis database is memory limited, so I can not just
wait to finish the production of the values (files) and then
just dump to an rdb file.
I would like to create smaller rdb files that contain
only the "complete" keys.
i.e
at time 1 the redis contains the following:
key3 -> (a, b, c)
key14 -> (e, f)
key1 -> (g, h, i)
then if I decide to dump the dump file into 1.rdb should contain only:
key3 and key14
If the dump is successful I will delete the dumped keys (key3, key4)
and the redis should have:
key14 -> (e, f)
Now in time 5 the redis contains:
key5 -> (i, j , k)
key14 -> (d, e, f)
key6 -> (l, m)
So if I save to 2.rdb the file should contain only:
key5, key14
and then the above keys should be deleted from redis.
Is that possible? I am using python if it matters.
Do you have another idea for this task?
Another DBMS , store direct to filesystem etc.
P.S Forgot to mention that in total there would be around
15.000.000 keys so there would be 15.000.000 * 3 files.
Also I use Linux with ext3 filesystem

I'd try to keep everything in one system. Do you really need Redis to manage the keysets? A database table that gets accessed frequently should stay mostly cached in memory.

Related

How many tasks are created when spark read or write from mysql?

As far as I know, Spark executors handle many tasks at the same time to guarantee processing data parallelly.Here comes the question. When connecting to external data storage,say mysql,how many tasks are there to finishi this job?In other words,are multiple tasks created at the same time and each task reads all data ,or data is read from only one task and is distributed to the cluster in some other way? How about writing data to mysql,how many connections are there?
Here is some piece of code to read or write data from/to mysql:
def jdbc(sqlContext: SQLContext, url: String, driver: String, dbtable: String, user: String, password: String, numPartitions: Int): DataFrame = {
sqlContext.read.format("jdbc").options(Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> s"(SELECT * FROM $dbtable) $dbtable",
"user" -> user,
"password" -> password,
"numPartitions" -> numPartitions.toString
)).load
}
def mysqlToDF(sparkSession:SparkSession, jdbc:JdbcInfo, table:String): DataFrame ={
var dF1 = sparkSession.sqlContext.read.format("jdbc")
.option("url", jdbc.jdbcUrl)
.option("user", jdbc.user)
.option("password", jdbc.passwd)
.option("driver", jdbc.jdbcDriver)
.option("dbtable", table)
.load()
// dF1.show(3)
dF1.createOrReplaceTempView(s"${table}")
dF1
}
}
here is a good article which answers your question:
https://freecontent.manning.com/what-happens-behind-the-scenes-with-spark/
In simple words: the workers separate the reading task into several parts and each worker only read a part of your input data. The number of tasks divided depends on your ressources and your data volume. The writing is the same principle: Spark writes the data to a distributed storage system, such as Hdfs and in Hdfs the data is stored in a ditributed way: each worker writes its data to some storage node in Hdfs.
By default data from jdbc source are loaded by one thread so you will have one task processed by one executor and thats the case you may expect in your second function mysqlToDF
In the first function "jdbc" you are closer to parallel read but still some parameters are needed, numPartitions is not enough, spark need some integer/date column and lower/upper bounds to be able to read in paralell (it will execute x queries for partial results)
Spark jdb documentation
In this docu you will find:
partitionColumn, lowerBound, upperBound (none) These options must
all be specified if any of them is specified. In addition,
numPartitions must be specified. They describe how to partition the
table when reading in parallel from multiple workers. partitionColumn
must be a numeric, date, or timestamp column from the table in
question. Notice that lowerBound and upperBound are just used to
decide the partition stride, not for filtering the rows in table. So
all rows in the table will be partitioned and returned. This option
applies only to reading.
numPartitions (none) The maximum
number of partitions that can be used for parallelism in table reading
and writing. This also determines the maximum number of concurrent
JDBC connections. If the number of partitions to write exceeds this
limit, we decrease it to this limit by calling coalesce(numPartitions)
before writing. read/write
regarding write
How about writing data to mysql,how many connections are there?
As stated in docu it also depends on numPartitions, if number of partitions when writing will be higher than numPartitions Spark will figure it out and call coalesce. Remember that coalesce may generate skew so sometimes it may be better to repartition it explicitly with repartition(numPartitions) to distribute data equally before write
If you don't set numPartitions number of paralell connections on write may be the same as number of active tasks in given moment so be aware that with to high parallelism and no upper bound you may choke source server

Find all possible paths between two nodes on graph using a graph database

I have a collection of nodes that make up a DAG (directed acyclic graph) with no loops guaranteed. I want to store the nodes in a database and have the database execute a search that shows me all paths between two nodes.
For example, you could think that I have the git history of a complex project.
Each node can be described with a JSON object that has:
{'id':'id',
'outbound':['id1','id2','id3']}
}
So if I had these nodes in the database:
{'id':'id0',
'outbound':['id1','id2']}
}
{'id':'id1',
'outbound':['id2','id3','id4','id5,'id6']}
}
{'id':'id2',
'outbound':['id2','id3'}
}
And if I wanted to know all of the paths connecting id0 and id3, I would want to get three lists:
id0 -> id1 -> id3
id0 -> id2 -> id3
id0 -> id1 -> id2 -> id3
I have thousands of these nodes today, I will have tens of thousands of them tomorrow. However, there are many DAGs in the database, and the typical DAG only has 5-10 nodes, so this problem is tractable.
I believe that there is no way to do this efficiently MySQL (right now all of the objects are stored in a table in a JSON column), however I believe that it is possible to do it efficiently in a graph database like Neo4j.
I've looked at the Neo4J documentation on Path Finding Algorithms and perhaps I'm confused, but the examples don't really look like working examples. I found a MySQL example which uses stored procedures and it doesn't look like it parallelizes very well. I'm not even sure what Amazon Neptune is doing; I think that it is using Spark GraphX.
I'm sort of lost as to where to start on this.
It's perfectly doable with Neo4j.
Importing json data
[
{"id":"id0",
"outbound":["id1","id2"]
},
{"id":"id1",
"outbound":["id2","id3","id4","id5","id6"]
},
{"id":"id2",
"outbound":["id2","id3"]
}
]
CALL apoc.load.json("graph.json")
YIELD value
MERGE (n:Node {id: value.id})
WITH n, value.outbound AS outbound
UNWIND outbound AS o
MERGE (n2:Node {id: o})
MERGE (n)-[:Edge]->(n2)
Apparently the data you provided is not acyclic...
Getting all paths between two nodes
As you are not mentioning shortest paths, but all paths, there is no specific algorithm required:
MATCH p=(:Node {id: "id0"})-[:Edge*]->(:Node {id: "id3"}) RETURN nodes(p)
"[{""id"":id0},{""id"":id1},{""id"":id3}]"
"[{""id"":id0},{""id"":id2},{""id"":id3}]"
"[{""id"":id0},{""id"":id1},{""id"":id2},{""id"":id3}]"
"[{""id"":id0},{""id"":id2},{""id"":id2},{""id"":id3}]"
"[{""id"":id0},{""id"":id1},{""id"":id2},{""id"":id2},{""id"":id3}]"
Comparaison with MySql
See how-much-faster-is-a-graph-database-really
The Graph Data Science library pathfinding algorithms are designed to find the shortest weighted paths and use algorithms similar to Dijkstra to find them. In your case, it seems that you are dealing with a directed unweighted graph and you could use the native cypher allShortestPath procedure:
An example would be:
MATCH (n1:Node{id:"A"}),(n2:Node{id:"B"})
MATCH path=allShortestPaths((n1)-[*..10]->(n2))
RETURN [n in nodes(path) | n.id] as outbound_nodes_id
It is always useful to check the Cypher refcard to see what is available with Cypher in Neo4j

MemoryError during reading csv

I am trying to read some information into Pandas DataFrame and facing the problem due to the value of the data.
Specs of PC:
RAM 32 GB
IntelCore i7 4GHz
Setup:
Data is in MySQL DB, 9 columns (7 int, 1 date, 1 DateTime). DB is on the local machine, so no internet bandwidth issues.
22 mil. rows of data.
Tried to read directly from MySQL server - it never ends.
engine = sqlalchemy.create_engine('mysql+pymysql://root:#localhost:3306/database')
search_df = pd.read_sql_table('search', engine)
I checked with SO and got the impression that instead of using the connector, better to parse CSV. I exported table to CSV.
CSV file size - 1.5GB
My code
dtype = {
'search_id' : int,
'job_count_total' : int,
'job_count_done' : int,
'city_id_start' : int,
'city_id_end' : int,
'date_start' : str,
'datetime_create' : str,
'agent_id' : int,
'ride_segment_found_cnt' : int
}
search_df = pd.read_csv('search.csv', sep=',', dtype=dtype)
I tried both engines, c and python, different chunk sizes, low_memory as True and False, specified dtypes and not, but still getting MemoryError.
I tried everything, mentioned in the question above (which was marked as of origin, mine as duplicate), but nothing changes.
I spotted only two difference:
If I parsing without chunks that I get Memory Error on parsing.
When I am parsing in chunks - on concatenation into one DF.
Also, chunking by 5_000_000 rows gives an error on parsing, less - on concatenation.
Here is an error message on concatenation:
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
Basically, the problem was with memory.
I played a bit with chunk-size + added some filtrations, that I had later in code on the chunk.
That allowed me to fit dataframe into memory.

Should I use Neo4j's Import Tool or Load Command to Insert Several Million Rows?

I have several CSV files that range from 25-100 MB in size. I have created constraints, created indices, am using periodic commit, and increased the allocated memory in the neo4j-wrapper.conf and neo4j.properties.
neo4j.properties:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=0M
neo4j-wrapper.conf changes:
wrapper.java.initmemory=5000
wrapper.java.maxmemory=5000
However my load is still taking a very long time, and I am considering using the recently released Import Tool (http://neo4j.com/docs/milestone/import-tool.html). Before I switch to it, I was wondering whether I could be doing anything else to improve the speed of my imports.
I begin by creating several constraints to make sure that the IDs I'm using are unique:
CREATE CONSTRAINT ON (Country) ASSERT c.Name IS UNIQUE;
//and constraints for other name identifiers as well..
I then use periodic commit...
USING PERIODIC COMMIT 10000
I then LOAD in the CSV where I ignore several fields
LOAD CSV WITH HEADERS FROM "file:/path/to/file/MyFile.csv" as line
WITH line
WHERE line.CountryName IS NOT NULL AND line.CityName IS NOT NULL AND line.NeighborhoodName IS NOT NULL
I then create the necessary nodes from my data.
WITH line
MERGE(country:Country {name : line.CountryName})
MERGE(city:City {name : line.CityName})
MERGE(neighborhood:Neighborhood {
name : line.NeighborhoodName,
size : toInt(line.NeighborhoodSize),
nickname : coalesce(line.NeighborhoodNN, ""),
... 50 other features
})
MERGE (city)-[:IN]->(Country)
CREATE (neighborhood)-[:IN]->(city)
//Note that each neighborhood only appears once
Does it make sense to use CREATE UNIQUE rather than applying MERGE to any COUNTRY reference? Would this speed it up?
A ~250,000-line CSV file took over 12 hours to complete, and seemed excessively slow. What else can I be doing to speed this up? Or does it just make sense to use the annoying-looking Import Tool?
A couple of things. Firstly, I would suggest reading Mark Needham's "Avoiding the Eager" blog post:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Basically what it says is that you should add a PROFILE to the start of each of your queries to see if any of them use the Eager operator. If they do this can really cost you performance-wise and you should probably split up your queries into separate MERGEs
Secondly, your neighborhood MERGE contains a lot of properties, and so each time it's trying to match on every single one of those properties before deciding if it should create it or not. I'd suggest something like:
MERGE (neighborhood:Neighborhood {name: line.NeighborhoodName})
ON CREATE SET
neighborhood.size = toInt(line.NeighborhoodSize),
neighborhood.nickname = coalesce(line.NeighborhoodNN, ""),
... 50 other features
})

Redis strings vs Redis hashes to represent JSON: efficiency?

I want to store a JSON payload into redis. There's really 2 ways I can do this:
One using a simple string keys and values.
key:user, value:payload (the entire JSON blob which can be 100-200 KB)
SET user:1 payload
Using hashes
HSET user:1 username "someone"
HSET user:1 location "NY"
HSET user:1 bio "STRING WITH OVER 100 lines"
Keep in mind that if I use a hash, the value length isn't predictable. They're not all short such as the bio example above.
Which is more memory efficient? Using string keys and values, or using a hash?
This article can provide a lot of insight here: http://redis.io/topics/memory-optimization
There are many ways to store an array of Objects in Redis (spoiler: I like option 1 for most use cases):
Store the entire object as JSON-encoded string in a single key and keep track of all Objects using a set (or list, if more appropriate). For example:
INCR id:users
SET user:{id} '{"name":"Fred","age":25}'
SADD users {id}
Generally speaking, this is probably the best method in most cases. If there are a lot of fields in the Object, your Objects are not nested with other Objects, and you tend to only access a small subset of fields at a time, it might be better to go with option 2.
Advantages: considered a "good practice." Each Object is a full-blown Redis key. JSON parsing is fast, especially when you need to access many fields for this Object at once. Disadvantages: slower when you only need to access a single field.
Store each Object's properties in a Redis hash.
INCR id:users
HMSET user:{id} name "Fred" age 25
SADD users {id}
Advantages: considered a "good practice." Each Object is a full-blown Redis key. No need to parse JSON strings. Disadvantages: possibly slower when you need to access all/most of the fields in an Object. Also, nested Objects (Objects within Objects) cannot be easily stored.
Store each Object as a JSON string in a Redis hash.
INCR id:users
HMSET users {id} '{"name":"Fred","age":25}'
This allows you to consolidate a bit and only use two keys instead of lots of keys. The obvious disadvantage is that you can't set the TTL (and other stuff) on each user Object, since it is merely a field in the Redis hash and not a full-blown Redis key.
Advantages: JSON parsing is fast, especially when you need to access many fields for this Object at once. Less "polluting" of the main key namespace. Disadvantages: About same memory usage as #1 when you have a lot of Objects. Slower than #2 when you only need to access a single field. Probably not considered a "good practice."
Store each property of each Object in a dedicated key.
INCR id:users
SET user:{id}:name "Fred"
SET user:{id}:age 25
SADD users {id}
According to the article above, this option is almost never preferred (unless the property of the Object needs to have specific TTL or something).
Advantages: Object properties are full-blown Redis keys, which might not be overkill for your app. Disadvantages: slow, uses more memory, and not considered "best practice." Lots of polluting of the main key namespace.
Overall Summary
Option 4 is generally not preferred. Options 1 and 2 are very similar, and they are both pretty common. I prefer option 1 (generally speaking) because it allows you to store more complicated Objects (with multiple layers of nesting, etc.) Option 3 is used when you really care about not polluting the main key namespace (i.e. you don't want there to be a lot of keys in your database and you don't care about things like TTL, key sharding, or whatever).
If I got something wrong here, please consider leaving a comment and allowing me to revise the answer before downvoting. Thanks! :)
It depends on how you access the data:
Go for Option 1:
If you use most of the fields on most of your accesses.
If there is variance on possible keys
Go for Option 2:
If you use just single fields on most of your accesses.
If you always know which fields are available
P.S.: As a rule of the thumb, go for the option which requires fewer queries on most of your use cases.
Some additions to a given set of answers:
First of all if you going to use Redis hash efficiently you must know
a keys count max number and values max size - otherwise if they break out hash-max-ziplist-value or hash-max-ziplist-entries Redis will convert it to practically usual key/value pairs under a hood. ( see hash-max-ziplist-value, hash-max-ziplist-entries ) And breaking under a hood from a hash options IS REALLY BAD, because each usual key/value pair inside Redis use +90 bytes per pair.
It means that if you start with option two and accidentally break out of max-hash-ziplist-value you will get +90 bytes per EACH ATTRIBUTE you have inside user model! ( actually not the +90 but +70 see console output below )
# you need me-redis and awesome-print gems to run exact code
redis = Redis.include(MeRedis).configure( hash_max_ziplist_value: 64, hash_max_ziplist_entries: 512 ).new
=> #<Redis client v4.0.1 for redis://127.0.0.1:6379/0>
> redis.flushdb
=> "OK"
> ap redis.info(:memory)
{
"used_memory" => "529512",
**"used_memory_human" => "517.10K"**,
....
}
=> nil
# me_set( 't:i' ... ) same as hset( 't:i/512', i % 512 ... )
# txt is some english fictionary book around 56K length,
# so we just take some random 63-symbols string from it
> redis.pipelined{ 10000.times{ |i| redis.me_set( "t:#{i}", txt[rand(50000), 63] ) } }; :done
=> :done
> ap redis.info(:memory)
{
"used_memory" => "1251944",
**"used_memory_human" => "1.19M"**, # ~ 72b per key/value
.....
}
> redis.flushdb
=> "OK"
# setting **only one value** +1 byte per hash of 512 values equal to set them all +1 byte
> redis.pipelined{ 10000.times{ |i| redis.me_set( "t:#{i}", txt[rand(50000), i % 512 == 0 ? 65 : 63] ) } }; :done
> ap redis.info(:memory)
{
"used_memory" => "1876064",
"used_memory_human" => "1.79M", # ~ 134 bytes per pair
....
}
redis.pipelined{ 10000.times{ |i| redis.set( "t:#{i}", txt[rand(50000), 65] ) } };
ap redis.info(:memory)
{
"used_memory" => "2262312",
"used_memory_human" => "2.16M", #~155 byte per pair i.e. +90 bytes
....
}
For TheHippo answer, comments on Option one are misleading:
hgetall/hmset/hmget to the rescue if you need all fields or multiple get/set operation.
For BMiner answer.
Third option is actually really fun, for dataset with max(id) < has-max-ziplist-value this solution has O(N) complexity, because, surprise, Reddis store small hashes as array-like container of length/key/value objects!
But many times hashes contain just a few fields. When hashes are small we can instead just encode them in an O(N) data structure, like a linear array with length-prefixed key value pairs. Since we do this only when N is small, the amortized time for HGET and HSET commands is still O(1): the hash will be converted into a real hash table as soon as the number of elements it contains will grow too much
But you should not worry, you'll break hash-max-ziplist-entries very fast and there you go you are now actually at solution number 1.
Second option will most likely go to the fourth solution under a hood because as question states:
Keep in mind that if I use a hash, the value length isn't predictable. They're not all short such as the bio example above.
And as you already said: the fourth solution is the most expensive +70 byte per each attribute for sure.
My suggestion how to optimize such dataset:
You've got two options:
If you cannot guarantee max size of some user attributes then you go for first solution, and if memory matter is crucial then
compress user json before storing in redis.
If you can force max size of all attributes.
Then you can set hash-max-ziplist-entries/value and use hashes either as one hash per user representation OR as hash memory optimization from this topic of a Redis guide: https://redis.io/topics/memory-optimization and store user as json string. Either way you may also compress long user attributes.
we had a similar issue in our production env , we have came up with an idea of gzipping the payload if it exceeds some threshold KB.
I have a repo only dedicated to this Redis client lib here
what is the basic idea is to detect the payload if the size is greater than some threshold and then gzip it and also base-64 it and then keep the compressed string as a normal string in the redis. on retrieval detect if the string is a valid base-64 string and if so decompress it.
the whole compressing and decompressing will be transparent plus you gain close to 50% network traffic
Compression Benchmark Results
BenchmarkDotNet=v0.12.1, OS=macOS 11.3 (20E232) [Darwin 20.4.0]
Intel Core i7-9750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.201
[Host] : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT DEBUG
Method
Mean
Error
StdDev
Gen 0
Gen 1
Gen 2
Allocated
WithCompressionBenchmark
668.2 ms
13.34 ms
27.24 ms
-
-
-
4.88 MB
WithoutCompressionBenchmark
1,387.1 ms
26.92 ms
37.74 ms
-
-
-
2.39 MB
To store JSON in Redis you can use the Redis JSON module.
This gives you:
Full support for the JSON standard
A JSONPath syntax for selecting/updating elements inside documents
Documents stored as binary data in a tree structure, allowing fast access to sub-elements
Typed atomic operations for all JSON values types
https://redis.io/docs/stack/json/
https://developer.redis.com/howtos/redisjson/getting-started/
https://redis.com/blog/redisjson-public-preview-performance-benchmarking/
You can use the json module: https://redis.io/docs/stack/json/
It is fully supported and allows you to use json as a data structure in redis.
There is also Redis Object Mappers for some languages: https://redis.io/docs/stack/get-started/tutorials/