Increase column value by X every T seconds - mysql

I want to increase (or generally tamper with) a database column value (e.g. time_passed) every set amount of time.
How do I address this issue? Should I approach this server side or database side?
My app is using Express, Socket.IO and I am using a MySQL database.

Let's say I store a timestamp at midnight (00:00:00), and I'm interested in increasing a column value (say 100) by 1 every 10 seconds. So, in reality I don't need to do anything else. After 60 seconds, I know that the new value is simply:
100 + (1 x ((60-0)/10)) = 106

I suggest using Cron jobs for this.
The software utility Cron is a time-based job scheduler in Unix-like
computer operating systems. People who set up and maintain software
environments use cron to schedule jobs (commands or shell scripts) to
run periodically at fixed times, dates, or intervals. It typically
automates system maintenance or administration—though its
general-purpose nature makes it useful
https://en.wikipedia.org/wiki/Cron
You can set up a Cron job to automatically run a script periodically, which you can do database maintenance with.
Cron is driven by a crontab (cron table) file, a configuration file that specifies shell commands to run periodically on a given schedule. Users can have their own individual crontab files and often there is a system-wide crontab file (usually in /etc or a subdirectory of /etc) that only system administrators can edit.
Each line of a crontab file represents a job, and looks like this:
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │ 7 is also Sunday on some systems)
# │ │ │ │ │
# │ │ │ │ │
# * * * * * command to execute
The following example would run the script at /home/oracle/scripts/export_dump.sh every Saturday at 11:45 PM.
45 23 * * 6 /home/oracle/scripts/export_dump.sh

Related

Incorrect handling of metrics by the delta() function in Grafana

I am using prometheus-cpp library, Prometheus and Grafana .
And I want to calculate the intensity of my application. This application processes files. I count the processing time of each file and send it to Prometheus as a counter.Increment(Processed_time). Then in Grafana I use the delta() function to calculate the intensity. The idea is this: if files were processed for 40 seconds in 1 minute, then the intensity is 40/60.
I am using the delta() function
delta(application_time_total_sec{}[1m]) / (60*1)
But there are two problems:
sometimes incorrect values ​​are obtained after the delta() function: they are greater than 1. It was as if my application worked 75 seconds in 1 minute))
incorrect values - they are greater than 1
If I want to increase the interval from 1 minute to 5 minutes, then I get crazy data.
delta(application_time_total_sec{}[5m]) / (60*5)
Am I using the delta() function incorrectly?

How to get Ethereum token holders amount history and analysis daily

I wanna to create a chart diagram for analysis token holders amount history, for example:
holders amount
▲
│ ┌─────
│ │
│ │
│ │
│ ┌──────┘
│ │
│ │
│ ┌─────┘
│ │
│ │
└─┴───────────────────► date
X axis is for date, Y axis is for holders amount, above chart is meaning that holders amount is increase over date.
However the problem is that I can't get the holders amount in the past date, The etherscan only provide a real-time holders amount.
You can calculate the total amount of token holders in time from the Transfer() event logs.
Use the getPastLogs() web3 function for example to collect the historical event logs.
from the block when the token contract was deployed
to the latest (curent) block or whichever suits your needs
address of the token contract emiting these event logs
topics help you to filter only some events, or only events containing a specific value
For example in your case, the Transfer(address,address,uint256) event results in the value ddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef (keccak-256 hash of the event signature) of the topics[0] item. So you can chose to filter only logs of this event - and not approvals, and other events.

How many files are output by a Foundry Transform in various combinations of repartition, hive partitioning, and bucketing?

I think I understand how each of repartition, hive partitioning, and bucketing affect the number of output files, but I am not quite clear on the interaction of the various features. Can someone help fill in the number of output files for each of the below situations where I've left a blank? The intent is to understand what the right code is for a situation where I have a mix of high and low cardinality columns that I need to partition / bucket by, where I have frequent operations that filter on the low cardinality columns, and join on the high cardinality columns.
Assume that we have a data frame df that starts with 200 input partitions, colA has 10 unique values, and colB has 1000 unique values.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
df.repartition('colB') = 1000 output files
df.repartition(50, 'colA') = 50 output files?
df.repartition(50, 'colB') = 50 output files, so some files will contain more than one value of colB?
Hive partitions:
output.write_dataframe(df, partition_cols=['colA']) = 1,000 output files (because I get potentially 100 files in each of the 10 hive partitions 10)
output.write_dataframe(df, partition_cols=['colB']) = 10,000 output files
output.write_dataframe(df, partition_cols=['colA', 'colB']) = 100,000 output files
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Bucketing:
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = 100 output files? In an experiment, this did not seem to be the case
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = 10 output files?
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = ???
All together now:
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ???
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ??? -- Is this the command that I want to use in the end? And anything downstream would first filter on colA to take advantage of the hive partitioning, then join on colB to take advantage of the bucketing?
For hive partitioning + bucketing, the # of output files is not constant and will depend on the actual data of the input partition.To clarify, let's say df is 200 partitions, not 200 files. Output files scale with # of input partitions, not # of files. 200 files could be misleading as that could be 1 partition to 1000's of partitions.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
df.repartition('colB') = 1000 output files
df.repartition(50, 'colA') = 50 output files
df.repartition(50, 'colB') = 50 output files
Hive partitions:
output.write_dataframe(df, partition_cols=['colA']) = upper bound of 2,000 output files (200 input partitions * max 10 values per partition)
output.write_dataframe(df, partition_cols=['colB']) = max 200,000 output files (200 * 1000 values per partition)
output.write_dataframe(df, partition_cols=['colA', 'colB']) = max 2,000,000 output files (200 partitions * 10 values * 1000)
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Bucketing:
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = max 20,000 files (200 partitions * max 100 buckets per partition)
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = max 2,000 files (200 partitions * max 10 buckets per partition)
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = exactly 10 files (repartitioned dataset makes 10 input partitions, each partition outputs to only 1 bucket)
All together now:
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = I could be wrong on this, but I believe it's max of 400,000 output files (200 input partitions * 10 colA partitions * 200 colB buckets)
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = I believe this is exactly 10,000 output files (repartition colA,colB = 10,000 partitions, each partition contains exactly 1 colA and 1 bucket of colB)
Background
The key to being able to reason about output file counts is understanding at which level each concept applies.
Repartition (df.repartition(N, 'colA', 'colB')) creates a new spark stage with the data shuffled as requested, into the specified number of shuffle partitions. This will change the number of tasks in the following stage, as well as the data layout in those tasks.
Hive partitioning (partition_cols=['colA', 'colB']) and bucketing (bucket_cols/bucket_count) only have any effect within the scope of the final stage's tasks, and effect how the task writes its data into files on disk.
In particular, each final stage task will write one file per hive-partition/bucket combination present in its data. Combinations not present in that task will not write an empty file if you're using hive-partitioning or bucketing.
Note: if not using hive-partitioning or bucketing, each task will write out exactly one file, even if that file is empty.
So in general you always want to make sure you repartition your data before writing to make sure the data layout matches your hive-partitioning/bucketing settings (i.e. each hive-partition/bucket combination is not split between multiple tasks), otherwise you could end up writing huge numbers of files.
Your examples
I think there is some misunderstanding floating around, so let's go through these one by one.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
Yes - the data will be randomly, evenly shuffled into 100 partitions, causing 100 tasks, each of which will write exactly one file.
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
No - the number of partitions to shuffle into is unspecified, so it will default to 200. So you'll have 200 tasks, at most 10 of which will contain any data (could be fewer due to hash collisions), so you will end up with 190 empty files, and 10 with data.
*Note: with AQE in spark 3, spark may decide to coalesce the 200 partitions into fewer when it realizes most of them are very small. I don't know the exact logic there, so technically the answer is actually "200 or fewer, only 10 will contain data".
df.repartition('colB') = 1000 output files
No - Similar to above, the data will be shuffled into 200 partitions. However in this case they will (likely) all contain data, so you will get 200 roughly-equally sized files.
Note: due to hash collisions, files may be larger or smaller depending on how many values of colB happened to land in each partition.
df.repartition(50, 'colA') = 50 output files?
Yes - Similar to before, except now we've overridden the partition count from 200 to 50. So 10 files with data, 40 empty. (or fewer because of AQE)
df.repartition(50, 'colB') = 50 output files, so some files will contain more than one value of colB?
Yes - Same as before, we'll get 50 files of slightly varying sizes depending on how the hashes of the colB values work out.
Hive partitions:
(I think the below examples are written assuming df is in 100 partitions to start rather than 200 as specified, so I'm going to go with that)
output.write_dataframe(df, partition_cols=['colA']) = 1,000 output files (because I get potentially 100 files in each of the 10 hive partitions 10)
Yes - You'll have 100 tasks, each of which will write one file for each colA value they see. So up to 1,000 files in the case the data is randomly distributed.
output.write_dataframe(df, partition_cols=['colB']) = 10,000 output files
No - Missing a 0 here. 100 tasks, each of which could write as many as 1,000 files (one for each colB value), for a total of up to 100,000 files.
output.write_dataframe(df, partition_cols=['colA', 'colB']) = 100,000 output files
No - 100 tasks, each of which will write one file for each combination of partition cols it sees. There are 10,000 such combinations, so this could write as many as 100 * 10,000 = 1,000,000 files!
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Yes - The repartition will shuffle our data into 200 tasks, but only 10 will contain data. Each will contain exactly one value of colA, so will write exactly one file. The other 190 tasks will write no files. So 10 files exactly.
Bucketing:
Again, assuming 100 partitions for df, not 200
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = 100 output files? In an experiment, this did not seem to be the case
No - Since we haven't laid out the data carefully, we have 100 tasks with (maybe) randomly distributed data. Each task will write one file per bucket it sees. So this could write up to 100 * 100 = 10,000 files!
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = 10 output files?
No - Similar to above, 100 tasks, each could write up to 10 files. So worst-case is 1,000 files here.
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = ???
Now we're adjusting the data layout before writing, we'll have 200 tasks, at most 10 of which will contain any data. Each value of colA will exist in only one task.
Each task will write one file per bucket it sees. So we should get at most 10 files here.
Note: Due to hash collisions, one or more buckets might be empty, so we might not get exactly 10.
All together now:
Again, assuming 100 partitions for df, not 200
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ???
100 tasks. 10 hive-partitions. 200 buckets.
Worst case is each task writes one file per hive-partition/bucket combination. i.e. 100 * 10 * 200 = 200,000 files.
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ??? -- Is this the command that I want to use in the end? And anything downstream would first filter on colA to take advantage of the hive partitioning, then join on colB to take advantage of the bucketing?
This one is sneaky. We have 200 tasks and the data is shuffled carefully so each colA/colB combination is in just one task. So everything seems good.
BUT each bucket contains multiple values of colB, and we have done nothing to make sure that an entire bucket is localized to one spark task.
So at worst, we could get one file per value of colB, per hive partition (colA value). i.e. 10 * 1,000 = 10,000 files.
Given our particular parameters, we can do slightly better by just focusing on getting the buckets laid out optimally:
output.write_dataframe(df.repartition(200, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200)
Now we're making sure that colB is shuffled exactly how it will be bucketed, so each task will contain exactly one bucket.
Then we'll get one file for each colA value in the task (likely 10 since colA is randomly shuffled), so at most 200 * 10 = 2,000 files.
This is the best we can do, assuming colA and colB are not correlated.
Conclusion
There's no one-size fits all approach to controlling file sizes.
Generally you want to make sure you shuffle your data so it's laid out in accordance with the hive-partition/bucketing strategy you're applying before writing.
However the specifics of what to do may vary in each case depending on your exact parameters.
The most important thing is to understand how these 3 concepts interact (as described in "Background" above), so you can reason about what will happen from first principals.

Hive query does not begin MapReduce process after starting job and generating Tracking URL

I'm using Apache Hive.
I created a table in Hive (similar to external table) and loaded data into the same using the LOAD DATA LOCAL INPATH './Desktop/loc1/kv1.csv' OVERWRITE INTO TABLE adih; command.
While I am able to retrieve simple data from the hive table adih (e.g. select * from adih, select c_code from adih limit 1000, etc), Hive gives me errors when I ask for data involving slight computations (e.g. select count(*) from adih, select distinct(c_code) from adih).
The Hive cli output is as shown in the following link -
hive> select distinct add_user from adih;
Query ID = latize_20161031155801_8922630f-0455-426b-aa3a-6507aa0014c6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1477889812097_0006, Tracking URL = http://latize-data1:20005/proxy/application_1477889812097_0006/
Kill Command = /opt/hadoop-2.7.1/bin/hadoop job -kill job_1477889812097_0006
[6]+ Stopped $HIVE_HOME/bin/hive
Hive stops displaying any further logs / actions beyond the last line of "Kill Command"
Not sure where I have gone wrong (many answers on StackOverflow tend to point back to YARN configs (environment config detailed below).
I have the log as well but it contains more than 30000 characters (Stack Overflow limit)
My hadoop environment is configured as follows -
1 Name Node & 1 Data Node. Each has 20 GB of RAM with sufficient ROM. Have allocated 13 GB of RAM for the yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb each with the mapreduce.map.memory.mb being set as 4 GB and the mapreduce.reduce.memory.mb being set as 12 GB. Number of reducers is currently set to default (-1). Also, Hive is configured to run with a MySQL DB (rather than Derby).
You should set the appropriate values to the properties show in your trace,
eg: Edit the properties in hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>67108864</value></property>
Looks like you have set mapred.reduce.tasks = -1, which makes Hive refer to its config to decide the number of reduce tasks.
You are getting an error as the number of reducers is missing in Hive config.
Try setting it using below command:
Hive> SET mapreduce.job.reduces=XX
As per official documentation: The right number of reduces seems to be 0.95 or 1.75 multiplied by (< no. of nodes > * < no. of maximum containers per node >).
I managed to get Hive and MR to work - increased the memory configurations for all the processes involved:
Increased the RAM allocated to YARN Scheduler and maximum RAM allocated to the YARN Nodemanager (in yarn-site.xml), alongside increasing the RAM allocated to the Mapper and Reducer (in mapred-site.xml).
Also incorporated parts of the answers by #Sathiyan S and #vmorusu - set the hive.exec.reducers.bytes.per.reducer property to 1 GB of data, which directly affects the number of reducers that Hive uses (through application of its heuristic techniques).

Where might I begin on this optimization problem?

I have a simple program which reads a bunch of things from the filesystem, filters the results , and prints them. This simple program implements a domain specific language to make selection easier. This DSL "compiles" down into an execution plan that looks like this (Input was C:\Windows\System32\* OR -md5"ABCDEFG" OR -tf):
Index Success Failure Description
0 S 1 File Matches C:\Windows\System32\*
1 S 2 File MD5 Matches ABCDEFG
2 S F File is file. (Not directory)
The filter is applied to the given file, and if it succeeds, the index pointer jumps to the index indicated in the success field, and if it fails, the index pointer jumps to the number indicated in the failure field. "S" means that the file passes the filter, F means that the file should be rejected.
Of course, a filter based upon a simple file attribute (!FILE_ATTRIBUTE_DIRECTORY) check is much faster than a check based upon the MD5 of the file, which requires opening and performing the actual hash of the file. Each filter "opcode" has a time class associated with it; MD5 gets a high timing number, ISFILE gets a low timing number.
I would like to reorder this execution plan so that opcodes that take a long time are executed as rarely as possible. For the above plan, that would mean it would have to be:
Index Success Failure Description
0 S 1 File is file. (Not directory)
1 S 2 File Matches C:\Windows\System32\*
2 S F File MD5 Matches ABCDEFG
According to the "Dragon Book", picking the best order of execution for three address code is an NP-Complete problem (At least according to page 511 of the second edition of that text), but in that case they are talking about register allocation and other issues of the machine. In my case, the actual "intermediate code" is much simpler. I'm wondering of a scheme exists that would allow me to reorder the source IL into the optimal execution plan.
Here is another example:
{ C:\Windows\Inf* AND -tp } OR { -tf AND NOT C:\Windows\System32\Drivers* }
Parsed to:
Index Success Failure Description
0 1 2 File Matches C:\Windows\Inf\*
1 S 2 File is a Portable Executable
2 3 F File is file. (Not directory)
3 F S File Matches C:\Windows\System32\Drivers\*
which is optimally:
Index Success Failure Description
0 1 2 File is file. (Not directory)
1 2 S File Matches C:\Windows\System32\Drivers\*
2 3 F File Matches C:\Windows\Inf\*
3 S F File is a Portable Executable
It sounds like it might be easier to pick an optimal order before compiling down to your opcodes. If you have a parse tree, and it is as "flat" as possible, then you can assign a score to each node and then sort each node's children by the lowest total score first.
For example:
{ C:\Windows\Inf* AND -tp } OR { -tf AND NOT C:\Windows\System32\Drivers* }
1 2 3 4
OR
/ \
AND AND
/ \ / \
1 2 3 4
You could sort the AND nodes (1, 2) and (3, 4) by the lowest score and then assign that score to each node. Then sort the children of the OR node by the lowest score of their children.
Since AND and OR are commutative, this sorting operation won't change the meaning of your overall expression.
#Greg Hewgill is right, this is easier to perform on the AST than on the Intermediate code. As you want to work on the Intermediate code, the first goal is to transform it into a dependency tree (which will look like the AST /shrug).
Start with the leaves - and it is probably easiest if you use negative-predicates for NOT.
Index Success Failure Description
0 1 2 File Matches C:\Windows\Inf\*
1 S 2 File is a Portable Executable
2 3 F File is file. (Not directory)
3 F S File Matches C:\Windows\System32\Drivers\*
Extract Leaf (anything with both children as S, F, or an extracted Node; insert NOT where required; Replace all references to Leaf with reference to parent node of leaf)
Index Success Failure Description
0 1 2 File Matches C:\Windows\Inf\*
1 S 2 File is a Portable Executable
2 L1 F File is file. (Not directory)
L1=NOT(cost(child))
|
Pred(cost(PATH))
Extract Node (If Success points to Extracted Node use conjunction to join; Failure uses disjunction; Replace all references to Node with reference to resulting root of tree containing Node).
Index Success Failure Description
0 1 L3 File Matches C:\Windows\Inf\*
1 S L3 File is a Portable Executable
L3=AND L1 L2 (cost(Min(L1,L2) + Selectivity(Min(L1,L2)) * Max(L1,L2)))
/ \
L1=NOT(cost(child)) L2=IS(cost(child))
| |
3=Pred(cost(PATH)) 2=Pred(cost(ISFILE))
Extract Node
Index Success Failure Description
0 L5 L3 File Matches C:\Windows\Inf\*
L5=OR L3 L4 (cost(Min(L3,L4) + (1.0 - Selectivity(Min(L3,L4))) * Max(L3,L4)))
/ \
| L4=IS(cost(child))
| |
| 1=Pred(cost(PORT_EXE))
|
L3=AND L1 L2 (cost(Min(L1,L2) + Selectivity(Min(L1,L2)) * Max(L1,L2)))
/ \
L1=NOT(cost(child)) L2=IS(cost(child))
| |
3=Pred(cost(PATH)) 2=Pred(cost(ISFILE))
Extract Node (In the case where Success and Failure both refer to Nodes, you will have to inject the Node into the tree by pattern matching on the root of the sub-tree defined by the Node)
If root is OR, invert predicate if necessary to ensure reference is Success and inject as conjunction with child not referenced by Failure.
If root is AND, invert predicate if necessary to ensure reference is Failure and inject as disjunction with child root referenced by Success.
Resulting in:
L5=OR L3 L4 (cost(Min(L3,L4) + (1.0 - Selectivity(Min(L3,L4))) * Max(L3,L4)))
/ \
| L4=AND(cost(as for L3))
| / \
| L6=IS(cost(child)) L7=IS(cost(child))
| | |
| 1=Pred(cost(PORT_EXE)) 0=Pred(cost(PATH))
|
L3=AND L1 L2 (cost(Min(L1,L2) + Selectivity(Min(L1,L2)) * Max(L1,L2)))
/ \
L1=NOT(cost(child)) L2=IS(cost(child))
| |
3=Pred(cost(PATH)) 2=Pred(cost(ISFILE))