Inserting several "new" items into the database with DBIC - mysql

I'm working in a bioinformatics project that requires me to read genomic data (nothing too fancy, just think of it as strings) from various organisms and insert it into a database. Each read belongs to one organism, and can contain from 5000 to 50000 thousand genes, which I need to process and analyze prior to storage.
The script currently doing this is written in perl and, after all calculations, stores the results in a hash likie this:
$new{$id}{gene_name} = $id;
$new{$id}{gene_database_source} = $gene_database_source
$new{$id}{product} = $product;
$new{$id}{sequence} = $sequence;
$new{$id}{seqlength} = $seqlength;
$new{$id}{digest} = $digest;
$new{$id}{mw} = $mw;
$new{$id}{iep} = $iep;
$new{$id}{tms} = $tms;
After all genes are read and, the insertions are made looping through the hash into an eval{} statement.
eval {
foreach my $id (keys %new) {
my $rs = $schema->resultset('Genes')->create(
{
gene_name => $new{$id}{gene_name},
gene_product => $new{$id}{product},
sequence => $new{$id}{sequence},
gene_protein_length => $new{$id}{seqlength},
digest => $new{$id}{digest},
gene_isoelectric_point => $new{$id}{iep},
gene_molecular_weight => $new{$id}{mw},
gene_tmd_count => $new{$id}{tms},
gene_species => $species,
species_code => $spc,
user_id => $tdruserid,
gene_database_source => $new{$id}{gene_database_source}
}
);
};
While this "works", it has at least two problems I'd like to solve:
The eval statement is intended to "failsafe" the insertions: if one of the insertions fail, the eval dies and no insertion is done. This is clearly not how eval works. I'm pretty sure all insertions made
until failure point will be done and there's no rollback whatsoever.
The script needs to loop twice through very large datasets (one while reading and creating the hashes, and once again when reading
the hashes and performing the insertions). This makes the process' performance rather poor.
Instead of creating the hashes, I'd been thinking of using the new directive of DBIX $schema->new({..stuff..}); and then doing a massive insert transaction. That would solve the double iteration and the eval would either work (or not) with a single transaction, which would do the expected behaviour of < either all insertions or none > ... Is there a way to do this?

You can create your massive transaction by using a TxnScopeGuard in DBIC. In the most basic form, that would be as follows.
eval { # or try from Try::Tiny
my $guard = $schema->txn_scope_guard;
foreach my $id ( keys %new ) {
my $rs = $schema->resultset('Genes')->create(
{
gene_name => $new{$id}{gene_name},
gene_product => $new{$id}{product},
sequence => $new{$id}{sequence},
gene_protein_length => $new{$id}{seqlength},
digest => $new{$id}{digest},
gene_isoelectric_point => $new{$id}{iep},
gene_molecular_weight => $new{$id}{mw},
gene_tmd_count => $new{$id}{tms},
gene_species => $species,
species_code => $spc,
user_id => $tdruserid,
gene_database_source => $new{$id}{gene_database_source}
}
);
}
$guard->commit;
}
You create a scope guard object, and when you're done setting up your transaction, you commit it. If the object goes out of scope, i.e. because something died, it will rollback the transaction automatically.
The eval can catch the die, and your program will not crash. You had that part correct, but you're also right that your code will not undo previous inserts. Note that Try::Tiny's try provides nicer syntax. But it's not needed here.
Transaction in this case means that all queries are collected and run at the same time.
Note that this will still insert one row per INSERT statement only!
If you want to instead create larger INSERT statements, like the following, you need populate, not new.
INSERT INTO foo (bar, baz) VALUES
(1, 1),
(2, 2),
(3, 3),
...
The populate method lets you pass in an array reference with multiple rows at one time. This is supposed to be way faster than inserting one at a time.
$schema->resultset("Artist")->populate([
[ qw( artistid name ) ],
[ 100, 'A Formally Unknown Singer' ],
[ 101, 'A singer that jumped the shark two albums ago' ],
[ 102, 'An actually cool singer' ],
]);
Translated to your loop, that would be as follows. Note that the documentation claims that it's faster if you run it in void context.
eval {
$schema->resultset('Genes')->populate(
[
[
qw(
gene_name gene_product sequence
gene_protein_length digest gene_isoelectric_point
gene_molecular_weight gene_tmd_count gene_species
species_code user_id gene_database_source
)
],
map {
[
$new{$_}{gene_name}, $new{$_}{product},
$new{$_}{sequence}, $new{$_}{seqlength},
$new{$_}{digest}, $new{$_}{iep},
$new{$_}{mw}, $new{$_}{tms},
$species, $spc,
$tdruserid, $new{$_}{gene_database_source},
]
} keys %new
],
);
}
Like this the scope guard is not needed. However, I would advise you to not do more than 1000 rows per statement though. Processing it in chunks might be a good idea for performance reasons. In that case, you'd loop over the keys 1000 at a time. List::MoreUtils has a nice natatime function for that.
use List::MoreUtils 'natatime';
eval {
my $guard = $schema->txn_scope_guard;
my $it = natatime 1_000, keys %new;
while ( my #keys = $it->() ) {
$schema->resultset('Genes')->populate(
[
[
qw(
gene_name gene_product sequence
gene_protein_length digest gene_isoelectric_point
gene_molecular_weight gene_tmd_count gene_species
species_code user_id gene_database_source
)
],
map {
[
$new{$_}{gene_name}, $new{$_}{product},
$new{$_}{sequence}, $new{$_}{seqlength},
$new{$_}{digest}, $new{$_}{iep},
$new{$_}{mw}, $new{$_}{tms},
$species, $spc,
$tdruserid, $new{$_}{gene_database_source},
]
} #keys
],
);
}
$guard->commit;
}
Now it will do 1000 rows per insertion, and run all those queries in one big transaction. If one of them fails, none will be done.
The script needs to loop twice through very large datasets (one while reading and creating the hashes, and once again when reading the hashes and performing the insertions). This makes the process' performance rather poor.
You're not showing how you create the data, besides this assignment.
$new{$id}{gene_name} = $id;
$new{$id}{gene_database_source} = $gene_database_source
$new{$id}{product} = $product;
If that's all there is to it, nothing is stopping you from using the approach I've shown above directly where you're processing the data the first time and building the hash. The following code is incomplete, because you're not telling us where the data is coming from, but you should get the gist.
eval {
my $guard = $schema->txn_scope_guard;
# we use this to collect rows to process
my #rows;
# this is where your data comes in
while ( my $foo = <DATA> ) {
# here you process the data and come up with your variables
my ( $id, $gene_database_source, $product, $sequence, $seqlength,
$digest, $mw, $iep, $tms );
# collect the row so we can insert it later
push(
#rows,
[
$id, $gene_database_source, $product, $sequence, $seqlength,
$digest, $mw, $iep, $tms,
]
);
# only insert if we reached the limit
if ( scalar #rows == 1000 ) {
$schema->resultset('Genes')->populate(
[
[
qw(
gene_name gene_product sequence
gene_protein_length digest gene_isoelectric_point
gene_molecular_weight gene_tmd_count gene_species
species_code user_id gene_database_source
)
],
\#rows,
],
);
# empty the list of values
#rows = ();
}
}
$guard->commit;
}
Essentially we collect up to 1000 rows directly as array references while we process them, and when we've reached the limit, we pass them to the database. We then reset our row array and start over. Again, all of this is wrapped in a transaction, so it will only be committed if all the inserts are fine.
There is more information on transactions in DBIC in the cookbook.
Please note that I have not tested any of this code.

Related

decode_json and return first key in hash

JSON string input: https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=MSFT&apikey=demo
I am trying to return just the first key (current day) in the hash but have been unable to do so. My code looks like the following
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use Data::Dumper;
use JSON;
my $html = get("https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=AMD&apikey=CMDPTEHVYH7W5VSZ");
my $decoded = decode_json($html);
my ($open) = $decoded->{'Time Series (Daily)'}->[0]->{'1. open'};
I keep getting "Not an ARRAY reference" which I researched and got more confused.
I can access what I want directly with the below code but I want to access just the first result or the current day:
my ($open) = $decoded->{'Time Series (Daily)'}{'2017-12-20'}{'1. open'};
Also if I do something like this:
my ($open) = $decoded->{'Time Series (Daily)'};
print Dumper($open);
The output is as follows:
$VAR1 = {
'2017-09-07' => {
'1. open' => '12.8400',
'5. volume' => '35467788',
'2. high' => '12.9400',
'4. close' => '12.6300',
'3. low' => '12.6000'
},
'2017-11-15' => {
'3. low' => '10.7700',
'4. close' => '11.0700',
'2. high' => '11.1300',
'5. volume' => '33326871',
'1. open' => '11.0100'
},
'2017-11-30' => {
'1. open' => '10.8700',
'2. high' => '11.0300',
'5. volume' => '43101899',
'3. low' => '10.7600',
'4. close' => '10.8900'
},
Thank you in advance for any help you can provide a noob.
Problem 1: { denotes the start of a JSON object, which gets decoded into a hash. Trying to derefence an array is going to fail.
Problem 2: Like Perl hashes, JSON objects are unordered, so talking about the
"first key" makes no sense. Perhaps you want the most recent date?
use List::Util qw( maxstr );
my $time_series_daily = $decoded->{'Time Series (Daily)'};
my $latest_date = maxstr #$time_series_daily;
my $open = $time_series_daily->{$latest_date}{'1. open'};
You are picking among hashref keys, not array (sequential container) elements. Since hashes are inherently unordered you can't index into that list but need to sort keys as needed.
With the exact format you show this works
my $top = (sort { $b cmp $a } keys %{ $decoded->{'Time Series (Daily)'} } )[0];
say $decoded->{'Time Series (Daily)'}{$top}{'1. open'};
It gets the list of keys, inverse-sorts them (alphabetically), and takes the first element of that list.
If your date-time format may vary then you'll need to parse it for sorting.
If you will really ever only want the most-recent one this is inefficient since it sorts the whole list. Then use a more specific tool to extract only the "largest" element, like
use List::Util qw(reduce);
my $top = reduce { $a gt $b ? $a : $b }
keys %{ $decoded->{'Time Series (Daily)'} };
But then in your case this can be done simply by maxstr from the same List::Util module, as shown in ikegami's answer. On the other hand, if the datetime format doesn't lend itself to a direct lexicographical comparison used by strmax then the reduce allows use of custom comparisons.

Symfony3 : How to do a massive import from a CSV file as fast as possible?

I have a .csv file with more than 690 000 rows.
I found a solution to import data that works very well but it's a little bit slow... (around 100 records every 3 seconds = 63 hours !!).
How can I improve my code to make it faster ?
I do the import via a console command.
Also, I would like to import only prescribers that aren't already in database (to save time). To complicate things, no field is really unique (except for id).
Two prescribers can have the same lastname, firstname, live in the same city and have the same RPPS and professional codes. But, it's the combination of these 6 fields which makes them unique !
That's why I check on every field before create a new one.
<?php
namespace AppBundle\Command;
use Symfony\Bundle\FrameworkBundle\Command\ContainerAwareCommand;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\Console\Helper\ProgressBar;
use AppBundle\Entity\Prescriber;
class PrescribersImportCommand extends ContainerAwareCommand
{
protected function configure()
{
$this
// the name of the command (the part after "bin/console")
->setName('import:prescribers')
->setDescription('Import prescribers from .csv file')
;
}
protected function execute(InputInterface $input, OutputInterface $output)
{
// Show when the script is launched
$now = new \DateTime();
$output->writeln('<comment>Start : ' . $now->format('d-m-Y G:i:s') . ' ---</comment>');
// Import CSV on DB via Doctrine ORM
$this->import($input, $output);
// Show when the script is over
$now = new \DateTime();
$output->writeln('<comment>End : ' . $now->format('d-m-Y G:i:s') . ' ---</comment>');
}
protected function import(InputInterface $input, OutputInterface $output)
{
$em = $this->getContainer()->get('doctrine')->getManager();
// Turning off doctrine default logs queries for saving memory
$em->getConnection()->getConfiguration()->setSQLLogger(null);
// Get php array of data from CSV
$data = $this->getData();
// Start progress
$size = count($data);
$progress = new ProgressBar($output, $size);
$progress->start();
// Processing on each row of data
$batchSize = 100; # frequency for persisting the data
$i = 1; # current index of records
foreach($data as $row) {
$p = $em->getRepository('AppBundle:Prescriber')->findOneBy(array(
'rpps' => $row['rpps'],
'lastname' => $row['nom'],
'firstname' => $row['prenom'],
'profCode' => $row['code_prof'],
'postalCode' => $row['code_postal'],
'city' => $row['ville'],
));
# If the prescriber doest not exist we create one
if(!is_object($p)){
$p = new Prescriber();
$p->setRpps($row['rpps']);
$p->setLastname($row['nom']);
$p->setFirstname($row['prenom']);
$p->setProfCode($row['code_prof']);
$p->setPostalCode($row['code_postal']);
$p->setCity($row['ville']);
$em->persist($p);
}
# flush each 100 prescribers persisted
if (($i % $batchSize) === 0) {
$em->flush();
$em->clear(); // Detaches all objects from Doctrine!
// Advancing for progress display on console
$progress->advance($batchSize);
$progress->display();
}
$i++;
}
// Flushing and clear data on queue
$em->flush();
$em->clear();
// Ending the progress bar process
$progress->finish();
}
protected function getData()
{
// Getting the CSV from filesystem
$fileName = 'web/docs/prescripteurs.csv';
// Using service for converting CSV to PHP Array
$converter = $this->getContainer()->get('app.csvtoarray_converter');
$data = $converter->convert($fileName);
return $data;
}
}
EDIT
According to #Jake N answer, here is the final code.
It's very very faster ! 10 minutes to import 653 727 / 693 230 rows (39 503 duplicate items!)
1) Add two columns in my table : created_at and updated_at
2) Add a single index of type UNIQUE on every column of my table (except id and dates) to prevent duplicate items with phpMyAdmin.
3) Add ON DUPLICATE KEY UPDATE in my query, to update just the updated_at column.
foreach($data as $row) {
$sql = "INSERT INTO prescripteurs (rpps, nom, prenom, code_prof, code_postal, ville)
VALUES(:rpps, :nom, :prenom, :codeprof, :cp, :ville)
ON DUPLICATE KEY UPDATE updated_at = NOW()";
$stmt = $em->getConnection()->prepare($sql);
$r = $stmt->execute(array(
'rpps' => $row['rpps'],
'nom' => $row['nom'],
'prenom' => $row['prenom'],
'codeprof' => $row['code_prof'],
'cp' => $row['code_postal'],
'ville' => $row['ville'],
));
if (!$r) {
$progress->clear();
$output->writeln('<comment>An error occured.</comment>');
$progress->display();
} elseif (($i % $batchSize) === 0) {
$progress->advance($batchSize);
$progress->display();
}
$i++;
}
// Ending the progress bar process
$progress->finish();
1. Don't use Doctrine
Try to not use Doctrine if you can, it eats memory and as you have found is slow. Try and use just raw SQL for the import with simple INSERT statements:
$sql = <<<SQL
INSERT INTO `category` (`label`, `code`, `is_hidden`) VALUES ('Hello', 'World', '1');
SQL;
$stmt = $this->getDoctrine()->getManager()->getConnection()->prepare($sql);
$stmt->execute();
Or you can prepare the statement with values:
$sql = <<<SQL
INSERT INTO `category` (`label`, `code`, `is_hidden`) VALUES (:label, :code, :hidden);
SQL;
$stmt = $this->getDoctrine()->getManager()->getConnection()->prepare($sql);
$stmt->execute(['label' => 'Hello', 'code' => 'World', 'hidden' => 1);
Untested code, but it should get you started as this is how I have done it before.
2. Index
Also, for your checks, have you got an index on all those fields? So that the lookup is as quick as possible.

Parse any number of subkeys in a perl hash

I have a perl hash that is obtained from parsing JSON. The JSON could be anything a user defined API could generated. The goal is to obtain a date/time string and determine if that date/time is out of bounds according to a user defined threshold. The only issue I have is that perl seems a bit cumbersome when dealing with hash key/subkey iteration. How can I look through all the keys and determine if a key or subkey exists throughout the hash? I have read many threads throughout stackoverflow, but nothing that exactly meets my needs. I only started perl last week so I may be missing something... Let me know if that's the case.
Below is the "relevant" code/subs. For all code see: https://gitlab.com/Jedimaster0/check_http_freshness
use warnings;
use strict;
use LWP::UserAgent;
use Getopt::Std;
use JSON::Parse 'parse_json';
use JSON::Parse 'assert_valid_json';
use DateTime;
use DateTime::Format::Strptime;
# Verify the content-type of the response is JSON
eval {
assert_valid_json ($response->content);
};
if ( $# ){
print "[ERROR] Response isn't valid JSON. Please verify source data. \n$#";
exit EXIT_UNKNOWN;
} else {
# Convert the JSON data into a perl hashrefs
$jsonDecoded = parse_json($response->content);
if ($verbose){print "[SUCCESS] JSON FOUND -> ", $response->content , "\n";}
if (defined $jsonDecoded->{$opts{K}}){
if ($verbose){print "[SUCCESS] JSON KEY FOUND -> ", $opts{K}, ": ", $jsonDecoded->{$opts{K}}, "\n";}
NAGIOS_STATUS(DATETIME_DIFFERENCE(DATETIME_LOOKUP($opts{F}, $jsonDecoded->{$opts{K}})));
} else {
print "[ERROR] Retreived JSON does not contain any data for the specified key: $opts{K}\n";
exit EXIT_UNKNOWN;
}
}
sub DATETIME_LOOKUP {
my $dateFormat = $_[0];
my $dateFromJSON = $_[1];
my $strp = DateTime::Format::Strptime->new(
pattern => $dateFormat,
time_zone => $opts{z},
on_error => sub { print "[ERROR] INVALID TIME FORMAT: $dateFormat OR TIME ZONE: $opts{z} \n$_[1] \n" ; HELP_MESSAGE(); exit EXIT_UNKNOWN; },
);
my $dt = $strp->parse_datetime($dateFromJSON);
if (defined $dt){
if ($verbose){print "[SUCCESS] Time formatted using -> $dateFormat\n", "[SUCCESS] JSON date converted -> $dt $opts{z}\n";}
return $dt;
} else {
print "[ERROR] DATE VARIABLE IS NOT DEFINED. Pattern or timezone incorrect."; exit EXIT_UNKNOWN
}
}
# Subtract JSON date/time from now and return delta
sub DATETIME_DIFFERENCE {
my $dateInitial = $_[0];
my $deltaDate;
# Convert to UTC for standardization of computations and it's just easier to read when everything matches.
$dateInitial->set_time_zone('UTC');
$deltaDate = $dateNowUTC->delta_ms($dateInitial);
if ($verbose){print "[SUCCESS] (NOW) $dateNowUTC UTC - (JSON DATE) $dateInitial ", $dateInitial->time_zone->short_name_for_datetime($dateInitial), " = ", $deltaDate->in_units($opts{u}), " $opts{u} \n";}
return $deltaDate->in_units($opts{u});
}
Sample Data
{
"localDate":"Wednesday 23rd November 2016 11:03:37 PM",
"utcDate":"Wednesday 23rd November 2016 11:03:37 PM",
"format":"l jS F Y h:i:s A",
"returnType":"json",
"timestamp":1479942217,
"timezone":"UTC",
"daylightSavingTime":false,
"url":"http:\/\/www.convert-unix-time.com?t=1479942217",
"subkey":{
"altTimestamp":1479942217,
"altSubkey":{
"thirdTimestamp":1479942217
}
}
}
[SOLVED]
I have used the answer that #HåkonHægland provided. Here are the below code changes. Using the flatten module, I can use any input string that matches the JSON keys. I still have some work to do, but you can see the issue is resolved. Thanks #HåkonHægland.
use warnings;
use strict;
use Data::Dumper;
use LWP::UserAgent;
use Getopt::Std;
use JSON::Parse 'parse_json';
use JSON::Parse 'assert_valid_json';
use Hash::Flatten qw(:all);
use DateTime;
use DateTime::Format::Strptime;
# Verify the content-type of the response is JSON
eval {
assert_valid_json ($response->content);
};
if ( $# ){
print "[ERROR] Response isn't valid JSON. Please verify source data. \n$#";
exit EXIT_UNKNOWN;
} else {
# Convert the JSON data into a perl hashrefs
my $jsonDecoded = parse_json($response->content);
my $flatHash = flatten($jsonDecoded);
if ($verbose){print "[SUCCESS] JSON FOUND -> ", Dumper($flatHash), "\n";}
if (defined $flatHash->{$opts{K}}){
if ($verbose){print "[SUCCESS] JSON KEY FOUND -> ", $opts{K}, ": ", $flatHash>{$opts{K}}, "\n";}
NAGIOS_STATUS(DATETIME_DIFFERENCE(DATETIME_LOOKUP($opts{F}, $flatHash->{$opts{K}})));
} else {
print "[ERROR] Retreived JSON does not contain any data for the specified key: $opts{K}\n";
exit EXIT_UNKNOWN;
}
}
Example:
./check_http_freshness.pl -U http://bastion.mimir-tech.org/json.html -K result.creation_date -v
[SUCCESS] JSON FOUND -> $VAR1 = {
'timestamp' => '20161122T200649',
'result.data_version' => 'data_20161122T200649_data_news_topics',
'result.source_version' => 'kg_release_20160509_r33',
'result.seed_version' => 'seed_20161016',
'success' => 1,
'result.creation_date' => '20161122T200649',
'result.data_id' => 'data_news_topics',
'result.data_tgz_name' => 'data_news_topics_20161122T200649.tgz',
'result.source_data_version' => 'seed_vtv: data_20161016T102932_seed_vtv',
'result.data_digest' => '6b5bf1c2202d6f3983d62c275f689d51'
};
Odd number of elements in anonymous hash at ./check_http_freshness.pl line 78, <DATA> line 1.
[SUCCESS] JSON KEY FOUND -> result.creation_date:
[SUCCESS] Time formatted using -> %Y%m%dT%H%M%S
[SUCCESS] JSON date converted -> 2016-11-22T20:06:49 UTC
[SUCCESS] (NOW) 2016-11-26T19:02:15 UTC - (JSON DATE) 2016-11-22T20:06:49 UTC = 94 hours
[CRITICAL] Delta hours (94) is >= (24) hours. Data is stale.
You could try use Hash::Flatten. For example:
use Hash::Flatten qw(flatten);
my $json_decoded = parse_json($json_str);
my $flat = flatten( $json_decoded );
say "found" if grep /(?:^|\.)\Q$key\E(?:\.?|$)/, keys %$flat;
You can use Data::Visitor::Callback to traverse the data structure. It lets you define callbacks for different kinds of data types inside your structure. Since we're only looking at a hash it's relatively simple.
The following program has a predefined list of keys to find (those would be user input in your case). I converted your example JSON to a Perl hashref and included it in the code because the conversion is not relevant. The program visits every hashref in this data structure (including the top level) and runs the callback.
Callbacks in Perl are code references. These can be created in two ways. We're doing the anonymous subroutine (sometimes called lambda function in other languages). The callback gets passed two arguments: the visitor object and the current data substructure.
We'll iterate all the keys we want to find and simply check if they exist in that current data structure. If we see one, we count it's existence in the %seen hash. Using a hash to store things we have seen is a common idiom in Perl.
We're using a postfix if here, which is convenient and easy to read. %seen is a hash, so we access the value behind the $key with $seen{$key}, while $data is a hash reference, so we use the dereferencing operator -> to access the value behind $key with $data->{$key}.
The callback needs us to return the $data again so it continues. The last line is just there, it's not important.
I've used Data::Printer to output the %seen hash because it's convenient. You can also use Data::Dumper if you want. In production, you will not need that.
use strict;
use warnings;
use Data::Printer;
use Data::Visitor::Callback;
my $from_json = {
"localDate" => "Wednesday 23rd November 2016 11:03:37 PM",
"utcDate" => "Wednesday 23rd November 2016 11:03:37 PM",
"format" => "l jS F Y h:i:s A",
"returnType" => "json",
"timestamp" => 1479942217,
"timezone" => "UTC",
"daylightSavingTime" =>
0, # this was false, I used 0 because that's a non-true value
"url" => "http:\/\/www.convert-unix-time.com?t=1479942217",
"subkey" => {
"altTimestamp" => 1479942217,
"altSubkey" => {
"thirdTimestamp" => 1479942217
}
}
};
my #keys_to_find = qw(timestamp altTimestamp thirdTimestamp missingTimestamp);
my %seen;
my $visitor = Data::Visitor::Callback->new(
hash => sub {
my ( $visitor, $data ) = #_;
foreach my $key (#keys_to_find) {
$seen{$key}++ if exists $data->{$key};
}
return $data;
},
);
$visitor->visit($from_json);
p %seen;
The program outputs the following. Note this is not a Perl data structure. Data::Printer is not a serializer, it's a tool to make data human readable in a convenient way.
{
altTimestamp 1,
thirdTimestamp 1,
timestamp 1
}
Since you also wanted to constraint the input, here's an example how to do that. The following program is a modification of the one above. It allows to give a set of different constraints for every required key.
I've done that by using a dispatch table. Essentially, that's a hash that contains code references. Kind of like the callbacks we use for the Visitor.
The constraints I've included are doing some things with dates. An easy way to work with dates in Perl is the core module Time::Piece. There are lots of questions around here about various date things where Time::Piece is the answer.
I've only done one constraint per key, but you could easily include several checks in those code refs, or make a list of code refs and put them in an array ref (keys => [ sub(), sub(), sub() ]) and then iterate that later.
In the visitor callback we are now also keeping track of the keys that have %passed the constraints check. We're calling the coderef with $coderef->($arg). If a constraint check returns a true value, it gets noted in the hash.
use strict;
use warnings;
use Data::Printer;
use Data::Visitor::Callback;
use Time::Piece;
use Time::Seconds; # for ONE_DAY
my $from_json = { ... }; # same as above
# prepare one of the constraints
# where I'm from, Christmas eve is considered Christmas
my $christmas = Time::Piece->strptime('24 Dec 2016', '%d %b %Y');
# set up the constraints per required key
my %constraints = (
timestamp => sub {
my ($epoch) = #_;
# not older than one day
return $epoch < time && $epoch > time - ONE_DAY;
},
altTimestamp => sub {
my ($epoch) = #_;
# epoch value should be an even number
return ! $epoch % 2;
},
thirdTimestamp => sub {
my ($epoch) = #_;
# before Christmas 2016
return $epoch < $christmas;
},
);
my %seen;
my %passed;
my $visitor = Data::Visitor::Callback->new(
hash => sub {
my ( $visitor, $data ) = #_;
foreach my $key (%constraints) {
if ( exists $data->{$key} ) {
$seen{$key}++;
$passed{$key}++ if $constraints{$key}->( $data->{$key} );
}
}
return $data;
},
);
$visitor->visit($from_json);
p %passed;
The output this time is:
{
thirdTimestamp 1,
timestamp 1
}
If you want to learn more about the dispatch tables, take a look at chapter two of the book Higher Order Perl by Mark Jason Dominus which is legally available for free here.

Activerecord query with group on multiple columns returning a hash with array as a key

I wrote an ActiveRecord query to fetch count of some data after grouping by two columns col_a and col_b
result = Sample.where(through: ['col_a', 'col_b'], status: [1, 5]).where("created_at > ?", 1.month.ago).group(:status, :through).count
This returns:
{[1, "col_a"]=>7, [1, "col_b"]=>7, [5, "col_a"]=>4, [5, "col_b"]=>1}
Now my question is, how do I access the values in this hash?
Doing something like results[1, "col_a"] throws an error (wrong no. of arguments).
I know I can do this by writing a loop and extracting the values one by one.
However I want to know if there is a more idiomatic way to access the values, something similar to results[1], maybe?
results[[1, "col_a"]]
# => 7
Four possible ways (I'm sure there are others):
# fetch one value at a time
results[[1, "col_a"]]
# => 7
# fetch all the values
results.values
# => [7, 7, 4, 1]
# loop through keys and values
results.each do |key, value|
puts key
puts value
end
# => [1, "col_a"], 7....
# convert results into a more usable hash
results.map! { |k,v| { k.join("_") => v } }.reduce({}, :merge)
results['1_col_a']
# => 7
Another heavier option, especially if this is a query you will do often, is to wrap the results into a new Ruby object. Then you can parse and use the results in a more idiomatic way and define an accessor simpler than [1,'col_a'].
class SampleGroupResult
attr_reader key, value
def initialize(key, value)
#key = key
#value = value
end
end
results.map { |k,v| SampleGroupResult.new(k,v) }

Dynamic Multi Insert with DBI placeholders for many sets of VALUES

I'm building a dynamic SQL statement, that will insert one or more sets of VALUES via a prepared DBI statement, my question is this:
Since I have a dynamic number of VALUES sets, and I will need to add as many ( ?, ?, ?),( ?, ?, ?) etc as necessary to extend the statement INSERT INTO `tblname` ( $columnsString ) VALUES in order to submit only one query using placeholders and bind values- is this the preferred method(most efficient, etc., - reasoning behind efficiency would be helpful in your answer if possible) or should I just be building this as a query string with sprintf and dbh->quote()?
(As a little extra information: I'm actually using AnyEvent::DBI right now, which only exposes placeholders & bind values and not the quote() method so this wouldn't be easy for me to accomplish without creating another straight DBI $dbh and using another db server connection just to use the quote() method, or without altering the AnyEvent::DBI module myself.)
Normally I would just execute the statements as necessary but in this heavy workload case I'm trying to batch inserts together for some DB efficiency.
Also, if anyone could answer if it is possible( and then how to ) insert an sql DEFAULT value using placeholders and bind values that'd be awesome. Typically if I ever needed to do that I'd append the DEFAULTs to the string directly and use sprintf and $dbh->quote() only for the non DEFAULT values.
UPDATE:
Worked out the misunderstanding in a quick chat. User ikegami suggested that instead of building the query string myself without placeholders, that I just intermingle VALUES and placeholders such as:
$queryString .= '(DEFAULT,?,?),(DEFAULT,DEFAULT,DEFAULT)';
Some of the reasoning behind my first asking of this question on SO was because I was somewhat against this intermingling due to my thought that it made the code less readable, though after being assured that sql 'DEFAULT' couldn't be in a placeholder bind value, this was the method I had begun implementing.
Using placeholders where possible does seem to be the more accepted method of building queries, and if you want an SQL DEFAULT you just need to include it in the same query building as the placeholders. This does not apply to NULL values, as those CAN be inserted with placeholders and a bind value of undef.
Update 2:
The reasoning I asked about performance, the 'acceptance' of building your own query with quote() vs building with placeholders, and why I've gone with a solution that involves using all columns for the SQL INSERT INTO tblname (cols) is because I have roughly 2-4 million rows a day going into a terrible db server, and my code is running on an equally terrible server. With my requirements of needing DEFAULT sql values, and these terrible performance constraints, I've chosen a solution for now.
For future devs who stumble upon this - take a look at #emazep's solution of using SQL::Abstract, or if for some reason you need to build your own, you might consider either using #Schwern's subroutine solution or possibly incorporating some of #ikegami's answer into it as these are all great answers as to the 'Current state of affairs' regarding the usage of DBI and building dynamic queries.
Unless there is a specific reason to reinvent the wheel (there could be some), SQL::Abstract (among others) has already solved the problem of dynamic SQL generation for all of us:
my %data = (
name => 'Jimbo Bobson',
phone => '123-456-7890',
address => '42 Sister Lane',
city => 'St. Louis',
state => 'Louisiana'
);
use SQL::Abstract;
my ($stmt, #bind)
= SQL::Abstract->new->insert('people', \%data);
print $stmt, "\n";
print join ', ', #bind;
which prints:
INSERT INTO people ( address, city, name, phone, state)
VALUES ( ?, ?, ?, ?, ? )
42 Sister Lane, St. Louis, Jimbo Bobson, 123-456-7890, Louisiana
SQL::Abstract then offers a nice trick to iterate over many rows to insert without regenerating the SQL every time, but for bulk inserts there is also SQL::Abstract::Plugin::InsertMulti
use SQL::Abstract;
use SQL::Abstract::Plugin::InsertMulti;
my ($stmt, #bind)
= SQL::Abstract->new->insert_multi( 'people', [
{ name => 'foo', age => 23 },
{ name => 'bar', age => 40 },
]);
# INSERT INTO people ( age, name ) VALUES ( ?, ? ), ( ?, ? )
# 23, foo, 40, bar
I have, on occasion, used a construct like:
#!/usr/bin/env perl
use strict; use warnings;
# ...
my #columns = ('a' .. 'z');
my $sql = sprintf(q{INSERT INTO sometable (%s) VALUES (%s)},
join(',', map $dbh->quote($_), #columns),
join(',', ('?') x #columns),
);
As for handling DEFAULT, wouldn't leaving that column out ensure that the DB sets it to the default value?
If you would use placeholders for "static" queries, you should use them for "dynamic" queries too. A query is a query.
my $stmt = 'UPDATE Widget SET foo=?'
my #params = $foo;
if ($set_far) {
$stmt .= ', far=?';
push #params, $far;
}
{
my #where;
if ($check_boo) {
push #where, 'boo=?';
push #params, $boo;
}
if ($check_bar) {
push #where, 'bar=?';
push #params, $bar;
}
$stmt .= ' WHERE ' . join ' AND ', map "($_)", #where
if #where;
}
$dbh->do($stmt, undef, #params);
I used an UPDATE since it allowed me to demonstrate more, but everything applies to INSERT too.
my #fields = ('foo');
my #params = ($foo);
if ($set_far) {
push #fields, 'bar';
push #params, $far;
}
$stmt = 'INSERT INTO Widget ('
. join(',', #fields)
. ') VALUES ('
. join(',', ('?')x#fields)
. ')';
$dbh->do($stmt, undef, #params);
You've expressed concerns about the readability of the code and also being able to pass in a DEFAULT. I'll take #ikegami's answer one step further...
sub insert {
my($dbh, $table, $fields, $values) = #_;
my $q_table = $dbh->quote($table);
my #q_fields = map { $dbh->quote($_) } #$fields;
my #placeholders = map { "?" } #q_fields;
my $sql = qq{
INSERT INTO $q_table
( #{[ join(', ', #q_fields) ]} )
VALUES ( #{[ join(', ', #placeholders ]} )
};
return $dbh->do($sql, undef, #$values);
}
Now you have a generic multi value insert routine.
# INSERT INTO foo ('bar', 'baz') VALUES ( 23, 42 )
insert( $dbh, "foo", ['bar', 'baz'], [23, 43] );
To indicate a default value, don't pass in that column.
# INSERT INTO foo ('bar') VALUES ( 23 )
# 'baz' will use its default
insert( $dbh, "foo", ['bar'], [23] );
You can optimize this to make your subroutine do multiple inserts with one subroutine call and one prepared statement saving CPU on the client side (and maybe some on the database side if it supports prepared handles).
sub insert {
my($dbh, $table, $fields, #rows) = #_;
my $q_table = $dbh->quote($table);
my #q_fields = map { $dbh->quote($_) } #$fields;
my #placeholders = map { "?" } #q_fields;
my $sql = qq{
INSERT INTO $q_table
( #{[ join(', ', #q_fields) ]} )
VALUES ( #{[ join(', ', #placeholders ]} )
};
my $sth = $dbh->prepare_cached($sql);
for my $values (#rows) {
$sth->execute(#$values);
}
}
# INSERT INTO foo ('bar', 'baz') VALUES ( 23, 42 )
# INSERT INTO foo ('bar', 'baz') VALUES ( 99, 12 )
insert( $dbh, "foo", ['bar', 'baz'], [23, 43], [99, 12] );
Finally, you can write a bulk insert passing in multiple values in a single statement. This is probably the most efficient way to do large groups of inserts. This is where having a fixed set of columns and passing in a DEFAULT marker comes in handy. I've employed the idiom where values passed as scalar references are treated as raw SQL values. Now you have the flexibility to pass in whatever you like.
sub insert {
my($dbh, $table, $fields, #rows) = #_;
my $q_table = $dbh->quote($table);
my #q_fields = map { $dbh->quote($_) } #$fields;
my $sql = qq{
INSERT INTO $q_table
( #{[ join(', ', #q_fields) ]} )
VALUES
};
# This would be more elegant building an array and then joining it together
# on ",\n", but that would double the memory usage and there might be
# a lot of values.
for my $values (#rows) {
$sql .= "( ";
# Scalar refs are treated as bare SQL.
$sql .= join ", ", map { ref $value ? $$_ : $dbh->quote($_) } #$values;
$sql .= "),\n";
}
$sql =~ s{,\n$}{};
return $dbh->do($sql);
}
# INSERT INTO foo ('bar', 'baz') VALUES ( 23, NOW ), ( DEFAULT, 12 )
insert( $dbh, "foo", ['bar', 'baz'], [23, \"NOW"], [\"DEFAULT", 12] );
The down side is this builds a string in memory, possibly very large. To get around that you have to involve database specific bulk insert from file syntax.
Rather than writing all this SQL generation stuff yourself, go with #emazep's answer and use SQL::Abstract and SQL::Abstract::Plugin::InsertMulti.
Just make sure you profile.