DolphinDB: Data deduplication - duplicates

How to do data deduplication in DolphinDB, is there any example I can refer to? Now, it seems that the database keeps all the data. I can see keyedTable or indexedTable will discard the duplication data, but is there any other method?

Deduplicate when writing data (in version 2.0 or later):
To find your installed version and see if you need to update, run version. Update to the newest version at www.dolphindb.com.
TSDB engine is needed to conduct data deduplication. Specify the parameter 'engine' in function 'database' to be "TSDB" to enable the TSDB engine:
login(`admin,`123456)
if(existsDatabase("dfs://tsdb"))
{
dropDatabase("dfs://tsdb")
}
//create a distributed database and specify the parameter engine to be TSDB
db_tsdb = database(directory = "dfs://tsdb", partitionType = VALUE, partitionScheme = 2012.03.01..2022.12.31, engine = 'TSDB')
Then create 3 DFS tables with different deduplication rules. Specify the sort columns by parameter sortColumns and the data deduplication rules by the parameter keepDuplicates in createPartitionedTable. The data deduplication runs only when duplicated key values are identified. There are 3 optional deduplication rules:
Last: Keep the latest data.
ALL: Keep all data.
FIRST: Keep the first data.
In the following case, we create 3 DFS tables, specify key columns to be "ID" and "Date" and keepDuplicates to be "LAST", "ALL" and "FIRST" respectively:
t = table(1:0,`Date`ID`Close,[DATETIME,SYMBOL,DOUBLE])
tsdb_table = createPartitionedTable(dbHandle = db_tsdb, table=t, tableName="last", sortColumns=["ID","Date"], partitionColumns=`Date, keepDuplicates=LAST)
last = db_tsdb.loadTable(`last)
tsdb_table = createPartitionedTable(dbHandle = db_tsdb, table=t, tableName="all", sortColumns=["ID","Date"], partitionColumns=`Date, keepDuplicates=ALL)
all = db_tsdb.loadTable(`all)
tsdb_table = createPartitionedTable(dbHandle = db_tsdb, table=t, tableName="first", sortColumns=["ID","Date"], partitionColumns=`Date, keepDuplicates=FIRST)
first = db_tsdb.loadTable(`first)
Write data with no duplicated sort columns and query the table:
Date = 2012.03.01 09:30:00..(2012.03.01 09:30:00 + 4)
ID = `000001`000001`000002`000002`000002
Close =1.0 2.0 3.0 4.0 5.0
first_table = table(Date,ID,Close)
last.append!(first_table)
all.append!(first_table)
first.append!(first_table)
select * from last
select * from all
select * from first
The above 3 tables return the same results as shown in the chart.
Then write data with the same sort columns.
Date = 2012.03.01 09:30:00..(2012.03.01 09:30:00 + 4)
ID = `000001`000001`000002`000002`000002
Close =10.0 20.0 30.0 40.0 50.0
second_table = table(Date,ID,Close)
last.append!(second_table)
all.append!(second_table)
first.append!(second_table)
select * from last
select * from all
select * from first
With different data written, the query results will be different as the 3 tables apply different deduplication rules.
Table with the LAST parameter keeps the latest data where keyColumns are duplicated:
Table with the FIRST parameter keeps the first data where keyColumns are duplicated:
Table with the ALL parameter keeps the data written for each time where keyColumns are duplicated:
My Script:
login(`admin,`123456)
if(existsDatabase("dfs://tsdb"))
{
dropDatabase("dfs://tsdb")
}
db_tsdb = database(directory = "dfs://tsdb", partitionType = VALUE, partitionScheme = 2012.03.01..2022.12.31, engine = 'TSDB')
t = table(1:0,`Date`ID`Close,[DATETIME,SYMBOL,DOUBLE])
tsdb_table = createPartitionedTable(dbHandle = db_tsdb, table=t, tableName="last", sortColumns=["ID","Date"], partitionColumns=`Date, keepDuplicates=LAST)
last = db_tsdb.loadTable(`last)
tsdb_table = createPartitionedTable(dbHandle = db_tsdb, table=t, tableName="all", sortColumns=["ID","Date"], partitionColumns=`Date, keepDuplicates=ALL)
all = db_tsdb.loadTable(`all)
tsdb_table = createPartitionedTable(dbHandle = db_tsdb, table=t, tableName="first", sortColumns=["ID","Date"], partitionColumns=`Date, keepDuplicates=FIRST)
first = db_tsdb.loadTable(`first)
Date = 2012.03.01 09:30:00..(2012.03.01 09:30:00 + 4)
ID = `000001`000001`000002`000002`000002
Close =1.0 2.0 3.0 4.0 5.0
first_table = table(Date,ID,Close)
last.append!(first_table)
all.append!(first_table)
first.append!(first_table)
select * from last
select * from all
select * from first
Date = 2012.03.01 09:30:00..(2012.03.01 09:30:00 + 4)
ID = `000001`000001`000002`000002`000002
Close =10.0 20.0 30.0 40.0 50.0
second_table = table(Date,ID,Close)
last.append!(second_table)
all.append!(second_table)
first.append!(second_table)
select * from last
select * from all
select * from first

Related

SQL to equivalent pandas - Merge on columns where column is null

I opened up this new question because I'm not sure the user's request and wording matched each other: pandas left join where right is null on multiple columns
What is the equivalent pandas code to this SQL? Contextually we're finding entries from a column in table_y that aren't in table_x with respect to several columns.
SELECT
table_x.column,
table_x.column2,
table_x.column3,
table_y.column,
table_y.column2,
table_y.column3,
FROM table_x
LEFT JOIN table_y
ON table_x.column = table_y.column
ON table_x.column2 = table_y.column2
WHERE
table_y.column2 is NULL
Is this it?
columns_join = ['column', 'column2']
data_y = data_y.set_index(columns_join)
data_x = data_x.set_index(columns_join)
data_diff = pandas.concat([data_x, data_y]).drop_duplicates(keep=False) # any row not in both
# Select the diff representative from each dataset - in case datasets are too large
x1 = data_x[data_x.index.isin(data_diff.index)]
x2 = data_y[data_y.index.isin(data_diff.index)]
# Perform an outer join with the joined indices from each set,
# then remove the entries only contributed from table_x
data_compare = x1.merge(x2, how = 'outer', indicator=True, left_index=True, right_index=True)
data_compare_final = (
data_compare
.query('_merge == left_join')
.drop('_merge', axis=1)
)
I don't think that's equivalent because we only removed entries from table_x that aren't in the join based on multiple columns. I think we have to continue and compare the column against table_y.
data_compare = data_compare.reset_index().set_index('column2')
data_y = data_y.reset_index().set_index('column2')
mask_column2 = data_y.index.isin(data_compare.index)
result = data_y[~mask_column2]
Without test data it is a bit difficult to be sure that this helps but you can try:
# Only if columns to join on in the right dataframe have the same name as columns in left
table_y[['col_join_1', 'col_join_2']] = table_y[['column', 'column2']] # Else this is not needed
# Merge left (LEFT JOIN)
table_merged = table_x.merge(
table_y,
how='left',
left_on=['column', 'column2'],
right_on=['col_join_1', 'col_join_2'],
suffixes=['_x', '_y']
)
# Filter dataframe
table_merged = table_merged.loc[
table_merged.column2_y.isna(),
['column_x', 'column2_x', 'column3_x', 'column_y', 'column2_y', 'column3_y']
]
I found an equivalent that amounts to setting the index to the join column(s), union'ing the tables, dropping the duplicates, and performing a cross join between the contributions to the union. From there, one can select
left_only for this equivalent SQL
SELECT
table_x.*,
table_y.*
FROM table_x
LEFT JOIN table_y
ON table_x.column = table_y.column
ON table_x.column2 = table_y.column2
WHERE
table_y.column2 is NULL
right_only for this equivalent SQL
SELECT
table_x.*,
table_y.*
FROM table_y
LEFT JOIN table_x
ON table_y.column = table_x.column
ON table_y.column2 = table_x.column2
WHERE
table_x.column2 is NULL
def create_dataframe_joined_diffs(dataframe_prod, dataframe_new, columns_join):
"""
Set the indices to the columns_key
Concat the dataframes and remove duplicates
Select the diff representative from each dataset
Reset the indices and perform an outer join
Pseudo-SQL:
SELECT
UNIQUE(*)
FROM dataframe_prod
OUTER JOIN dataframe_new
ON columns_join
"""
data_new = dataframe_new.set_index(columns_join)
data_prod = dataframe_prod.set_index(columns_join)
# Get any row not in both (may be removing too many)
data_diff = pandas.concat([data_prod, data_new]).drop_duplicates(keep=False) # any row not in both
# Select the diff representative from each dataset
x1 = data_prod[data_prod.index.isin(data_diff.index)]
x2 = data_new[data_new.index.isin(data_diff.index)]
# Perform an outer join and keep the joined indices from each set
# Sort the columns to make them easier to compare
data_compare = x1.merge(x2, how = 'outer', indicator=True, left_index=True, right_index=True).sort_index(axis=1)
return data_compare
mask_left = dataframe_compare['_merge'] == 'left_only'
mask_right = dataframe_compare['_merge'] == 'right_only'

Mysql query returns wrong data with where clause

I am using the following query to get data from mysql database and I get wrong data. I want to get all data with the cart_Status of 2 or 3 which have the view_Status of 1:
SELECT * FROM `cart` WHERE `view_Status` = 1 AND cart_Status = 2 OR `cart_Status` = 3
This is how my data structure and table looks like:
But in result, it returns something regardless of view_Status = 1 which is not my target.
it returns :
Of course, it should not return anything! But, it does!
This is about operator precendence.
Your query evaluates as
SELECT * FROM `cart` WHERE (`view_Status` = 1 AND cart_Status = 2) OR `cart_Status` = 3
You should to add parentheses:
SELECT * FROM `cart` WHERE `view_Status` = 1 AND (cart_Status = 2 OR `cart_Status` = 3)
SELECT * FROM `cart` WHERE `view_Status` = 1 AND (cart_Status = 2 OR `cart_Status` = 3)
or better
SELECT * FROM `cart` WHERE `view_Status` = 1 AND cart_Status in (2, 3);
You appear to be learning SQL. Use parentheses in the WHERE clause, particularly when you mix AND and OR.
However, in your case, IN is a better solution:
SELECT c.*
FROM `cart` c
WHERE c.view_Status = 1 AND cart_Status IN (2, 3);
It's a problem with operators precedence. Typically AND is executed before OR in programming languages (think of AND as of multiplication of bits, and of OR as of addition of bits and precedence becomes familiar). So, your condition:
`view_Status` = 1 AND cart_Status = 2 OR `cart_Status` = 3
is parsed like this:
(`view_Status` = 1 AND cart_Status = 2) OR `cart_Status` = 3
which results in all rows with specific cart_Status to be selected. You have to add parenthesis around the second clause:
`view_Status` = 1 AND (cart_Status = 2 OR `cart_Status` = 3)
or, even shorter:
`view_Status` = 1 AND cart_Status IN (2, 3)

joining tables takes very long time symfony 2.7

I have the following query which takes more than 20 secs (20138ms) to return the results.
$locale = 'en'; // test
$query = $this->getEntityManager()->createQuery('
SELECT
product.id, product.productnr, ProductGrp.productgrp' . $locale . ', Criteria.criteria'.$locale.'
FROM
Productbundle:product product
JOIN
Productbundle:Criteria Criteria WITH Criteria.criteriaid = product.criteriaid
JOIN
Productbundle:ProductGrp ProductGrp WITH ProductGrp.partgrpid = product.partgrpid
WHERE
product.productnr =:productnr
')
->setMaxResults(1)
->setParameter('productnr', $productnr)
->getResult();
when I ran the query from "runnable query" it took about 20 secs (20.7809) in phpmyadmin.
runnable query :
SELECT o0_.id AS id0, o0_.productnr AS productnr1, o1_.productgrpen AS productgrpen2, o2_.criteriaen AS criteriaen3
FROM product o0_
INNER JOIN Criteria o2_ ON (o2_.criteriaid = o0_.criteriaid)
INNER JOIN ProductGrp o1_ ON (o1_.partgrpid = o0_.partgrpid)
WHERE o0_.productnr = 'ABC1234'
LIMIT 1;
However when I ran the following code in phpmyadmin it takes less than 2seconds to return the results
SELECT product.id, product.productnr,ProductGrp.productgrpen ,Criteria.criteriaen
FROM `product`
INNER JOIN ProductGrp ON ProductGrp.partgrpid = product.partgrpid
INNER JOIN Criteria ON Criteria.criteriaid = product.criteriaid
Where productnr = 'ABC1234'
LIMIT 1
table size
-------------------------------
|Product | over 5mill rows |
-------------------------------
|ProductGrp | over 200 rows |
-------------------------------
|Criteria | over 600 rows |
-------------------------------
Symfony version : 2.7
Indexes although not listed, I would suggest the following
table indexed on
Product (productnr, id, criteriaid, partgrpid )
Criteria (criteriaid ) -- would expect as primary key
ProductGrp (partgrpid ) -- also would expect
Also, how many "locale" string version columns do you have / support.

Update table mysql and if value is 10 update another table

I've following sql to update results table:
$mysqli->query("UPDATE results
SET result_value = IF('$logo_value' - result_tries < 0 OR '$logo_value' - result_tries = 0, 1, '$logo_value' - result_tries)
WHERE logo_id = '$logo_id'
AND user_id = '$user_id'
AND result_value = 0");
In the same sql command is it possible to update another table based on result_value?
if result_value = 10
Update users SET user_hints = user_hints +1 WHERE user_id = '$user_id'
How would I incorporate this into sql syntax above?
Long way I can think of is to select this value get it into php variable. And than do another update based on php variable value... But this seems long and tedious
This is a long shot (not tested) but how about:
$mysqli->query("UPDATE results, users
SET result_value =
IF('$logo_value' - results.result_tries < 0 OR
'$logo_value' - results.result_tries = 0,
1, '$logo_value' - result_tries),
users.user_hints =
IF(results.result_value >= 10,
users.user_hints + 1, users.user_hints)
WHERE results.logo_id = '$logo_id'
AND results.user_id = '$user_id'
AND results.user_id = users.user_id
AND results.result_value = 0");
If both tables have some of the same column names, of course, youll have to specify which table (like results.user_id -or- users.user_id)

Getting the closest time in a PDO statement

I am working from 2 databases and I need to find records which matches the closest times. Both fields are datetime().
So in essence:
table1.time = 2012-06-07 15:30:00
table2.time = 2012-06-07 15:30:01
table2.time = 2012-06-07 15:30:02
table2.time = 2012-06-07 15:30:03
NOTE: The table I am querying (table2) is a mssql table, and table1.time is a datetime() time. I need to find in table2 the row which closest matches table1.time, but I have no guarnatee that it would be an exact match, so I need the closest. I only need to return 1 result.
I tried the SQL below based on an example from a previous stackoverflow query but it failed to work.
Table1 is a mysql database where table2 is mssql and the query happens on table2 (mssql)
try {
$sql = "
SELECT
PCO_AGENT.NAME,
PCO_INBOUNDLOG.LOGIN AS LOGINID,
PCO_INBOUNDLOG.PHONE AS CALLERID,
PCO_INBOUNDLOG.STATION AS EXTEN,
PCO_INBOUNDLOG.TALKTIME AS CALLLENGTH,
PCO_INBOUNDLOG.CHANNELRECORDID AS RECORDINGID,
PCO_SOFTPHONECALLLOG.RDATE,
PCO_INBOUNDLOG.RDATE AS INBOUNDDATE
FROM
PCO_INBOUNDLOG
INNER JOIN
PCO_LOGINAGENT ON PCO_INBOUNDLOG.LOGIN = PCO_LOGINAGENT.LOGIN
INNER JOIN
PCO_SOFTPHONECALLLOG ON PCO_INBOUNDLOG.ID = PCO_SOFTPHONECALLLOG.CONTACTID
INNER JOIN
PCO_AGENT ON PCO_LOGINAGENT.AGENTID = PCO_AGENT.ID
WHERE
PCO_INBOUNDLOG.STATION = :extension
AND ABS(DATEDIFF(:start,PCO_SOFTPHONECALLLOG.RDATE))
";
$arr = array(":extension" => $array['extension'], ":start" => $array['start']);
$query = $this->mssql->prepare($sql);
$query->execute($arr);
$row = $query->fetchAll(PDO::FETCH_ASSOC);
$this->pre($row);
}
I am getting the following error at the moment:
SQLSTATE[HY000]: General error: 174 General SQL Server error: Check messages from the SQL Server [174] (severity 15) [(null)]SQLSTATE[HY000]: General error: 174 General SQL Server error: Check messages from the SQL Server [174] (severity 15) [(null)]
Found a shorter version:
SELECT * FROM `table` WHERE `date` < '$var' ORDER BY date LIMIT 1;