I am trying to create a database table in NoSQL in order to be able to retrieve the element with a maximum value in one of its columns.
Suppose the SQL schema looks like this:
Table_Page
PageId: int(10) - PK
Name: varbinary(255)
RevisionId: int(10) - FK
Table_Revision
RevisionId: int(10) - PK
Text: varbinary(255)
Rev_TimeStamp: binary(14)
How could I design the schema in Amazon DynamoDB Console such that it supports a query to retrieve the page with the latest Revision? Thanks!
I suppose the query you want to do is, given a page_id, find the revision of that page with the largest timestamp. (instead of finding a revision of largest timestamp, regardless of page_id)
You can design your table in DynamoDB like this:
Table_Page_Revision
HashKey: PageId
RangeKey: Rev_TimeStamp
Attrubute 1: RevisionId
Attribute 2: Text
Then another table just to store the name of a page:
Table_Page_Name
HashKey: PageId
Attribute: Name
To do your query, you can use this pseudo code:
Table_Page_Revision.query(HashKey="Your Page Id", ScanIndexForward=False, Limit=1)
We set the "scan forward" parameter to false, meaning it will start from the item with large range key to smaller range key (DESC). We also set the limit to 1 which means we are only interested in getting 1 item returned. Combined together this gives you the item with the largest "Rev_TimeStamp"
I faced the same problem and I think you are looking for a solution in the "RDB" field that don't match the freedom of data modeling that DynamodB gives you.
I managed to save a completely separated record that is the "latest version" of the data, and using a suitable sorting key, you can directly get the latest data in a single query and in "blazingly fast" (:D) fashion.
Simply add a chunk of code after every put command (or update, or upsert) that update the single "latestVersion" sorting key.
In your example you don't show which are the Partition Key and the Sorting Key, but I imagine a model like that:
Hashkey(int - Partition Key) | version(int Sorting Key) | data(json) | created(date)
If you switch the version type and go for something like:
HashKey(int - Partition Key) | version(int Sorting Key) | data(json) | created(date)
You can index the sorting key wiht somethig like:
HashKey| RevisionId | data | created
01 | 00000 | {latestData} | ...latestTimeStamp
01 | 00001 | { someData } | ...someTimeStamp
01 | 00002 | { someData } | ...someTimeStamp
02 | 00000 | {latestData} | ...latestTimeStamp
02 | 00001 | { someData } | ...someTimeStamp
When you'll insert the 01 - 00003 data, you'll overwrite the 00000 too.
Doing this, when you get the HashKey + 00000 you have always the latest data of your HashKey
Hope it helps.
Related
I have insurance companies "dictionary" in my database, let's say:
+----+-------------------+----------+
| ID | Name | Data |
+----+-------------------+----------+
| 1 | InsuranceCompany1 | SomeData |
+----+-------------------+----------+
But I'm fetching data from another system, and in result I got duplicates of insurance companies, but without my data:
+----+-------------------+----------+
| ID | Name | Data |
+----+-------------------+----------+
| 1 | InsuranceCompany1 | SomeData |
+----+-------------------+----------+
| 2 | InsuranceCompany1 | |
+----+-------------------+----------+
Both records are related in variety of models but they refer to the same data, and what I want is to pair these records without changing queries or data in other tables, so noone knows there are two records, but both refer to one instance which is
+----+-------------------+----------+
| 1 | InsuranceCompany1 | SomeData |
+----+-------------------+----------+
My question is: Is there some proper way to handle situations like this?
I've came up with solution which is to add parent_id column, and manually set parent_id in duplicated rows, and then override Eloquent methods like find in a model to return parent if there is parent_id set.
Copying SomeData column is not an option because there can be condition if insurance_company_id == id;
You can try creating a view of your dict table something like this:
CREATE VIEW unique_dict AS
SELECT MIN(ID) ID,
Name,
GROUP_CONCAT(Data) Data
FROM dict
GROUP BY Name
That will give you one row per name.
Then, in your queries requiring one row per name, SELECT from the unique_dict view rather than the dict table.
GROUP_CONCAT() yields a list of values from Data, which helps if more than one duplicated row contains a value: you get them all.
Longer term you might be smart to consider these duplicates to be "dirty data", and clean them up as you INSERT new rows. How to do that?
Create a unique index on Name.
CREATE UNIQUE INDEX unique_name ON dict(Name);
Then, when loading new data into dict use Eloquent's updateOrCreate() function. Here's something to read about that. Laravel 5.1 Create or Update on Duplicate
I have a MySQL table with around 3 million rows (listings) at the moment. These listings are updated 24/7 (around 30 listings/sec) by a python script (Scrapy) using pymsql - so the performance of the queries is relevant!
If a listing doesn't exist (i.e. the UNIQUE url), a new record will be inserted (which is around every hundredth listing). The id is set to auto_increment and I am using a INSERT INTO listings ... ON DUPLICATE KEY UPDATE last_seen_at = CURRENT_TIMESTAMP. The update on last_seen_at is necessary to check if the item is still online, as I am crawling the search results page with multiple listings on it and not checking each individual URL each time.
+--------------+-------------------+-----+----------------+
| Field | Type | Key | Extra |
+--------------+-------------------+-----+----------------+
| id | int(11) unsigned | PRI | auto_increment |
| url | varchar(255) | UNI | |
| ... | ... | | |
| last_seen_at | timestamp | | |
| ... | ... | | |
+--------------+-------------------+-----+----------------+
The problem:
At first, it all went fine. Then I noticed larger and larger gaps in the auto_incremented id column and found out it's due to the INSERT INTO ... statement: MySQL attempts to do the insert first. This is when the id gets auto incremented. Once incremented, it stays. Then the duplicate is detected and the update happens.
Now my question is: Which is the best solution regarding performance for with long term perspective?
Option A: Set the id column to unsigned INT or BIGINT and just ignore the gaps. Problem here is I'm afraid of hitting the maximum after a couple of years updating. I'm already at an auto_increment value of around 12,000,000 for around 3,000,000 listings after two days of updating...
Option B: Switch to an INSERT IGNORE ... statement, check the affected rows and UPDATE ... if necessary.
Option C: SELECT ... the existing listings, check existence within python and INSERT ... or UPDATE ... dependingly.
Any other wise options?
Additonal Info: I need an id for information related to a listing stored in other tables (e.g. listings_images, listings_prices etc.). IMHO using the URL (which is unique) won't be the best option for foreign keys.
+------------+-------------------+
| Field | Type |
+------------+-------------------+
| listing_id | int(11) unsigned |
| price | int(9) |
| created_at | timestamp |
+------------+-------------------+
I was in exact situation as yours
I have millions of records being entered by scraper into table, scraper was running every day
I tried following but failed
Load all urls into a Python tuple or list and while scraping, only scrape those which are not in the list - FAILED because at the time of loading urls into a Python tuple or list script consumed so much of server's RAM
Check each record before entering - FAILED because it made INSERTion process too slow because it first have to query the table with millions of rows and then decide whether to INSERT or not
SOLUTION WORKED FOR ME: (for table with millions of rows)
I removed id column because it is irreverent and I do not need that
Make url PRIMARY KEY since it will be unique
Add UNIQUE INDEX -- THIS IS MUST TO DO - It will increase your table's performance drastically
Do bulk inserts instead of inserting one-by-one (see pipeline code below)
Notice it is using INSERT IGNORE INTO, so only new records will be entered and if it exists, it will be ignored completely
If you use REPLACE INTO instead of INSERT IGNORE INTO in MySQL, the new records will be entered, but if a record exists, it will be updated
class BatchInsertPipeline(object):
def __init__(self):
self.items = []
self.query = None
def process_item(self, item, spider):
table = item['_table_name']
del item['_table_name']
if self.query is None:
placeholders = ', '.join(['%s'] * len(item))
columns = '`' + '`, `'.join(item.keys()).rstrip(' `') + '`'
self.query = 'INSERT IGNORE INTO '+table+' ( %s ) VALUES ( %s )' \
% (columns, placeholders)
self.items.append(tuple(item.values()))
if len(self.items) >= 500:
self.insert_current_items(spider)
return item
def insert_current_items(self,spider):
spider.cursor.executemany(self.query, self.items)
self.items = []
def close_spider(self, spider):
self.insert_current_items(spider)
self.items = []
I have a table that basically looks like the following:
Timestamp | Service | Observation
----------+---------+------------
... | vm-1 | 15
... | vm-1 | 20
... | vm-1 | 20
... | vm-1 | 20
... | vm-1 | 20
... | vm-1 | 20
... | bvm-2 | 184
... | bvm-2 | 104
... | bvm-2 | 4
... | bvm-2 | 14
... | bvm-2 | 657
... | bvm-2 | 6
... | bvm-2 | 6
The Service column will not have a lot of different values. I don't know at table creation time what all possible values are going to be so I can't use an enum, but the number of distinct values are going to grow very slowly at (less than ~10 new distinct values per month or less), whereas I'll have thousands of new observations per day.
Right now I'm just thinking of using a VARCHAR or mysql's TEXT type for the Service column, but given the specifics of the situation those kind of seem wasteful.
Are databases usually smart about this sort of thing? Or is there some way I can hint to the database that this behavior is something that it can reliably exploit?
I'm using MySQL 5.7. I'd prefer something standards compliant or portable, but I'm also open to MySQL specific workarounds.
EDIT:
In other words, what I want is for the column to be treated like an enum, but have the database figure out dynamically based on the data that shows up in the table what the different enum values are.
Every time you need to use an enum you should consider creating another table and reference to it. It's basic normalization. So create one table for the ServiceType with a name and an id field the name can be VARCHAR and the id should be INT. The actual table then just uses the id instead of the service name.
You can write a simple stored procedure to do the inserting and looking up of duplicate names as well as a view to access the results so outside of the DB you barely know how it is internally handled.
Your stored procedure needs to:
Check if the service exists and insert it if not. INSERT IGNORE ... is probably your friend here.
Get the ID of the service with SELECT id INTO #serv_id FROM ServiceType WHERE name = [service_name];
Insert into the table with the service ID instead of the service.
Don't over optimize. MySQL does not store TINYINT more efficiently than INT so just use the latter and it won't fail until you have billions of services.
I think , you have to create a new table for store the services and and then this table primary key (service_id) can be replaced in place of service text. But main table service column should be int type for storing the service id . So please change the service column type to int(4) .
hope it will be helpfull
Update: Question refined, I still need help!
I have the following table structure:
table reports:
ID | time | title | (extra columns)
1 | 1364762762 | xxx | ...
Multiple object tables that have the following structure
ID | objectID | time | title | (extra columns)
1 | 1 | 1222222222 | ... | ...
2 | 2 | 1333333333 | ... | ...
3 | 3 | 1444444444 | ... | ...
4 | 1 | 1555555555 | ... | ...
In the object tables, on an object update a new version with the same objectID is inserted, so that the old versions are still available. For example see the entries with objectID = 1
In the reports table, a report is inserted but never updated/edited.
What I want to be able to do is the following:
For each entry in my reports table, I want to be able to query the state of all objects, like they were, when the report was created.
For example lets look at the sample report above with ID 1. At the time it was created (see the time column), the current version of objectID 1 was the entry with ID 1 (entry ID 4 did not exist at that point).
ObjectID 2 also existed with it's current version with entry ID 2.
I am not sure how to achieve this.
I could use a query that selects the object versions by the time column:
SELECT *
FROM (
SELECT *
FROM objects
WHERE time < [reportTime]
ORDER BY time DESC
)
GROUP BY objectID
Lets not talk about the performance of this query, it is just to make clear what I want to do. My problem is the comparison of the time columns. I think this is no good way to make sure that I got the right object versions, because the system time may change "for any reason" and the time column would then have wrong data in it, which would lead to wrong results.
What would be another way to do so?
I thought about not using a time column for this, but instead a GLOBAL incremental value that I know the insertion order across the database tables.
If you are interting new versions of the object, and your problem is the time column(I assume you are using this column to sort which one is newer); I suggest you to use an auto-incremental ID column for the versions. Eventually, even if the time value is not reliable for you, the ID will be.Since it is always increasing. So higher ID, newer version.
I have a large database with two tables: stat and total.
The example of the relation is the following:
STAT:
| ID | total event |
+--------+--------------+
| 7 | 2 |
| 8 | 1 |
TOTAL:
|ID | Event |
+---+--------------+
| 7 | "hello" |
| 7 | "everybody" |
| 8 | "hi" |
This is a very simplified version; also consider that STAT table could have 500K records, and for each STAT I can have about 200 TOTAL rows.
Currently, if I run a simple SELECT query in table TOTAL the system is terribly slow.
Could anyone help me with some advice for the creation of the TOTAL table? Is it possible to say to MySQL that the id column is already sorted so that there is no reason to scan all the rows till the end where, for example, id=7?
Add INDEX(ID) to your tables (both), if you did not already.
SELECT COUNT(*) FROM TOTAL WHERE ID=7 -> if ID is indexed, this will be fast.
You can add an index, and furthermore you can partition your table.
As per #ypercube's comment, tables are not stored in a sorted state, so one cannot "tell" this to the database. However you can add an index on tables to make them faster to search.
One important thing to check - it looks like TOTAL.ID is intended as a foreign key - if so, the table TOTAL should have a primary key called ID. Rename the existing column of that name to STAT_ID instead, so it is obvious what it is a foreign key for. Then add an index on STAT_ID.
Lastly, as a point of style, I recommend that you make your table and column names case-insensitive, and write them in lower-case. It makes it easier to read SQL when keywords are in upper case, and database objects are in lower.