I have googled this a lot, and I have not found anything matching my problem.
I have a lot of Time Series, containing different sensors readings. Each Time Series is stored into a .csv file, so each file contains a single column.
I have to populate this MySQL table:
CREATE TABLE scheme.sensor_readings (
id int unsigned not null auto_increment,
sensor_id int unsigned not null,
date_created datetime,
reading_value double,
PRIMARY KEY(id),
FOREIGN KEY (sensor_id) REFERENCES scheme.sensors (id) ON DELETE CASCADE
) ENGINE = InnoDB;
while the sensors table is:
CREATE TABLE scheme.sensors (
id int unsigned not null auto_increment,
sensor_title varchar(255) not null,
description varchar(255) not null,
date_created datetime,
PRIMARY KEY(id)
) ENGINE = InnoDB;
Now, I should fill the reading_value field with values contained in the above descripted .csv files. An example of this kind of file:
START INFO
Recording Time *timestamp*
Oil Pressure dt: 1,000000 sec
STOP INFO
0,445328
0,429459
0,4245
0,445099
0,432434
0,433426
...
EOF
What I need is to design an SQL query in which I populate this table while reading values from a .csv file.
I cannot figure out how to proceed: should I use some sort of temporary table as a buffer?
I use HeidiSQL as Client.
The kind of tool you looking for is called an ETL (Extract, transform, Load).
You can extract data form csv files (among other), transfrom them by adding the info from the sensor db-table (among other), and load it into the sensor_reading db-table.
There are plenty of ETL on the market. Although, I should be agnostic, a free, easy to learn and covering all your future needs, you may start evaluating PDI (Pentaho Data Integrator, nicknamed Kettle). Go there, download the latest Data Integrator, unzip and press the spoon.bat / spoon.sh. A nice getting started is there. And the StackOverFlow flag Pentaho Data Integration, respond usually quite quickly.
Alternatively you may try Talend or plenty others.
Related
I need to store the geo path data comprising of geo-points which should be indexed by unique key. For example: The path traveled by vehicle indexed by its trip id. This path can be of indefinite length.
As of now, I am thinking to store the path in the form of JSON object. The options that I have in my mind are Riak and MongoDB. I want to go with open-source technology. It will be nice if it supports clustering. In case one node goes down, we won't have any downtime in our application.
MySQL is currently our source of raw data (which we will be anyhow moving to the NoSQL DB but not as of now). But with the huge amount of data (2 million geo-point entries per day), it takes MYSQL a lot of time to filter the data based on timestamp. MySQL will still be our primary data source. The solution I am looking for will act as a cache for faster path retrieval based on id.
In current MySQL schema, the fields I have are:
system_timestamp,
gps_timestamp,
speed,
lat,
lot
This table store all the geo-points of the vehicle whether vehicle is on trip or not. Here trip is based on whether driver wants to track the movement or not. If he want to track the movement, we generate a unique trip id and associate it to the driver along with the trip's start time and the end time. Later for displaying the path based on trip id, we use the start & end time of the trip to filter the data from the raw table.
I want to store the trip path into secondary database as a cache so that it's retrieval will be fast.
Which database should be my ideal choice? What other options do I have?
I'm going to go out on a limb here and say that I believe there is a less complicated way of fixing your performance issue.
I assume you are using MySQL with InnoDB and you are indexing the timestamp field(s).
If I were you, I would simply turn the relevant timestamp (system or gps) into the primary key. With InnoDB, the table data is physically organized to do ultra-fast lookups based on the primary key column(s). Also, make sure that the relevant timestamp column is of the unsigned non-null type.
Now, instead of doing a lookup for the paths in between start and end time (as you're currently doing), I would create a separate table within the same MySQL database containing pairs of trip ID/path timestamp, where "path timestamp" is the primary key from the paths table, as mentioned earlier. Primary index the trip ID. Populate this table using the same logic/mechanism you initially imagined for Riak or MongoDB. This will basically be your "caching" system, using nothing but MySQL.
A typical lookup would take the trip ID to find all of the path timestamps associated and thus all of the path data.
CREATE TABLE IF NOT EXISTS `paths` (
`system_timestamp` int(10) unsigned NOT NULL,
`gps_timestamp` int(10) NOT NULL,
`speed` smallint(8) unsigned NOT NULL,
`lat` decimal(10,6) NOT NULL,
`lng` decimal(10,6) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `paths` ADD PRIMARY KEY (`system_timestamp`);
CREATE TABLE IF NOT EXISTS `trips` (
`trip_id` int(10) unsigned NOT NULL,
`system_timestamp` int(10) unsigned NOT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `trips` ADD PRIMARY KEY (`trip_id`);
SELECT * FROM `trips`
INNER JOIN `paths` ON
`trips`.`system_timestamp` = `paths`.`system_timestamp`
WHERE `trip_id` = 1;
I am trying to move from MySQL to Cassandra for a music service application I am building.
I have read the following stackexchange: MySQL Data Model to Cassandra Help?
and checked out https://wiki.apache.org/cassandra/DataModel - also the DataStax Cassandra Modeling they did with the music service also, but the documentation so far are very small and narrow that I can't ditch MySql type queries away, so I would need help on.
This is my album table that works so far in mysql
CREATE TABLE `albums` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(150) NOT NULL,
`description` varchar(300) NOT NULL,
`release_date` int(10) unsigned NOT NULL,
`status` enum('active','inactive','pending') NOT NULL,
`licensor_id` int(11) NOT NULL,
`score` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `status` (`status`),
KEY `licensor_id` (`licensor_id`),
KEY `batch_id` (`batch_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1720100 ;
I also have a one to many relationship on the following tables:, artist (many artist to one album), genre(many genre to one album), songs(1 album contains many songs).
I have many pivot tables going around in order to couple these around.
So because Cassandra doesn't allow joins, I figure that doing set,list,map would help me resolve to the proper dataset.
at first my thoughts were to solve my maping by just reusing the same table:
CREATE TABLE `albums` (
`id` int(10) ,
`title` varchar(150) ,
`description` varchar(300) ,
`release_date` date ,
`status` enum('active','inactive','pending') ,
`licensor_id` int(11) ,
`data_source_provider_id` int(10) ,
`score` int(10)
`genre` <set>
`artist` <set>
PRIMARY KEY (`id`),
) ;
(apologies if the above are not the correct syntax for Cassandra, Ive only begun installing the system on a dev system)
My queries are of the following:
Give me all albums sorted by Score (Descending)
Give me all albums from a particular genre, sorted by score
Give me all albums from a particular artist, sorted by score
Give me all albums sorted by release date, then by score.
In SQL the 4 are easy when doing the join - however since Cassandra doesn't allow joins i figure that my modelling was adequent enough however #4 cannot be satisified (there are no double order by as far as i can tell).
Multiple indexes are slow, and considering that its on a large dataset (there are 1.8M records for now, but I'm planning on pumping triple the amount at least, hence why Cassandra would be useful)
My question are:
1) Is my path to go from MySQL to Cassandra correct despite being stuck on my 4 questions - or did it do it wrong? (I've done some Active Records before with MongoDB where you can have a sub entity within the document, but Cassandra only has set,list and map).
2) If I want to expand my modelling to: " I want to create a list X that contains a predefined number elements from the albums table". Would tagging each Albums element with a new field "tag" that has X be the smart way to filter things OR would it be best to create a new table, that contains all the elements that I need and just query that.
The general advice for Cassandra is to write your tables based on your queries. Don't be shy about writing the same data to multiple tables if some of those queries are not compatible with each other. (Twitter, for example would write each tweet to a table of all the followers of that user.)
That said, looking at your queries, your challenge will be that Cassandra does not inherently have a way of handling some of your sorting needs. You will need to add an analytics engine like Spark or Hadoop's M/R to sort on a non-unique (constantly changing?) field like score.
Let's look at some table definitions that will be a good start. Then you can determine if you need a full blown distributed analytics engine or whether locally sorting the results of the query will be enough.
CREATE TABLE albums(
id uuid,
title text,
description text,
releasedate timestamp,
status text,
license_id varint,
data_source_provider_id varint,
score counter,
genre set<text>,
artist set<text>,
PRIMARY KEY (id)
);
This table will store all your albums by id. Based on your use case, selecting all the albums and sorting them by score would definitely be out of the question. You could, potentially, do something clever like modulo-ing the score and putting the albums in buckets, but I'm not convinced that would scale. Any of your queries could be answered using this table plus analytics, but in the interest of completeness, let's look at some other options for putting your data in Cassandra. Each of the following tables could readily reduce the load from any analytics investigations you run that have additional parameters (like a range of dates or set of genres).
CREATE TABLE albums(
id uuid,
title text,
description text,
releasedate timestamp,
status text,
license_id varint,
data_source_provider_id varint,
score counter,
genre set<text>,
artist text,
PRIMARY KEY (artist, releasedate, title)
);
Cassandra can automatically sort immutable fields. The table above will store each artist's albums in a separate partition (each partition is colocated in your cluster and replicated based on your replication factor). If an album has multiple artists, this record would be duplicated under each artist's entry, and that's OKAY. The second and third keys (releasedate and title) are considered sorting keys. Cassandra will sort the albums first by releasedate and second by title (for the other priority, reverse their order above). Each combo of artist, releasedate and title is logically one row (although on disk, they will be stored as a widerow per artist only). For one artist, you can probably sort the entries by score locally, without direct intervention from the database.
Sorting by release date can easily be accomplished by a similar table, but changing the PRIMARY KEY to: PRIMARY KEY (releasedate, ..?). In this case, however, you probably will face a challenge in sorting (locally) if you have a substantial range of release dates.
Finally, don't try something similar for genre. Genre is too large a set to be contained in a single partition key. Hypothetically if you had a secondary way of splitting that set up, you could do PRIMARY KEY ((genre, artist)), (double parens intentional) but I don't think this fits well with your particular use case as both of such keys are required to look up an entry.
Is there anything I can change in the my.ini file to speed up "LOAD DATA INFILE"?
I have two MySQL 5.5 instances each of which has one identical table structured as follows:
CREATE TABLE `log_access` (
`_id` bigint(20) NOT NULL AUTO_INCREMENT,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`type_id` int(11) NOT NULL,
`building_id` int(11) NOT NULL,
`card_id` varchar(15) NOT NULL,
`user_key` varchar(35) DEFAULT NULL,
`user_name` varchar(25) DEFAULT NULL,
`user_validation` varchar(10) DEFAULT NULL,
PRIMARY KEY (`_id`),
KEY `log_access__user_key_timestamp` (`user_key`,`timestamp`)
KEY `log_access__timestamp` (`timestamp`)
) ENGINE=MyISAM
On a daily basis I need to move the data from previous day from instance A to instance B, which consists of roughly 25 million records. At the moment I am doing the following:
On instance A, generate an OUTFILE with "WHERE timestamp BETWEEN
'2014-09-23 00:00:00' AND '2014-09-23 23:59:59'. This usually takes
less than 2 minutes.
On instance B, execute "LOAD DATA INFILE". This is the problem area
as it takes about 13 hours.
On instance A, delete records from the previous day. This will probably be another
On instance B, run stats On instance B, truncate the table
I have also considered partitioning the tables and just exchanging the partitions. EXCHANGE PARTITION is supported as of 5.6 and I am willing to update MySQL, however, all documentation discusses exchanging between tables and I haven't been able to confirm that I would be able to do that between DB instances.
Replication between the instances, but as I have not tinkered with replication in the past and this is a time sensitive assignment I am somewhat reluctant to tread into new waters.
Any words of wisdom much appreciated.
CREATE the table without PRIMARY KEY and _id column and add these after LOAD DATA INFILE is complete. MySQL checks the PRIMARY KEY integrity with each INSERT, so I think you can gain a lot of performance here. With MariaDB you can disable keys, but I think this won't work on some storage engines (see here)
Not-very-nice-alternative:
I found it very easy to move a MYISAM-database by just copy/move the files on disk. If you cut/paste the files and run a REPAIR TABLE. on your target machine you can do this without restarting the Server. Just make sure you copy all 3 files (.frm, .myd, .myi)
LOAD DATA INFILE in perfect PK-order, INTO a table that only has the PK-definition, so no secondary indexes yet. After import, add all secondary indexes at once, with 'ALTER TABLE mytable ALGORITHM=INPLACE, LOCK=NONE, ADD KEY ...'.
Consider adding back the secondary indexes on each involved box separately, so not via replication (sql_log_bin=0), to prevent replication lag.
Consider using a partitioned table, as then you can run a 'LOAD DATA INFILE' per partition, in parallel. (applies to RANGE and HASH partitioning, as the separate tsv-files (one or more per partition) are easy to prepare for those)
MariaDB doesn't have the variant 'INTO mytable PARTITION (p000)' yet.
You can load into a separate table first, and then exchange partitions, but MariaDB also doesn't have 'WITHOUT VALIDATION' yet.
Please excuse any syntax errors in my examples; I am new to SQL.
For this question, let us suppose I have this hypothetical structure:
authors_list:
author_id INT NOT_NULL AUTO_INCREMENT PRIMARY
author_name VARCHAR(30) NOT_NULL
books_list:
book_id INT NOT_NULL AUTO_INCREMENT PRIMARY
book_author_id INT NOT_NULL FOREIGN_KEY(authors_list.author_id)
book_name VARCHAR(30) NOT_NULL
Generally when importing books, I would only know the book name and author name. I have finally figured out how to insert into books_list using only this data:
INSERT INTO `books_list`(`book_author_id`, `book_name`) VALUES ((SELECT `author_id` FROM `authors_list` WHERE `author_name` = 'SomeAuthorName'), 'SomeBookName')
However, I have a .csv file which only contains the columns author_name and book_name. I have previously been importing .csv files with phpMyAdmin, but those tables did not have foreign keys. Is there any way to import a .csv of the form described using this "on the fly lookup" functionality?
You can use SQL directly: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
If you need more logic than id generation, you can import data into another table and then write script or procedure to copy data from this table to books_list, using some your customized logic.
If our steps works, use it. There will be probably limit in amount of data. If you reach the limit, use suggested way.
I am currently evaluating strategy for storing supplier catalogs.
There can be multiple items in catalog vary from 100 to 0.25million.
Each item may have multiple errors. application should support browsing of catalog items
Group by Type of Error, Category, Manufacturer, Suppliers etc..
Browse items for any group, Should be able to sort and search on multiple columns (partid,
names, price etc..)
Question is when i have to provide functionality of "Multiple SEARCH and SORT and GROUP" how should i create index.
According to mysql doc & blogs for index it seems that creating index on individual column will not be used by all query.
Creating multi column index is even not specific for my case.
There might be 20 - 30 combination of group search & sort.
How do i scale and how can i make search fast.
Expecting to handle 50 million records of data.
Currently evaluating on 15 million of data.
Suggestions are welcome.
CREATE TABLE CATALOG_ITEM
(
AUTO_ID BIGINT PRIMARY KEY AUTO_INCREMENT,
TENANT_ID VARCHAR(40) NOT NULL,
CATALOG_ID VARCHAR(40) NOT NULL,
CATALOG_VERSION INT NOT NULL,
ITEM_ID VARCHAR(40) NOT NULL,
VERSION INT NOT NULL,
NAME VARCHAR(250) NOT NULL,
DESCRIPTION VARCHAR(2000) NOT NULL,
CURRENCY VARCHAR(5) NOT NULL,
PRICE DOUBLE NOT NULL,
UOM VARCHAR(10) NOT NULL,
LEAD_TIME INT DEFAULT 0,
SUPPLIER_ID VARCHAR(40) NOT NULL,
SUPPLIER_NAME VARCHAR(100) NOT NULL,
SUPPLIER_PART_ID VARCHAR(40) NOT NULL,
MANUFACTURER_PART_ID VARCHAR(40),
MANUFACTURER_NAME VARCHAR(100),
CATEGORY_CODE VARCHAR(40) NOT NULL,
CATEGORY_NAME VARCHAR(100) NOT NULL,
SOURCE_TYPE INT DEFAULT 0,
ACTIVE BOOLEAN,
SUPPLIER_PRODUCT_URL VARCHAR(250),
MANUFACTURER_PRODUCT_URL VARCHAR(250),
IMAGE_URL VARCHAR(250),
THUMBNAIL_URL VARCHAR(250),
UNIQUE(TENANT_ID,ITEM_ID,VERSION),
UNIQUE(TENANT_ID,CATALOG_ID,ITEM_ID)
);
CREATE TABLE CATALOG_ITEM_ERROR
(
ITEM_REF BIGINT,
FIELD VARCHAR(40) NOT NULL,
ERROR_TYPE INT NOT NULL,
ERROR_VALUE VARCHAR(2000)
);
If you are determined to do this solely in MySQL, then you should be creating indexes that will work for all your queries. It's OK to have 20 or 30 indexes if there are 20-30 different queries doing your sorting. But you can probalby do it with far less indexes than that.
You also need to plan how these indexes will be maintained. I'm assuming because this is for supplier catalogs that the data is not going to change much. In this case, simply creating all the indexes you need should do the job nicely. If the data rows are going to be edited or inserted frequently in realtime, then you have to consider that with your indexing - then having 20 or 30 indexes might not be such a good idea (since MySQL will be constantly having to update them all). You also have to consider which MySQL storage engine to use. If your data never changes, MyISAM (the default engine, basically fast flat files) is a good choice. If it changes a lot, then you should be using InnoDB so you can get row level locking. InnoDB would also allow you to define a clustered index, which is a special index that controls the order stuff is stored on disk. So if you had one particular query that is run 99% of the time, you could create a clustered index for it and all the data would already be in the right order on disk, and would return super super fast. But, every insert or update to the data would result in the entire table being reordered on disk, which is not fast for lots of data. You'd never use one if the data changed at all frequently, and you might have to batch load data updates (like new versions of a supplier's million rows). Again, it comes down to whether you will be updating it never, now and then, or constantly in realtime.
Finally, you should consider alternative means than doing this in MySQL. There are a lot of really good search products out there now, such as Apache Solr or Sphinx (mentioned in a comment above), which could make your life a lot easier when coding up the search interfaces themselves. You could index the catalogs in one of these and then use them provide some really awesome search features like full text and/or faceted search. It's like having a private google search engine indexing your stuff, is a good way to describe how these work. It takes time to write the code to interface with the search server, but you will most likely save that time not having to write and wrap your head around the indexing problem and other issues I mentioned above.
If you do just go with creating all the indexes though, learn how to use the EXPLAIN command in MySQL. That will let you see what MySQL's plan for executing a query will be. You can create indexes then re-run EXPLAIN on your queries and see how MySQL is going to use them. This way you can make sure that each of your query methods has indexes supporting it, and is not falling back to a scanning your entire table of data to find things. With as many rows as you're talking about, every query MUST be able to use indexes to find its data. If you get those right, it'll perform fine.