load csv file content into mysql table with validation

load csv file content into mysql table with validation - mysql

i want to import a large csv file (about 12MO) into mysql table , first i tried with LOAD DATA INFILE it work perfectly , but in my case i want firstly test the csv rows to determine if i want to update data or insert new records
so the solution is to read the file and compare the content of each row with the data already in the table , and make the right action
this methode also works but takes lot of time and resources
now my question is
1 : can i use the import functions of PHPMYADMIN ( open source )
and my project is comercial
2 : if yes i can , do you know some tutorials about this ( any idea )
3 : if no i can't , is there such commercial frameworks for exporting/importing
thanks

This is actually pretty common SQL: you either want to insert or update, yes? So you need two statements (one for update, one for insert) and a way to tell if it should be inserted. What you really need is a unique key that will never be duplicated for the individual record (can be a composite key) and two statements, like this:
UPDATE right SET
right1 = left1,
right2 = left2,
etc
FROM the_import_table right
LEFT JOIN the_existing_data left ON left.key = right.key
WHERE right.key IS NOT NULL --note that I write in TSQL
INSERT INTO right (
right1,
right2,
etc
) SELECT
left1,
left2,
etc
FROM left
LEFT JOIN right on left.key = right.key
WHERE right.key IS NULL
Note that you can use composite keys using a set of ANDed values in the WHERE, and note that you're not updating the composite keys but you are probably inserting the composite keys. This should be good to get you a start. Before you ask for clarification update your question with actual code.

MySQL has specific insert syntax to deal with duplicate rows.

Related

Does the order of Insertion items in mysql affect the speed of insertion

I have a need to insert more than 50,000 records with 200+ fields and wondered if it was important to line up the insertion url to match the items one for one.. or will the time be insignificant.. for example.
INSERT INTO `wp_realty_listingsdb` (`id`,`fname`,`lname, `age`)
values(Joe,Jordan,21);
INSERT INTO `wp_realty_listingsdb` (`id`,`age, fname`,`lname)
values(1,21,Joe,Jordan);
To clarify here is the layout of the data base..
Field 1 = fname
Field 2 = lname
Field 3 = age
So the db has the fields laid out from left to right but as you can see on the second insertion the order of fields matches the data being entered however only the top Insert matches the order of the data base from left to right.
Morale of the story the data huge amounts is laid out as the second one and the question is.. if a quick rearrangement of the data to match the order of the db would it save measuralble time or no...
Psychollogically it seems like if i line up 200 buckets and go along dropping acorns into the buckets as i come to them it would be faster.. Like i don't have to run down to bucket 155 and drop one then come back to bucket 3...LOL..
With the simple example of couse speed wont matter but there will be 50,000 acorns and 200+ buckets... If no one knows then i will set it up and test to be sure.

I don't think there is impact on the speed if you swap fields.
It will be better to re-arrange your values to match with the original order like in the example#1 so that you can execute single query as shownbelow which is faster than that in example#2,
example#1
INSERT INTO wp_realty_listingsdb (`id`,`fname`,`lname`, `age`) VALUES(1,'Joe','Jordan',21),(2,'Joe','Jordan',21);
example#2
INSERT INTO `wp_realty_listingsdb` (`fname`,`lname`, `age`) values('Joe','Jordan',21);
INSERT INTO `wp_realty_listingsdb` (`id`,`age`, `fname`,`lname`) values(1,21,'Joe','Jordan');
Also, it will be better to have the records imported as SQL file from MySQL console instead uploading via GUI applications.
You can use the below command to import the file from MySQL console,
mysql>use yourdbname;
mysql>source /path/to/sql/file.sql

Create and Update a Single Table from SQL Backend with Multiple Tables, Using Access 2010

Good Morning All,
I'm having problem pulling the data I need from a SQL Backend and keeping it up to date.
I've got two tables, that hold the data at need. At one Point they were split due to a software update we received. First Table dbo_PT_NC Second Table dbo_PT_Task
Primary key of PT_NC is the "NCR" Field, The Task Table has its own Unique ID, But the PT_Task.TaskTypeID field is linked to the "NCR" field
SELECT dbo_PT_Task.TaskTypeID,
dbo_PT_NC.NCR,
dbo_PT_NC.NCR_Date,
dbo_PT_NC.NC_type,
dbo_PT_NC.Customer,
dbo_PT_NC.Material,
dbo_PT_NC.Rev,
dbo_PT_NC.Qty_rejected,
dbo_PT_Task.TaskType,
dbo_PT_Task.Notes AS dbo_PT_Task_Notes,
dbo_PT_NC.Origin,
dbo_PT_NC.Origin_ref,
dbo_PT_NC.Origin_cause,
dbo_PT_NC.Origin_category
FROM dbo_PT_NC INNER JOIN dbo_PT_Task ON dbo_PT_NC.[NCR] = dbo_PT_Task.[TaskTypeID]
WHERE (((dbo_PT_NC.NCR_Date)>=#1/1/2016#) AND ((dbo_PT_Task.TaskSubType)="Origination"))
ORDER BY dbo_PT_NC.NCR_Date, dbo_PT_NC.Customer;
After I have this data pulled and put into a Snapshot (I do not want the Live Data to be accessible by the front end users) I'll be adding columns for a Weak Point Management System we are implementing, Fields Such as:
Scrap Code (lookup field to another table i've built inside excel)
Containment, Root Cause, Plan, Do, Check, and Act, all of which Should most likely be Memo Fields (As characters may break 255)
Date Completed A date the process was complete
This table (The data i've snapshotted and the new fields added) will need to be updated with New or Changed Records from the SQL Backend i've previously connected to.
UPDATE
Big thanks to Andre.. Got it working, Sample code below (i've added more update fields since)
UPDATE tblWeakPointMaster, dbo_PT_NC INNER JOIN dbo_PT_Task ON dbo_PT_NC.NCR = dbo_PT_Task.TaskTypeID
SET tblWeakPointMaster.Qty_rejected = [dbo_PT_NC].[Qty_rejected],
tblWeakPointMaster.dbo_PT_Task_Notes = [dbo_PT_Task].[Notes],
tblWeakPointMaster.Material = [dbo_PT_NC].[Material],
tblWeakPointMaster.Rev = [dbo_PT_NC].[Rev],
tblWeakPointMaster.NC_type = [dbo_PT_NC].[NC_type]
WHERE (((tblWeakPointMaster.NCR)=dbo_PT_NC.NCR) And ((tblWeakPointMaster.TaskID)=dbo_PT_Task.TaskID));

I assume there is a 1:n relation between PT_NC and PT_Task?
Then you should include both primary keys in the import SELECT.
Either use them as composite primary key in the Access tables instead of the new KEY column. Or if that is impractical because other tables are linking to tblWeakPointMaster, you can also keep that primary key.
But in any case, these two columns form the JOIN between tblWeakPointMaster and tblWeakPointUpdates.
All other columns can be used to update tblWeakPointMaster from tblWeakPointUpdates (assuming they can be edited in the original database).
Edit: if you don't use them as composite primary key, you need to create an unique index on the combination, or the JOIN will not be updateable, I think.

Using Sphinx for the first time - configuring the sql_query key

I'm currently practicing using Sphinx, I've not far off done much, except the configuration what I'm trying to do. The sql_query key is leaving me somewhat confused what to put there, I read in the Sphinx documentation of sql_query but it doesn't seem to clear my mind from knowing what to do since I have many SELECTs in my web application, and I want to use Sphinx for my search and the SQL is often changed (upon user search filtering).
As of my search using MySQL, I want to integrate Sphinx to my web application, if the sql_key is not optional, do I have to expect to put the whole search SQL query into that field or do I pick out the necessary fields from tables to start a reindex?
Can someone point me to the right direction so I can get things going well with Sphinx and my web application.

sql_query is mandatory , it's run by sphinx to get the data you want to be indexed from mysql . You can have joins , conditions etc. , must be a valid sql query . You should have something like "SELECT id ,field1,field2,fieldx from table" . id must be a primary id .Each row returned by this query is considered a document ( which is returned by sphinx when you search ) .
If you have multiple tables ( that are very different by meaning - users , articles etc.) - you need to create an index for each .
Read tutorials from here : http://sphinxsearch.com/info/articles/ to understand how sphinx works .

You can create a sql query to get union set of records from the Database. If you do multiple table joining and query to select the best result set, you can do it with Sphinx too.
You may run into a few trouble with your existing table structure in the database.
Like :
Base table does not have integer primary key field
Create a new table which has two fields. One for the integer id field and the other field to hold the primary key of the base table. Do an inner join with that table and select the id field from that table.
Eg. SELECT t1.id, t2.name, t2.description, t2.content FROM table_new t1 INNER JOIN table_2 t2 WHERE t1.document_id = t1.thread_id INNER JOIN REST_OF_YOUR_SELECT_QUERY
The ta.id is for Sphinx search engine to do its internal indexing.
You filter data by placing WHERE clause and filtering
You can do that in Sphinx by setting filters dynamically based on the conditions.
You select and join different tables to get results
This also can be done by setting different sources and indexes based on your requirements.
Hope this would help you to get an understanding what you need to add and modify to start thinking how Sphinx search engine can be configured to your requirements. Just come here again if your need more help.

Storing an array (doubles) in phpMyAdmin

I'm very new to MySQL, although I've used SQL databases in other contexts before. I have a test site set up which has an online cPanel with access to phpMyAdmin. I'm attempting to setup a MySQL database, and so far it's working fine (I can connect to the Database and the table).
The only problem I'm having is with inserting data. I'd like to insert an entire array (specifically, the array will be a double[]) into one column. After looking at the column types available in phpMyAdmin, it doesn't seem to support inserting arrays other than Binary arrays.
I've found many solutions for inserting arrays programatically including this thread, but for this site we will be inserting data via the online cPanel. Is there a way to do that?

If you want access to that data, and want to be able to use the power of SQL to search in your double[], you should do it this way:
First, you should spend some time researching relational databases. They allow you to create linked data.
An important part of every relational database is using good keys. A key is a unique identifier for a row that allows you to access the data on that row in an efficient manner.
Another important part of relational databases are indexes. Indexes are not required to be unique. But are useful if you are trying to search on them (SQL has made an "index" of the table based on a column or group of columns)
If you wanted to create a table that would have a double[] array, you might instead create a 2nd table that relates to the first table by the first tables primary key.
CREATE TABLE base (
base_id INT AUTO_INCREMENT,
name VARCHAR(32),
PRIMARY KEY(base_id)
);
CREATE TABLE darray (
base_id INT,
data DOUBLE,
INDEX(base_id)
);
To get the information back out that you want, you can select using a JOIN statement. If you wanted to get all the information where the base_id was 3, you would write it like so:
SELECT * FROM base
JOIN darray ON darray.base_id = base.base_id
WHERE base.base_id = 3;
The advanced form of writing this with aliasing
SELECT * FROM base b
JOIN darray d ON d.base_id = b.base_id
WHERE b.base_id = 3;
If you don't want to have access to the data, but are just recalling it, you should do it this way: (Although this is debatable, I still recommend the above way, if you are willing to learn more sql)
I assume you will be using PHP, we will be serializing the data (see: http://php.net/manual/en/function.serialize.php)
Note we will don't have the darray table, but instead add a
data BLOB
to the base table.
Inserting with PHP serialized data
<?php
$serializedData = serialize($darray);
$result = mysql_query("INSERT INTO base (name, data) VALUES('a name', '$serializedData ')");
Getting the serialized data
<?php
$result = mysql_query("SELECT data FROM base WHERE base_id=3");
if($result && mysql_affected_rows($result) > 0) {
$serializedData = mysql_result($result, 0, 'data');
$darray = unserialize($serializedData);
}

You can import data for tables with a .sql file (basically just a file full of insertion queries) but phpMyAdmin doesn't support inserting data from arbitrary data types. If you want to insert a double[] array as multiple rows in a table, you'll need to take an approach similar to the one in the thread you linked.
(Note that you can always write such a program for the explicit purpose of generating a .sql file which you then use for deployment.)

adding data to interrelated tables..easier way?

I am a bit rusty with mysql and trying to jump in again..So sorry if this is too easy of a question.
I basically created a data model that has a table called "Master" with required fields of a name and an IDcode and a then a "Details" table with a foreign key of IDcode.
Now here's where its getting tricky..I am entering:
INSERT INTO Details (Name, UpdateDate) Values (name, updateDate)
I get an error: saying IDcode on details doesn't have a default value..so I add one then it complains that Field 'Master_IDcode' doesn't have a default value
It all makes sense but I'm wondering if there's any easy way to do what I am trying to do. I want to add data into details and if no IDcode exists, I want to add an entry into the master table. The problem is I have to first add the name to the fund Master..wait for a unique ID to be generated(for IDcode) then figure that out and add it to my query when I enter the master data. As you can imagine the queries are going to probably get quite long since I have many tables.
Is there an easier way? where everytime I add something it searches by name if a foreign key exists and if not it adds it on all the tables that its linked to? Is there a standard way people do this? I can't imagine with all the complex databases out there people have not figured out a more easier way.
Sorry if this question doesn't make sense. I can add more information if needed.
p.s. this maybe a different question but I have heard of Django for python and that it helps creates queries..would it help my situation?
Thanks so much in advance :-)

(decided to expand on the comments above and put it into an answer)
I suggest creating a set of staging tables in your database (one for each data set/file).
Then use LOAD DATA INFILE (or insert the rows in batches) into those staging tables.
Make sure you drop indexes before the load, and re-create what you need after the data is loaded.
You can then make a single pass over the staging table to create the missing master records. For example, let's say that one of your staging table contains a country code that should be used as a masterID. You could add the master record by doing something along the lines of:
insert
into master_table(country_code)
select distinct s.country_code
from staging_table s
left join master_table m on(s.country_code = m.country_code)
where m.country_code is null;
Then you can proceed and insert the rows into the "real" tables, knowing that all detail rows references a valid master record.
If you need to get reference information along with the data (such as translating some code) you can do this with a simple join. Also, if you want to filter rows by some other table this is now also very easy.
insert
into real_table_x(
key
,colA
,colB
,colC
,computed_column_not_present_in_staging_table
,understandableCode
)
select x.key
,x.colA
,x.colB
,x.colC
,(x.colA + x.colB) / x.colC
,c.understandableCode
from staging_table_x x
join code_translation c on(x.strange_code = c.strange_code);
This approach is a very efficient one and it scales very nicely. Variations of the above are commonly used in the ETL part of data warehouses to load massive amounts of data.
One caveat with MySQL is that it doesn't support hash joins, which is a join mechanism very suitable to fully join two tables. MySQL uses nested loops instead, which mean that you need to index the join columns very carefully.
InnoDB tables with their clustering feature on the primary key can help to make this a bit more efficient.
One last point. When you have the staging data inside the database, it is easy to add some analysis of the data and put aside "bad" rows in a separate table. You can then inspect the data using SQL instead of wading through csv files in yuor editor.

I don't think there's one-step way to do this.
What I do is issue a
INSERT IGNORE (..) values (..)
to the master table, wich will either create the row if it doesn't exist, or do nothing, and then issue a
SELECT id FROM master where someUniqueAttribute = ..
The other option would be stored procedures/triggers, but they are still pretty new in MySQL and I doubt wether this would help performance.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008