How to batch load a CSV columns into a MySQL table - mysql

I have numerous csv files that will form the basis of a mysql database. My problem is as follows:
The input CSV files are of the format:
TIME | VALUE PARAM 1 | VALUE PARAM 2 | VALUE PARAM 3 | ETC.
0.00001 | 10 | 20 | 30 | etc.
This is not the structure I want to use in the database. There I would like one big table for all of the data, structured something like:
TIME | PARAMETER | VALUE | Unit of Measure | Version
This means that I would like to insert the combination of TIME and VALUE PARAM 1 from the CSV into the table, then the combination of TIME and VALUE PARAM 2, and so on, and so on.
I haven't done anything like this before, but could a possible solution be to set up a BASH script that loops through the columns and on each iteration inserts the combination of time + value into my database?
I have a reasonable understanding of mysql, but very limited knowledge of bash scripting. But I couldn't find a way out with the mysql LOAD DATA INFILE command.
If you need more info to help me out, I'm happy to provide more info!
Regards,
Erik

i do this all day, every day, and as a rule, have the most success with the least headaches by using LOAD DATA INFILE to a temporary table, then leveraging the power of mySQL to get it into the final table/format successfully. Details at this answer.
To illustrate this further, we process log files for every video event of 80K highschools/colleges around the country (that's every pause/play/seek/stop/start for 100's of thousands of videos).
They're served from a number of different servers, depending on the type of videos (WMV, FLV, MP4, etc.), so there's some 200GB to handle every night, with each format having a different log layout. The old way we did it with CSV/PHP took literally days to finish, but changing it to LOAD DATA INFILE into temporary tables, unifying them into a second, standardized temporary table, then using SQL to group and otherwise slice and dice cut the execution time to a few hours.

It would probably be easiest to preprocess your CSV with an awk script first, and then (as Greg P said) use LOAD DATA LOCAL INFILE. If I understand your requirements correctly, this awk script should work:
#!/usr/bin/awk -F| -f
NR==1 {
for(col = 2; col <= NF; col++) label[col] = $col
printf("TIME | PARAM | VALUE | UNIT | VERSION\n")
next
}
{
for(col = 2; col <= NF; col++) {
printf("%s | %s | %s | [unit] | [version]\n", $1, label[col], $col)
}
}
Output:
$ ./test.awk test.in
TIME | PARAM | VALUE | UNIT | VERSION
0.00001 | VALUE PARAM 1 | 10 | [unit] | [version]
0.00001 | VALUE PARAM 2 | 20 | [unit] | [version]
0.00001 | VALUE PARAM 3 | 30 | [unit] | [version]
0.00001 | ETC. | etc. | [unit] | [version]
Then
mysql> LOAD DATA LOCAL INFILE 'processed.csv'
mysql> INTO TABLE 'table'
mysql> FIELDS TERMINATED BY '|'
mysql> IGNORE 1 LINES;
(Note: I haven't tested the MySQL)

Related

Why redshift does not have entry for csv file in stl_load_commits ??

Even though I know aws has mentioned on their documentation that csv is more like txt file for them. But why there is no entry for CSV file.
For example:
If I am running a query like:
COPY "systemtable" FROM 's3://test/example.txt' <credentials> IGNOREHEADER 1 delimiter as ','
then its creating entry in stl_load_commits, which I can query by:
select query, curtime as updated from stl_load_commits where query = pg_last_copy_id();
But, in same way when I am tryig with:
COPY "systemtable" FROM 's3://test/example.csv'
<credentials> IGNOREHEADER 1 delimiter as ',' format csv;
then return from
select query, curtime as updated from stl_load_commits where query = pg_last_copy_id();
is blank, Why aws does not create entry for csv ?
This is the first part of the question. Secondly, there must be some way through which we can check the status of the loaded file?
How can we check if the file has successfully loaded in DB if the file is of type csv?
The format of the file does not affect the visibility of success or error information in system tables.
When you run COPY it returns confirmation of success and a count of rows loaded. Some SQL clients may not return this information to you but here's what it looks like using psql:
COPY public.web_sales from 's3://my-files/csv/web_sales/'
FORMAT CSV
GZIP
CREDENTIALS 'aws_iam_role=arn:aws:iam::01234567890:role/redshift-cluster'
;
-- INFO: Load into table 'web_sales' completed, 72001237 record(s) loaded successfully.
-- COPY
If the load succeeded you can see the files in stl_load_commits:
SELECT query, TRIM(file_format) format, TRIM(filename) file_name, lines, errors FROM stl_load_commits WHERE query = pg_last_copy_id();
query | format | file_name | lines | errors
---------+--------+---------------------------------------------+---------+--------
1928751 | Text | s3://my-files/csv/web_sales/0000_part_03.gz | 3053206 | -1
1928751 | Text | s3://my-files/csv/web_sales/0000_part_01.gz | 3053285 | -1
If the load fails you should get an error. Here's an example error (note the table I try to load):
COPY public.store_sales from 's3://my-files/csv/web_sales/'
FORMAT CSV
GZIP
CREDENTIALS 'aws_iam_role=arn:aws:iam::01234567890:role/redshift-cluster'
;
--ERROR: Load into table 'store_sales' failed. Check 'stl_load_errors' system table for details.
You can see the error details in stl_load_errors.
SELECT query, TRIM(filename) file_name, TRIM(colname) "column", line_number line, TRIM(err_reason) err_reason FROM stl_load_errors where query = pg_last_copy_id();
query | file_name | column | line | err_reason
---------+------------------------+-------------------+------+---------------------------
1928961 | s3://…/0000_part_01.gz | ss_wholesale_cost | 1 | Overflow for NUMERIC(7,2)
1928961 | s3://…/0000_part_02.gz | ss_wholesale_cost | 1 | Overflow for NUMERIC(7,2)

MySQL varchar column filled but not visible

I'm having a problem with a column ( VARCHAR(513) NOT NULL ) on a MySQL table.During a procedure of import from a CSV file, a bunch of rows got filled with some weird stuff coming from I don't know where.
This stuff is not visible from Workbench, but if I query the DBMS with:SELECT * FROM MyTable;I got:ID | Drive | Directory | URI | Type ||
1 | Z: | \Users\Data\ | \server\dati | 1 || // <-correct row
...
32 | NULL | \Users\OtherDir\ | | 0 ||While row 1 is correct, row 32 shows a URI filled with something. Now, if I query dbms with:SELECT length(URI) FROM MyTable WHERE ID = 32; I got 32. While, doing:SELECT URI FROM MyTable WhERE ID = 32; inside a MFC application, gets a string with length 0.Inside this program I have a tool for handling this table but this cannot work because I cannot build up queries about rows with bugged URI: how can I fix this? Where this problem comes from? If you need more information please ask.
Thanks.
Looks like you have white spaces in the data and which is causing the issue and when you import data from CSV its most often happen.
So to fix it you may need to run the following update statement
update MyTable set URI = trim(URI);
The above will remove the white spaces from the column.
Also while importing data from CSV its better to use the TRIM() for the values before inserting into the database and this will avoid this kind of issues.

Mysql table formatting with Ruby mysql gem

Mysql by default prints table results in mysql table formatting
+----+----------+-------------+
| id | name | is_override |
+----+----------+-------------+
| 1 | Combined | 0 |
| 2 | Standard | 0 |
+----+----------+-------------+
When calling mysql from the unix shell this table formatting is not preserved, but it's easy to request it via the -t option
mysql -t my_schema < my_query_file.sql
When using Ruby, I'm using the mysql gem to return results. Since the gem returns data as hashes, there's no option to preserve table formatting. However, is there any way I can easily print a hash/data with that formatting? Without having to calculate spacing and such?
db = Mysql.new(my_database, my_username, my_password, my_schema)
result = db.query("select * from my_table")
result.each_hash { |h|
# Print row. Any way to print it with formatting here?
puts h
}
Some gems and codes:
https://rubygems.org/gems/datagrid
http://rubygems.org/gems/text-table
https://github.com/visionmedia/terminal-table
https://github.com/geemus/formatador
https://github.com/wbailey/command_line_reporter
https://github.com/arches/table_print
http://johnallen.us/?p=347
I have not tried any of them.

pdi spoon ms-access concat

Suppose I have this table named table1:
| f1| f2 |
--------------
| 1 | str1 |
| 1 | str2 |
| 2 | str3 |
| 3 | str4 |
| 3 | str5 |
I wanted to do something like:
Select f1, group_concat(f2) from table1
this is in mysql, I am working with ms-access! And get the result:
| 1 | str1,str2|
| 2 | str3 |
| 3 | str4,str5|
So I searched for a function in ms-access that would do the same and found it! xD
The problem is that everyday I have to download some database in ms-access, create the function to concat there, and then create a new table with those concated values.
I wanted to incorporate that process in the Pentaho Data Integration spoon transformations, that I use after all this work.
So what I want is a way to define a ms-access function in the PDI spoon, or some way to combine steps that would emulate the group_concat from mysql.
Simple - Query from access, and use the "group by" step to do your group_concat - there is an option to concatenate fields separated by , or any string of your choice.
Dont forget that the stream must be sorted by whatever you're grouping by unless you use the memory group by step.
A simple way is you move your data in ms-access to mysql with the same structure (mysql DB structure = ms-access DB structure), then execute your "Select f1, group_concat(f2) from table1". For details follow this below steps :
Create transformation A to move/transfer your ms-access data to mysql
Create transformation B to execute Select f1, group_concat(f2) from table1
Create job to execute transformation A and B (You must execute tranformation A before B)

Storing a variable number of files' download statistics in mysql database

I have a number of files on my website that are private and pushed through php. I keep track of the downloads using a mysql database. Currently I just use a column for each file and insert a new row for every day, which is fine because I don't have many files.
However, I am going to be starting to add and remove files fairly often, and the number of files will be getting very large. As I see it I have two options:
The first is to add and remove columns for each file as they are added and removed. This would quickly lead to the table having very many columns. I am self-taught so I'm not sure, but I think that's probably a very bad thing. Adding and removing columns once there are a lot of rows sounds like a very expensive operation.
I could also create a new database with a generic 'fileID' feild, and then can add a new row every day for each file, but this would lead to a lot of rows. Also, it would be a lot of row insert operations to create tracking for the next day.
Which would be better? Or is there a third solution that I'm missing? Should I be using something other than mysql? I want something that can be queried so I can display the stats as graphs on the site.
Thank you very much for your help, and for taking the time to read.
I could also create a new database with a generic 'fileID' feild, and then can add a new row every day for each file, but this would lead to a lot of rows.
Yes, this is what you need to do — but you mean "a new table", not "a new database".
Basically you'll want a file table, which might look like this:
id | name | created_date | [other fields ...]
----+-----------+--------------+--------------------
1 | foo.txt | 2012-01-26 | ...
2 | bar.txt | 2012-01-27 | ...
and your downloads_by_day table will refer to it:
id | file_id | `date` | download_count
----+---------+------------+----------------
1 | 1 | 2012-01-27 | 17
2 | 2 | 2012-01-27 | 23
3 | 1 | 2012-01-28 | 6
4 | 2 | 2012-01-28 | 195