Importing a series of .CSV files that contain one field while adding additional 'known' data in other fields - mysql

I've got a process that creates a csv file that contains ONE set of values that I need to import into a field in a MySQL database table. This process creates a specific file name that identifies the values of the other fields in that table. For instance, the file name T001U020C075.csv would be broken down as follows:
T001 = Test 001
U020 = User 020
C075 = Channel 075
The file contains a single row of data separated by commas for all of the test results for that user on a specific channel and it might look something like:
12.555, 15.275, 18.333, 25.000 ... (there are hundreds, maybe thousands, of results per user, per channel).
What I'm looking to do is to import directly from the CSV file adding the field information from the file name so that it looks something like:
insert into results (test_no, user_id, channel_id, result) values (1, 20, 75, 12.555)
I've tried to use "Bulk Insert" but that seems to want to import all of the fields where each ROW is a record. Sure, I could go into each file and convert the row to a column and add the data from the file name into the columns preceding the results but that would be a very time consuming task as there are hundreds of files that have been created and need to be imported.
I've found several "import CSV" solutions but they all assume all of the data is in the file. Obviously, it's not...
The process that generated these files is unable to be modified (yes, I asked). Even if it could be modified, it would only provide the proper format going forward and what is needed is analysis of the historical data. And, the new format would take significantly more space.
I'm limited to using either MATLAB or MySQL Workbench to import the data.
Any help is appreciated.
Bob

A possible SQL approach to getting the data loaded into the table would be to run a statement like this:
LOAD DATA LOCAL INFILE '/dir/T001U020C075.csv'
INTO TABLE results
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ','
( result )
SET test_no = '001'
, user_id = '020'
, channel_id = '075'
;
We need the comma to be the line separator. We can specify some character that we are guaranteed not to tppear to be the field separator. So we get LOAD DATA to see a single "field" on each "line".
(If there isn't trailing comma at the end of the file, after the last value, we need to test to make sure we are getting the last value (the last "line" as we're telling LOAD DATA to look at the file.)
We could use user-defined variables in place of the literals, but that leaves the part about parsing the filename. That's really ugly in SQL, but it could be done, assuming a consistent filename format...
-- parse filename components into user-defined variables
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(f.n,'T',-1),'U',1) AS t
, SUBSTRING_INDEX(SUBSTRING_INDEX(f.n,'U',-1),'C',1) AS u
, SUBSTRING_INDEX(f.n,'C',-1) AS c
, f.n AS n
FROM ( SELECT SUBSTRING_INDEX(SUBSTRING_INDEX( i.filename ,'/',-1),'.csv',1) AS n
FROM ( SELECT '/tmp/T001U020C075.csv' AS filename ) i
) f
INTO #ls_u
, #ls_t
, #ls_c
, #ls_n
;
while we're testing, we probably want to see the result of the parsing.
-- for debugging/testing
SELECT #ls_t
, #ls_u
, #ls_c
, #ls_n
;
And then the part about running of the actual LOAD DATA statement. We've got to specify the filename again. We need to make sure we're using the same filename ...
LOAD DATA LOCAL INFILE '/tmp/T001U020C075.csv'
INTO TABLE results
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ','
( result )
SET test_no = #ls_t
, user_id = #ls_u
, channel_id = #ls_c
;
(The client will need read permission the .csv file)
Unfortunately, we can't wrap this in a procedure because running LOAD DATA
statement is not allowed from a stored program.
Some would correctly point out that as a workaround, we could compile/build a user-defined function (UDF) to execute an external program, and a procedure could call that. Personally, I wouldn't do it. But it is an alternative we should mention, given the constraints.

Related

MySql naming the automatically downloaded CSV file using first and last date

MySql query gives me data from the 2020-09-21 to 2022-11-02. I want to save the file as FieldData_20200921_20221102.csv.
Mysql query
SELECT 'datetime','sensor_1','sensor_2'
UNION ALL
SELECT datetime,sensor_1,sensor_2
FROM `field_schema`.`sensor_table`
INTO OUTFILE "FieldData.csv"
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
;
Present output file:
Presently I named the file as FieldData.csv and it is accordingly giving me the same. But I want the query to automatically append the first and last dates to this file, so, it helps me know the duration of data without having to open it.
Expected output file
FieldData_20200921_20221102.csv.
MySQL's SELECT ... INTO OUTFILE syntax accepts only a fixed string literal for the filename, not a variable or an expression.
To make a custom filename, you would have to format the filename yourself and then write dynamic SQL so the filename could be a string literal. But to do that, you first would have to know the minimum and maximum date values in the data set you are dumping.
I hardly ever use SELECT ... INTO OUTFILE, because it can only create the outfile on the database server. I usually want the file to be saved on the server where my client application is running, and the database server's filesystem is not accessible to the application.
Both the file naming problem and the filesystem access problem are better solved by avoiding the SELECT ... INTO OUTFILE feature, and instead writing to a CSV file using application code. Then you can name the file whatever you want.

mysql query not recognizing string

I have a table with a varchar(50) column = name. I have uploaded values from a local csv file such that the table looks like below. There are no errors/warnings on the import and I have imported other csv files of the same format (Windows Comma Separated) without having this issue.
***************
ID * columnName
***************
1 * any
2 * thing
3 * helpful
When I run:
SELECT * FROM myDB.tableName;
I see the table as shown above. However, when I run:
SELECT * FROM myDB.tableName WHERE columnName = "any";
I get no rows returned. If I then overwrite the csv loaded value in the table by:
UPDATE myDB.tableName SET columnName='any' WHERE ID= 1;
and then run the same query, then the row is returned as expected. So, at this point, I have two questions:
How can I prevent the csv uploading values that are not recognized as strings?
How can I bulk update all of the currently loaded values in columnName to be recognized as strings (I can't do individual updates as shown above, since there are too many rows affected)?
If the .csv file is from Windows, the file may use CRLF as the line delimiter.
And if the LOAD DATA specifies LINES TERMINATED BY '\n' you might be picking up the CR character as part of the last column.
It's also possible you are picking up trailing spaces.
That's really just a guess.
If that's the case, you might need your LOAD DATA to specify the CRLF as the line terminator, and you may also want to run that last field through a TRIM function.
My LOAD DATA from .csv file created on Windows would look something like this (excerpted, not complete):
LOAD DATA ...
...
LINES TERMINATED BY '\r\n'
...
( id
, #fld2
)
SET columnName = TRIM(#fld2)
To debug what is currently stored in the column from your load, you could use the HEX function. (That's the closest thing I've found in MySQL to an Oracle-style DUMP() function.)
With the latin1 characterset, the CR character is shown as x'0D'. A space is x'20' and a tab character is x'09'.
SELECT HEX('abc'), HEX('abc \t\r')
HEX('abc') HEX('abc \t\r')
---------- -----------------
61 62 63 61 62 63 20 09 0D
So, to check for what's stored, you could run something like this:
SELECT columnName, HEX(columnName)
FROM mytable
WHERE id = 1
Based on that, you can make appropriate adjustments to the LOAD DATA statement.
Using the technique of loading the field into a user-defined variable (as shown in my example LOAD DATA, loading the field contents to #fld2, you can use a SET clause to assign an expression to a column. The expression could make use of any number of builtin MySQL functions. For example, to remove tab characters from the string
SET columnName = REPLACE(#fld2,'\t','')
I agree with #bitfiddler that it looks like your data includes whitespace or non-printable characters. If you can't clean the data as it's added executing
UPDATE myDB.tableName SET columnName=TRIM(columnName)
will do a bulk update of the data in place, but might take a while if the dataset is large.

Creating Hive table - how to derive column names from CSV source?

...I really thought this would be a well-traveled path.
I want to create the DDL statement in Hive (or SQL for that matter) by inspecting the first record in a CSV file that exposes (as is often the case) the column names.
I've seen a variety of near answers to this issue, but not to many that can be automated or replicated at scale.
I created the following code to handle the task, but I fear that it has some issues:
#!/usr/bin/python
import sys
import csv
# get file name (and hence table name) from command line
# exit with usage if no suitable argument
if len(sys.argv) < 2:
sys.exit('Usage: ' + sys.argv[0] + ': input CSV filename')
ifile = sys.argv[1]
# emit the standard invocation
print 'CREATE EXTERNAL TABLE ' + ifile + ' ('
with open(ifile + '.csv') as inputfile:
reader = csv.DictReader(inputfile)
for row in reader:
k = row.keys()
sprung = len(k)
latch = 0
for item in k:
latch += 1
dtype = '` STRING' if latch == sprung else '` STRING,'
print '`' + item.strip() + dtype
break
print ')\n'
print "ROW FORMAT DELIMITED FIELDS TERMINATED BY ','"
print "LOCATION 'replacethisstringwith HDFS or S3 location'"
The first is that it simply datatypes everything as a STRING. (I suppose that coming from CSV, that's a forgivable sin. And of course one could doctor the resulting output to set the datatypes more accurately.)
The second is that it does not sanitize the potential column names for characters not allowed in Hive table column names. (I easily broke it immediately by reading in a data set where the column names routinely had an apostrophe as data. This caused a mess.)
The third is that the data location is tokenized. I suppose with just a little more coding time, it could be passed on the command line as an argument.
My question is -- why would we need to do this? What easy approach to doing this am I missing?
(BTW: no bonus points for referencing the CSV Serde - I think that's only available in Hive 14. A lot of us are not that far along yet with our production systems.)
Regarding the first issue (all columns are typed as strings), this is actually the current behavior even if the table were being processed by something like the CSVSerde or RegexSerDe. Depending on whether the particulars of your use case can tolerate the additional runtime latency, one possible approach is to define a view based upon your external table that dynamically recasts the columns at query time, and direct queries against the view instead of the external table. Something like:
CREATE VIEW VIEW my_view (
CAST(col1 AS INT) AS col1,
CAST(col2 AS STRING) AS col2,
CAST(col3 AS INT) as col3,
...
...
) AS SELECT * FROM my_external_table;
For the second issue (sanitizing column names), I'm inferring your Hive installation is 0.12 or earlier (0.13 supports any unicode character in a column name). If you import the re regex module, you can perform that scrubbing in your Python with something like the following:
for item in k:
...
print '`' + re.sub(r'\W', '', item.strip()) + dtype
That should get rid of any non-alphernumeric/underscore characters, which was the pre-0.13 expectation for Hive column names. By the way, I don't think you need the surrounding backticks anymore if you sanitize the column name this way.
As for the third issue (external table location), I think specifying the location as a command line parameter is a reasonable approach. One alternative may be to add another "metarow" to your data file that specifies the location somehow, but that would be a pain if you are already sitting on a ton of data files - personally I prefer the command line approach.
The Kite SDK has functionality to infer a CSV schema with the names from the header record and the types from the first few data records, and then create a Hive table from that schema. You can also use it to import CSV data into that table.

MySql load data infile STR_TO_DATE returning blank?

i'm importing 1m+ records into my table from a csv file.
Works great using the load data local infile method.
However, the dates are all different formats.
A quick google lead me to this function:
STR_TO_DATE
However, when I implement that, I get nothing, an empty insert. here's my SQ cut down to include one date (I've 4 with the same issue) and generic column names:
load data local infile 'myfile.csv' into table `mytable`
fields terminated by '\t'
lines terminated by '\n'
IGNORE 1 LINES
( `column name 1`
, `my second column`
, #temp_date
, `final column`)
SET `Get Date` = STR_TO_DATE(#temp_date, '%c/%e/%Y')
If I do:
SET `Get Date` = #temp_date
The date from the csv is captured in the the format it was in the file.
However when I try the first method, my table column is empty. I've changed the column type to varchar (255) from timestamp to captre whatever is going in, but ultimatly, I want to capture y-m-d H:i:s (Not sure if STR_TO_DATE can do that?)
I'm also unsure as to why I need the # symbol.. google failed me there.
So, my questions are:
Why do I need the # symbol to use this function?
Should the data format ('%c/%e/%Y') be the format of the inputted data or my desired output?
Can I capture time in this way too?
sorry for the large post!
Back to Google for now...
Why do I need the # symbol to use this function?
The # symbol means that you are using a variable, so the read string isnt put right away into the table but into a memory pice that lets you operate with it before inserting it. More info in http://dev.mysql.com/doc/refman/5.0/en/user-variables.html
Should the data format ('%c/%e/%Y') be the format of the inputted data or my desired output?
Its the format of the inputted data, more info in http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_str-to-date
Can I capture time in this way too?
You should be able to as long as you chose the correct format, something like
STR_TO_DATE(#temp_date,'%c/%e/%Y %h:%i:%s');
I had this problem. What solved it for me was making sure I accounted for whitespace that weren't delimiters in my load file. So if ',' is the delimiter:
..., 4/29/2012, ...
might be interpreted as " 4/29/2012"
So should be
...,4/29/2012,...

how to populate a database?

I have a mysql database with a single table, that includes an autoincrement ID, a string and two numbers. I want to populate this database with many strings, coming from a text file, with all numbers initially reset to 0.
Is there a way to do it quickly? I thought of creating a script that generates many INSERT statements, but that seems somewhat primitive and slow. Especially since mysql is on remote site.
Yes - use LOAD DATA INFILE docs are here Example :
LOAD DATA INFILE 'csvfile'
INTO TABLE table
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 0 LINES
(cola, colb, colc)
SET cold = 0,
cole = 0;
Notice the set line - here is where you set a default value.
Depending on your field separator change the line FIELDS TERMINATED BY ','
The other answers only respond to half of your question. For the other half (zeroing numeric columns):
Either:
Set the default value of your number columns to 0,
In your text file, simply delete the numeric values,
This will cause the field to be read by LOAD INFILE as null, and the defauly value will be assigned, which you have set to 0.
Or:
Once you have your data in the table, issue a MySQL command to zero the fields, like
UPDATE table SET first_numeric_column_name = 0, second_numeric_column_name = 0 WHERE 1;
And to sum everything up, use LOAD DATA INFILE.
If you have access to server's file system, you can utilize LOAD DATA
If you don't want to fight with syntax, easiest way (if on windows) is to use HeidiSQL
It has friendly wizard for this purpose.
Maybe I can help you with right syntax, if you post sample line from text file.
I recommend you to use SB Data Generator by Softbuilder (which I work for), download and install the free trial.
First, create a connection to your MySQL database then go to “Tools -> Virtual Data” and import your test data (the file must be in CSV format).
After importing the test data, you will be able to preview them and query them in the same window without inserting them into the database.
Now, if you want to insert test data into your database, go to “Tools -> Data Generation” and select "generate data from virtual data".
SB data generator from Softbuilder