I have a flat file database (yeah gross I know - the worst part is that it's 1.4GB), and I'm in the process of moving it to a MySQL database. The problem is that I'm not sure how to go about doing this - and I've checked through every related question on here but none relate to what I want to do, nor how my database is currently setup.
My current flat file database is setup to where a normal MySQL row is its own file, and a MySQL table would be the directory. So for example if you have a user named Jon, there would be a file for the user in a directory named /members/. Within that file would be various information for the user including the users id, rank etc - all separated by tabs, all on separate lines (userid\t4).
So here's an example user file:
userid 4
notes staff notes: bla bla staff2 notes: bla bla bla
username Example
So how can I convert the above into their own rows and fields in MySQL? And if possible, could I do thousands of these files at once?
Thanks.
This seems like a fairly trivial scripting problem.
See the example (pseudocode) below for how you might read in the user directory into a user table.
Clearly, you would want it to be a bit more robust, with error checking / data validation, but just for perspective, see below:
for file in list_dir('/path/to/users/'):
line_data = dict()
for line in open(file, 'r'):
key, value = line.split("\t", 1)
line_data[key] = value
mysql_query('''
INSERT INTO
users
SET
user_id = $1,
foo = $2,
bar = $3
''',
(
line_data['user_id'],
line_data['foo'],
line_data['bar']
)
)
LOAD DATA INFILE is used for CSV files, and yours are not, so:
merge all files in a directory in a single CSV file, removing the name of the columns (userid, username...) and separate the cols with a separator ([TAB], ";", ...) than import as CVS.
Loop for every dirs you got.
or write a "stupid" program (php works well) that do all this job for you.
Related
I've got a process that creates a csv file that contains ONE set of values that I need to import into a field in a MySQL database table. This process creates a specific file name that identifies the values of the other fields in that table. For instance, the file name T001U020C075.csv would be broken down as follows:
T001 = Test 001
U020 = User 020
C075 = Channel 075
The file contains a single row of data separated by commas for all of the test results for that user on a specific channel and it might look something like:
12.555, 15.275, 18.333, 25.000 ... (there are hundreds, maybe thousands, of results per user, per channel).
What I'm looking to do is to import directly from the CSV file adding the field information from the file name so that it looks something like:
insert into results (test_no, user_id, channel_id, result) values (1, 20, 75, 12.555)
I've tried to use "Bulk Insert" but that seems to want to import all of the fields where each ROW is a record. Sure, I could go into each file and convert the row to a column and add the data from the file name into the columns preceding the results but that would be a very time consuming task as there are hundreds of files that have been created and need to be imported.
I've found several "import CSV" solutions but they all assume all of the data is in the file. Obviously, it's not...
The process that generated these files is unable to be modified (yes, I asked). Even if it could be modified, it would only provide the proper format going forward and what is needed is analysis of the historical data. And, the new format would take significantly more space.
I'm limited to using either MATLAB or MySQL Workbench to import the data.
Any help is appreciated.
Bob
A possible SQL approach to getting the data loaded into the table would be to run a statement like this:
LOAD DATA LOCAL INFILE '/dir/T001U020C075.csv'
INTO TABLE results
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ','
( result )
SET test_no = '001'
, user_id = '020'
, channel_id = '075'
;
We need the comma to be the line separator. We can specify some character that we are guaranteed not to tppear to be the field separator. So we get LOAD DATA to see a single "field" on each "line".
(If there isn't trailing comma at the end of the file, after the last value, we need to test to make sure we are getting the last value (the last "line" as we're telling LOAD DATA to look at the file.)
We could use user-defined variables in place of the literals, but that leaves the part about parsing the filename. That's really ugly in SQL, but it could be done, assuming a consistent filename format...
-- parse filename components into user-defined variables
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(f.n,'T',-1),'U',1) AS t
, SUBSTRING_INDEX(SUBSTRING_INDEX(f.n,'U',-1),'C',1) AS u
, SUBSTRING_INDEX(f.n,'C',-1) AS c
, f.n AS n
FROM ( SELECT SUBSTRING_INDEX(SUBSTRING_INDEX( i.filename ,'/',-1),'.csv',1) AS n
FROM ( SELECT '/tmp/T001U020C075.csv' AS filename ) i
) f
INTO #ls_u
, #ls_t
, #ls_c
, #ls_n
;
while we're testing, we probably want to see the result of the parsing.
-- for debugging/testing
SELECT #ls_t
, #ls_u
, #ls_c
, #ls_n
;
And then the part about running of the actual LOAD DATA statement. We've got to specify the filename again. We need to make sure we're using the same filename ...
LOAD DATA LOCAL INFILE '/tmp/T001U020C075.csv'
INTO TABLE results
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ','
( result )
SET test_no = #ls_t
, user_id = #ls_u
, channel_id = #ls_c
;
(The client will need read permission the .csv file)
Unfortunately, we can't wrap this in a procedure because running LOAD DATA
statement is not allowed from a stored program.
Some would correctly point out that as a workaround, we could compile/build a user-defined function (UDF) to execute an external program, and a procedure could call that. Personally, I wouldn't do it. But it is an alternative we should mention, given the constraints.
I am using prestashop and have data in zencart I am matching up information and want to select the data to be inserted into a different table under different fields.
insert into presta_table1 (c1, c2, ...)
select c1, c2, ...
from zen_table1`
Since a lot is different I need to do approximately 800 records once I match up what field is what in what table.
I recently found a example
USE datab1;INSERT INTO datab3.prestatable (author,editor)
SELECT author_name,editor_name FROM author,datab2.editor
WHERE author.editor_id = datab2.editor.editor_id;
be nice to find a way to import avoiding duplicates
I am unable to find examples of this.
Here is what I did to get data out of this POS (Point of sale) system that uses mysql for a database.
I found tables with the data I needed and I exported the data out that came out in a csv format. I used calc in Libreoffice to open and then in anouther sheet manipulate the data into the example csv feilds and that worked good.
I had some issues with some of the data but I used consol commands to help me get by these let me share them with you.
Zencart description data exported from mysqlworkbench had some model numbers I needed to export out into there own field
43,1,"Black triple SS","Black Triple SS
12101-57 (7.5 inch)",,0
i used a command to add in " , " in so I could extract the data and over lay it . essential copy past into the spreadsheet Calc where I needed it in the normal order.
this sed command removes the ),( in the file and replaces with a carriage return
I did a database dump and removed starting ( and ending ) then saved as zencart_product.csv then ran this in a console
sed 's/),(/\n/g' zencart_product.csv > zencart_productNEW.csv
I had about 1000 files with $ and # in them so I put them all in a dir and renamed them with
get rid of the $ symbol
rename 's/\$//g' *
get rid of the $ symbol
rename 's/\$//g' *
get rid of space
rename "s/\s+//g" *
I hope people stuck in some software that want the data out are able to get it out with some time and effort and that this helps someone. Thanks
I'm trying to build an SSIS package that reads from a text file and outputs into another text file. The catch is that the file I'm trying to read from has multiple sections and I can't find anything that shows how to do that.
The file looks like this:
[sectionA]
key1=value1
key2=value2
key3=value3
[sectionB]
key4=value4
key5=value5
key6=value6
I started with a couple of tutorials that read from a flat file source but the data gets pulled into an equally ugly table. Hoping someone has some input on this.
The SSIS Flat File Connection is built for speed so it doesnt allow for niceties like that.
I would still use the Flat File Connection but just load all the data into a single, wide NVARCHAR column in a SQL table. I would add an IDENTITY column to that table to get a relative Row Number.
Then I would add downstream tasks using SQL to select by Sections e.g. for Section A rows:
WHERE File_Row_Number > ( SELECT MIN ( File_Row_Number ) FROM Staging_Table WHERE nvarchar_column = '[sectionA]' )
AND File_Row_Number < ( SELECT MIN ( File_Row_Number ) FROM Staging_Table WHERE nvarchar_column = '[sectionB]' )
If the split requirements are as simple as those shown I might attempt them in SQL e.g.
How do I split a string so I can access item x?
But I would probably lean towards using Strings.Split in a Script Task where the code will be simpler and safer.
I need to process multivariate time series given as multiline, multirow *.csv files with Apache Pig. I am trying to use a custom UDF (EvalFunc) to solve my problem. However, all Loaders I tried (except org.apache.pig.impl.io.ReadToEndLoader which I do not get to work) to load data in my csv-files and pass it to the UDF return one line of the file as one record. What I need is, however one column (or the content of the complete file) to be able to process a complete time series. Processing one value is obviously useless because I need longer sequences of values...
The data in the csv-files looks like this (30 columns, 1st is a datetime, all others are double values, here 3 sample lines):
17.06.2013 00:00:00;427;-13.793273;2.885583;-0.074701;209.790688;233.118828;1.411723;329.099170;331.554919;0.077026;0.485670;0.691253;2.847106;297.912382;50.000000;0.000000;0.012599;1.161726;0.023110;0.952259;0.024673;2.304819;0.027350;0.671688;0.025068;0.091313;0.026113;0.271128;0.032320;0
17.06.2013 00:00:01;430;-13.879651;3.137179;-0.067678;209.796500;233.141233;1.411920;329.176863;330.910693;0.071084;0.365037;0.564816;2.837506;293.418550;50.000000;0.000000;0.014108;1.159334;0.020250;0.954318;0.022934;2.294808;0.028274;0.668540;0.020850;0.093157;0.027120;0.265855;0.033370;0
17.06.2013 00:00:02;451;-15.080651;3.397742;-0.078467;209.781511;233.117081;1.410744;328.868437;330.494671;0.076037;0.358719;0.544694;2.841955;288.345883;50.000000;0.000000;0.017203;1.158976;0.022345;0.959076;0.018688;2.298611;0.027253;0.665095;0.025332;0.099996;0.023892;0.271983;0.024882;0
Has anyone an idea how I could process this as 29 time series?
Thanks in advance!
What do you want to achieve?
If you want to read all rows in all files as a single record, this can work:
a = LOAD '...' USING PigStorage(';') as <schema> ;
b = GROUP a ALL;
b will contain all the rows in a bag.
If you want to read each CSV file as a single record, this can work:
a = LOAD '...' USING PigStorage(';','tagsource') as <schema> ;
b = GROUP a BY $0; --$0 is the filename
b will contain all the rows per file in a bag.
I have a database built from CSV files loaded from an external source. For some reason, an ID number in many of the tables is loaded into the CSV / database encased in single quotes - here's a sample line:
"'010010'","MARSHALL MEDICAL CENTER NORTH","8000 ALABAMA HIGHWAY 69","","","GUNTERSVILLE","AL","35976","MARSHALL","2565718000","Acute Care Hospitals","Government - Hospital District or Authority","Yes"
Is there any SQL I can run on the already-established database to strip these single quotes, or do I have to parse every CSV file and re-import?
I believe the following would do it (test it first):
UPDATE U
SET YourID = REPLACE(YourID, '''', '')
FROM MyTable AS U
WHERE YourID LIKE '''%'''
If it works right, do a full backup before running it in production.