Import text file into SQL database - mysql

I have a number of separate text files which i would like to import into an SQL database. The data is not comma separted so that rules out using my idea of importing data by comma. However, the data is across a number of rows. See example text file below. Please could anyone advise how i could import specific data such as the programmed and mean values, shift number, etc?

It looks like you have a machine-generated report. The ideal approach is to have that machine produce a different report--one that has no '/////' or any of that crap, just the data you want to import. So that new report's output might look like this.
shift_num, prog_min, mean_sec, att_sec, adt_min
1, 600, 599, 658, 210
...
In practice, though, it's often not "possible" to get reports like that. (That is, it's always possible for the machine to do it, but often humans are unwilling.) When that happens, use your favorite text-processing language to turn the report into usable data.
I like awk for this kind of stuff. Others like perl.
To illustrate, I keyed in this replica of your report. (Saved as test.dat.)
ORDER Nr FG68909 Q.ty Ordered 99
...
SHIFT Nr. 1
////////
PROGRAMMED MEAN
600 min JOB TIME 599 sec
AVERAGE Turnaround Time 658 sec
AVERAGE Delivery Time 210 mins
Then I wrote this awk program. It makes a lot of assumptions about the layout of your report. Some of them will probably fail on real data.
/SHIFT/ {shift = $NF}
/JOB TIME/ {
programmed = sprintf("%d %s", $1, $2);
mean = sprintf("%d %s", $(NF-1), $NF);
}
/AVERAGE Turnaround/ { avg_turnaround = sprintf("%d %s", $(NF-1), $NF);}
# Assumes the line "AVERAGE Delivery" is also the end of the record.
/AVERAGE Delivery/ {
avg_delivery = sprintf("%d %s", $(NF-1), $NF);
printf("%d, '%s', '%s', '%s', '%s'\n", shift, programmed, mean, avg_turnaround, avg_delivery);
# Clear the vars for the next record.
shift = "";
programmed = "";
mean = "";
avg_turnaround = "";
avg_delivery = "";
}
The output . . .
$ awk -f test.awk test.dat
1, '600 min', '599 sec', '658 sec', '210 mins'

You could write a simple application in C# to parse the contents of the file using regex, turn it into one line, and insert semicolons where required.

Related

How to save the results of an optimal power flow in MATPOWER for multiple runs?

I am using MATPOWER for optimal power flow of IEEE30 bus system. I am changing the real power generation of a particular bus 12 times and want to save the result 12 times also. But while doing so only the result of last run is saved in the result struct. The code is given:
P=xlsread('C:\Users\User\Documents\MATLAB\output\sp.xlsx');
for h=1:12
P(h);
**mpc.gen(NG,PG)=P(h);**
mpopt = mpoption('pf.alg', 'NR', 'verbose', 1, 'out.all', 0);
results= runopf(mpc,mpopt);
end
You could store the result struct you get from each opf run in a struct array like so:
for h=1:12
P(h);
**mpc.gen(NG,PG)=P(h);**
mpopt = mpoption('pf.alg', 'NR', 'verbose', 1, 'out.all', 0);
results(h) = runopf(mpc,mpopt);
end
Adressing results should then be possible by calling e.g. results(3).branch or whatever you want to evaluate.

Unpivot Csv Files with changing schemas on linux

From one of our customers, we receive x number of csv files on our sftp server. The files usually vary in terms of header names, column count and of course row count (usually somewhere between a couple of thousand and a couple of million rows, file size do for the most of them not exceed 350 mb). Currently we process all the files through ssis using some custom c# script.
What I want to accomplish is this...Move the entire process to linux (our sftp server), in order to shorten the data-flow and the pre-processing time.
This may very well be a trivial task for a lot of you guys, but I cant say I belong to that category...having no real experience developing on linux.
So how to do this, are there any feasible solutions, in regards to time efficiency, memory consumption etc...
Csv files could look like this, except the number of user columns always change:
eg. Filename: userdata.csv
Question; user1; user2; user3; user4
How old are you; 20; 22; 45; 54
How tall are you; 186; 176; 166; 195
And the output I'm after looks like this:
Question; Value; User; Filename
How old are you; 20; user1; userdata
How old are you; 22; user2; userdata
How old are you; 45; user3; userdata
How old are you; 54; user4; userdata
How tall are you; 186; user1; userdata
How tall are you; 176; user2; userdata
How tall are you; 166; user3; userdata
How tall are you; 195; user4; userdata
Suggestions, advice...anything is most welcome.
Update:
Just to elaborate on the input/output specifics..
input.csv (The result of a questionnaire)
2 questions, "How old are you" and "How tall are you" answered by 4 users, "user1", "user2", "user3" and "user4".
For the purpose of this example "user1" - "user4" is used.
In our live data the users real names are used.
The number of user columns will vary depending on how many participated in the questionnaire.
output.csv
The header row is change to display 4 static fields: Question, Value, User and Filename.
Instead of having a row per question as in the input file, we need a row per user.
The Filename column should hold the name of the input file without extension.
character encoding is UTF-8 and the separator is semicolon. Qualifiers are not used.
So, after a bit of reading in here and a lot of trial and error, it seems i have a working solution. Though it might not be pretty and leaves room for improvement, this is what it got:
A scheduled bash script which loops a filename array and passes the individual filenames to the awk script.
orgFile.sh
#!/bin/sh
shopt -s nullglob
fileList=(*.csv)
for i in "${fileList[#]}"; do
awk -v filename="$i" -f newFile.awk $i
done
newFile.awk
#!/usr/bin/awk -f
function fname(file, a, n)
{
n = split(file, a, ".")
return a[1]
}
BEGIN{
FS = ";"
fn = "done_" filename
print "Question;Value;User;ID" > fn
}
{
if (NR == 1)
{
for (i = 1; i <= NF; i++)
{
headers[i] = $i
}
}
else
{
for (i = 1 ; i <= NF; i++ )
{
if (i > 1)
{
print $1 FS $i FS headers[i] FS fname(filename) >> fn
}
}
}
}

How to remove duplicates in a CSV file?

I have a large file with a bunch of movie data, including a unique ID for each movie. although every ID on each line is unique, some lines include duplicate movie data.
For example:
ID,movie_title,year
1,toy story,1995
2,jumanji,1995
[...]
6676,toy story,1995
6677,jumanji,1995
In this case, I'd like to remove completly the 6677,toy story,1995 and 6677,jumanji,1995 lines. This occurs with more than just one movie, so I can't do a simple find and replace. I've tried to use Sublime Text's Edit>Permute Lines>Unique feature and it works fine, but I end up losing the first column of the data (the unique IDs).
can anyone recommend a better way to get rid of these duplicate lines?
The following perl script does the trick. Effectively, all occurrences of a movie but the first will be deleted from the list of entries. Do not forget to add the file paths. Execute with 'perl ' from the command line (mac os ships with perl):
use IO::File;
my (
$curline
, $fh_in
, $fh_out
, $dict
, #fields
, $key
, $value
);
$fh_in = new IO::File("<..."); # add input file name
$fh_out = new IO::File(">..."); # add output file name
while (<$fh_in>) {
chomp;
$curline = $_;
#fields = split ( /,/, $curline );
($key, $value) = (join(',', #fields[1..$#fields]), $fields[0]);
if (!exists($$dict{$key})) {
$$dict{$key} = 1;
$fh_out->print("$curline\n");
}
}
$fh_out->close();
exit(0);
Explanation
The code processes the input line by line
It maintains an hash of movie identifiers seen.
Movie identifiers are defined as the line content without the id number and the immediately following comma.
A line is printed iff the movie identifier has not yet been seen.
Caveat
Evidently, this solution is not robust against spelling errors.
A certain degree of error tolerance can be added by normalizing keys. Example (case-insensitive matching):
my $key_norm; # move that out of the loop in production code
$key_norm = lc($key);
if (!exists($$dict{$key_norm})) {
$$dict{$key_norm} = 1;
$fh_out->print("$curline\n");
}
Neither elegance nor performance had a say in authoring this code ;-)

How do I write a function that takes the average of a list of numbers

I want to avoid importing different modules as that is mostly what I have found while looking online. I am stuck with this bit of code and I don't really know how to fix it or improve on it. Here's what I've got so far.
def avg(lst):
'''lst is a list that contains lists of numbers; the
function prints, one per line, the average of each list'''
for i[0:-1] in lst:
return (sum(i[0:-1]))//len(i)
Again, I'm quite new and this for loops jargon is quite confusing to me, so if someone could help me get it so the output of, say, a list of grades would be different lines containing the averages. So if for lst I inserted grades = [[95,92,86,87], [66,54], [89,72,100], [33,0,0]], it would have 4 lines that all had the averages of those sublists. I also am to assume in the function that the sublists could have any amount of grades, but I can assume that the lists have non-zero values.
Edit1: # jramirez, could you explain what that is doing differently than mine possible? I don't doubt that it is better or that it will work but I still don't really understand how to recreate this myself... regardless, thank you.
I think this is what you want:
def grade_average(grades):
for grade in grades:
avg = 0
for num in grade:
avg += num
avg = avg / len(grade)
print ("Average for " + str(grade) + " is = " + str(avg))
if __name__ == '__main__':
grades = [[95,92,86,87],[66,54],[89,72,100],[33,0,0]]
grade_average(grades)
Result:
Average for [95, 92, 86, 87] is = 90.0
Average for [66, 54] is = 60.0
Average for [89, 72, 100] is = 87.0
Average for [33, 0, 0] is = 11.0
Problems with your code: the extraneous indexing of i; the use of // to truncate he averate (use round if you want to round it); and the use of return in the loop, so it would stop after the first average. Your docstring says 'print' but you return instead. This is actually a good thing. Functions should not print the result they calculate, as that make the answer inaccessible to further calculation. Here is how I would write this, as a generator function.
def averages(gradelists):
'''Yield average for each gradelist.'''
for glist in gradelists:
yield sum(glist) /len(glist)
print(list(averages(
[[95,92,86,87], [66,54], [89,72,100], [33,0,0]])))
[90.0, 60.0, 87.0, 11.0]
To return a list, change the body of the function to (beginner version)
ret = []
for glist in gradelists:
ret.append(sum(glist) /len(glist))
return ret
or (more advanced, using list comprehension)
return [sum(glist) /len(glist) for glist in gradelists]
However, I really recommend learning about iterators, generators, and generator functions (defined with yield).

Dynamic Naming of Matlab Function

I currently have a MATLAB function that looks like this:
function outfile=multi_read(modelfrom,modelto,type)
models=[modelfrom:1:modelto];
num_models=length(models);
model_path='../MODELS/GRADIENT/'
for id=1:num_models
fn=[model_path num2str(models(id)) '/']; %Location of file to be read
outfile=model_read(fn,type); %model_read is a separate function
end
end
The idea of this function is to execute another function model_read for a series of files, and output these files to the workspace (not to disk). Note that the output from model_read is a structure! I want the function to save the file to the workspace using sequential names, similar to typing:
file1=multi_read(1,1,x)
file2=multi_read(2,2,x)
file3=multi_read(3,3,x)
etc.
which would give file1, file2 and file3 in the workspace, but instead by recalling the command only once, something like:
multi_read(1,3,x)
which would give the same workspace output.
Essentially my questions is, how do I get a function to output variables with multiple names without having to recall the function multiple times.
As suggested in the comment I would try this approach which is more robust, at least IMHO:
N = tot_num_of_your_files; %whatever it is
file = cellfun(#(i)multi_read(i,i,x),mat2cell(1:N,1,ones(1,N)),...
'UniformOutput' , false); %(x needs to be defined)
You will recover objects by doing file{i}.
Here is code to do what you ask:
for i = 1:3
istr=num2str(i)
line = ['file' istr '= multi_read(' istr ', ' istr ', x)']
eval(line)
end
Alternatively, here is code to do what you should want:
for i = 1:3
file{i} = multi_read(i,i,x)
end