Random records extracted from a large CSV file - csv

I have 50 CSV files, up to 2 millions records in each.
I daily need to get 10000 random records from each of the 50 files and make a new CSV files with all the info (10000*50)
I can not do it manually, because will take me a lot of time, also I've tried to use Access, but, because database is larger then 2G, I cannot use it.
Also, I've tried to use CSVed - a good soft, but still did not help me.
Could someone please give an idea/soft in order to get random records from files and make a new CSV file?

There are many languages you could use, I would use C# and do this.
1) Get the number of lines in a file.
Lines in text file
2) Generate the 10,000 random numbers (unique if you need that) based on the maximum being the count from step 1.
Random without duplicates
3) Pull the records from step 2 from the file and write to new file.
4) Repeat for each file.
Other options if you want to consider a database other than Access are MySQL or SQL Server Express to name a couple.

Related

How to handle 25k .csv files and mapping to database?

I have a situation that requires mapping over 25k unique csv files, into a MySQL database.
The problem is that each csv file will have unique column headings, and thus requires mapping the columns to correct tables/columns in the MySQL database.
So for example, we may find that column 2 in one csv file is the Country, but in another csv file it is column 6. Or we might find that Country just doesn't exist in a particular csv file. You can see why doing manual mapping of columns for 25k files is not practical.
The reason it needs mapping is because we want the ability to perform a search across all the files, based on a predefined structure.
For example, we will want to find travel companies in the UK that have more than 20 employees.
We need the ability to perform this query across all data in the files, to obtain the right results. The database structure has already been defined and have been parsing in csv files fine, until now we just realised there will be a huge number of csv files to do this mapping with.
Are we better off with a NoSQL solution? Would you recommend something like Neo4j? Is there a better solution for mapping unique csv files to structured MySQL schema?
Edit:
I am planning now for us to first parse the first row of every file, and store that in a new table associated to the file with a many-to-many relationship.
From this, will allow the user to define the matching table column to file column (since I suspect a large number of columns will be practically the same).
Then save this mapping data against the file and have an automated process then perform the insertion based on this mapping.
This should hopefully reduce the workload from having to set the mapping on each file individually, and instead focus on the mapping of associated columns across all files. It will still be a heavy task though I imagine.
Thanks

Manipulating 100gb Table

I have a particular dataset in tsv format seperated by tabs that is one big txt file of around 100gb (Somewhere around 255 million rows). I have to filter and extract relevant rows so I can easily work on them. So far, I know that Excel can't handle that many rows, and familliar text editors can't open or very painful to work with tables. I've tried LogParser, a 36 mins query gave me a csv output but unfortunately exported number of rows are way below what I guess is present in the data. Also I get some parsing errors and some columns in the exported sets are shifted. Do you have any other alternatives? Maybe can I somehow turn the data into an SQL database? Is it possible?

One table access much bigger than original files

I have 12 excel files which are around 400 MB in total. But when importing them into access have already reached the 2 GB limit in the file 9.... is it normal?
Have tried to clean and compact but reduced only to 1.8 GB...
I do not have any query or report, just a big plain table (2.2 MM records x 30 columns so far).
If it is like that, as a solution do you think the options below would work?
1) Link to excel instead of importing it would reduce the file size considerably?
2) I can normalize the table a bit reducing some records by creating tables with the fields that are very repetitive
Thanks for any ideas...
(I could try the above by try and error but it would take me more than a coupe hours)
Linking - if possible - is much to prefer.
xlsx files are zipped. Rename it to zip, open it and study the size of the main file in folder xl\worksheets.

Should I store an audio file itself in a BLOB field or should I store its path

I need to have a catalog of audio files along with some properties of them, like DateCreated, Title, Format etc. I'm expecting to have those files in wav and/or gsm format. The maximum size of a file will be about 10MB. Maximum number of files to be added daily will be around 600. About 25 percent of the records will be erased every 10 days. And 80 percent of the entire records will be deleted every month (according to the approximation). When I query the records I won't need them to be queried with the physical files, rather I'll need their other properties. But then I might want to take a look at a specific file or multiple files , which is when I will pull the audio files themselves.
Now here's my question: Should I store the audio files in a BLOB column along with their properties in the same table or should I have them in a separate location - this might be a database or just a file directory- and store the path to it? I'm planning to use MySQL, although I'm not quite sure whether it's appropriate for data with such magnitude. So if it's not a good choice could you also suggest another DBMS that'd be more suitable in this particular case.

MySQL, load data from file, into number of tables

My basic task is to import parts of data from one single file, into several different tables as fast as possible.
I currently have a file per table, and i manage to import each file into the relevant table by using LOAD DATA syntax.
Our product received new requirements from a client, he is no more interested to send us multiple files but instead he wants to send us single file which contains all the original records instead of maintaining multiple such files.
I thought of several suggestions:
I may require the client to write a single raw before each batch of lines in file describing the table to which he want it to be loaded and the number of preceding lines that need to be imported.
e.g.
Table2,500
...
Table3,400
Then i could try to apply LOAD DATA for each such block of lines discarding the Table and line number description. IS IT FEASIBLE?
I may require each record to contain the table name as additional attribute, then i need to iterate each records and inserting it , although i am sure it is much slower vs LOAD DATA.
I may also pre-process this file using for example Java and execute the LOAD DATA as statement in a for loop.
I may require almost any format changes i desire, but it have to be one single file and the import must be fast.
(I have to say that what i mean by saying table description, it is actually a different name of a feature, and i have decided that all relevant files to this feature should be saved in different table name - it is transparent to the client)
What sounds as the best solution? is their any other suggestion?
It depends on your data file. We're doing something similar and made a small perl script to read the data file line by line. If the line has the content we need (for example starts with table1,) we know that it should be in table 1 so we print that line.
Then you can either save that output to a file or to a named pipe and use that with LOAD DATA.
This will probably have a much better performance that loading it in temporary tables and from there into new tables.
The perl script (but you can do it in any language) can be very simple.
You may have another option which is to define a single table and load all your data into that table, then use select-insert-delete to transfer data from this table to your target tables. Depending on the total number of columns this may or may not be possible. However, if possible, you don't need to write an external java program and can entirely rely on the database for loading your data which can also offer you a cleaner and more optimized way of doing the job. You will much probably need to have an additional marker column which can be the name of the target tables. If so, this can be considered as a variant of option 2 above.