Best way to clean and store files - csv

Our employees are doing work for clients, and the clients send us files which contain information that we turn into performance metrics (we do not have direct access to this information - it needs to be sent from the clients). These files are normally .csv or .xlsx so typically I read them with pandas and output a much cleaner, smaller file.
1) Some files contain call drivers or other categorical information which repeats constantly (for example, Issue Driver 1 with like 20 possibilities and Issue Driver 2 with 100 possibilities) - these files are about 100+ million records per year so they become pretty large if I consolidate them. Is it better to create a dictionary and map each driver out to an integer? I read a bit about the category dtype in pandas - does this make output file sizes smaller too or just in-memory?
2) I store the output as .csv which means that I lose the dtypes if I ever read the file again. How do I maintain dtypes and should I save the files to sqlite instead perhaps instead of massive .csv files? My issue now is that I literally create codes to break the files up into separate .csvs per month and then maintain a massive file which I use for analysis (dump it into Tableau normally). If I need to to make changes to the monthly files I have to re-write them all which is slow on my laptops non-SSD hard drive.
3) I normally only need to share data with one or two people. And most analysis requests are adhoc but involve like one - three years worth of very granular data (individual surveys or interactions each represented by a single row in separate files). In other words I do not need a system with high concurrency of read-write. Just want something fast, efficient, and consolidated.

Related

Optimization MySQL vs Many Flat Files and HDD Utilization

I want to run a machine learning algorithm as my endgame- research code that is thusfar unproven and unpublished for text mining purposes. The text is already obtained, but was scraped from warc format obtained from the Common Crawl. I'm in the process of preparing the data for machine learning purposes, and one of the analysis tasks that's desirable is IDF- Inverse Document Frequency analysis of the corpus prior to launching into the ML application proper.
It's my understanding that for IDF to work, each file should represent one speaker or one idea- generally a short paragraph of ascii text not much longer than a tweet. The challenge is that I've scraped some 15 million files. I'm using Strawberry Perl on Windows 7 to read each file and split on the tag contained in the document such that each comment from the social media in question falls into an element of an array (and in a more strongly-typed language would be of type string).
From here I'm experiencing performance issues. I've let my script run all day and it's only made it through 400,000 input files in a 24 hour period. From those input files it's spawned about 2 million output files representing one file per speaker of html-stripped text with Perl's HTML::Strip module. As I look at my system, I see that disk utilization on my local data drive is very high- there's a tremendous number of ASCII text writes, much smaller than 1 KB, each of which is being crammed into a 1 KB sector of my local NTFS-formatted HDD.
Is it a worthwhile endeavor to stop the run, set up a MySQL database on my home system, set up a text field in the database that is perhaps 500-1000 characters in max length, then rerun the perl script such that it slurps an input html file, splits it, HTML-strips it, then prepares and executes a string insert vs a database table?
In general- will switching from a file output format that is a tremendous number of individual text files to a format that is a tremendous number of database inserts be easier on my hard drive / faster to write out in the long run due to some caching or RAM/disk-space utilization magic in the DBMS?
A file system can be interpreted as a hierarchical key-value store, and it is frequently used as such by Unix-ish programs. However, creating files can be somewhat expensive, depending also on the OS and file system you are using. In particular, different file systems differ significantly by how access times scale with the number of files within one directory. E.g. see NTFS performance and large volumes of files and directories and How do you deal with lots of small files?: “NTFS performance severely degrades after 10,000 files in a directory.”
You may therefore see significant benefits by moving from a pseudo-database using millions of small files to a “real” database such as SQLite that stores the data in a single file, thus making access to individual records cheaper.
On the other hand, 2 million records are not that much, suggesting that file system overhead might not be the limiting factor for you. Consider running your software with a test workload and use a profiler or other debugging tools to see where the time is spent. Is it really the open() that takes so much time? Or is there other expensive processing that could be optimized? If there is a pre-processing step that can be parallelized, that alone may slash the processing time quite noticeably.
Whow!
A few years ago, we had massive problems in the popular cms. By the plain mostly a good performance. But it changes to the down, when sidepass inlines comes too.
So i wrote some ugly lines to find the fastest way. Note, that the ressources setting the different limits!
1st) I used the time for establishing of a direct adressable point. Everyone haves an own set of flatfiles.
2nd) I made a Ramdisk. Be sure that you have enough for your Project!
3rd) For the backup i used rsync and renundance i compressed/extracted to the Ramdisk in a tar.gz
In practical this way the fastest one is. The conversion of timecode and generating recursive folder-structures is very simple. Read, write, replace, delete too.
The final release results in processing from:
PHP/MySQL > 5 sec
Perl/HDD ~ 1.2 sec
Perl/RamDisk ~ 0.001 sec
When i see, what you are doing there, this construct may be usuable for you. I am not know about the internals your project.
The harddisk will live much longer, your workflow can be optimized through direct addressing. Its accessable from other stages. Will say, you can work on that base from other scripts too. As you believe, a dataprocessing in R, a notifier from shell, or anything else...
Buffering errors like MySQL are no longer needed. Your CPU no longer loops noops.

Should I use files or a database?

I'm building a cloud sync application which syncs a users data across multiple devices. I am at a crossroads and am deciding whether to store the data on the server as files, or in a relational database. I am using Amazon Web Services and will use S3 for user files or their database service if I choose to store the data in a table instead. The data I'm storing is the state of the application every ten seconds. This could be problematic to be storing in a database because the average number of rows per user that would be stored is 100,000 and with my current user base of 20,000 people that's 2 billion rows right off the bat. Would I be better off storing that information in files? Because that would be about 100 files totaling 6 megabytes per user.
As discussed in the comments, I would store these as files.
S3 is perfectly suited to be a key/value store and if you're able to diff the changes and ensure that you aren't unnecessarily duplicating loads of data, the sync will be far easier to do by downloading the relevant files from S3 and syncing them client side.
You get a big cost saving of not having to operate a database server that can store tonnes of rows and stay up to provide them to the clients quickly.
My only real concern would be that the data in these files can be difficult to parse if you wanted to aggregate stats/data/info across multiple users as a backend or administrative view. You wouldn't be able to write simple SQL queries to sum up values etc, and would have to open the relevant files, process them with something like awk or regular expressions etc, and then compute the values that way.
You're likely doing that on the client side any for the specific files that relate to that user though, so there's probably some overlap there!

Storing HTML files

We have about 60 million webpages in a compressed format. We would like to de-compress and work with these files individually.
Here are my questions!
First, if I decompress them into the file system, would the FS cope with such number of files. My file system is ext4. (I've 4 different file systems so I can divide the data between them like 15 M pages for each file system)
Secondly, Would storing these files into a relational database be a better option? assuming that all the hassle of cleaning html text is done before inserting them into the database.
Thanks,
If you extract them into a single directory you may exceed the maximum allocated indices in that folder. If you extract them into multiple directories you will fair better.
60 Million is definitely a fair amount, if you plan on doing any indexing on them or searching then a database would be your best option, you can do indexing on files using something like lucene it all depends on what you want to do with the files After they ave been extracted.
I currently have a similar issue with images on a large user site, the way I got around this issue was to give each image a GUID and for each byte in the guid assign it to a different directory, then the next byte under a subdirectory (down to 8 bytes) if my fill ratio goes up I'll create more subdirectories to compensate, it also means I can spread it across different net storage boxes.

Database for sequential data

I'm completely new to databases so pardon the simplicity of the question. We have an embedded Linux system that needs to store data collected over a time span of several hours. The data will need to be searchable sequentially and includes data like GPS, environmental data, etc. This data will need to saved off in a folder on a removable SSD and labeled as a "Mission". Several "Missions" can exists on a single SSD and should not be mixed together because they need to be copied and saved off individually at the users discretion to external media. Data will be saved off as often as 10 times a second and needs to be very robust because of the potential for power outages.
The data will need to be searchable on the system it is created on but also after the removalable disk is taken to another system (also Linux) it needs to be loaded and used there also. In the past we have done custom files to store the data but it seems like a database might be the best option. How portable are databases like MySQL? Can a user easily remove a disk with a database on it and plug it in a new machine to use without too much effort? Our queries will mostly be time based because the user will be "playing" through the data after it is collected in perhaps 10x the collection rate. Also, our base code is written in Qt (C++) so we would need to interact with the database in that way.
I'd go with SQLite. It's small and lite. It stores all its data into one file. You can copy or move the file to another computer and read it there. You data writer can just remake the file, empty when it detects that today's ssd does not already have the file.
It's also worth mentioning that SQLite undergoes testing at the level afforded only by select few safety-critical pieces of software. The test suite, while partly autogenerated, is a staggering 100 million lines of code. It is not lite at all when it comes to robustness. I would trust SQLite more than a random self-made database implementation.
SQLite is used in certified avionics AFAIK.

Mysql Database Question about Large Columns

I have a table that has 100.000 rows, and soon it will be doubled. The size of the database is currently 5 gb and most of them goes to one particular column, which is a text column for PDF files. We expect to have 20-30 GB or maybe 50 gb database after couple of month and this system will be used frequently.
I have couple of questions regarding with this setup
1-) We are using innodb on every table, including users table etc. Is it better to use myisam on this table, where we store text version of the PDF files? (from memory usage /performance perspective)
2-) We use Sphinx for searching, however the data must be retrieved for highlighting. Highlighting is done via sphinx API but still we need to retrieve 10 rows in order to send it to Sphinx again. This 10 rows may allocate 50 mb memory, which is quite large. So I am planning to split these PDF files into chunks of 5 pages in the database, so these 100.000 rows will be around 3-4 million rows and couple of month later, instead of having 300.000-350.000 rows, we'll have 10 million rows to store text version of these PDF files. However, we will retrieve less pages, so again instead of retrieving 400 pages to send Sphinx for highlighting, we can retrieve 5 pages and it will have a big impact on the performance. Currently, when we search a term and retrieve PDF files that have more than 100 pages, the execution time is 0.3-0.35 seconds, however if we retrieve PDF files that have less than 5 pages, the execution time reduces to 0.06 seconds, and it also uses less memory.
Do you think, this is a good trade-off? We will have million of rows instead of having 100k-200k rows but it will save memory and improve the performance. Is it a good approach to solve this problem and do you have any ideas how to overcome this problem?
The text version of the data is used only for indexing and highlighting. So, we are very flexible.
Edit: We store pdf files on our cloud, however for search highlighting, we need to retrieve the text version of the pdf file and give it to the Sphinx, Sphinx then returns the highlighted 256 character text. To index pdf files we need to insert them into the database because they also have additional metadata, like description tags and title and we need to link them for search engine. If we index txt files or pdf files from the file server, it is not possible to get other data from the db and link them to those txt files on the search engine. So, we still store PDF files on our cloud, but the text version must be in our db as well to index their tag title and description as well. They are different tables, but it must be in the database as well.
Thanks,
it sounds like you don't really need to retrieve your entire pdf file every time you hit on a row for that pdf file.
are you separating the metadata about your pdf files from the file itself? you definitely shouldn't have just one table here. you might want something like table pdf_info with 100 columns (do you really have that much metadata? why 100 columns?) and a foreign key to the the pdf_files table containing the actual text for the files. then you can experiment with, maybe, making the info table innodb and the files table myisam.
IMHO: there are many, many reasons to NOT store your pdf file in the mysql database. i would just store the file paths to a SAN or some other file distribution mechanism. sql is good for storing any abstract data, and files are certainly in that category. but file systems are specifically designed to store files, and webservers specifically designed to deliver those files to you as quickly as possible. so... just something to think about.
Use Solr, it is possible index text files with their metadata from a database. I have switched the search engine to Solr.
That sounds like a really bad technology choice. If you can slow the growth so you can keep everything in memory (affordable to 128GB or so) or partion for a larger size, you can basically be network transfer limited.
[edit]
If the pdfs are on disk, and not in ram, your disk needs to be accessed. If you don't have a SSD, you can do that 50 times/second/disk. As long as a pdf is smaller than a disk track, splitting is not very interesting. If you split the pdfs and then need access to all parts, you might need to load from several tracks, slowing you down a lot.
Handling large documents with a RDBMs in a multi-user setup is not a good idea, performance wise.