We have about 60 million webpages in a compressed format. We would like to de-compress and work with these files individually.
Here are my questions!
First, if I decompress them into the file system, would the FS cope with such number of files. My file system is ext4. (I've 4 different file systems so I can divide the data between them like 15 M pages for each file system)
Secondly, Would storing these files into a relational database be a better option? assuming that all the hassle of cleaning html text is done before inserting them into the database.
Thanks,
If you extract them into a single directory you may exceed the maximum allocated indices in that folder. If you extract them into multiple directories you will fair better.
60 Million is definitely a fair amount, if you plan on doing any indexing on them or searching then a database would be your best option, you can do indexing on files using something like lucene it all depends on what you want to do with the files After they ave been extracted.
I currently have a similar issue with images on a large user site, the way I got around this issue was to give each image a GUID and for each byte in the guid assign it to a different directory, then the next byte under a subdirectory (down to 8 bytes) if my fill ratio goes up I'll create more subdirectories to compensate, it also means I can spread it across different net storage boxes.
Related
Our employees are doing work for clients, and the clients send us files which contain information that we turn into performance metrics (we do not have direct access to this information - it needs to be sent from the clients). These files are normally .csv or .xlsx so typically I read them with pandas and output a much cleaner, smaller file.
1) Some files contain call drivers or other categorical information which repeats constantly (for example, Issue Driver 1 with like 20 possibilities and Issue Driver 2 with 100 possibilities) - these files are about 100+ million records per year so they become pretty large if I consolidate them. Is it better to create a dictionary and map each driver out to an integer? I read a bit about the category dtype in pandas - does this make output file sizes smaller too or just in-memory?
2) I store the output as .csv which means that I lose the dtypes if I ever read the file again. How do I maintain dtypes and should I save the files to sqlite instead perhaps instead of massive .csv files? My issue now is that I literally create codes to break the files up into separate .csvs per month and then maintain a massive file which I use for analysis (dump it into Tableau normally). If I need to to make changes to the monthly files I have to re-write them all which is slow on my laptops non-SSD hard drive.
3) I normally only need to share data with one or two people. And most analysis requests are adhoc but involve like one - three years worth of very granular data (individual surveys or interactions each represented by a single row in separate files). In other words I do not need a system with high concurrency of read-write. Just want something fast, efficient, and consolidated.
I have a website with a lot of small images (no larger than 300 kb). They're all in a single folder, and now they're around 40.000, but they keep increasing very fast.
When I open a page of my website, each page will load 6 of those images. I can have also 1000 or more people connected together on my website.
Do you think this way of storing small files will slow down performances of my server? I thought about these possible solutions, but I'm not sure which of those is the best:
- Store these pictures in folders, maybe a folder for each day, so we'll have no more than 5000 picture in each folder;
- Store pictures in my DB in blob fields.
My server uses CentOS 6.5 as operating system.
Which solution do you think will be the best?
I need a database that can store a large number of BLOBS. The BLOBs would be picture files and would also have a timestamp and a few basic fields (size, metrics, ids of objects in other databases, things like that), but the main purpose of the database is to store the pictures.
We would like to be able to store the data in the database for a while, in the order of few months. With the data coming in maybe every few minutes, the number of BLOBs stored can grow quite quickly.
For now (development phase) we will be using a MySQL for this. I was wondering if MySQL is a good direction to go, in terms of:
Being able to store binary data efficiently
Scalability
Maintenance requirements.
Thanks,
MySQL is a good database, and can handle large data sets. However, there is a great benefit in making your whole database fit into RAM, in such case all database-related activity will be much faster. By putting large and seldom-accessed objects into your database, you're making this harder.
So, I think a combined approach is the best:
Save only metadata in the database, and save the files on disk as-is. Better to hash the directories if you're talking about 100,000 of files, then save file under the name of an index field in your database. E.g. such directory structure:
00/00001.jpg
00/00002.jpg
00/00003.jpg
....
....
10/10234.jpg
10/10235.jpg
In this case, your directories won't have too many files, and accessing the files is fast and easy. Of course if your database server is distributed/redundant, things get more interesting, any such approach may or may not be warranted, depending on the load, redundancy/fail over requirements, etc.
I suggest to store images on hard disk and in your mysql implementation maintain the metadata of your image including the filename (maybe). So your script can easily pick it up from your local hard drive.
For Reading & storing files, hard disk and most modern OS are really good at it. So I believe mysql is not going to solve anything here.
What is the best practise for storing a large number of files that are referenced in a database in the file system?
We're currently moving from a system that stores around 14,000 files (around 6GB of images and documents) in a MySQL database. This is quickly becoming unmanageable.
We currently plan to save the files by their database primary key in the file system. I'm concerned about the possible performance issues of having that many files in the same folder. Also, these files will be inserted by several different applications on the same server.
Specifically I'd like to know:
Is this a good solution given these parameters?
Will it leave room to scale further in the future?
Are there any concerns about storage of many files in the same location?
Is there a better way to name/distribute the files?
I like to name the file as following
/* create directory */
$dir = date('Y').'/'.date('m').'/'.date('d');
Hash the contents with MD5, then add a suffix (the PK will suffice for this) to get the file's new filename. Create 16 folders corresponding to the first character of the hash. Create 16 folders under each of those for the second character. Store the image in the appropriate path based on the first 2 hex characters of the hash, then add the hash to the appropriate record in the database.
I have a table that has 100.000 rows, and soon it will be doubled. The size of the database is currently 5 gb and most of them goes to one particular column, which is a text column for PDF files. We expect to have 20-30 GB or maybe 50 gb database after couple of month and this system will be used frequently.
I have couple of questions regarding with this setup
1-) We are using innodb on every table, including users table etc. Is it better to use myisam on this table, where we store text version of the PDF files? (from memory usage /performance perspective)
2-) We use Sphinx for searching, however the data must be retrieved for highlighting. Highlighting is done via sphinx API but still we need to retrieve 10 rows in order to send it to Sphinx again. This 10 rows may allocate 50 mb memory, which is quite large. So I am planning to split these PDF files into chunks of 5 pages in the database, so these 100.000 rows will be around 3-4 million rows and couple of month later, instead of having 300.000-350.000 rows, we'll have 10 million rows to store text version of these PDF files. However, we will retrieve less pages, so again instead of retrieving 400 pages to send Sphinx for highlighting, we can retrieve 5 pages and it will have a big impact on the performance. Currently, when we search a term and retrieve PDF files that have more than 100 pages, the execution time is 0.3-0.35 seconds, however if we retrieve PDF files that have less than 5 pages, the execution time reduces to 0.06 seconds, and it also uses less memory.
Do you think, this is a good trade-off? We will have million of rows instead of having 100k-200k rows but it will save memory and improve the performance. Is it a good approach to solve this problem and do you have any ideas how to overcome this problem?
The text version of the data is used only for indexing and highlighting. So, we are very flexible.
Edit: We store pdf files on our cloud, however for search highlighting, we need to retrieve the text version of the pdf file and give it to the Sphinx, Sphinx then returns the highlighted 256 character text. To index pdf files we need to insert them into the database because they also have additional metadata, like description tags and title and we need to link them for search engine. If we index txt files or pdf files from the file server, it is not possible to get other data from the db and link them to those txt files on the search engine. So, we still store PDF files on our cloud, but the text version must be in our db as well to index their tag title and description as well. They are different tables, but it must be in the database as well.
Thanks,
it sounds like you don't really need to retrieve your entire pdf file every time you hit on a row for that pdf file.
are you separating the metadata about your pdf files from the file itself? you definitely shouldn't have just one table here. you might want something like table pdf_info with 100 columns (do you really have that much metadata? why 100 columns?) and a foreign key to the the pdf_files table containing the actual text for the files. then you can experiment with, maybe, making the info table innodb and the files table myisam.
IMHO: there are many, many reasons to NOT store your pdf file in the mysql database. i would just store the file paths to a SAN or some other file distribution mechanism. sql is good for storing any abstract data, and files are certainly in that category. but file systems are specifically designed to store files, and webservers specifically designed to deliver those files to you as quickly as possible. so... just something to think about.
Use Solr, it is possible index text files with their metadata from a database. I have switched the search engine to Solr.
That sounds like a really bad technology choice. If you can slow the growth so you can keep everything in memory (affordable to 128GB or so) or partion for a larger size, you can basically be network transfer limited.
[edit]
If the pdfs are on disk, and not in ram, your disk needs to be accessed. If you don't have a SSD, you can do that 50 times/second/disk. As long as a pdf is smaller than a disk track, splitting is not very interesting. If you split the pdfs and then need access to all parts, you might need to load from several tracks, slowing you down a lot.
Handling large documents with a RDBMs in a multi-user setup is not a good idea, performance wise.