It looks like Hadoop MapReduce requires a key value pair structure in the text or binary text.
In reality we might have files to be split into chunks to be processed. But the keys may be
spread across the file. It may not be a clear cut that one key followed by one value. Is there any InputFileFormatter that can read such type of binary files? I don't want to use Map Reduce and Map Reduce. That will slow down the performance and defeat the purpose of using map reduce.
Any suggestions? Thanks,
According to the Hadoop : The Definitive Guide
The logical records that FileInputFormats define do not usually fit neatly into HDFS
blocks. For example, a TextInputFormat’s logical records are lines, which will cross
HDFS boundaries more often than not. This has no bearing on the functioning of your
program—lines are not missed or broken, for example—but it’s worth knowing about,
as it does mean that data-local maps (that is, maps that are running on the same host
as their input data) will perform some remote reads. The slight overhead this causes is
not normally significant.
If the file is split by HDFS between boundaries, then Hadoop framework will take care of it. But if you split the file manually, then boundaries have to be taken into consideration.
In reality we might have files to be split into chunks to be processed. But the keys may be spread across the file. It may not be a clear cut that one key followed by one value.
What's the scenario, we can look at a workaround for this?
Related
How to check pdf is exist or same 80% in mysql?
User want to upload pdf.
But problem is reup.
I think covert pdf to binary
=> I will have a string "X"(binary of that pdf) to save in mysql.
=> Select like %(splice (1/3 length(X) -> 2/3 length(X)).
maybe do it?
im using laravel
thank for reading
This cannot be done reasonably in MySQL. Since you are also using a PHP environment, it may be possible to perform via PHP, but to achieve a general solution you will need substantial effort.
PDF files are composed of (possibly compressed) streams of images and text. Several libraries can attempt to extract the text, and will work reasonably well if the PDF was generated in a straightforward way; however, they will typically fail if some text was rendered as images of its characters, or if other ofuscation has been applied. In those cases, you will need to use OCR to generate the actual text as it is seen when the PDF is displayed. Note also that tables and images are out-of-scope for these tools.
Once you have two text files, finding overlaps becomes much easier, although there are several techniques. "Same 80%" can be interpreted in several ways, but let us assume that copying a contiguous 79% of the text from a file and saving it again should not trigger alarms, while copying 81% of that same text should trigger them. Any diff tool can provide information on duplicate chunks, and may be enough for your purposes. A more sophisticated approach, which however does not provide exact percentages, is to use the normalized compression distance.
I have a server talking to a mobile app, and this app potentially does thousands of requests per day. I don't care that much about performance in this particular case, so saving some miliseconds isn't as big as a concern as saving bandwidth - especially since I'm paying for it.
(1) What is the advantage of using JSON over binary here, when bandwidth is a much bigger deal than performance? I mean, I have read some people saying that the size difference between raw data and JSON isn't really that much - and that might as well be partially true, but when you have thousands of daily requests being made by hundreds of thousands of users, merely doubling the amount of bytes will have a huge impact on bandwidth usage - and in the end, on the server bill.
Also, some people said that you can easily alter the JSON output format, while changing the binary serialization might be a little more complicated. Again, I agree, but shouldn't it be a little more complicated than that? Like, what are the odds that we're gonna change our format? Will the ease of change make up for JSON's bandwidth excess?
(2) And finally, I stumbled upon this link while doing some research on this topic, and in the summary table (Ctrl + F, 'summary') it says that the JSON data size is smaller than the actual binary data? How is that even possible?
I would very much appreciate some answers to these questions.
Thank you in advance.
thousands of requests per day
that's ... not really a lot, so most approaches will usually be fine
What is the advantage of using JSON over binary here, when bandwidth is a much bigger deal than performance?
JSON wouldn't usually have an advantage; usually that would go to binary protocols - things like protobuf; however, compression may be more significant than choice of protocol. If you want meaningful answers, however, the only way to get that is to test it with your data.
If bandwidth is your key concern, I'd go with protobuf, adding compression on top if you have a lot of text data in your content (text compresses beautifully, and protobuf simply uses UTF8, so it is still available for compression).
it says that the JSON data size is smaller than the actual binary data?
JSON contains textual field names, punctuation (", :, ,), etc - and all values are textual rather than primitive; JSON will be larger than good binary serializers. The article, however, compares to BinaryFormatter; BinaryFormatter does not qualify as a good binary serializer IMO. It is easy to use and works without needing to do anything. If you compare against something like protobuf: protobuf will beat JSON every time. And sure enough, if we look at the table: JSON is 102 or 86 bytes (depending on the serializer); protobuf-net is 62 bytes, MsgPack is 61, BinaryFormatter is 669. Do not conflate "a binary serializer" with BinaryFormatter. I blame the article for this error, not you. MsgPack and Protocol Buffers (protobuf-net) are binary serializers, and they come out in the lead.
(disclosure: I'm the author of protobuf-net)
I have an optimization question.
Lets say that I'm making a website, and it has a JSON file with 5,000 pairs (about 582 kb) and through the combination of 3 sliders and some select tags it is possible to display every value. So the time to appear between every pair is in microseconds.
My question is: If the website is also made to run on mobile browsers, where is it more efficient to have the 5000 pairs of data - in a JSON file or in the data base? and why?
I am building a photo site with similar requirements and I can say after months of investigations and experimenting that there are no easy answer to that question. But I will try to give you some hints:
Try to divide the data in chunks, for example - if your sliders are selecting values between 1 through 100, instead of delivering exactly what the client selected, round up a bit maybe +-10 or maybe more, that way you can continue filtering on the client side without a server roundtrip. Save all data in client memory before querying.
Don't render more than what is visible on the screen, JSON storage and filtering is fast but DOM is very slow, minimize the visible elements.
Use 304 caching - meaning - whenever the client is requesting the same data twice; send a proper 304 response with etag. For example - a good rule of thumb here is to use something you know very easily, like the max ID in the database or so to see if any new data has been updated since the last call. If not, just send 304 and the client will use whatever he had the last time.
Use absolute positioning. Don't even try to use the CSS float or something like that, it will not work. Just calculate each position of each element. This will also help you to achieve tip nr 2 (by filtering out all elements that are outside of the visible screen). You can still use CSS transitions which gives nice animations when they change sliders.
You could experiment with IndexedDB to help with the client side querying but unfortunately the support in different browsers are still not good enough plus you hit the roof on storage, better to use the ordinary cache and with proper headings.
Good Luck!
A database like MongoDB would be good for this. It still uses the JSON syntax for storage so you can use the values from the JSON file. The querying is very fast too and you wouldn't have to parse the JSON file and store it in an object before using it.
Given the size of the data (just 582Kb) I will opt for the Json file.
The drawback is you will have a penalty starting the app and loading the data in memory, but then all queries will run very fast in memory as a good advantage.
You need to think about how much acceses will your app do to the database (how many queries) against load the file just once. And think if your main objective are mobile browsers or pcs.
For this volume of data I wouldn't try a database (another process consuming resources), just try how much resources (time, memory) are needed to load the JSON file.
If the data is going to grow... then you will need to rethink this, or maybe split your json file following some criteria.
I occasionally find myself needing certain filesystem APIs which could be implemented very efficiently if supported by the filesystem, but I've never heard of them. For example:
Truncate file from the beginning, on an allocation unit boundary
Split file into two on an allocation unit boundary
Insert or remove a chunk from the middle of the file, again, on an allocation unit boundary
The only way that I know of to do things like these is to rewrite the data into a new file. This has the benefit that the allocation unit is no longer relevant, but is extremely slow in comparison to some low-level filesystem magic.
I understand that the alignment requirements mean that the methods aren't always applicable, but I think they can still be useful. For example, a file archiver may be able to trim down the archive very efficiently after the user deletes a file from the archive, even if that leaves a small amount of garbage either side for alignment reasons.
Is it really the case that such APIs don't exist, or am I simply not aware of them? I am mostly interested in NTFS, but hearing about other filesystems will be interesting too.
For NTFS and FAT there are no such APIs. You can obvoiusly truncate the end a file but not the beginning.
Implementing this is unadvisable due to file system caching.
Most of the time people implement a layer "on top" of NTFS to support this.
Raymond Chen has essentially answered this question.
His answer is that no, such APIs don't exist, because there is too little demand for them. Raymond also suggests the use of sparse files and decomitting blocks by zeroing them.
Does anyone know how I can store large binary values in Riak?
For now, they don't recommend storing files larger than 50MB in size without splitting them. See: FAQ - Riak Wiki
If your files are smaller than 50MB, than proceed as you would with storing non binary data in Riak.
Another reason one might pick Riak is for flexibility in modeling your data. Riak will store any data you tell it to in a content-agnostic way — it does not enforce tables, columns, or referential integrity. This means you can store binary files right alongside more programmer-transparent formats like JSON or XML. Using Riak as a sort of “document database” (semi-structured, mostly de-normalized data) and “attachment storage” will have different needs than the key/value-style scheme — namely, the need for efficient online-queries, conflict resolution, increased internal semantics, and robust expressions of relationships.Schema Design in Riak - Introduction
#Brian Mansell's answer is on the right track - you don't really want to store large binary values (over 50 MB) as a single object, in Riak (the cluster becomes unusably slow, after a while).
You have 2 options, instead:
1) If a binary object is small enough, store it directly. If it's over a certain threshold (50 MB is a decent arbitrary value to start with, but really, run some performance tests to see what the average object size is, for your cluster, after which it starts to crawl) -- break up the file into several chunks, and store the chunks separately. (In fact, most people that I've seen go this route, use chunks of 1 MB in size).
This means, of course, that you have to keep track of the "manifest" -- which chunks got stored where, and in what order. And then, to retrieve the file, you would first have to fetch the object tracking the chunks, then fetch the individual file chunks and reassemble them back into the original file. Take a look at a project like https://github.com/podados/python-riakfs to see how they did it.
2) Alternatively, you can just use Riak CS (Riak Cloud Storage), to do all of the above, but the code is written for you. That's exactly how RiakCS works -- it breaks an incoming file into chunks, stores and tracks them individually in plain Riak, and reassembles them when it comes time to fetch it back. And provides an Amazon S3 API for file storage, for your convenience. I highly recommend this route (so as not to reinvent the wheel -- chunking and tracking files is hard enough). Yes, CS is a paid product, but check out the free Developer Trial, if you're curious.
Just like every other value. Why would it be different?
Use either the Erlang interface ( http://hg.basho.com/riak/src/461421125af9/doc/basic-client.txt ) or the "raw" HTTP interface ( http://hg.basho.com/riak/src/tip/doc/raw-http-howto.txt ). It should "just work."
Also, you'll generally find a better response on the riak-users mailing list than you will here. http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com (No offense to z8000, who seems to also have answers.)