What alternative I can use to SVG words cloud? - html

Recently, I've designed a word cloud in Illustrator for a customer. It uses around 5,000 people's names in white on a colored background on a logo path, and includes a few vector logos. Each name is ridiculously small, and we want to be able to search on the cloud and find our name.
We've put it online as a SVG with success - but a 20M file can cause problems!
So everything would be fine until we reach 10,000 visitors at the same time, and make all our servers crash and timeout everyone.
So what is our alternative to make this fast, easy for visitors to use, and latency free? We think about Canvas, but not sure if it's simple to make a words cloud with [really (thing about following a logo path)] custom shape.

It sounds like you have 20Mb because the names are being stored/represented with paths. If you represent them as text, you will substantially reduce the size of the file, AND make it appropriately searchable.
Assuming 13 characters per name (including the space in between), UTF-8 encoding, and 10,000 names the names themselves should only take 127Kb. You may wish to experiment with transmitting the background of the SVG and the names (JSON?), and using a script to construct the cloud in the browser.
Edit: Even if you create a completely static SVG, representing the text as text will result in a substantial saving of space over the use of paths.

Related

Producing many different hashes of a jpg file with minimal change to picture

My goal is to write a program (e.g. in Python or C++) which takes as input a JPG file (e.g. tux.jpg) and make tiny changes to it, such that it outputs many different images (maybe a thousand images or even more), but in a way that all these images, while having different hash, look almost the same visually, i.e. the changes should have the least impact to the original image as possible.
I first though to play around with the jpg header but that might not be enough to make the many thousands of different pictures I want.
As a naive way, I thought to flip a random bit in the file, but that bit can possibly result in a less than desirable result, which can be seen especially in small pictures (e.g. a dark pixel in the white space in the tux picture). Ideally, I would like to change a random pixel with a "neighboring" color, such that the two resulting pictures have almost no visual difference.
For this purpose, I read the JPG codec example but I find it very confusing and hard to understand. Can someone help me what my program should look for as it parses the file in binary format and how to change a random pixel with a "neighboring" color?
You can change the comment part of the file by playing with the file header. A simple way to do that is to use a ready made open source program that allows you to put the comment of your choice, example HLLO repeated 8 times. That should give you 256 bits to play with. You can then determine the place where the HLLO pattern is located in the file using a hex editor. You then load the data in memory and start changing these 32 bytes and calculate the hash each time to get a collision (a hash that matches)
By the time you find a collision, the universe will have ended.
Although in theory doable, it's practically impossible to crack SHA256 in a reasonable amount of time, standard encryption protocols would be over and hackers would be enjoying their time.

Is there a "law of diminishing returns" for converting images to Base64 as opposed to simply using the images themselves?

Say I have some icons on my site (32x32, 64x64, 128x128, etc.), then converting them to Base64 makes sense, right?
Now say I have some high resolution images that are 2MB, 3MB, 4MB, and greater that I am using on my site. Does it make sense to convert these larger images to Base64 or is it "smarter" to simply keep using them as .jpg/.png/.gif/etc.?
If there is such a "law", "rule of thumb", etc. what is the line in the sand?
EDIT: While this post was marked as a duplicate, the linked "original" is from over 10 years ago; browser technology, computers, and the web itself, has changed significantly since then. What would be the answer for today's technology?
The answer to the questions is yes, and it depends.
If we rephrase the question to: Does the law of diminishing returns apply to using base64 for embedding images in a page?
a) Yes, the law applies
b) It depends on: Image count and size, and your setup (ie HTTP (HTTP/2?) connection type, etc)
The reason being that more images require more connections imply more handshakes, unless you are using keep alive connections or HTTP/2 streaming. If the images are bigger and require more computing to convert from base64 back to binary (plus decompression), then the bandwidth saves come with CPU expense.
In general, if you have lots of images (icons, for example), you could embed as base64. But in that case you also have the following options:
a) Image Atlas: Converting all small images to a single image (one load) and showing only the portion that you need through the page.
b) Converting to alternative formats, such as fonts or SVG, and again rendering what you need. Example: Open Iconic.

How to reduce pdf size with PDFLib with heavy images?

I'm creating pdf with the help of the PDFLib engine. My requirement is quite heavy in terms of data which is going to be stored in pdf. One pdf will going to store around 300 Images. And will create around 100 pdfs at the same time. Mine images are been repeated so, I'll know which image will be places where. Now if I go with the image_load option of the PDFLib, the pdf size is around 100Mb. Is there any way so that I can reduce the size ?
The answer is Templates.
PDFlib supports a PDF feature with the technical name Form XObjects. However, since this term conflicts with interactive forms we refer to this feature as templates.
A PDFlib template can be thought of as an off-page buffer into which text, vector, and image operations are redirected (instead of acting on a regular page).
After the template is finished it can be used much like a raster image, and placed an arbitrary number of times on arbitrary pages. Like images, templates can be subjected to
geometrical transformations such as scaling or skewing.
When a template is used on multiple pages (or multiply on the same page), the actual PDF operators for constructing the template are only included once in the PDF file, thereby saving PDF output file size. Templates suggest themselves for elements which appear repeatedly on several pages, such as a constant background, or a company logo.
And will also reduces the size.

Extract or crop image from within TIFF

I need to extract/crop the logotype (BEAVER) in the middle from a TIFF file that looks like this: http://i41.tinypic.com/2i7rbie.jpg
And then I need to automate the process so it can be repeated about 9 million times...
My guess is that I would have to use some OCR software. But is it possible for such a software to "crop anything that starts below this point and ends above this point"?
Thoughts?
Typically OCR software does only extraction of text from images and conversion of it into some text-specific format. It does not do crop. However, you can use OCR technologies to achieve your task. I would recommend following:
OCR whole page
Get coordinates of recognized text
Apply your magic rules to recognized text to locate area to crop: such as averything in between "application filled" and "STATEMENT" sentences.
Cut from image that area and export it where you want it.
Real challenge is in the amount of text you would like to process. You have to be very carefull when defining your "smart rules" to make sure they don't provide false positives and always send suspicious images to separate queue that you will later manually review and update your rules.
In general it may look like this:
Take first 10 of images, define logo detection rules, test and see if everything works well
Then run on next 10, see what was prcessed wrong, what was not processed, update rules, re-process those 10 to make sure everything works well now
Re-run it on new batches of same size until it will start working well.
Then increase batch size from 10 to 100, and go with those batches until again everything start working smoothly
Then continue this way perfecting your rules and increasing batch size. At some point of time you will go to production speed.
Most likely you will encounter some strange images that either contradict existing rules, or just wrong. Not always you have to update your rules to accomodate it. It may happen that there it only dozen of images like that in whole your 9 million collection. It might be better to leave them in exceptions queue for manual processing, and don't risk stability of your magic rules.

How much faster is it to use inline/base64 images for a web site than just linking to the hard file?

How much faster is it to use a base64/line to display images than opposed to simply linking to the hard file on the server?
url(data:image/png;base64,.......)
I haven't been able to find any type of performance metrics on this.
I have a few concerns:
You no longer gain the benefit of caching
Isn't a base64 A LOT larger in size than what a PNG/JPEG file size?
Let's define "faster" as in: the time it takes for a user to see a full rendered HTML web page
'Faster' is a hard thing to answer because there are many possible interpretations and situations:
Base64 encoding will expand the image by a third, which will increase bandwidth utilization. On the other hand, including it in the file will remove another GET round trip to the server. So, a pipe with great throughput but poor latency (such as a satellite internet connection) will likely load a page with inlined images faster than if you were using distinct image files. Even on my (rural, slow) DSL line, sites that require many round trips take a lot longer to load than those that are just relatively large but require only a few GETs.
If you do the base64 encoding from the source files with each request, you'll be using up more CPU, thrashing your data caches, etc, which might hurt your servers response time. (Of course you can always use memcached or such to resolve that problem).
Doing this will of course prevent most forms of caching, which could hurt a lot if the image is viewed often - say, a logo that is displayed on every page, which could normally be cached by the browser (or a proxy cache like squid or whatever) and requested once a month. It will also prevent the many many optimizations web servers have for serving static files using kernel APIs like sendfile(2).
Basically, doing this will help in certain situations, and hurt in others. You need to identify which situations are important to you before you can really figure out if this is a worthwhile trick for you.
I have done a comparison between two HTML pages containing 1800 one-pixel images.
The first page declares the images inline:
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsQAAA7EAZUrDhsAAAANSURBVBhXYzh8+PB/AAffA0nNPuCLAAAAAElFTkSuQmCC">
In the second one, images reference an external file:
<img src="img/one-gray-px.png">
I found that when loading multiple times the same image, if it is declared inline, the browser performs a request for each image (I suppose it base64-decodes it one time per image), whereas in the other scenario, the image is requested once per document (see the comparison image below).
The document with inline images loads in about 250ms and the document with linked images does it in 30ms.
(Tested with Chromium 34)
The scenario of an HTML document with multiple instances of the same inline image doesn't make much sense a priori. However, I found that the plugin jquery lazyload defines an inline placeholder by default for all the "lazy" images, whose src attribute will be set to it. Then, if the document contains lots of lazy images, a situation like the one described above can happen.
You no longer gain the benefit of caching
Whether that matters would vary according to how much you depend on caching.
The other (perhaps more important) thing is that if there are many images, the browser won't get them simultaneously (i.e. in parallel), but only a few at a time -- so the protocol ends up being chatty. If there's some network end-to-end delay, then many images divided by a few images at a time multiplied by the end-to-end delay per image results in a noticeable time before the last image is loaded.
Isn't a base64 A LOT larger in size than what a PNG/JPEG file size?
The file format / image compression algorithm is the same, I take it, i.e. it's PNG.
Using Base-64, each 8-bit character represents 6-bits: therefore binary data is being decompressed by a ratio of 8-to-6, i.e. only about 35%.
How much faster is it
Define 'faster'. Do you mean HTTP performance (see below) or rendering performance?
You no longer gain the benefit of caching
Actually, if you're doing this in a CSS file it will still be cached. Of course, any changes to the CSS will invalidate the cache.
In some situations this could be used as a huge performance boost over many HTTP connections. I say some situations because you can likely take advantage of techniques like image sprites for most stuff, but it's always good to have another tool in your arsenal!