iText7: Creating PDF from TIFF multipage image using iText - tiff

I am trying to use iText 7.1.1 to convert a TIFF image to PDF file with multiple pages. Thanks for those to get me started with this article Create PDF from TIFF image using iText. However, it is iText 5.5.x and I have trouble to duplicate it in iText 7.
I did find TiffImageData.getNumberOfPages(raf) to replace int pages = TiffImage.getNumberOfPages(rafa).
However, I am not able to replace TiffImage.getTiffImage(rafa, i) in iText7. Do I need to use new Image(ImageDataFactory.createTiff(...)). Appreciate any suggestion(s).
iText 5.5.x code
import java.io.FileOutputStream;
import java.io.RandomAccessFile;
import java.nio.channels.FileChannel;
import com.itextpdf.text.Document;
import com.itextpdf.text.Image;
import com.itextpdf.text.Rectangle;
import com.itextpdf.text.io.FileChannelRandomAccessSource;
import com.itextpdf.text.pdf.PdfWriter;
import com.itextpdf.text.pdf.RandomAccessFileOrArray;
import com.itextpdf.text.pdf.codec.TiffImage;
public class Test1 {
public static void main(String[] args) throws Exception {
RandomAccessFile aFile = new RandomAccessFile("/myfolder/origin.tif", "r");
FileChannel inChannel = aFile.getChannel();
FileChannelRandomAccessSource fcra = new FileChannelRandomAccessSource(inChannel);
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream("/myfolder/destination.pdf"));
document.open();
RandomAccessFileOrArray rafa = new RandomAccessFileOrArray(fcra);
int pages = TiffImage.getNumberOfPages(rafa);
Image image;
for (int i = 1; i <= pages; i++) {
image = TiffImage.getTiffImage(rafa, i);
Rectangle pageSize = new Rectangle(image.getWidth(), image.getHeight());
document.setPageSize(pageSize);
document.newPage();
document.add(image);
}
document.close();
aFile.close();
}

Do I need to use new Image( ImageDataFactory.createTiff(...))
Yes.
You want this: ImageDataFactory.createTiff(bytes, recoverFromImageError, page, direct)
Then you would open a new PDF, loop through the TIFF pages and:
Get the TIFF image size
Create a new page in the PDF matching the TIFF page size
Add the TIFF image to the new PDF page
Here is a note from Bruno Lowagie on using TIFF with iText 7: How to avoid an exception when importing a TIFF file?
I see you probably want fully working code. Here you go:
import com.itextpdf.io.image.ImageData;
import com.itextpdf.io.image.ImageDataFactory;
import com.itextpdf.io.image.TiffImageData;
import com.itextpdf.io.source.RandomAccessFileOrArray;
import com.itextpdf.io.source.RandomAccessSourceFactory;
import com.itextpdf.kernel.geom.PageSize;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfPage;
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.kernel.pdf.canvas.PdfCanvas;
public class TiffToPdf {
public static void main(String[] args) throws IOException {
Path tiffFile = Paths.get("/myfolder/origin.tiff");
RandomAccessFileOrArray raf = new RandomAccessFileOrArray(new RandomAccessSourceFactory().createBestSource(tiffFile.toString()));
int tiffPages = TiffImageData.getNumberOfPages(raf);
raf.close();
try (PdfDocument output = new PdfDocument(new PdfWriter("/myfolder/destination.pdf"))) {
for (int page = 1; page <= tiffPages; page++) {
ImageData tiffImage = ImageDataFactory.createTiff(tiffFile.toUri().toURL(), true, page, true);
Rectangle tiffPageSize = new Rectangle(tiffImage.getWidth(), tiffImage.getHeight());
PdfPage newPage = output.addNewPage(new PageSize(tiffPageSize));
PdfCanvas canvas = new PdfCanvas(newPage);
canvas.addImage(tiffImage, tiffPageSize, false);
}
}
}
}
Some might suggest you use the high level API to achieve this a little more cleanly but this should be sufficient for your question.

This is the same above but in vb.net.
It converts a multipage TIFF to a PDF.
Imports System.IO
Imports iTextSharp.text
Imports iTextSharp.text.pdf
Sub ConvertTIFF2PDF(ByVal inFile As String, ByVal outFile As String)
Dim pdfDoc As PdfDocument = New PdfDocument(New PdfWriter(outFile))
Dim doc As Document = New Document(pdfDoc)
Dim aFile = New RandomAccessFileOrArray(New RandomAccessSourceFactory().CreateBestSource(inFile.ToString))
Dim tiffPages = TiffImageData.GetNumberOfPages(aFile)
Dim uri As System.Uri = New Uri(inFile)
For i As Integer = 1 To tiffPages
Console.WriteLine("tiffPages: " & (i) & " of " & tiffPages.ToString)
Dim tiffImage = ImageDataFactory.CreateTiff(uri, False, i, False)
Dim tiffPageSize = New Geom.Rectangle(tiffImage.GetWidth(), tiffImage.GetHeight())
Dim newPage = pdfDoc.AddNewPage(New PageSize(tiffPageSize))
Dim canvas As PdfCanvas = New PdfCanvas(newPage)
canvas.AddImage(tiffImage, tiffPageSize, False)
Next
doc.Close()
pdfDoc.Close()
aFile.Close()
End Sub

It's Just the C# Version :
public void ConvertTIFF2PDF(string inFile, string outFile)
{
iTextSharp.text.Document document = new iTextSharp.text.Document(iTextSharp.text.PageSize.A4, 0, 0, 0, 0);
iTextSharp.text.pdf.PdfWriter writer = iTextSharp.text.pdf.PdfWriter.GetInstance(document, new FileStream(outFile, FileMode.Open));
Bitmap bm = new Bitmap(inFile);
int total = bm.GetFrameCount(FrameDimension.Page);
document.Open();
iTextSharp.text.pdf.PdfContentByte cb = writer.DirectContent;
for (int k = 0; k < total; ++k)
{
bm.SelectActiveFrame(FrameDimension.Page, k);
iTextSharp.text.Image img = iTextSharp.text.Image.GetInstance(bm, ImageFormat.Bmp);
// scale the image to fit in the page
img.ScalePercent(72f / img.DpiX * 100);
img.SetAbsolutePosition(0, 0);
cb.AddImage(img);
document.NewPage();
}
document.Close();
}

Related

While converting doc to html graphics or shapes are not converting into html format

We want to display doc file into dialog box on browser. That is why I convert it into html file. So doc file converted into html successfully but if doc file has graphics or any shapes then it converts into html file. But graphics ware not converting into any html tags like img or etc and not shown in file displayed on UI,
So how we convert doc file which has graphics or shape into html.
InputStream input = new FileInputStream (baseDir + fileName);
HWPFDocument wordDocument = new HWPFDocument (input);
wordToHtmlConverter.processDocument (wordDocument);
wordToHtmlConverter.setPicturesManager (picmang=new PicturesManager() {
public String savePicture (byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches) {
return suggestedName;
}
});
org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource (htmlDocument);
StreamResult streamResult = new StreamResult (outStream);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty (OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty (OutputKeys.INDENT, "yes");
serializer.setOutputProperty (OutputKeys.METHOD, "html");
serializer.transform (domSource, streamResult);
outStream.close();
String content = new String (outStream.toByteArray() );
FileOutputStream fos = null;
String destinationHTMLFile = baseDir + fileName.replace(".docx", "").replace(".doc", "")+".html";
BufferedWriter bw = null;
File file = new File(destinationHTMLFile);
fos = new FileOutputStream(file);
bw = new BufferedWriter(new OutputStreamWriter(fos, "UTF-8"));
bw.write(content);
So please help me out to display doc file in browser.
The AbstractWordConverter.setPicturesManager must be done before AbstractWordConverter.processDocument. And of course the method PicturesManager.savePicture in Interface PicturesManager needs to be filled with functionality for saving the pictures in the class which implements this interface.
Following example takes a WordDocument.doc from my home directory and transforms this to HTML including pictures and puts the resulting files (HTML file and image files) in a new created directory html. Note, the pictures included in the WordDocument.doc must be either *.gif or *.png or *.jpg since the used approach for Writing/Saving an Image only supports those types.
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.PicturesManager;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.PictureType;
import org.apache.poi.util.XMLHelper;
import org.w3c.dom.Document;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.StringWriter;
import java.io.FileInputStream;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.awt.image.BufferedImage;
import javax.imageio.ImageIO;
public class TestWordToHtmlConverter {
private static void convertDocToHTML(String docFilePathAndName, String htmlPath, String htmlFileName) throws Exception {
new File(htmlPath).mkdir();
HWPFDocument hwpfDocument = new HWPFDocument(new FileInputStream(docFilePathAndName));
Document newDocument = XMLHelper.getDocumentBuilderFactory().newDocumentBuilder().newDocument();
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument);
wordToHtmlConverter.setPicturesManager(
new PicturesManager() {
public String savePicture(byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches) {
/*
System.out.println(content);
System.out.println(pictureType);
System.out.println(suggestedName);
System.out.println(widthInches);
System.out.println(heightInches);
*/
try {
BufferedImage image = ImageIO.read(new ByteArrayInputStream(content));
ImageIO.write(image, pictureType.getExtension(), new File(htmlPath, suggestedName));
} catch (Exception e) {
e.printStackTrace();
}
return suggestedName;
}
}
);
wordToHtmlConverter.processDocument(hwpfDocument);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.transform(new DOMSource(wordToHtmlConverter.getDocument()),
new StreamResult(new File(htmlPath, htmlFileName)));
}
public static void main(String[] args) throws Exception {
convertDocToHTML("/home/axel/Dokumente/WordDocument.doc", "/home/axel/Dokumente/html", "WordDocument.html");
}
}

GLFW Window Icon

I am working on a project in OpenGL with lwjgl. I was having a hard time loading an icon for the window, as it wanted a GLFWImage buffer. After a long time of scouring the internet, this is what I have:
try {
BufferedImage originalImage =
ImageIO.read(new File("favicon.png"));
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write( originalImage, "png", baos );
baos.flush();
byte[] imageInByte = baos.toByteArray();
ByteBuffer buF = ByteBuffer.wrap(imageInByte);
GLFWImage.Buffer b = new GLFWImage.Buffer(buF);
glfwSetWindowIcon(window, b);
} catch (IOException io){
System.out.println("Could not load window icon!");
System.out.println(io.toString());
}
The java runtime crashes with an output like this:
# A fatal error has been detected by the Java Runtime Environment:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
I haven't been able to find a method to do this that doesn't give this kind of error. The official glfw documentation says to use a method that doesn't seem to exist in LWJGL. If you have any experience with this, it would even be helpful to just point me in the right direction.
Thanks in advance.
This solution is unwieldy; however, it works for me! :)
It's based on code in the lwjgl events demo, but to use that I had to implement the demo IOUtil. The code for setting the icon is this:
ByteBuffer icon16;
ByteBuffer icon32;
try {
icon16 = IOUtil.ioResourceToByteBuffer("src/hexsweeper/hex16.png", 2048);
icon32 = IOUtil.ioResourceToByteBuffer("src/hexsweeper/hex32.png", 4096);
} catch (Exception e) {
throw new RuntimeException(e);
}
IntBuffer w = memAllocInt(1);
IntBuffer h = memAllocInt(1);
IntBuffer comp = memAllocInt(1);
try ( GLFWImage.Buffer icons = GLFWImage.malloc(2) ) {
ByteBuffer pixels16 = stbi_load_from_memory(icon16, w, h, comp, 4);
icons
.position(0)
.width(w.get(0))
.height(h.get(0))
.pixels(pixels16);
ByteBuffer pixels32 = stbi_load_from_memory(icon32, w, h, comp, 4);
icons
.position(1)
.width(w.get(0))
.height(h.get(0))
.pixels(pixels32);
icons.position(0);
glfwSetWindowIcon(window, icons);
stbi_image_free(pixels32);
stbi_image_free(pixels16);
}
The imports are as follows:
import java.nio.ByteBuffer;
import java.nio.IntBuffer;
import org.lwjgl.glfw.GLFWImage;
import static org.lwjgl.stb.STBImage.*;
import static org.lwjgl.system.MemoryUtil.*;
And in another file (named IOUtil) I put the following code:
import org.lwjgl.BufferUtils;
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.SeekableByteChannel;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import static org.lwjgl.BufferUtils.*;
public final class IOUtil {
private IOUtil() {
}
private static ByteBuffer resizeBuffer(ByteBuffer buffer, int newCapacity) {
ByteBuffer newBuffer = BufferUtils.createByteBuffer(newCapacity);
buffer.flip();
newBuffer.put(buffer);
return newBuffer;
}
/**
* Reads the specified resource and returns the raw data as a ByteBuffer.
*
* #param resource the resource to read
* #param bufferSize the initial buffer size
*
* #return the resource data
*
* #throws IOException if an IO error occurs
*/
public static ByteBuffer ioResourceToByteBuffer(String resource, int bufferSize) throws IOException {
ByteBuffer buffer;
Path path = Paths.get(resource);
if ( Files.isReadable(path) ) {
try (SeekableByteChannel fc = Files.newByteChannel(path)) {
buffer = BufferUtils.createByteBuffer((int)fc.size() + 1);
while ( fc.read(buffer) != -1 ) ;
}
} else {
try (
InputStream source = IOUtil.class.getClassLoader().getResourceAsStream(resource);
ReadableByteChannel rbc = Channels.newChannel(source)
) {
buffer = createByteBuffer(bufferSize);
while ( true ) {
int bytes = rbc.read(buffer);
if ( bytes == -1 )
break;
if ( buffer.remaining() == 0 )
buffer = resizeBuffer(buffer, buffer.capacity() * 2);
}
}
}
buffer.flip();
return buffer;
}
}
Replace the "src/hexsweeper/hex16.png" with however you get to your files, the window with your window, and you should be set. This worked for me, hope it works for everyone else!
Note: I didn't write the bulk of this code. It was made by the wonderfully helpfull lwjgl contributors apostolos, Spasi, and kappaOne.

Weka: loading CSV file without headers

How to load a CSV file without headers in Weka?
There are a few related questions, but none seems to get to the point.
MWE
Here is the test.csv file:
20,1,"+"
30,2,"+"
30,1,"+"
15,1,"-"
10,0,"-"
Here is the Test.java code:
// javac -Xlint -cp weka.jar Test.java && java -cp .:weka.jar Test
import weka.core.converters.CSVLoader;
import weka.core.Instances;
import weka.classifiers.Classifier;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.Evaluation;
import java.io.File;
class Test
{
public static void main(String[] args) {
try {
CSVLoader loader = new CSVLoader();
loader.setOptions(new String[] {"-H"});
loader.setSource(new File("test.csv"));
Instances tr = loader.getDataSet();
tr.setClassIndex(tr.numAttributes() - 1);
Classifier m = (Classifier) new NaiveBayes();
m.buildClassifier(tr);
Evaluation eval = new Evaluation(tr);
eval.evaluateModel(m, tr);
System.out.println(eval.toSummaryString());
}
catch(Exception ex) {
System.out.println(ex);
}
}
}
When running, it only reports 4 instances, not 5. If I add headers, then it works correctly.
Correctly Classified Instances 4 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0.0065
Root mean squared error 0.0112
Relative absolute error 1.3088 %
Root relative squared error 2.2477 %
Total Number of Instances 4
Notice I have used:
loader.setOptions(new String[] {"-H"});
I have also tried the direct API loader.setNoHeaderRowPresent(true);, but it seems to not be available in Weka 3.6.13.
References:
CSVLoader API
EDIT: It turns out this was a problem in 3.6.13. The code works fine for 3.7.10.
I am not sure about 3.6.13, but the code for 3.7.10 shows that first row of data is added if setNoHeaderRowPresent is set true.
You are setting false, set it to true.Refrence from grepcode of CSVLoader
Set whether there is no header row in the data.
Parameters: b true if
there is no header row in the data
public void setNoHeaderRowPresent(boolean b) {
m_noHeaderRow = b; 293
}
if (m_noHeaderRow) {
m_rowBuffer.add(firstRow);
}
So in your code use
loader.setNoHeaderRowPresent(true)
and not loader.setNoHeaderRowPresent(false) to include first row in data set.
As a work-around, this reads the CSV file and passes it along as an ARFF file:
// javac -Xlint -cp weka.jar Test.java && java -cp .:weka.jar Test
import weka.core.converters.CSVLoader;
import weka.core.Instances;
import weka.classifiers.Classifier;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.Evaluation;
import java.io.FileReader;
import java.io.BufferedReader;
import java.io.StringReader;
import java.lang.StringBuffer;
class Test
{
public static void main(String[] args) {
try {
String filename = "test.csv";
BufferedReader br = new BufferedReader(new FileReader(filename));
String line = br.readLine();
int cols = line.length() - line.replace(",", "").length() + 1;
StringBuilder arff = new StringBuilder("#RELATION test\n");
for(int i = 0; i < cols-1; i++) {
arff.append("#ATTRIBUTE ");
arff.append(String.valueOf((char)(i + 'a')));
arff.append(" NUMERIC\n");
}
arff.append("#ATTRIBUTE class {+,-}\n");
arff.append("#DATA\n");
while(line != null) {
arff.append(line);
arff.append("\n");
line = br.readLine();
}
System.out.println(arff.toString());
Instances tr = new Instances(new StringReader(arff.toString()));
tr.setClassIndex(tr.numAttributes() - 1);
Classifier m = (Classifier) new NaiveBayes();
m.buildClassifier(tr);
Evaluation eval = new Evaluation(tr);
eval.evaluateModel(m, tr);
System.out.println(eval.toSummaryString());
}
catch(Exception ex) {
System.out.println(ex);
}
}
}

Apache POI - formatting output to HTML

I am writing to an Excel file using Apache POI, but I want my output to be formatted as HTML not as literal text.
SXSSFWorkbook workbook = new SXSSFWorkbook();
Sheet sheet0 = workbook.createSheet("sheet0");
Row row0 = sheet0.createRow(2);
Cell cell0 = row0.createCell(2);
cell0.setCellValue("<html><b>blah blah blah</b></html>");
What appears when I open the Excel file is:
"<html><b>blah blah blah</b></html>"
but I want:
"blah blah blah"
essentially I am looking for a piece of code along the lines of:
cell0.setCellFormat(CellFormat.HTML);
Except, that doesn't exist.
here is some info on this topic
http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/ss/examples/html/ToHtml.java
I will try this for now:
public void printPage() throws IOException {
try {
ensureOut();
if (completeHTML) {
out.format(
"<?xml version=\"1.0\" encoding=\"iso-8859-1\" ?>%n");
out.format("<html>%n");
out.format("<head>%n");
out.format("</head>%n");
out.format("<body>%n");
}
print();
if (completeHTML) {
out.format("</body>%n");
out.format("</html>%n");
}
} finally {
if (out != null)
out.close();
if (output instanceof Closeable) {
Closeable closeable = (Closeable) output;
closeable.close();
}
}
}
Based on my version for DocX, here is the adapted version for Hssf. As with the other version, you'll have to debug and extend the loop for the various css styles.
Update: I've overlooked yesterday, that you wanted to have a streaming XSSF solution, so I fiddled around, if it's possible to just use the usermodel classes (not really, when it comes to font colors), furthermore I wondered why SXSSF didn't use any of my font setting until I found out, that's currently by design (see Bug 52484)
import java.awt.Color;
import java.io.FileOutputStream;
import java.lang.reflect.Field;
import java.util.Enumeration;
import javax.swing.text.*;
import javax.swing.text.html.*;
import org.apache.poi.hssf.usermodel.*;
import org.apache.poi.hssf.util.HSSFColor;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.*;
public class StyledTextXls {
public static void main(String[] args) throws Exception {
HTMLEditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
kit.insertHTML(doc, doc.getLength(), "<p>paragraph <b>1</b></p>", 0, 0, null);
kit.insertHTML(doc, doc.getLength(), "<p>paragraph <span style=\"color:red\">2</span></p>", 0, 0, null);
Workbook wb = new XSSFWorkbook();
// Workbook wb = new HSSFWorkbook();
// Workbook wb = new SXSSFWorkbook(100); // doesn't work yet - see Bug 52484
Sheet sheet = wb.createSheet();
Row row = sheet.createRow(0);
Cell cell = row.createCell(0);
StringBuffer sb = new StringBuffer();
for (int lines=0, lastPos=-1; lastPos < doc.getLength(); lines++) {
if (lines > 0) sb.append("\n");
Element line = doc.getParagraphElement(lastPos+1);
lastPos = line.getEndOffset();
for (int elIdx=0; elIdx < line.getElementCount(); elIdx++) {
final Element frag = line.getElement(elIdx);
String subtext = doc.getText(frag.getStartOffset(), frag.getEndOffset()-frag.getStartOffset());
sb.append(subtext);
}
}
CreationHelper ch = wb.getCreationHelper();
RichTextString rt = ch.createRichTextString(sb.toString());
for (int lines=0, lastPos=-1; lastPos < doc.getLength(); lines++) {
Element line = doc.getParagraphElement(lastPos+1);
lastPos = line.getEndOffset();
for (int elIdx=0; elIdx < line.getElementCount(); elIdx++) {
final Element frag = line.getElement(elIdx);
Font font = getFontFromFragment(wb, frag);
rt.applyFont(frag.getStartOffset()+lines, frag.getEndOffset()+lines, font);
}
}
cell.setCellValue(rt);
cell.getCellStyle().setWrapText(true);
row.setHeightInPoints((6*sheet.getDefaultRowHeightInPoints()));
sheet.autoSizeColumn((short)0);
FileOutputStream fos = new FileOutputStream("richtext"+(wb instanceof HSSFWorkbook ? ".xls" : ".xlsx"));
wb.write(fos);
fos.close();
}
static Font getFontFromFragment(Workbook wb, Element frag) {
// creating a font on each is call is not very efficient
// but should be ok for this exercise ...
Font font = wb.createFont();
final AttributeSet as = frag.getAttributes();
final Enumeration<?> ae = as.getAttributeNames();
while (ae.hasMoreElements()) {
final Object attrib = ae.nextElement();
try {
if (CSS.Attribute.COLOR.equals(attrib)) {
// I don't know how to really work with the CSS-swing class ...
Field f = as.getAttribute(attrib).getClass().getDeclaredField("c");
f.setAccessible(true);
Color c = (Color)f.get(as.getAttribute(attrib));
if (font instanceof XSSFFont) {
((XSSFFont)font).setColor(new XSSFColor(c));
} else if (font instanceof HSSFFont && wb instanceof HSSFWorkbook) {
HSSFPalette pal = ((HSSFWorkbook)wb).getCustomPalette();
HSSFColor col = pal.findSimilarColor(c.getRed(), c.getGreen(), c.getBlue());
((HSSFFont)font).setColor(col.getIndex());
}
} else if (CSS.Attribute.FONT_WEIGHT.equals(attrib)) {
if ("bold".equals(as.getAttribute(attrib).toString())) {
font.setBoldweight(Font.BOLDWEIGHT_BOLD);
}
}
} catch (Exception e) {
System.out.println(attrib.getClass().getCanonicalName()+" can't be handled.");
}
}
return font;
}
}

Extract first line of CSV file in Pig

I have several CSV files and the header is always the first line in the file. What's the best way to get that line out of the CSV file as a string in Pig? Preprocessing with sed, awk etc is not an option.
I've tried loading the file with regular PigStorage and the Piggy bank CsvLoader, but its not clear to me how I can get that first line, if at all.
I'm open to writing an UDF, if that's what it takes.
Disclaimer: I'm not great with Java.
You are going to need a UDF. I'm not sure exactly what you are asking for, but this UDF will take a series of CSV files and turn them into maps, where the keys are the values at the top of the file. This should hopefully be enough of a skeleton so that you can change it into what you want.
The couple of tests I've done remotely and locally indicate that this will work.
package myudfs;
import java.io.IOException;
import org.apache.pig.LoadFunc;
import java.util.Map;
import java.util.HashMap;
import java.util.ArrayList;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
public class ExampleCSVLoader extends LoadFunc {
protected RecordReader in = null;
private String fieldDel = "" + '\t';
private Map<String, String> outputMap = null;
private TupleFactory mTupleFactory = TupleFactory.getInstance();
// This stores the fields that are defined in the first line of the file
private ArrayList<Object> topfields = null;
public ExampleCSVLoader() {}
public ExampleCSVLoader(String delimiter) {
this();
this.fieldDel = delimiter;
}
#Override
public Tuple getNext() throws IOException {
try {
boolean notDone = in.nextKeyValue();
if (!notDone) {
outputMap = null;
topfields = null;
return null;
}
String value = in.getCurrentValue().toString();
String[] values = value.split(fieldDel);
Tuple t = mTupleFactory.newTuple(1);
ArrayList<Object> tf = new ArrayList<Object>();
int pos = 0;
for (int i = 0; i < values.length; i++) {
if (topfields == null) {
tf.add(values[i]);
} else {
readField(values[i], pos);
pos = pos + 1;
}
}
if (topfields == null) {
topfields = tf;
t = mTupleFactory.newTuple();
} else {
t.set(0, outputMap);
}
outputMap = null;
return t;
} catch (InterruptedException e) {
int errCode = 6018;
String errMsg = "Error while reading input";
throw new ExecException(errMsg, errCode,
PigException.REMOTE_ENVIRONMENT, e);
}
}
// Applies foo to the appropriate value in topfields
private void readField(String foo, int pos) {
if (outputMap == null) {
outputMap = new HashMap<String, String>();
}
outputMap.put((String) topfields.get(pos), foo);
}
#Override
public InputFormat getInputFormat() {
return new TextInputFormat();
}
#Override
public void prepareToRead(RecordReader reader, PigSplit split) {
in = reader;
}
#Override
public void setLocation(String location, Job job)
throws IOException {
FileInputFormat.setInputPaths(job, location);
}
}
Sample output loading a directory with:
csv1.in csv2.in
------- ---------
A|B|C D|E|F
Hello|This|is PLEASE|WORK|FOO
FOO|BAR|BING OR|EVERYTHING|WILL
BANG|BOSH BE|FOR|NAUGHT
Produces this output:
A: {M: map[]}
()
([D#PLEASE,E#WORK,F#FOO])
([D#OR,E#EVERYTHING,F#WILL])
([D#BE,E#FOR,F#NAUGHT])
()
([A#Hello,B#This,C#is])
([A#FOO,B#BAR,C#BING])
([A#BANG,B#BOSH])
The ()s are the top lines of the file. getNext() requires that we return something, otherwise the file will stop being processed. Therefore they return a null schema.
If your CSV comply with CSV conventions of Excel 2007 you can use already available loader from Piggybank http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java?view=markup
It has an option to skip the CSV header SKIP_INPUT_HEADER