Duplicate file finder based on comparing 1st 24 characters of file name - duplicates

I have more than 12000 files and I need to compare 1st 24 characters of each file name.
If match is found than need to delete it.
for e.g
dat_2016_08_13_11_01_02_1339.data
dat_2016_08_13_11_01_02_2140.data
dat_2016_08_13_12_47_33_1362.data
dat_2016_08_13_13_12_03_1062.data
dat_2016_08_13_13_12_03_0217.data
In above case :
File 1 is duplicate of 2 as 1st 24 characters "dat_2016_08_13_11_01_02" are same.
File 4 is duplicate of 5 as 1st 24 characters "dat_2016_08_13_13_12_03" are same.
FYI, date of file creation for file 1&2 is different. so, can't use attribute comparison.
I have gone through various duplicate file finder tools. no one gives the way to customize compare technique for my special need.
Can anyone suggest the way to do it ?

Here is a program i quickly coded see if this helps:
Remember Change the directory
import java.io.File;
public class duplicates {
public static void main(String[] args) throws Exception{
File f=new File("C:/testfolder");//insert your directory here
File[] files=f.listFiles();
for(int y=0;y<files.length;y++){
f=new File("C:/testfolder");
files=f.listFiles();
String find="";
if(files[y].getName().length()>24){
find=files[y].getName().substring(0,24);
}else{
continue;
}
for(int x=0;x<files.length;x++){
if(files[x].getName().startsWith(find)){
System.out.println("deleted file:"+files[x].getAbsolutePath());
files[x].delete();
break;
}
}
}
System.out.println("DONE!!");
}
}
I have not tested this with the large no. of files but it should work(might take a min or 2)
Hope this Helped:)

Related

SSIS flat file with values containing text qualifier

I received a flat file that cannot be generated in other way. The delimited is a comma and the text qualifier is a double quote. The problem is that sometimes a have a double quote in the value. In example:
"0","12345", "Centre d"edu et de recherche", "B8E7"
Because of the double quote in the value, I received this error:
[Flat File Source [58]] Error: The column delimiter for column "XYZ" was not found.
[Flat File Source [58]] Error: An error occurred while processing file "C:\somefile.csv" on data row 296.
What can I do to process this file?
I use SSIS 2016 with Visual Studio 2015
You can use the Flat File Source error output to redirect bad rows to another flat file and correct values manually while all valid rows will be processed.
There are many links online to learn more about Flat File Source Error Output:
Flat File source Error Output connection in SSIS
How to Avoid Package Design Flaws When Sourcing Data From Flat Files
Flat File Source Editor (Error Output Page)
Update 1 - Workaround using Script Component and conditional split
Since Flat File error output is not working you can use a script component with a conditional split to filter bad rows, the following update is a step by step guide to implement that:
Add a Flat File connection manager, Go To advanced Tab, Delete all columns except one column and change it length to 4000
Add a script component, Go to Input and Output Column Tab, add desired output columns (in this example 4 columns) and add a Flag Column of type DT_BOOL
Inside the Script Component write the following script to check if the number of columns is 4 then Flag = True which means this is a valid row else set Flag as False which mean that this is a bad row:
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (!Row.Column0_IsNull && !String.IsNullOrWhiteSpace(Row.Column0))
{
string[] cells = Row.Column0.Split(new string[] { "\",\"" }, StringSplitOptions.None);
if (cells.Length == 4)
{
Row.Col1 = cells[0].TrimStart('\"');
Row.Col2 = cells[1];
Row.Col3 = cells[2];
Row.Col4 = cells[3].TrimEnd('\"');
Row.Flag = true;
}
else
{
bool cancel;
Row.Flag = false;
}
}
else
{
Row.Col1_IsNull = true;
Row.Col2_IsNull = true;
Row.Col3_IsNull = true;
Row.Col4_IsNull = true;
Row.Flag = true;
}
}
}
Add a conditional split to split rows based on Flag column
Map the Valid Rows output to the OLEDB Destination, and the Bad Rows output to another flat file where you only map Column0

How to start inserting row after some specified number to MySql database using Pentaho?

Basically what i want to do is that,
I have CSV file containing 10,000 rows that i want to insert into the database . When i start my transformation i want to start inserting in database after 4500 rows .
So i want to skill number of rows that i specified .
How can i achieve that ?
Any help would be great.
Image Description : I simply create a transformation that read data from csv and write to database . I do not know which step will help me to achieve this .
Note : I have attached my simple transformation
I haven't found a step that count the rows processed, but you can use the "User Defined Java Class" step to count the row number and delete the first 4500 with a code like this:
// This will be the counter.
Long rowCount;
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
if (first) {
rowCount = 0l;
first=false;
}
Object[] r = getRow();
if (r == null) {
setOutputDone();
return false;
}
// Increment of the counter.
rowCount++;
// Check ouf the counter. Doesn't output the current row if it's less than 4501.
if (rowCount>4500l) {
Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());
// Adds the row count to a stream field.
get(Fields.Out, "Count").setValue(outputRow, rowCount);
putRow(data.outputRowMeta, outputRow);
}
return true;
}
I used following kettle file , that solved my problem .
Thanks to #WorkingHard..and #jxc

Can a logback message field be truncated/trimmed to n characters?

Sometimes see huge log messages and do not always have the ability to (easily) turn of word wrapping.
Is there a way to truncate %message to, say, 80 characters via logback.xml?
Have a look at the format modifiers section:
From http://logback.qos.ch/manual/layouts.html#formatModifiers:
Format modifiers
By default the relevant information is output as-is. However, with the aid of format modifiers it is possible to change the minimum and maximum width and the justifications of each data field.
...
Truncation from the end is possible by appending a minus character right after the period. In that case, if the maximum field width is eight and the data item is ten characters long, then the last two characters of the data item are dropped.
The Adrian's answer is great if you only need to truncate the message. However in my case I wanted to add "... [truncated]" in case of the really truncated messages.
I used a custom convertors mechanism for this purpose - by performing the following steps:
Define you custom converter:
public class LongMessagesConverter extends ClassicConverter {
private static final int MAX_FORMATTED_MESSAGE_LENGTH = 25600;
private static final String TRUNCATION_SUFFIX = "... [truncated]";
private static final int TRUNCATED_MESSAGE_SIZE =
TRUNCATION_SUFFIX.length() + MAX_FORMATTED_MESSAGE_LENGTH;
#Override
public String convert(ILoggingEvent event) {
String formattedMessage = event.getFormattedMessage();
if (formattedMessage == null ||
formattedMessage.length() < MAX_FORMATTED_MESSAGE_LENGTH) {
return formattedMessage;
}
return new StringBuilder(TRUNCATED_MESSAGE_SIZE)
.append(formattedMessage.substring(0, MAX_FORMATTED_MESSAGE_LENGTH))
.append(TRUNCATION_SUFFIX)
.toString();
}
}
Add to your logback.xml the following definition:
<conversionRule conversionWord="boundedMsg" converterClass="your.package.LongMessagesConverter"/>
Replace %msg token with %boundedMsg in your message format pattern

How to import comma delimited text file into datawindow (powerbuilder 11.5)

Hi good day I'm very new to powerbuilder and I'm using PB 11.5
Can someone know how to import comma delimited text file into datawindow.
Example Text file
"1234","20141011","Juan, Delacruz","Usa","001992345456"...
"12345","20141011","Arc, Ino","Newyork","005765753256"...
How can I import the third column which is the full name and the last column which is the account number. I want to transfer the name and account number into my external data window. I've tried to use the ImportString(all the rows are being transferred in one column only). I have three fields in my external data window.the Name and Account number.
Here's the code
ls_File = dw_2.Object.file_name[1]
li_FileHandle = FileOpen(ls_File)
li_FileRead = FileRead(li_FileHandle, ls_Text)
DO WHILE li_FileRead > 0
li_Count ++
li_FileRead = FileRead(li_FileHandle, ls_Text)
ll_row = dw_1.ImportString(ls_Text,1)
Loop.
Please help me with the code! Thank You
It seems that PB expects by default a tab-separated csv file (while the 'c' from 'csv' stands for 'coma'...).
Add the csv! enumerated value in the arguments of ImportString() and it should fix the point (it does in my test box).
Also, the columns defined in your dataobject must match the columns in the csv file (at least for the the first columns your are interested in). If there are mode columns in the csv file, they will be ignored. But if you want to get the 1st (or 2nd) and 3rd columns, you need to define the first 3 columns. You can always hide the #1 or #2 if you do not need it.
BTW, your code has some issues :
you should always test the return values of function calls like FileOpen() for stopping processing in case of non-existent / non-readable file
You are reading the text file twice for the first row: once before the while and another inside of the loop. Or maybe it is intended to ignore a first line with column headers ?
FWIF, here is a working code based on yours:
string ls_file = "c:\dev\powerbuilder\experiment\data.csv"
string ls_text
int li_FileHandle, li_fileread, li_count
long ll_row
li_FileHandle = FileOpen(ls_File)
if li_FileHandle < 1 then
return
end if
li_FileRead = FileRead(li_FileHandle, ls_Text)
DO WHILE li_FileRead > 0
li_Count ++
ll_row = dw_1.ImportString(csv!,ls_Text,1)
li_FileRead = FileRead(li_FileHandle, ls_Text)//read next line
Loop
fileclose(li_fileHandle)
use datawindow_name.importfile(CSV!,file_path) method.

Excluding Content From SQL Bulk Insert

I want to import my IIS logs into SQL for reporting using Bulk Insert, but the comment lines - the ones that start with a # - cause a problem becasue those lines do not have the same number f fields as the data lines.
If I manually deleted the comments, I can perform a bulk insert.
Is there a way to perform a bulk insert while excluding lines based on a match such as : any line that beings with a "#".
Thanks.
The approach I generally use with BULK INSERT and irregular data is to push the incoming data into a temporary staging table with a single VARCHAR(MAX) column.
Once it's in there, I can use more flexible decision-making tools like SQL queries and string functions to decide which rows I want to select out of the staging table and bring into my main tables. This is also helpful because BULK INSERT can be maddeningly cryptic about the why and how of why it fails on a specific file.
The only other option I can think of is using pre-upload scripting to trim comments and other lines that don't fit your tabular criteria before you do your bulk insert.
I recommend using logparser.exe instead. LogParser has some pretty neat capabilities on its own, but it can also be used to format the IIS log to be properly imported by SQL Server.
Microsoft has a tool called "PrepWebLog" http://support.microsoft.com/kb/296093 - which strips-out these hash/pound characters, however I'm running it now (using a PowerShell script for multiple files) and am finding its performance intolerably slow.
I think it'd be faster if I wrote a C# program (or maybe even a macro).
Update: PrepWebLog just crashed on me. I'd avoid it.
Update #2, I looked at PowerShell's Get-Content and Set-Content commands but didn't like the syntax and possible performance. So I wrote this little C# console app:
if (args.Length == 2)
{
string path = args[0];
string outPath = args[1];
Regex hashString = new Regex("^#.+\r\n", RegexOptions.Multiline | RegexOptions.Compiled);
foreach (string file in Directory.GetFiles(path, "*.log"))
{
string data;
using (StreamReader sr = new StreamReader(file))
{
data = sr.ReadToEnd();
}
string output = hashString.Replace(data, string.Empty);
using (StreamWriter sw = new StreamWriter(Path.Combine(outPath, new FileInfo(file).Name), false))
{
sw.Write(output);
}
}
}
else
{
Console.WriteLine("Source and Destination Log Path required or too many arguments");
}
It's pretty quick.
Following up on what PeterX wrote, I modified the application to handle large log files since anything sufficiently large would create an out-of-memory exception. Also, since we're only interested in whether or not the first character of a line starts with a hash, we can just use StartsWith() method on the read operation.
class Program
{
static void Main(string[] args)
{
if (args.Length == 2)
{
string path = args[0];
string outPath = args[1];
string line;
foreach (string file in Directory.GetFiles(path, "*.log"))
{
using (StreamReader sr = new StreamReader(file))
{
using (StreamWriter sw = new StreamWriter(Path.Combine(outPath, new FileInfo(file).Name), false))
{
while ((line = sr.ReadLine()) != null)
{
if(!line.StartsWith("#"))
{
sw.WriteLine(line);
}
}
}
}
}
}
else
{
Console.WriteLine("Source and Destination Log Path required or too many arguments");
}
}
}