How can I write data to CSV file from the prolog code below?
Thanks,
SB
run:-
member(A,[a,b,c,d,e]),
member(B,[1,2,3,4,5,6]),
write(A),write(' '),write(B),nl,
fail.
run.
Simple solution
Since you are using SWI-Prolog, you can use the CSV library.
?- use_module(library(csv)).
?- findall(row(A,B), (member(A, [a,b,c,d,e]), member(B, [1,2,3,4,5])), Rows), csv_write_file('output.csv', Rows).
As you can see, I do this in two steps:
I create terms of the form row(A, B).
Then I hand those terms to csv_write_file/2 which takes care of creating a syntactically correct output file.
Non-standard separators
In your question you are not writing a comma between A and B but a space. If you really want to use the space as a separator you can set this as an option:
csv_write_file('output.csv', Rows, [option(separator(0' )])
'Unbalanced' arguments
Also, in your question you have more values for B than for A. You can write code that handles this, but there are several ways in which this can be dealt with. E.g., (1) you can fill missing cells with nill; (2) you can throw an exception if same_length(As, Bs) fails; (3) you can only write the 'full' rows:
length(As0, N1),
length(Bs0, N2),
N is max(N1, N2),
length(As, N),
append(As, _, As0),
length(Bs, N),
append(Bs, _, Bs0),
Related
I am having issues reading a .dat file into a dataframe. I think the issue is with the delimiter. I have included a screen shot of what the data in the file looks like below. My best guess is that it is tab delimited between columns and then new-line delimited between rows. I have tried reading in the data with the following commands:
df = CSV.File("FORCECHAIN00046.dat"; header=false) |> DataFrame!
df = CSV.File("FORCECHAIN00046.dat"; header=false, delim = ' ') |> DataFrame!
My result either way is just a DataFrame with only one column including all the data frome each column concatenated into one string. I tried to even specify the types with the following code:
df = CSV.File("FORCECHAIN00046.dat"; types=[Float64,Float64,Float64,Float64,
Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64]) |> DataFrame!
And I received an the following error:
┌ Warning: 2; something went wrong trying to determine row positions for multithreading; it'd be very helpful if you could open an issue at https://github.com/JuliaData/CS
V.jl/issues so package authors can investigate
I can work around this by uploading it into google sheets and then downloading a csv, but I would like to find a way to make the original .dat file work.
Part of the issue here is that .dat is not a proper file format—it's just something that seems to be written out in a somewhat human-readable format with columns of numbers separated by variable numbers of spaces so that the numbers line up when you look at them in an editor. Google Sheets has a lot of clever tricks built in to "do what you want" for all kinds of ill-defined data files, so I'm not too surprised that it manages to parse this. The CSV package on the other hand supports using a single character as a delimiter or even a multi-character string, but not a variable number of spaces like this.
Possible solutions:
if the files aren't too big, you could easily roll your own parser that splits each line and then builds a matrix
you can also pre-process the file turning multiple spaces into single spaces
That's probably the easiest way to do this and here's some Julia code (untested since you didn't provide test data) that will open your file and convert it to a more reasonable format:
function dat2csv(dat_path::AbstractString, csv_path::AbstractString)
open(csv_path, write=true) do io
for line in eachline(dat_path)
join(io, split(line), ',')
println(io)
end
end
return csv_path
end
function dat2csv(dat_path::AbstractString)
base, ext = splitext(dat_path)
ext == ".dat" ||
throw(ArgumentError("file name doesn't end with `.dat`"))
return dat2csv(dat_path, "$base.csv")
end
You would call this function as dat2csv("FORCECHAIN00046.dat") and it would create the file FORCECHAIN00046.csv, which would be a proper CSV file using commas as delimiters. That won't work well if the files contain any values with commas in them, but it looks like they are just numbers, in which case it should be fine. So you can use this function to convert the files to proper CSV and then load that file with the CSV package.
A little explanation of the code:
the two-argument dat2csv method opens csv_path for writing and then calls eachline on dat_path to read one line form it at a time
eachline strips any trailing newline from each line, so each line will be bunch of numbers separated by whitespace with some leading and/or trailing whitespace
split(line) does the default splitting of line which splits it on whitespace, dropping any empty values—this leaves just the non-whitespace entries as strings in an array
join(io, split(line), ',') joins the strings in the array together, separated by the , character and writes that to the io write handle for csv_path
println(io) writes a newline after that—otherwise everything would just end up on a single very long line
the one-argument dat2csv method calls splitext to split the file name into a base name and an extension, checking that the extension is the expected .dat and calling the two-argument version with the .dat replaced by .csv
Try using the readdlm function in DelimitedFiles library, and convert to DataFrame afterwards:
using DelimitedFiles, DataFrames
df = DataFrame(readdlm("FORCECHAIN00046.dat"), :auto)
I habitually use csvRead in scilab to read my data files however I am now faced with one which contains blocks of 200 rows, preceeded by 3 lines of headers, all of which I would like to take into account.
I've tried specifying a range of data following the example on the scilab help website for csvRead (example is right at the bottom of the page) (https://help.scilab.org/doc/6.0.0/en_US/csvRead.html) but I always come out with the same error messages :
The line and/or colmun indices are outside of the limits
or
Error in the column structure.
My first three lines are headers which I know can cause a problem but even if I omit them from my block-range, I still have the same problem.
Otherwise, my data is ordered such that I have my three lines of headers (two lines containing a header over just one or two columns, one line containing a header over all columns), 200 lines of data, and a blank line - this represents data from one image and I have about 500 images in the file, I would like to be able to read and process all of them and keep track of the headers because they state the image number which I need to reference later. Example:
DTN-dist_Devissage-1_0006_0,,,,,,
L0,,,,,,
X [mm],Y [mm],W [mm],exx [1] - Lagrange,eyy [1] - Lagrange,exy [1] - Lagrange,Von Mises Strain [1] - Lagrange
-1.13307,-15.0362,-0.00137507,7.74679e-05,8.30045e-05,5.68249e-05,0.00012711
-1.10417,-14.9504,-0.00193334,7.66086e-05,8.02914e-05,5.43132e-05,0.000122655
-1.07528,-14.8647,-0.00249155,7.57493e-05,7.75786e-05,5.18017e-05,0.0001182
Does anyone have a solution to this?
My current code, following an adapted version of the Scilab-help example looks like this (I have tried varying the blocksize and iblock values to include/omit headers:
blocksize=200;
C1=1;
C2=14;
iblock=1
while (%t)
R1=(iblock-1)*blocksize+4;
R2=blocksize+R1-1;
irange=[R1 C1 R2 C2];
V=csvRead(filepath+filename,",",".","",[],"",irange);
iblock=iblock+1
end
Errors
The CSV
A lot's of your problem comes from the inconsistency of the number of coma in your csv file. Opening it in LibreOffice Calc and saving it puts the right number of comma, even on empty lines.
R1
Your current code doesn't position R1 at the beginning of the values. The right formula is
R1=(iblock-1)*(blocksize+blanksize+headersize)+1+headersize;
End of file
Currently your code raise an error and the end of the file because R1 becomes greater than the number of lines. To solve this, you can specify the maximum number of block or test the value of R1 against the number of lines.
Improved solution for much bigger file.
When solving your probem with a big file, two problems were raised :
We need to know the number of blocks or the number of lines
Each call of csvRead is really slow because it process the whole file at each call (1s / block !)
My idea was to read the whole file and store it in a string matrix ( since mgetl as been improved since 6.0.0 ), then use csvTextScan on a submatrix. Doing so also removes the manual writing of the number of block/lines.
The code follows :
clear all
clc
s = filesep()
filepath='.'+s;
filename='DTN_full.csv';
// header is important as it as the image name
headersize=3;
blocksize=200;
C1=1;
C2=14;
iblock=1
// let save everything. Good for the example.
bigstruct = struct();
// Read all the value in one pass
// then using csvTextScan is much more efficient
text = mgetl(filepath+filename);
nlines = size(text,'r');
while ( %t )
mprintf("Block #%d",iblock);
// Lets read the header
R1=(iblock-1)*(headersize+blocksize+1)+1;
R2=R1 + headersize-1;
// if R1 or R1 is bigger than the number of lines, stop
if sum([R1,R2] > nlines )
mprintf('; End of file\n')
break
end
// We use csvTextScan ony on the lines that matters
// speed the program, since csvRead read thge whole file
// every time it is used.
H=csvTextScan(text(R1:R2),",",".","string");
mprintf("; %s",H(1,1))
R1 = R1 + headersize;
R2 = R1 + blocksize-1;
if sum([R1,R2]> nlines )
mprintf('; End of file\n')
break
end
mprintf("; rows %d to %d\n",R1,R2)
// Lets read the values
V=csvTextScan(text(R1:R2),",",".","double");
iblock=iblock+1
// Let save theses data
bigstruct(H(1,1)) = V;
end
and returns
Block #1; DTN-dist_0005_0; rows 4 to 203
....
Block #178; DTN-dist_0710_0; rows 36112 to 36311
Block #179; End of file
Time elapsed 1.827092s
I have a CSV file that has columns that contain arrays. The file looks like this:
siblings
[3010,3011,3012]
[2010,2012]
What I am trying to do is to return the elements of the arrays as separate rows.
I looked through the documentation and found UNWIND function. When I tried the example shown in the documentation:
UNWIND [1, 2, 3] AS x
RETURN x
everything worked fine. When I tried it with my data the query looked like this:
LOAD CSV WITH HEADERS FROM
'file:///test.csv' AS line1
UNWIND line1.siblings as a
RETURN a
and the result was
[3010,3011,3012]
[2010,2012]
instead of:
3010
3011
3012
...
Does anyone know what I am doing wrong?
The CSV columns are all handled as strings so you need to treat them as such. In your case you can remove the brackets from the siblings column prior to splitting the string into a collection. That step would help you to avoid pre-processing the file. Once you split the string then you still have a collection of strings.
WITH "[1001,1002,1003]" as numbers_with_brackets_in_a_string
WITH substring( numbers_with_brackets_in_a_string, 1, size(numbers_with_brackets_in_a_string)-2 ) as numbers_in_a_string
UNWIND split( numbers_in_a_string, ',' ) as number_as_a_string
RETURN number_as_a_string
I have found an ugly way of fixing this.
I did the following thing:
LOAD CSV WITH HEADERS FROM
'file:///test.csv' AS line1
WITH line1, split(line1.siblings, ",") as s
UNWIND s as sib
RETURN sib
after which I got the following thing:
[3010
3011
3012]
...
I removed [] from the CSV file and I got the output that I desired.
I know this is an ugly solutions and I would appreciate if someone could find a better one.
I have a csv file example.csv which contains two columns with header var1 and var2.
I want to populate an initially empty Prolog knowledge base file import.pl with repeated facts, while each row of example.csv is treated same:
fact(A1, A2).
fact(B1, B2).
fact(C1, C2).
How can I code this in SWI-Prolog ?
EDIT, based on answer from #Shevliaskovic:
:- use_module(library(csv)).
import:-
csv_read_file('example.csv', Data, [functor(fact), separator(0';)]),
maplist(assert, Data).
When import. is run in console, we update the knowledge base exactly the way it is requested (except for the fact that the knowledge base is directly updated in memory, rather than doing this via a file and subsequent consult).
Check setof([X, Y], fact(X,Y), Z). :
Z = [['A1', 'A2'], ['B1', 'B2'], ['C1', 'C2'], [var1, var2]].
SWI Prolog has a built in process for this.
It is
csv_read_file(+File, -Rows)
Or you can add some options:
csv_read_file(+File, -Rows, +Options)
You can see it at the documentation. For more information
Here is the example that the documentation has:
Suppose we want to create a predicate table/6 from a CSV file that we
know contains 6 fields per record. This can be done using the code
below. Without the option arity(6), this would generate a predicate
table/N, where N is the number of fields per record in the data.
?- csv_read_file(File, Rows, [functor(table), arity(6)]),
maplist(assert, Rows).
For example:
If you have a File.csv that looks like:
A1 A2
B1 B2
C1 C2
You can import it to SWI like:
9 ?- csv_read_file('File.csv', Data).
The result would be:
Data = [row('A1', 'A2'), row('B1', 'B2'), row('C1', 'C2')].
I wonder if in Perl/MySQL if is possible to build a list of variant words, based on a given word, to which that word may have the common OCR errors occurring (i.e. 8 instead of b)? In other words, if I have a list of words, and in that list is the word "Alphabet", then is there a way to extend or build a new list to include my original word plus the OCR error variants of "Alphabet"? So in my output, I could have the following variants to Alphabet perhaps:
Alphabet
A1phabet
Alpha8et
A1pha8et
Of course it would be useful to code for most if not all of the common errros that appear in OCR'ed text. Things like 8 instead of b, or 1 instead of l. I'm not looking to fix the errors, because in my data itself I could have OCR errors, but want to create a variant list of words as my output based on a list of words I give it as an input. So in my data, I may have Alpha8et, but if I do a simple search for Alphabet, it won't find this obvious error.
My quick and dirty MySQL approach
Select * from
(SELECT Word
FROM words
union all
-- Rule 1 (8 instead of b)
SELECT
case
when Word regexp 'b|B' = 1
then replace(replace(Word, 'B','8'),'b','8')
end as Word
FROM words
union all
-- Rule 2 (1 instead of l)
SELECT
case
when Word regexp 'l|L' = 1
then replace(replace(Word, 'L','1'),'l','1')
end as Word
FROM words) qry
where qry.Word is not null
order by qry.Word;
I'm thinking there must be a more automated and cleaner method
If you have examples of scanned texts with both the as-scanned (raw) version, and the corrected version, it should be relatively simple to generate a list of the character corrections. Gather this data from enough texts, then sort it by frequency. Decide how frequent a correction has to be for it to be "common," then leave only the common corrections in the list.
Turn the list into a map keyed by the correct letter; the value being an array of the common mis-scans for that letter. Use a recursive function to take a word and generate all of its variations.
This example, in Ruby, shows the recursive function. Gathering up the possible mis-scans is up to you:
VARIATIONS = {
'l' => ['1'],
'b' => ['8'],
}
def variations(word)
return [''] if word.empty?
first_character = word[0..0]
remainder = word[1..-1]
possible_first_characters =
[first_character] | VARIATIONS.fetch(first_character, [])
possible_remainders = variations(remainder)
possible_first_characters.product(possible_remainders).map(&:join)
end
p variations('Alphabet')
# => ["Alphabet", "Alpha8et", "A1phabet", "A1pha8et"]
The original word is included in the list of variations. If you want only possible mis-scans, then remove the original word:
def misscans(word)
variations(word) - [word]
end
p misscans('Alphabet')
# => ["Alpha8et", "A1phabet", "A1pha8et"]
A quick-and-dirty (and untested) version of a command-line program would couple the above functions with this "main" function:
input_path, output_path = ARGV
File.open(input_path, 'r') do |infile|
File.open(output_path, 'w') do |outfile|
while word = infile.gets
outfile.puts misscans(word)
end
end
end
An efficient way for achieving this is by using the bitap algorithm. Perl has re::engine::TRE, a binding to libtre which implements the fuzzy string matching in regexp:
use strict;
use warnings qw(all);
use re::engine::TRE max_cost => 1;
# match "Perl"
if ("A pearl is a hard object produced..." =~ /\(Perl\)/i) {
say $1; # find "pearl"
}
Plus, there is agrep tool which allows you to use libtre from the command line:
$ agrep -i -E 1 peArl *
fork.pl:#!/usr/bin/env perl
geo.pl:#!/usr/bin/env perl
leak.pl:#!/usr/local/bin/perl
When you need to match several words against the OCRized text, there are two distinct approaches.
You could simply build one regexp with your entire dictionary, if it is small enough:
/(Arakanese|Nelumbium|additionary|archarios|corbeil|golee|layer|reinstill\)/
Large dictionary queries can be optimized by building trigram index.
Perl has a String::Trigram for doing this in-memory.
Several RDBMS also have trigram index extensions. PostgreSQL-flavored pg_trgm allows you to write queries like this, which are fast enough even for really big dictionaries:
SELECT DISTINCT street, similarity(street, word)
FROM address_street
JOIN (
SELECT UNNEST(ARRAY['higienopolis','lapa','morumbi']) AS word
) AS t0 ON street % word;
(this one took ~70ms on a table with ~150K rows)