I am trying to load data from a flat file. The file is around 2.5 GB in size and row count is close to billion. I am using a flat file source inside DFT. Few rows inside the file does not follow the column pattern, for example there is a extra delimiter or say text qualifier as value of one column. I want to skip those rows and load rest of the rows which has correct format. I am using SSIS 2014. Flat file source inside DFT is failing. I have set alwaysCheckforrowdelimiter property to false but still does not work. Since the file is too huge manually opening and changing is not possible. Kindly help.
I have the same idea as Nick.McDermaid but I can maybe help you a bit more.
You can clean your file with a regular expression. (In a script)
You just need to define a regex to match lines with the number of delimiter you want. Other lines should be deleted.
Here is a visual example executed in Notepad++
Notepad++ Example screenshot
Here is the pattern used for my example:
^[A-Z]*;[A-Z]*;[A-Z]*;[A-Z]*$
And the data sample:
AA;BB;CC;DD
AA;BB;CC;DD
AA;BB;CC;DD;EE
AA;BB;CC;DD
AA;BB;CC
AA;BB;CC;DD
AA;BB;CC;DD
You can try it on line: https://regex101.com/r/PIYIcY/1
Regards,
Arnaud
Related
I have a 3 column csv file. The 2nd column contains numbers with a leading zero. For example:
044934343
I need to convert a .csv file into a .xls and to do that I'm using the command line tool called 'unoconv'.
It's converting as expected, however when I load up the .xls in Excel instead of showing '04493434', the cell shows '4493434' (the leading 0 has been removed).
I have tried surrounding the number in the .csv file with a single quote and a double quote however the leading 0 is still removed after conversion.
Is there a way to tell unoconv that a particular column should be of a TEXT type? I've tried to read the man page of unocov however the options are little confusing.
Any help would be greatly appreciated.
Perhaps I came too late at the scene, but just in case someone is looking for an answer for a similar question this is how to do:
unoconv -i FilterOptions=44,34,76,1,1/1/2/2/3/1 --format xls <csvFileName>
The key here is "1/1/2/2/3/1" part, which tells unoconv that the second column's type should be "TEXT", leaving the first and third as "Standard".
You can find more info here: https://wiki.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options#Token_7.2C_csv_import
BTW this is my first post here...
I am not being able to load csv file using weka, I have removed each and every special symbol even using text editor, still no luck. I am attaching the file, I will be obliged if solve this problem.
It shows "Wrong number of values, Read 31, expected 27, read token[EOL], line 3"
link : https://drive.google.com/open?id=0By7zyIPDD6HJMmthWnZLSUk5aFE
You have planty of empty fields in your file and if you download it as .csv even the header gets three commas at its end.
e.g. your 6th line:
,Doug Walker,,,131,,Rob Walker,131,,Documentary,Doug Walker,Star Wars: Episode VII The Force Awakens ,8,143,,0,,,,,,,,,12,7.1,,0,,,
Simmilar to the suggestion in this post you could try s.th. like notepad++ or another text editor to replace ",," by ",?," to fill up your gaps.
Convert NA values to ? automatically while loading
I did this and then you get in your first row two question marks as column names wich obviously doesnt work, so change the first row to look like this:
color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,?,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,additionalColName1,additionalColName2,additionalColName3
if you try now to import your data weka starts telling you which lines it doesn't like and why. Btw. you did not "removed each and every special symbol"!
After removing a fiew lines with e.g. the Ç character it worked.
Thats just an ugly workaround, try filling the empty values and find a regular expression or a better way to save your file to remove the last three commas of every line, i was just too lazy for now. But i could load it into weka and that's what you wanted (:
Currently working on a Data Mining project using my own dataset I had found using Weka. The only issue is that taking my file from csv format and converting it into arff format is causing issues.
java.io.IOException: wrong number of values. Read 2, expected 5, Read Token[EOL], line 3
This is the error I am getting. I have browsed around online looking for similar issues and have tried removing all quotes and special characters that throw this exception. Every place I looked told me to remove special characters and I believe there are none left. The link to my dataset is here : https://docs.google.com/spreadsheets/d/1xqEe7MZE9SdKB_yvFSgWeSVYuDrq0b31Eu5oECNbGH0/edit#gid=1736568367&vpid=A1
This is the first three lines of my file where the first is the attribute names, file is separated by commas in note
Inequality Adjusted HPI Rank,Sub Region,Inequality Adjusted Life Expectancy,Inquality Adjusted Well being,Footprint
,Inequality adjusted HPI
1,1,73.1,6.9,2.5,48.2
2,6,65.17333333,5.487667631,1.390974448,45.97489063
If you open your file with a text editor, you will see that Footprint has quotes around it. Delete the quotes and you are good to go!
Weka is normally not that good in reading CSV files that include special characters, and ARFF files are normally easier to use. Therefore, in such cases, the easiest way is to convert your CSV file to an ARFF file using R ("RWeka" and "foreign" libraries can handle this conversion).
There is also another possibility. I was creating my CSV file and the header had a different number of elements compared to the rest of the data. So, check the header as well...!
I'm trying to write a csv file with the delimiter ctrl+a. I'm going to have to eventually write the file to hadoop and I'm unable to use a standard delimiter.
Currently I'm trying this:
writer = csv.writer(f, delimiter = "\u0001")
for item in aList:
writer.writerow(item)
f.close()
However, the outputted excel file doesn't appear to be written correctly...
Some rows are condensed into one block, while others will have one field in the first and then the rest condensed into the second block, etc.
Is the error where I'm setting up the writer object, or am I just not familiar with separating files this way?
You can try using the nonprinting "group separator" character, which can be represented in python code as '\035'
see http://www.asciitable.com/index/asciifull.gif for some other nonprinting characters if you need more.
It may be helpful to include more context about why you want to use nonstandard delimiter. And whether Excel parsing of the file is necessary, or just a quick check to see if the file might be parsed properly by the target system, Hadoop.
I have a set of csv files that are very simple to load into Stata using the -insheet- command. But they have very uninformative variable names. For each of these files, I also have a file of metadata consisting of two columns: the original (uninformative) variable names, and a description of what the variables actually mean. I'd like to use these metadata files to create variable labels, preferably without going through and typing up all the separate label commands or turning the metadata file into a dictionary for each file. It seems like there must be a quick way of loading the metadata file into Stata and looping through it to generate the label commands, but I don't know what it is. Any thoughts?
Ideally each line of the metadata is something like
varname1 "more interesting description"
in which case you can prefix each line with
label var
and then run the file as if it were a do-file using do. See the help for label. That is easy in a decent text editor, as for example searching for the start of each line and replacing it with label var (note the need for the space).
What could bite here includes:
You don't have double quotes " " as delimiters, in which case you need to insert them.
The extra information does not qualify as a variable label because it is more than 80 characters long. See help limits.
There are other ways to do this with Stata. You could write a program to read in the metadata and write out a do-file using file, but if this were my problem I would reach first for my text editor. (Most experienced Stata programmers use something else as well as doedit.)