Distinct column values in csv file - csv

I have a csv file.
columns in csv file - "SNo. StateName CityName AreaName PinCode NonServ.Area MessangerService Remark".
The column CityName has repeated values.
Ex: In many records, it has unique value (Delhi).
Is there any approach in java to read that csv file and get the distinct values from that column of the csv file.

The only way I can think of is to do it row by row and store each value into an array-type structure. Using a set structure such as HashSet or TreeSet will ensure unique values.
The other option, which isn't what you were looking for but might work depending on your project is to use a database instead of a csv file. It then becomes very easy to select distinct values in a column.

df is where you've read csv data
df[CityName].unique()

Related

How to update a values of specific fields on csv using nifi?

I have a CSV file where it contains id, name, and salary as fields. The data in my CSV file is like below.
id,name,salary
1,Jhon,2345
2,Alex,3456
I want to update the current CSV with new id (id*4)
id,name,salary
4,Jhon,2345
8,Alex,3456
The format of the file in the destination should aslo be CSV. Can anyone tell me the flow? (What processors do I need). I'm very new to nifi. A big thanks in advance.
Use UpdateRecord processor with the below settings,
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Replacement Value Strategy Literal Value
/id ${field.value:multiply(4)}
then it gives the desired result. Just csv in and csv out.

Need help creating schema for loading CSV into BigQuery

I am trying to load some CSV files into BigQuery from Google Cloud Storage and wrestling with schema generation. There is an auto-generate option but it is poorly documented. The problem is that if I choose to let BigQuery generate the schema, it does a decent job of guessing data types, but only sometimes does it recognizes the first row of the data as a header row, and sometimes it does not (treats the 1st row as data and generates column names like string_field_N). The first rows of my data are always header rows. Some of the tables have many columns (over 30), and I do not want to mess around with schema syntax because BigQuery always bombs with an uninformative error message when something (I have no idea what) is wrong with the schema.
So: How can I force it to recognize the first row as a header row? If that isn't possible, how do I get it to spit out the schema it generated in the proper syntax so that I can edit it (for appropriate column names) and use that as the schema on import?
I would recommend doing 2 things here:
Preprocess your file and store the final layout of the file sans the first row i.e. the header row
BQ load accepts an additional parameter in form of a JSON schema file, use this to explicitly define the table schema and pass this file as a parameter. This allows you the flexibility to alter schema at any point in time, if required
Allowing BQ to autodetect schema is not advised.
Schema auto detection in BigQuery should be able to detect the first row of your CSV file as column names in most cases. One of the cases for which column name detection fails is when you have similar data types all over your CSV file. For instance, BigQuery schema auto detect would not be able to detect header names for the following file since every field is a String.
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
The "Header rows to skip" option in the UI would not help fixing this shortcoming of schema auto detection in BigQuery.
If you are following the GCP documentation for Loading CSV Data from Google Cloud Storage you have the option to skip n number of rows:
(Optional) An integer indicating the number of header rows in the source data.
The option is called "Header rows to skip" in the Web UI, but it's also available as a CLI flag (--skip_leading_rows) and as BigQuery API property (skipLeadingRows)
Yes you can modify the existing schema (aka DDL) using bq show..
bq show --schema --format=prettyjson project_id:dataset.table > myschema.json
Note that this will result in you creating a new BQ table all together.
I have way to schema for loading csv into bigquery. You just enough edit value column, for example :
weight|total|summary
2|4|just string
2.3|89.5|just string
if use schema generator by bigquery, field weight and total will define as INT64, but when insert second rows so error or failed. So, you just enough edit first rows like this
weight|total|summary
'2'|'4'|just string
2.3|89.5|just string
You must set field weight & total as STRING, and if you want to aggregate you just use convert type data in bigquery.
cheers
If 'column name' type and 'datatype' are the same for all over the csv file, then BigQuery misunderstood that 'column name' as data. And add a self generated name for the column. I couldn't find any technical way to solve this. So I took another approach. 
If the data is not sensitive, then add another column with the 'column name' in string type. And all of the values in the column in number type. Ex. Column name 'Test' and all values are 0. Upload the file to the BigQuery and use this query to drop the column name.
ALTER TABLE <table name> DROP COLUMN <Test>
Change and according to your Table.

Does the COPY CSV command in Redshift load in the order defined in the headers?

I have some code which pulls CSV's from S3 into a Redshift table. I'm getting issues whereby, if the CSV is stored in a certain column order, the copy command doesn't match the column order in the CSV header.
So if I have a CSV with the columns id|age|name and I have a Redshift table with the columns id|name|age, it will attempt to pull in the data in the CSV header order. So in this case, it will attempt to pull the name CSV column into the age column in Redshift, which causes a type error.
My query is:
copy schema.#tmp from <s3file>
iam_role <iamrole>
acceptinvchars
truncatecolumns
IGNOREBLANKLINES
ignoreheader 1
COMPUPDATE OFF
STATUPDATE OFF
delimiter ','
timeformat 'auto'
dateformat 'auto';
Do I need to define the column order in the copy command to match the two up?
COPY ignores column names in the file; the columns are matched from left to right.
But you can specify a column list in the COPY statement. Use that to tell PostgreSQL the order of the columns in the file.

Let Google BigQuery infer schema from csv string file

I want to upload csv data into BigQuery. When the data has different types (like string and int), it is capable of inferring the column names with the headers, because the headers are all strings, whereas the other lines contains integers.
BigQuery infers headers by comparing the first row of the file with
other rows in the data set. If the first line contains only strings,
and the other lines do not, BigQuery assumes that the first row is a
header row.
https://cloud.google.com/bigquery/docs/schema-detect
The problem is when your data is all strings ...
You can specify --skip_leading_rows, but BigQuery still does not use the first row as the name of your variables.
I know I can specify the column names manually, but I would prefer not doing that, as I have a lot of tables. Is there another solution ?
If your data is all in "string" type and if you have the first row of your CSV file containing the metadata, then I guess it is easy to do a quick script that would parse the first line of your CSV and generates a similar "create table" command:
bq mk --schema name:STRING,street:STRING,city:STRING... -t mydataset.myNewTable
Use that command to create a new (void) table, and then load your CSV file into that new table (using --skip_leading_rows as you mentioned)
14/02/2018: Update thanks to Felipe's comment:
Above comment can be simplified this way:
bq mk --schema `head -1 myData.csv` -t mydataset.myNewTable
It's not possible with current API. You can file a feature request in the public BigQuery tracker https://issuetracker.google.com/issues/new?component=187149&template=0.
As a workaround, you can add a single non-string value at the end of the second line in your file, and then set the allowJaggedRows option in the Load configuration. Downside is you'll get an extra column in your table. If having an extra column is not acceptable, you can use query instead of load, and select * EXCEPT the added extra column, but query is not free.

How to split a column into two columns in SSIS if any invalid data in the column

Iam trying to load data from CSV file and dumping into database. While reading date values column from the CSV file getting some error because of CSV file contains some invalid data like '31-FEB-2014'.So i need to store those invalid data into another column in the table, how to achieve it using SSIS.Please assist.
Make a new column on your table which is of datatype nvarchar. Map your CSV Source column to the new column.
Then afterwards you can do some magic. Example you could use a derived column to handle the new nvarchar value and convert it back to a decent date format and then map it to your original column.
You just need to redirect it. See the red arrow and drag it to other destination like below:
Set properties like below:
tag me incase you're stuck.