Open parts of CSV in Stata - csv

I'd like to know more about how Stata 13 can work with dataset in .CSV of large size (let's say higher than the RAM I have).
I can open the first nth line or the first nth columns with the following command
import delimited using filename.csv, rowrange(1:1000) colrange(1:3)
However, it seems I cannot open, without loading the whole dataset first, one of the following things:
the first and last three variables
the first and last 100 lines
a list of lines such that a variable satisfies some condition
Are there ways to do these things in Stata?

I'm not sure you can do this with one command, but you can try importing by parts and using merge. An example:
clear all
set more off
*----- example data -----
copy http://www.stata.com/examples/auto.csv auto.csv, replace
*----- what you want -----
* import first two columns
import delimited using "auto.csv", colrange(1:2) rowrange(1:6)
gen obs = _n
* save in temp file
tempfile first
save "`first'"
* import last two columns
import delimited using "auto.csv", colrange(4:5) rowrange(1:6) clear
gen obs = _n
* merge current data with the tempfile
merge 1:1 obs using "`first'", assert(match) nogen
* list
drop obs
order make foreign price
list
The previous covers point 1 in your question. For point 2, do something similar but instead of merge, use append.
The commands infile and use both support the use of if and in in their syntax, which may help you with point 3.
Edit
An example for point 2:
clear all
set more off
*----- example data -----
copy http://www.stata.com/examples/auto.csv auto.csv, replace
*----- what you want -----
* import first two rows of data
import delimited using "auto.csv", colrange(1:4) rowrange(2:3)
* save in temp file
tempfile first
save "`first'"
* import last two rows of data
import delimited using "auto.csv", colrange(1:4) rowrange(10:11) clear
* append current data with the tempfile
append using "`first'"
* list
sort make
list
Observation 1 starts in row 2 (row 1 contains variable names), so we need to shift everything in rowrange() by 1. Curiously, some testing shows that adding the varnames(1) option did nothing to change this behaviour.

Related

Importing specific columns from a CSV into excel

I am trying to do what the title says and also do it for new records. I cannot link the CSV file because it exceeds the 255 limit. So i am attempting to split up the table.
I have the below table in access
DateOfTest
Time
PromptTime
TestSequence
PATResults
Logs
Serial Number
1
2
3
4
5
6
7
Obviously, where the numbers are i want the data from the CSV to be inserted.
I have created a form including a button so i can run some VBA, but i cannot find the correct information online for my work, as i am new to VBA it is also a bit confusing.
I have attempted some random code, but i was just spraying and praying at that point
I am not sure I understood your question. In the impoer tool you can choose columns, but if you want to do it with a script, I would suggest to perform pre-processing phase with simple python and pandas to read the csv file, remove any unwanted columns and save to another CSV to be uploaded directly to excel.
something like this
import pandas as pd
df = pd.read_csv ('csvfile.csv')
df.drop('column_name', inplace=True, axis=1)
df.to_excel ('filename.xlsx', index = False, header=True)

Rename the fields base on the first record content of each one

Ihave the scheme below,
phpMyAdmin - Table structure
is made by an import from the csv file,
how can I rename the fields base on the first record content of each one?
like:
rename: COL 1 to: User ID
rename: COL 2 to: Main - Full Name
rename: COL 3 to: Main - First name
and so on ...
Before importing CSV data remove the first line with columns names from the file. Then, in the phpmyadmin while importing CSV file in advanced options, it is possible to put comma separated list of column names (that's the name of the input field) which match the CSV columns.
php approach: renameField.php
maybe it's not the most elegant solution, but it does it's job.

Create a node for each column only once while importing csv into Neo4j

I have a csv file that looks the following way:
I want to create a database from it in Neo4j. Rows are nodes with labels gene, columns are also nodes with labels cell. I need to write a CREATE query that would create all my gene and cell - nodes and a relationship one for each combination of gene and cell. Currently I am stuck with the following code:
LOAD CSV WITH HEADERS FROM 'file:///merged_full.csv' AS line
CREATE (:Gene {id: line.gene_ids, name: line.wikigene_name})
I need to somehow iterate over all columns - starting from index 3 - after creating gene nodes, but I do not know how to do that.
Here are 3 queries that, performed in order, should do what you want.
This query creates a temporary Headers node with a names property that contains the collection of headers from the CSV file. It uses LIMIT 1 to only process the first row of the file. It also creates all the Cell nodes, each with it own name property.
LOAD CSV FROM 'file:///merged_full.csv' AS line
MERGE (h:Headers)
SET h.names = line
WITH line
LIMIT 1
UNWIND line[3..] AS name
MERGE (c:Cell {name: name})
This query uses the APOC function apoc.map.fromNodes to generate a map named cells, which maps each cell name to its cell node. It also gets the Headers node. It then loads the non-header data from the CSV file (using SKIP 1 to skip over the header row), and processes each row as follows. It uses MERGE to get/create a Gene node, g, with the desired id and name. It uses the REDUCE function to generate a collection of the Cell nodes that have a "1" column value in the current row, and the FOREACH clause then creates a (g)-[:HAS]->(x) relationship (if necessary) for every cell, x, in that collection.
WITH apoc.map.fromNodes('Cell', 'name') AS cells
MATCH (h:Headers)
LOAD CSV FROM 'file:///merged_full.csv' AS line
WITH h, cells, line
SKIP 1
MERGE (g:Gene {id: line[1], name: line[2]})
FOREACH(
x IN REDUCE(s = [], i IN RANGE(3, SIZE(line)-1) |
CASE line[i] WHEN "1" THEN s + cells[h.names[i]] ELSE s END) |
MERGE (g)-[:HAS]->(x))
This query just deletes the temporary Headers node (if you wish):
MATCH (h:Headers)
DELETE h;
If the columns correspond with cell nodes, then you should know all the cell nodes you need just be looking at the CSV header.
I'd recommend writing a small query just to create each of the cell nodes you need, then create an index or unique constraint on :Cell(id) (or name, or whatever the property is that is meant to identify a :Cell).
At that point the problem becomes getting and processing each relevant column (I assume only the ones with 1 as the value). APOC Procedures may help here.
apoc.map.sortedProperties() can be used to take your line map and give you a list of key/value list pairs, which you can filter down to those where the key begins with 'V', and where the value is 1, then use what's remaining to match on the relevant :Cell node and create the relationship.

SSIS: Flat File Source to SQL without Duplicate Rows

I have a (bit large) flat file (csv). Which I am trying to import in my SQL Server table using SSIS Package. There is nothing special, its a plain import. The problem is, more than 50% of the lines are duplicate.
E.g. Data:
Item Number | Item Name | Update Date
ITEM-01 | First Item | 1-Jan-2013
ITEM-01 | First Item | 5-Jan-2013
ITEM-24 | Another Item | 12-Mar-2012
ITEM-24 | Another Item | 13-Mar-2012
ITEM-24 | Another Item | 14-Mar-2012
Now I need to create my Master Item record table using this data, as you can see the data is duplicate due to the Update Date. This is guaranteed that file will always be sorted by Item Number. So what I need to do is just to check if next item number = previous item number then do NOT import this line.
I used Sort with Remove Duplicate, in SSIS package, but it is actually trying to sort all the lines which is useless because lines are already sorted. Plus it is taking forever to sort too many lines.
So is there any other way?
There are a couple of approaches you can take to do this.
1. Aggregate Transformation
Group by Item Number and Item Name and then perform an aggregate operation on Update Date. Based on the logic you mentioned above, the Minimum operation should work. In order to use the Minimum operation, you'll need to convert the Update Date column to a date (can't perform Minimum on a string). That conversion can be done in a Data Conversion Transformation. Below are the guts of what this would look like:
2. Script Component Transformation
Essentially, you could implement the logic you mentioned above:
if next item number = previous item number then do NOT import this
line
First, you must configure the Script Component appropriately (the steps below assume that you don't rename the default input and output names):
Select Transformation as the Script Component type
Add the Script Component after the Flat File Source in your Data Flow:
Double Click the Script Component to open the Script Transformation Editor.
Under Input Columns, select all columns:
Under Inputs and Outputs, select Output 0, and set the SynchronousInputID property to None
Now manually add columns to Output 0 to match the columns in Input 0 (don't forget to set the data types):
Finally, edit the script. There will be a method named Input0_ProcessInputRow- modify it as below and add a private field named previousItemNumber as below:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (!Row.ItemNumber.Equals(previousItemNumber))
{
Output0Buffer.AddRow();
Output0Buffer.ItemName = Row.ItemName;
Output0Buffer.ItemNumber = Row.ItemNumber;
Output0Buffer.UpdateDate = Row.UpdateDate;
}
previousItemNumber = Row.ItemNumber;
}
private string previousItemNumber = string.Empty;
If performance is a biggy for you I'd suggest you to dump the entire text file into a temporary table on SQL Server and then use a SELECT DISTINCT * to get the desired values.

Import CSV column from different file into new file

I have 2 CSV files almost identical with the following differences:
The first has a column, "date".
The second doesn't have "date" and also has 50 rows less than the 1st ("email").
They are a list of subscribers with date created. The second, however, is the updated list with subscribers who wanted to be removed, but this no longer has the date created.
Is there any way to import column "date" from 1st CSV into the 2nd CSV by making a reference to the "email" column so I can get the correct date of that subscriber?
Sorry, there seems to be not a ready made (probably an evening's worth of effort) command line tool available.
You could look at different ways, one complex way is to load it in tables, to the merge (using a select and join on the two tables) and export it back as csv.
The simplest I could think of was to use R (given that you have header names, in your CSV?):
csv1_data <- read.csv('/path/to/csv1.csv')
csv2_data <- read.csv('/path/to/csv2.csv')
merged_csv <- merge(csv1_data, csv2_data)
write.table(merged_csv,file="/path/to/merged_csv.csv",sep=",",row.names=T)
The first 2 lines load the data in R, the 3 line merges them using the default S3 method, the final line exports the result as a csv file, with the headers.
Hope this helps!