I am new to neuroscience and I am learning to work with neuroscience data.
However I am a bit confused. I tried to download the dataset:
https://openfmri.org/dataset/ds000116/
but somehow, I cannot understand which files to download, if I click Browse Data For All Revisions on S3 , then I can just find Anatomical and functional Images, but I am looking for both EEG and fMRI data.
If I download Raw data on AWS or Curated dataset its really small in size, as compared to Processed data for Subject 1 on AWS for each subject.
So my questions are:
What is the difference between Browse Data For All Revisions on S3 ,Raw data on AWS,Curated dataset and Processed data for Subject 1 on AWS for each subject on dataset page?
Which dataset/files should I download?
Is the dataset Processed data for Subject 1 on AWS for each subject, is pre--processed, if yes then what pre-processing steps have been done?
can someone point me to resources for pre-processing EEG and fMRI data, can I use fmriprep for fMRI data?
Answered in the original post https://neurostars.org/t/which-files-to-download-from-openfmri-data/999?u=oesteban
Related
I want to compare the data I have in csv file to the data which is in ldap produciton server.
There are thousands of users data in csv file and i want to compare the data with the data in production server.
Let's suppose I have user ID xtz12345 in the csv file with uid number 123456. Now I want to cross check the uidNumber of the same user ID xtz12345 in the production server.
Is there any way I can automate this? There are thousands of UserID to be checked and if i do it manually it probably gonna take a lot of time. Can anyone suggest what should I do?
Powershell script is good start place.
import activedirectory module (assuming Windows ADdownload and install RSAT tools, here) in Powershell to fetch information from AD, example
use import-csv in powershell to read csv values. Now, compare first with second. example
Happy to help
I am trying to copy files from one folder to another. However source folder has multiple folders in it and then multiple files. My requirement is to move all the files from each of these folder into single folder. I have about millions of file and each of these files have hardly 1 or 2 records.
Example -
source_folder - dev-bucket/data/
Inside this source_folder, I have following -
folder a - inside this folder, 10000 json files
folder b - inside this folder, 10000 json files
My aim - Target_folder - dev-bucket/final/20000 json files.
I tried writing below code, however, the processing time is also huge. Is there any other way to approach this?
try:
for obj in bucket.objects.filter(Prefix=source_folder):
old_source = {'Bucket': obj.bucket_name,'Key': obj.key}
file_count = file_count+1
new_obj = bucket.Object(final_file)
new_obj.copy(old_source)
except Exception as e:
logger.print("The process has failed to copy files from sftp location to base location", e)
exit(1)
I was thinking of merging the data into 1 single json file before moving the file. However, I am new to Python and AWS and am struggling to understand how should I read and write the data. I was trying to do below but am kind of stuck.
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=s3_bucket, Prefix=FOLDER)
response = []
for page in pages:
for obj in page['Contents']:
read_files = obj["Key"]
result = s3_client.get_object(Bucket=s3_bucket, Key=read_files)
text = result["Body"].read().decode()
response = response.append(text)
Can you please guide me? Many thanks in advance.
If you need copy one time, I sugget to use aws cli
aws s3 cp source destination --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
If possible, it is best to avoid having high numbers of objects. They are slow to list and to iterate through.
From your question is seems that they contain JSON data and you are happy to merge the contents of files. A good way to do this is:
Use an AWS Glue crawler to inspect the contents of a directory and create a virtual 'table' in the AWS Glue Catalog
Then use Amazon Athena to SELECT data from that virtual table (which reads all the files) and copy it into a new table using CREATE TABLE AS
Depending upon how you intend to use the data in future, Athena can even convert it into a different format, such as Snappy-compressed Parquet files that are very fast for querying
If you instead just wish to continue with your code for copying files, you might consider activating Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. Your Python program could then use that inventory file as the input list of files, rather than having to call ListObjects.
However, I would highly recommend a strategy that reduces the number of objects you are storing unless there is a compelling reason to keep them all separate.
If you receive more files every day, you might even consider sending the data to an Amazon Kinesis Data Firehose, which can buffer data by size or time and store it in fewer files.
I have huge csv files of size ~15GB in aws s3 (s3://bucket1/rawFile.csv). Lets say if the schema looks like this:
cust_id, account_num, paid_date, cust_f_name
1001, 1234567890, 01/01/2001, Jonathan
I am trying to mask the account number column and the customer name and create a new maskedFile.csv and store it in another aws s3 bucket (s3://bucket2/maskedFile.csv) as follows:
cust_id, account_num, paid_date, cust_f_name
1001, 123*******, 01/01/2001, Jon*******
This needs to be done just once with one snapshot of the payment data.
How can i do this? and what tools should I use to achieve this? Please let me know.
AWS Glue is AWS' managed ETL and data catalog tool, and it was made for exactly this kind of task.
You point it to the source folder on S3, tell it the destination folder where you want the results to land, and you are guided through the transformations you want. Basically if you can write a bit of Python you can do a simply masking transform in no time.
Once that's set up, Glue will automatically transform any new file you drop into the source folder, so you have not only created the code necessary to do the masking, you have a completely automated pipeline that runs when new data arrives. I saw that your case only calls for it to run once, but it's really not much easier to write the code to do it once.
To see an example of using Glue to set up a simple ETL job, take a look at: https://gorillalogic.com/blog/in-search-of-happiness-a-quick-etl-use-case-with-aws-glue-redshift/. And there are plenty of other tutorials out there to get you started.
You can try FileMasker.
It will mask CSV (and JSON) files in S3 buckets.
You can run it as an AWS Lambda function although the Lambda restrictions will limit the input file sizes to a few GB each.
If you can split the input files into smaller files then you'll be fine. Otherwise, contact the vendor for options. See https://www.dataveil.com/filemasker/
Disclaimer: I work for DataVeil.
I'm creating a questionnaire application in Qt, where surveys are created, and users log on and complete these surveys. I am saving these as JSON.
Each survey could have 60+ questions, and are completed multiple teams by different people.
Is it more appropriate to save as 1 JSON file, or a file for each Survey?
I would use a Database rather than a JSON file. You can use JSON to serialize data and transfer it through processes and computers or servers, but you don't want to save big data to a JSON file.
Anyway if that's what you want to do I would save each survey in a different JSON file. Maybe keep them in order by assigning a unique identifier to each file (name of the file) so that you can find and search for them easily.
One single file would be a single point of failure, and when reading and writing it there would be concurrency problems. One file for each survey should soothe the problem.
I am trying to import data from past NFL games in the form of Play-by-play tables and am mostly working in R to collect the data and create a data set.
An example of the data I am after is on this page: http://www.nfl.com/gamecenter/2012020500/2011/POST22/giants#patriots#menu=gameinfo&tab=analyze&analyze=playbyplay
I know that NFL.com uses JSON and much of the necessary data are in JSON files attached to the site. My efforts at extracting data from these files using the JSON package in R have been pretty feeble. Any advice y'all have is appreciated.
Would I just be better off using PHP to farm the data?
I don't know if you have already succeeded loading the JSON files into R, but here is an example of that:
library(rjson)
json=fromJSON(file='http://www.nfl.com/liveupdate/game-center/2012020500/2012020500_gtd.json')
json$`2012020500`$home$stats
If you are having trouble finding the URL of the JSON file, use Firebug (an extension for Firefox) and you can see the webpage requesting the JSON file.
The JSON file, is, of course, huge and complicated. But it is complicated data. Whatever you are looking for should be in there. If you are just looking for a straight dump of the play-by-play text, then you can use this URL:
http://www.nfl.com/widget/gc/2011/tabs/cat-post-playbyplay?gameId=2012020500
I extracted all the data for one team for one season more-or-less manually. If you want data for a lot of games consider emailing the league and asking for the files you mentioned. They publish the data, so maybe they will give you the files. The NFL spokesman is Greg Aiello. I suspect you could find his email address with Google.
Sorry this is not a suggested programming solution. If this answer is not appropriate for the forum please delete it. It is my first posted answer.