How to retrieve lost file from IPFS? - ipfs

Couple weeks ago I try spin up a local IPFS node, publish a file, and were able to access it via publish gateway. I thought the file would have been store by a lots of nodes, so I deleted it from my local machine, now I can't access the file via the ID (QmNvxsaXqWoLR1NNJpiRXTEo57ptyg3CjSGBrgeyiyFiPm) anymore.
I noticed that I can still somehow access the data from the webui, but only able to see the raw data instead of the files. Is there any way to retrieve the file?

I actually can retrieve this CID via a simple ipfs cat QmNvxsaXqWoLR1NNJpiRXTEo57ptyg3CjSGBrgeyiyFiPm:
{
"0x9a39f286e1cd710da14e45ac124e38f2b6242622": "4.705",
"0x7c981d31b2ab65ce9f9cce49feac9e9e11e8ca64": "0.174481",
"0xa83cdaaadbb0e01d5de8df4a670947eacbb11f7e": "0.860812",
"0x445f4b54039cb1f86644351f2ef324c6876f6d76": "0.036128",
"0x29eab4341629aa1ae5e996f76ea0750548311ecf": "5.4",
"0xbbccf6cab5b3aec26b0cbc6095b5b6ddbacfd59a": "17.172011",
"0x33d5ae030cf11723f9b34ecc6fe5cfe00c6dc133": "0.001909",
"0x03886228bb749eeba43426d2d6b70eba472f4876": "6.8",
"0x1eb8e88a563fde7b3b8ebbbb0e1ac117c3d80800": "1821.138157",
"0x62ba33ccc4a404456e388456c332d871dae7ae9e": "0.000145",
"0x63e62588330657c99ba79139e7c21af0c0db1e7e": "12.560212",
"0xcd45fdaa6a72740e1d092f458213ff39d3d94a10": "280.592062",
"0xb92667e34cb6753449adf464f18ce1833caf26e0": "0.647424",
"0x9a5179e08acf37b3d84c9a0c0d6f3ea2417f9175": "10.097725",
"0xc43cffc5db578879cc5d0d4cfe07ad514c934d3b": "6.365907",
"0x34915628fc56ae8ff6684be39462e7ba398164b8": "0.00069",
"0x47e2bc7475ef8a9a5e10aef076da53d3c446291e": "5.305",
"0xf432d70c941ebe657ca8cff0b70d1649d5781eea": "0.153823",
"0xff90d66d41fc97b223e8005dba51635b5d49632b": "0.002298",
"0x1cf41ad63f67f3e7f8a1db240d812f5392b9a9c4": "6.05013",
"0xc418aaa0d1e018ded3efc0f72a089519b3d58683": "0.179902",
"0x7d209486a3562fe406b72d65b3703884c50bac81": "2.191224",
"0xe782657a1043062087232b3c20c4d25e2a982cb3": "0.110927",
"0xd998e5a4777e1b47c1441a88bb553cbf16802e4c": "0.095045",
"0x9f3ef50ea64adad5b33f1f8222760cfbf42007f7": "0.069055",
"0x40c1efa324fd80329117409c65081f13e7a08a42": "2790.399058",
"0x9ef8c5ae4a320ef0984695af9a85d07f5be13792": "0.139741",
"0xf46422c1b6c2135dbca9b55771fd6e7869a8691c": "995.479262",
"0xf6f3bc09782d3c0df474eb3cec5cac8423bfedf3": "0.00012",
"0x4f2769e87c7d96ed9ca72084845ee05e7de5dda2": "0.000509",
"0x92f1e9a52c1a81fdb76ee6477c0c605917cddbe5": "0.811623",
"0x1e6424a481e6404ed2858d540aec37399671f5e0": "19253.760913",
"0xc9b2c3a6a8e1896aadcf236b88019c7574d75069": "781.127767",
"0xb08f95dbc639621dbaf48a472ae8fce0f6f56a6e": "34.704074"
}
I thought the file would have been store by a lots of nodes, so I deleted it from my local machine
It's important to note that data is only stored by other nodes temporarily if they access the content themselves. If you want data to live reliably longterm, you can use a pinning service like Pinata, as you're paying them to keep your data pinned.
Otherwise you have to rely on other nodes pinning your data to ensure it remains available.

Related

Azure Datafactory process and filter files to process

I have a pipeline that processes some files, and in some cases "groups" of files. Meaning the files should be processed together and are correlated with a timestamp.
Ex.
Timestamp#Customer.csv
Timestamp#Customer_Offices.csv
Timestamp_1#Customer.csv
Timestamp_1#Customer_Offices.csv
...
I have a table with all the scopes, and files with respective filemask. I have populated a variable in the beginning of the pipeline based on a parameter
The Get files activity goes to a sFTP location and grab files from a folder. Then I only want to process the "Customer.csv" and ".Customer_Offices.csv" files. This is because the folder location has more file types or scopes to be processed by other pipelines. If I don't filter, the next activities end up by processing metadata of files that are not supposed to. In terms of efficiency and performance s bad, and is even causing some issues with files being left behind.
I've tried something like
#variables('FilesToSearch').contains(#endswith(item().name, 'do I need this 2nd parm in arrays ?'))
but no luck... :(
Any help will be highly appreciated,
Best regards,
Manuel
contains function can direct for a string to find a substring, so you can try something like this expression #contains(item().name,'Customer')
and no need to create a variable.
Or use endsWith function and use this expression:
#or(endswith(item().name,'Customer.csv'),endswith(item().name,'Customer_Offices.csv'))

Undesired behavior when reloading modified nodes into aws-neptune

I'm using the bulk loader to load data from csv files on S3 into a Neptune DB cluster.
The data is loaded successfully. However, when I reload the data with some of the nodes' property values modified, the new value is not replacing the old one, but rather being added to it ,making it a list of values separated by a comma. For example:
Initial values loaded:
~id,~label,ip:string,creationTime:date
2,user,"1.2.3.4",2019-02-13
If I reload this node with a different ip:
2,user,"5.6.7.8",2019-02-13
Then I run the following traversal: g.V(2).valueMap(), and getting: ip=[1.2.3.4, 5.6.7.8], creationTime=[2019-02-13]
While this behavior may be beneficial for some use-cases, it's mostly undesired. I want the new value to replace the old one.
I couldn't find any reference in the documentation to the loader behavior in case of reloading nodes, and there is no relevant parameter to configure in the API request.
How can I have reloaded nodes overwriting the existing ones?
Update: Neptune now supports single cardinality bulk-loading. Just set
updateSingleCardinalityProperties = TRUE
SOURCE: https://docs.aws.amazon.com/neptune/latest/userguide/load-api-reference-load.html
currently the Neptune bulk loader uses Set cardinality. To update an existing property the best way is to use Gremlin via the HTTP or WS endpoint.
From Gremlin you can specify that you want single cardinality (thus replacing rather than adding to the property value). An example would be
g.V('2').property(single,"ip","5.6.7.8")
Hope that helps,
Kelvin

How do I download gridded sst data?

I've recently been introduced to R and trying the heatwaveR package. I get an error when loading erddap data ... Here's the code I have used so far:
library(rerddap)
library(ncdf4)
info(datasetid = "ncdc_oisst_v2_avhrr_by_time_zlev_lat_lon", url = "https://www.ncei.noaa.gov/erddap/")
And I get the following error:
Error in curl::curl_fetch_memory(x$url$url, handle = x$url$handle) :
schannel: next InitializeSecurityContext failed: SEC_E_INVALID_TOKEN (0x80090308) - The token supplied to the function is invalid
Would like some help in this. I'm new to this website too so I apologize if the above question is not as per standards (codes to be typed in a grey box, etc.)
Someone directed this post to my attention from the heatwaveR issues page on GitHub. Here is the answer I provided for them:
I do not manage the rerddap package so can't say exactly why it may be giving you this error. But I can say that I have noticed lately that the OISST data are often not available on the ERDDAP server in question. I (attempt to) download fresh data every day and am often denied with an error similar to the one you posted. It's gotten to the point where I had to insert some logic gates into my download script so it tells me that the data aren't currently being hosted before it tries to download them. I should also point out that one may download the "final" data from this server, which have roughly a two week delay from present day, as well as the "preliminary (prelim)" data, which are near-real-time but haven't gone through all of the QC steps yet. These two products are accounted for in the following code:
# First download the list of data products on the server
server_data <- rerddap::ed_datasets(which = "griddap", "https://www.ncei.noaa.gov/erddap/")$Dataset.ID
# Check if the "final" data are currently hosted
if(!"ncdc_oisst_v2_avhrr_by_time_zlev_lat_lon" %in% server_data)
stop("Final data are not currently up on the ERDDAP server")
# Check if the "prelim" data are currently hosted
if(!"ncdc_oisst_v2_avhrr_prelim_by_time_zlev_lat_lon" %in% server_data)
stop("Prelim data are not currently up on the ERDDAP server")
If the data are available I then check the times/dates available with these two lines:
# Download final OISST meta-data
final_info <- rerddap::info(datasetid = "ncdc_oisst_v2_avhrr_by_time_zlev_lat_lon", url = "https://www.ncei.noaa.gov/erddap/")
# Download prelim OISST meta-data
prelim_info <- rerddap::info(datasetid = "ncdc_oisst_v2_avhrr_prelim_by_time_zlev_lat_lon", url = "https://www.ncei.noaa.gov/erddap/")
I ran this now and it looks like the data are currently available. Is your error from today, or from a day or two ago? The availability seems to cycle over the week but I haven't quite made sense of any pattern yet. It is also important to note that about a day before the data go dark they are filled with all sorts of massive errors. So I've also had to add error trapping into my code that stops the data aggregation process once it detects temperatures in excess of some massive number. In this case it is something like1^90, but the number isn't consistent meaning it is not a missing value placeholder.
To manually see for yourself if the data are being hosted you can go to this link and scroll to the bottom:
https://www.ncei.noaa.gov/erddap/griddap/index.html
All the best,
-Robert

Save long directory path to local variable in Apache Drill?

With Apache Drill, when querying files from the filesystem, is there any way to set a shortcut for long directory paths?
For example, in:
> SELECT * FROM dfs.`/Users/me/Clients/foo/current-data/sample/releases/test*.json`
Is there any way I can shorten /Users/me/Dropbox/Clients/foo/current-data/sample/releases/ to a local variable so I don't have to type the full path each time?
I've looked through the docs, but can't see any reference to this (but maybe I'm being dumb).
There are a couple options here:
You could create a view from you long query so you don't have to type the monstrosity every time. This is less flexible then the second solution. For more information, check out: https://drill.apache.org/docs/create-view
You could modify the DFS storage settings (in the web ui at http://:8047 under the storage tab/dfs) and create a new workspaces pointing directly to the "/Users/me/Clients/foo/current-data/sample/releases" directory.
For example:
"releases": {
"location":
"/mapr/demo.mapr.com/data/a/university/student/records/grades/",
"writable": true,
"defaultInputFormat": null
}
Then you would be able to query select * from dfs.releases.tests.csv

ZooKeeper Multi-Server Setup by Example

From the ZooKeeper multi-server config docs they show the following configs that can be placed inside of zoo.cfg (ZK's config file) on each server:
tickTime=2000
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
Furthermore, they state that you need a myid file on each ZK node whose content matches one of the server.id values above. So for example, in a 3-node "ensemble" (ZK cluster), the first node's myid file would simply contain the value 1. The second node's myid file would contain 2, and so forth.
I have a few practical questions about what this looks like in the real world:
1. Can localhost be used? If zoo.cfg has to be repeated on each node in the ensemble, is it OK to define the current server as localhost? For example, in a 3-node ensemble, would it be OK for Server #2's zoo.cfg to look like:
tickTime=2000
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=localhost:2888:3888 # Afterall, we're on server #2!
server.3=zoo3:2888:3888
Or is this not advised/not possible?
2. Do they server ids have to be numerical? For instance, could I have a 5-node ensemble where each server's zoo.cfg looks like:
tickTime=2000
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.red=zoo1:2888:3888
server.green=zoo2:2888:3888
server.blue=zoo3:2888:3888
server.orange=zoo1:2888:3888
server.purple=zoo2:2888:3888
And, say, Server 1's myid would contain the value red inside of it (etc.)?
1. Can localhost be used?
This is a good question as ZooKeeper docs don't make it cristal clear whether the configuration file only accepts IP addresses. It says only hostname which could mean either an IP address, a DNS, or a name in the hosts file, such as localhost.
server.x=[hostname]:nnnnn[:nnnnn], etc
(No Java system property)
servers making up the ZooKeeper ensemble. When the server starts up, it determines which server it is by looking for the file myid in the data directory. That file contains the server number, in ASCII, and it should match x in server.x in the left hand side of this setting.
However, note that ZooKeeper recommend to use the exactly same configuration file in all hosts:
ZooKeeper's behavior is governed by the ZooKeeper configuration file. This file is designed so that the exact same file can be used by all the servers that make up a ZooKeeper server assuming the disk layouts are the same. If servers use different configuration files, care must be taken to ensure that the list of servers in all of the different configuration files match.
So simply put the machine IP address and everything should work. Also, I have personally tested using 0.0.0.0 (in a situation where the interface IP address was different from the public IP address) and it does work.
2. Do they server ids have to be numerical?
From ZooKeeper multi-server configuration docs, myid need to be a numerical value from 1 to 255:
The myid file consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.
Since myid must match the x in server.x parameter, we can infer that x must be a numerical value as well.