Is batch import option available? - mysql

Is there any method to import data from mysql to elasticsearch,batch by batch?If yes,how to do it??
Because the bulk import seems to be a problem.When i import 191000 items,only a few are being imported.

Related

How to convert SQLAlchemy data model into SQLModel data model

Context:
I want to build an API using FastAPI and SQLModel
I need to use database reflection to create the SQLModel table models based on existing database
I couldnt find how to do database reflection directly in SQLModel
So I do database reflection in SQLAlchemy and now i want turn the SQLAlchemy table models into SQLModel ones to easily utilize them with FastAPI
Problem:
I cant figure out how to create an SQLModel table model based on the SQLAlchemy table model which i created with database reflection. as shown in the code below.
The docs of SQLModel suggest easy integration of SQLAlchemy so i figured it should be easy...
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker, scoped_session
from sqlalchemy.ext.declarative import declarative_base
from fastapi import FastAPI
from sqlmodel import SQLModel
app = FastAPI()
# do database reflection
engine = create_engine('sqlite:///database.db')
Base = declarative_base()
Base.metadata.reflect(engine)
print("alchemy meta tables:\n ", Base.metadata.tables)
# creates SQLAlchemy model based on database table schema of table 'hero'
class HeroDBReflection(Base):
__table__ = Base.metadata.tables['hero']
# TODO: How to create this Hero SQlModel data model based on the SQLAlchemy model 'HeroDBReflection'?
# class Hero(SQLModel, table=True):
# metadata = Base.metadata.tables['hero'] ?
# #app.post("/heroes/", response_model=Hero)
# def create_hero(hero: Hero):
# session.add(hero)
# session.commit
# session.refresh(hero)
# return hero
*This is my first Stackoverflow post, so I hope that my question is clear :)
Thanks a lot!

how to load modin dataframe from pyarrow or pandas

Since Modin does not support loading from multiple pyarrow files on s3, I am using pyarrow to load the data.
import s3fs
import modin.pandas as pd
from pyarrow import parquet
s3 = s3fs.S3FileSystem(
key=aws_key,
secret=aws_secret
)
table = parquet.ParquetDataset(
path_or_paths="s3://bucket/path",
filesystem=s3,
).read(
columns=["hotelId", "startDate", "endDate"]
)
# to get a pandas df the next step would be table.to_pandas()
If I know want to put the data in a Modin df for parallel computations without having to write to and read from a csv? Is there a way to construct the Modin df directly from a pyarrow.Table or at least from a pandas dataframe?
Mahesh's answer should work but I believe it would result in a full data copy (2X memory footprint by default: https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy)
At the time of writing Modin does have a native arrow integration, so you can directly convert using
from modin.pandas.utils import from_arrow
mdf = from_arrow(pyarrow_table)
You can't construct the Modin dataframe directly out of a pyarrow.Table, because pandas doesn't support that, and Modin only supports a subset of the pandas API. However, the table has a method that converts it to a pandas dataframe, and you can construct the Modin dataframe out of that. Using table from your code:
import modin.pandas as pd
modin_dataframe = pd.Dataframe(table.to_pandas())

DirectoryPartitioning in pyarrow (python)

dataset = ds.dataset("abfs://test", format="parquet", partitioning="hive", filesystem=fs)
I can read datasets with the pyarrow dataset feature, but how can I write to a dataset with a different schema?
I seem to be able to import DirectoryPartitioning, for example, but I cannot figure out a way to save the data to create a schema like this:
from pyarrow.dataset import DirectoryPartitioning
partitioning = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8())]))
print(partitioning.parse("/2009/11/3"))
Will we continue to use write_to_dataset to write Parquet files, or will there be a new method specific to the datasets class?

Can't understand this module/type error

I'm trying to use the Aeson JSON library in haskell. Right now, i just need to use "decode" to read a JSON dump.
import Data.Aeson
import Data.ByteString as BS
import Control.Applicative
main :: IO ()
main = print $ decode <$> BS.readFile "json"
I got the following error when trying to compile/run it:
Couldn't match type 'ByteString'
with 'Data.ByteString.Lazy.Internal.ByteString'
NB: 'ByteString is defined in 'Data.ByteString.Internal'
'Data.ByteString.Lazy.Internal.ByteString'
is defined in 'Data.ByteString.Lazy.Internal.ByteString
This error doesn't make sense to me. I tried importing the files described by ghc, but the import either fails or doesn't solve the problem.
Thanks
There are two variants of ByteString: A strict (the default one), exported by Data.ByteString, and a lazy one, exported by Data.ByteString.Lazy.
Aeson works on top of lazy byte string, so you should change your second line to
import Data.ByteString.Lazy as BS

Error on trying to use Dataframe.to_json method

I'm trying to export a pandas dataframe to JSON with no luck. I've tried:
all_data.to_json("spdata.json") and all_data.to_json()
I get the same attribute error on both: 'DataFrame' object has no attribute 'to_json'. Just to make sure something isn't wrong with the DataFrame, i tested writing it to_csv and that worked.
Is there something i'm missing in my syntax or package i need to import? I am running Python version 2.7.5 which is part of an Enthought Canopy Express package. Imports at the beginning of my code are:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
from sys import argv
from datetime import datetime, timedelta
from dateutil.parser import parse
The to_json method was introduced to 0.12, so you'll need to upgrade your pandas to be able to use it.