public class DefaultJsonHandler extends Object implements JsonHandler
JsonHandler
based on Hadoop APIs.Constructor and Description |
---|
DefaultJsonHandler(org.apache.hadoop.conf.Configuration hadoopConf) |
Modifier and Type | Method and Description |
---|---|
ColumnarBatch |
parseJson(ColumnVector jsonStringVector,
StructType outputSchema,
java.util.Optional<ColumnVector> selectionVector)
Parse the given json strings and return the fields requested by
outputSchema as
columns in a ColumnarBatch . |
CloseableIterator<ColumnarBatch> |
readJsonFiles(CloseableIterator<FileStatus> scanFileIter,
StructType physicalSchema,
java.util.Optional<Predicate> predicate)
Read and parse the JSON format file at given locations and return the data as a
ColumnarBatch with the columns requested by physicalSchema . |
void |
writeJsonFileAtomically(String filePath,
CloseableIterator<Row> data,
boolean overwrite)
Makes use of
LogStore implementations in `delta-storage` to atomically write the data
to a file depending upon the destination filesystem. |
public DefaultJsonHandler(org.apache.hadoop.conf.Configuration hadoopConf)
public ColumnarBatch parseJson(ColumnVector jsonStringVector, StructType outputSchema, java.util.Optional<ColumnVector> selectionVector)
JsonHandler
outputSchema
as
columns in a ColumnarBatch
.
There are a couple special cases that should be handled for specific data types:
"NaN"
"+INF", "Infinity", "+Infinity"
"-INF", "-Infinity""
"yyyy-MM-dd"
"yyyy-MM-dd'T'HH:mm:ss.SSSXXX"
parseJson
in interface JsonHandler
jsonStringVector
- String ColumnVector
of valid JSON strings.outputSchema
- Schema of the data to return from the parsed JSON. If any requested fields
are missing in the JSON string, a null is returned for that particular field in the
returned Row
. The type for each given field is expected to match the type in the
JSON.selectionVector
- Optional selection vector indicating which rows to parse the JSON. If
present, only the selected rows should be parsed. Unselected rows should be all null in the
returned batch.ColumnarBatch
of schema outputSchema
with one row for each entry in
jsonStringVector
public CloseableIterator<ColumnarBatch> readJsonFiles(CloseableIterator<FileStatus> scanFileIter, StructType physicalSchema, java.util.Optional<Predicate> predicate) throws java.io.IOException
JsonHandler
ColumnarBatch
with the columns requested by physicalSchema
.readJsonFiles
in interface JsonHandler
scanFileIter
- Iterator of files to read data from.physicalSchema
- Select list of columns to read from the JSON file.predicate
- Optional predicate which the JSON reader can optionally use to prune rows that
don't satisfy the predicate. Because pruning is optional and may be incomplete, caller is
still responsible apply the predicate on the data returned by this method.ColumnarBatch
s containing the data in columnar format. It is the
responsibility of the caller to close the iterator. The data returned is in the same as the
order of files given in scanFileIter
java.io.IOException
- if an I/O error occurs during the read.public void writeJsonFileAtomically(String filePath, CloseableIterator<Row> data, boolean overwrite) throws java.io.IOException
LogStore
implementations in `delta-storage` to atomically write the data
to a file depending upon the destination filesystem.writeJsonFileAtomically
in interface JsonHandler
filePath
- Destination file pathdata
- Data to write as Jsonoverwrite
- If true
, the file is overwritten if it already exists. If false
and a file exists FileAlreadyExistsException
is thrown.java.io.IOException
java.nio.file.FileAlreadyExistsException
- if the file already exists and overwrite
is false.