HiveText
| Input | Output | Alias |
|---|---|---|
| ✔ | ✗ |
Description
HiveText reads the text serialization format used by Apache Hive
tables (the format produced by Hive's LazySimpleSerDe). It is a delimited text
format, similar to CSV, in which fields are
separated by the Hive default \x01 (Ctrl-A) delimiter. The field delimiter is
configurable via input_format_hive_text_fields_delimiter.
HiveText is an input-only format. The data has no header row: values are
mapped positionally onto the columns of the destination table, so the column
names and types are taken from the table (or from an explicitly provided
structure) rather than inferred from the data. While reading, ClickHouse parses
dates and times in best-effort mode (see date_time_input_format),
fills omitted trailing fields with column defaults, and skips fields it does not
recognize.
Within a field, values are parsed using the same escaping rules as CSV rather
than Hive's nested delimiters. In particular, a column of type
Array is read from the bracketed
representation (for example, "['a','b','c']"), not from values separated by
the Hive collection delimiter \x02.
The input_format_hive_text_collection_items_delimiter and
input_format_hive_text_map_keys_delimiter settings are
accepted for compatibility but are currently not used during parsing.
By default, rows are allowed to have a variable number of fields (see
input_format_hive_text_allow_variable_number_of_columns):
rows with fewer fields than the table have the missing columns filled with
default values, and rows with extra trailing fields have the extras skipped.
Example usage
The examples below override the default field delimiter with a comma (,) using
input_format_hive_text_fields_delimiter so that the input
files are easy to read.
Reading a HiveText file
Given a file hive_data.txt with comma-separated fields:
We create a table that defines the column names and types, and insert the file
into it with FORMAT HiveText:
Note that the first row, 1,3, has only two fields, so the missing column c
is filled with its default value 0.
Variable number of columns
With the default input_format_hive_text_allow_variable_number_of_columns = 1,
rows that have more fields than the table simply have the extra trailing fields
skipped:
Setting input_format_hive_text_allow_variable_number_of_columns = 0 instead
enforces a strict field count, and a row with fewer fields than the table raises
a parsing exception.
Format settings
| Setting | Description | Default |
|---|---|---|
input_format_hive_text_fields_delimiter | Delimiter between fields in Hive Text File | \x01 |
input_format_hive_text_collection_items_delimiter | Delimiter between collection (array or map) items in Hive Text File. Accepted but currently not used during parsing. | \x02 |
input_format_hive_text_map_keys_delimiter | Delimiter between a pair of map key/values in Hive Text File. Accepted but currently not used during parsing. | \x03 |
input_format_hive_text_allow_variable_number_of_columns | Ignore extra columns in Hive Text input (if file has more columns than expected) and treat missing fields as default values | 1 |