orc-format-1.1.0/ 000755 000765 000024 00000000000 14777360722 014561 5 ustar 00dongjoon staff 000000 000000 orc-format-1.1.0/.asf.yaml 000644 000765 000024 00000002404 14777360722 016274 0 ustar 00dongjoon staff 000000 000000 # Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# https://cwiki.apache.org/confluence/display/INFRA/git+-+.asf.yaml+features
---
github:
description: "Apache ORC - the smallest, fastest columnar storage for Hadoop workloads"
homepage: https://orc.apache.org/
features:
issues: true
enabled_merge_buttons:
merge: false
squash: true
rebase: true
labels:
- apache
- orc
- java
- cpp
- big-data
notifications:
pullrequests: issues@orc.apache.org
issues: issues@orc.apache.org
commits: commits@orc.apache.org
orc-format-1.1.0/LICENSE 000644 000765 000024 00000045761 14777360722 015603 0 ustar 00dongjoon staff 000000 000000
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability contains
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
APACHE ORC SUBCOMPONENTS:
The Apache ORC project contains subcomponents with separate copyright
notices and license terms. Your use of the source code for the these
subcomponents is subject to the terms and conditions of the following
licenses.
For protobuf:
Copyright 2008 Google Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the
distribution.
* Neither the name of Google Inc. nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Code generated by the Protocol Buffer compiler is owned by the owner
of the input file used when generating it. This code is not
standalone and requires a support library to be linked with it. This
support library is itself covered by the above license.
For the site:
Parts of the site formatting includes software developed by Tom Preston-Werner
that are licensed under the MIT License (MIT):
(c) Copyright [2008-2015] Tom Preston-Werner
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
For snappy:
Copyright 2011, Google Inc.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the
distribution.
* Neither the name of Google Inc. nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
For zlib:
(C) 1995-2017 Jean-loup Gailly and Mark Adler
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Jean-loup Gailly Mark Adler
jloup@gzip.org madler@alumni.caltech.edu
If you use the zlib library in a product, we would appreciate *not* receiving
lengthy legal documents to sign. The sources are provided for free but without
warranty of any kind. The library has been entirely written by Jean-loup
Gailly and Mark Adler; it does not include third-party code.
If you redistribute modified sources, we would appreciate that you include in
the file ChangeLog history information documenting your changes. Please read
the FAQ for more information on the distribution of modified source versions.
For orc.threeten:
/*
* Copyright (c) 2007-present, Stephen Colebourne & Michael Nascimento Santos
*
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* * Redistributions of source code must retain the above copyright notice,
* this list of conditions and the following disclaimer.
*
* * Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
*
* * Neither the name of JSR-310 nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
* NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/ orc-format-1.1.0/pom.xml 000644 000765 000024 00000032265 14777360722 016106 0 ustar 00dongjoon staff 000000 000000
4.0.0org.apacheapache29org.apache.orcorc-format1.1.0jarApache ORC FormatORC is a self-describing type-aware columnar file format designed
for Hadoop workloads. It is optimized for large streaming reads,
but with integrated support for finding required rows
quickly. Storing data in a columnar format lets the reader read,
decompress, and process only the values that are required for the
current query.
https://orc.apache.org2013ORC User Listuser-subscribe@orc.apache.orguser-unsubscribe@orc.apache.orguser@orc.apache.orghttps://mail-archives.apache.org/mod_mbox/orc-user/ORC Developer Listdev-subscribe@orc.apache.orgdev-unsubscribe@orc.apache.orgdev@orc.apache.orghttps://mail-archives.apache.org/mod_mbox/orc-dev/173.6.03.6.03.5.117false3.9.63.17.3${project.build.directory}/testing-tmpcom.google.protobufprotobuf-java3.25.5truefalsegcs-maven-central-mirrorGCS Maven Central mirrorhttps://maven-central.storage-download.googleapis.com/maven2/truefalsecentralMaven Repositoryhttps://repo.maven.apache.org/maven2com.github.os72protoc-jar-maven-plugin3.11.4generate-sourcesrun${protoc.version}com.google.protobuf:protoc:${protoc.version}nonesrc/main/proto/org.apache.maven.pluginsmaven-enforcer-plugin3.4.0org.codehaus.mojoextra-enforcer-rules1.7.0enforce-mavenenforce${maven.version}${java.version}${java.version}org.apache.maven.pluginsmaven-compiler-plugin3.10.1org.apache.maven.pluginsmaven-javadoc-plugin**/OrcProto.java${project.artifactId}org.apache.maven.pluginsmaven-shade-plugin${maven-shade-plugin.version}shaded-protobufshadepackagecom.google.protobuf:protobuf-javatrueshaded-protobufcom.google.protobuforg.apache.orc.protobuf*:*module-info.classMETA-INF/MANIFEST.MFMETA-INF/DEPENDENCIESMETA-INF/LICENSEMETA-INF/NOTICEgoogle/protobuf/**nohiveshadepackagecom.google.protobuf:protobuf-javaorg.apache.hive:hive-storage-apitruenohivecom.google.protobuforg.apache.orc.protobuforg.apache.hadoop.hiveorg.apache.orc.storageorg.apache.hiveorg.apache.orc.storage*:*module-info.classMETA-INF/MANIFEST.MFMETA-INF/DEPENDENCIESMETA-INF/LICENSEMETA-INF/NOTICEgoogle/protobuf/**org.codehaus.mojobuild-helper-maven-plugin3.4.0add-sourceadd-sourcegenerate-sources${project.build.directory}/generated-sourcesorg.cyclonedxcyclonedx-maven-plugin2.7.6makeBompackage
orc-format-1.1.0/specification/ 000755 000765 000024 00000000000 14777360722 017401 5 ustar 00dongjoon staff 000000 000000 orc-format-1.1.0/NOTICE 000644 000765 000024 00000000456 14777360722 015472 0 ustar 00dongjoon staff 000000 000000 Apache ORC
Copyright 2013 and onwards The Apache Software Foundation.
This product includes software developed by The Apache Software
Foundation (http://www.apache.org/).
This product includes software developed by Hewlett-Packard:
(c) Copyright [2014-2015] Hewlett-Packard Development Company, L.P
orc-format-1.1.0/README.md 000644 000765 000024 00000005233 14777360722 016043 0 ustar 00dongjoon staff 000000 000000
# [Apache ORC](https://orc.apache.org/)
ORC is a self-describing type-aware columnar file format designed for
Hadoop workloads. It is optimized for large streaming reads, but with
integrated support for finding required rows quickly. Storing data in
a columnar format lets the reader read, decompress, and process only
the values that are required for the current query. Because ORC files
are type-aware, the writer chooses the most appropriate encoding for
the type and builds an internal index as the file is written.
Predicate pushdown uses those indexes to determine which stripes in a
file need to be read for a particular query and the row indexes can
narrow the search to a particular set of 10,000 rows. ORC supports the
complete set of types in Hive, including the complex types: structs,
lists, maps, and unions.
## ORC Format
This project includes ORC specifications and the protobuf definition.
`Apache ORC Format 1.0.0` is designed to be used for `Apache ORC 2.0+`.
Releases:
* Maven Central: 
* Downloads: Apache ORC downloads
* Release tags: Apache ORC Format releases
* Plan: Apache ORC Format future release plan
The current build status:
* Main branch

Bug tracking: Apache ORC Format Issues
## Building
```
./mvnw install
```
orc-format-1.1.0/.gitignore 000644 000765 000024 00000000172 14777360722 016551 0 ustar 00dongjoon staff 000000 000000 target
.classpath*
.project
.settings
*~
*.iml
dependency-reduced-pom.xml
*.ipr
*.iws
.idea
.DS_Store
.java-version
*.swp
orc-format-1.1.0/.github/ 000755 000765 000024 00000000000 14777360722 016121 5 ustar 00dongjoon staff 000000 000000 orc-format-1.1.0/mvnw 000755 000765 000024 00000007753 14777360722 015512 0 ustar 00dongjoon staff 000000 000000 #!/usr/bin/env bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Determine the current working directory
_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
# Preserve the calling directory
_CALLING_DIR="$(pwd)"
# Options used during compilation
_COMPILE_JVM_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
# Installs any application tarball given a URL, the expected tarball name,
# and, optionally, a checkable binary path to determine if the binary has
# already been installed
## Arg1 - URL
## Arg2 - Tarball Name
## Arg3 - Checkable Binary
install_app() {
local remote_tarball="$1/$2"
local local_tarball="${_DIR}/build/$2"
local binary="${_DIR}/build/$3"
local curl_opts="--silent --show-error -L"
local wget_opts="--no-verbose"
if [ -z "$3" -o ! -f "$binary" ]; then
# check if we already have the tarball
# check if we have curl installed
# download application
[ ! -f "${local_tarball}" ] && [ $(command -v curl) ] && \
echo "exec: curl ${curl_opts} ${remote_tarball}" 1>&2 && \
curl ${curl_opts} "${remote_tarball}" > "${local_tarball}"
# if the file still doesn't exist, lets try `wget` and cross our fingers
[ ! -f "${local_tarball}" ] && [ $(command -v wget) ] && \
echo "exec: wget ${wget_opts} ${remote_tarball}" 1>&2 && \
wget ${wget_opts} -O "${local_tarball}" "${remote_tarball}"
# if both were unsuccessful, exit
[ ! -f "${local_tarball}" ] && \
echo -n "ERROR: Cannot download $2 with cURL or wget; " && \
echo "please install manually and try again." && \
exit 2
cd "${_DIR}/build" && tar -xzf "$2"
fi
}
# See simple version normalization: http://stackoverflow.com/questions/16989598/bash-comparing-version-numbers
function version { echo "$@" | awk -F. '{ printf("%03d%03d%03d\n", $1,$2,$3); }'; }
# Determine the Maven version from the root pom.xml file and
# install maven under the build/ folder if needed.
install_mvn() {
local MVN_VERSION=`grep "" "${_DIR}/pom.xml" | head -n1 | awk -F '[<>]' '{print $3}'`
MVN_BIN="$(command -v mvn)"
if [ "$MVN_BIN" ]; then
local MVN_DETECTED_VERSION="$(mvn --version | head -n1 | awk '{print $3}')"
fi
if [ $(version $MVN_DETECTED_VERSION) -lt $(version $MVN_VERSION) ]; then
local APACHE_MIRROR=${APACHE_MIRROR:-'https://www.apache.org/dyn/closer.lua?action=download&filename='}
if [ $(command -v curl) ]; then
local TEST_MIRROR_URL="${APACHE_MIRROR}/maven/maven-3/${MVN_VERSION}/binaries/apache-maven-${MVN_VERSION}-bin.tar.gz"
if ! curl -L --output /dev/null --silent --head --fail "$TEST_MIRROR_URL" ; then
# Fall back to archive.apache.org for older Maven
echo "Falling back to archive.apache.org to download Maven"
APACHE_MIRROR="https://archive.apache.org/dist"
fi
fi
mkdir -p build
install_app \
"${APACHE_MIRROR}/maven/maven-3/${MVN_VERSION}/binaries" \
"apache-maven-${MVN_VERSION}-bin.tar.gz" \
"build/apache-maven-${MVN_VERSION}/bin/mvn"
MVN_BIN="${_DIR}/build/apache-maven-${MVN_VERSION}/bin/mvn"
fi
}
install_mvn
# Reset the current working directory
cd "${_CALLING_DIR}"
# Set any `mvn` options if not already present
export MAVEN_OPTS=${MAVEN_OPTS:-"$_COMPILE_JVM_OPTS"}
echo "Using \`mvn\` from path: $MVN_BIN" 1>&2
"${MVN_BIN}" "$@"
orc-format-1.1.0/src/ 000755 000765 000024 00000000000 14777360722 015350 5 ustar 00dongjoon staff 000000 000000 orc-format-1.1.0/src/main/ 000755 000765 000024 00000000000 14777360722 016274 5 ustar 00dongjoon staff 000000 000000 orc-format-1.1.0/src/main/proto/ 000755 000765 000024 00000000000 14777360722 017437 5 ustar 00dongjoon staff 000000 000000 orc-format-1.1.0/src/main/proto/buf.yaml 000644 000765 000024 00000001521 14777360722 021076 0 ustar 00dongjoon staff 000000 000000 #
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
version: v1
breaking:
use:
- FILE
lint:
use:
- BASIC
orc-format-1.1.0/src/main/proto/orc/ 000755 000765 000024 00000000000 14777360722 020222 5 ustar 00dongjoon staff 000000 000000 orc-format-1.1.0/src/main/proto/orc/proto/ 000755 000765 000024 00000000000 14777360722 021365 5 ustar 00dongjoon staff 000000 000000 orc-format-1.1.0/src/main/proto/orc/proto/orc_proto.proto 000644 000765 000024 00000034452 14777360722 024470 0 ustar 00dongjoon staff 000000 000000 /**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
syntax = "proto2";
package orc.proto;
option java_package = "org.apache.orc";
message IntegerStatistics {
optional sint64 minimum = 1;
optional sint64 maximum = 2;
optional sint64 sum = 3;
}
message DoubleStatistics {
optional double minimum = 1;
optional double maximum = 2;
optional double sum = 3;
}
message StringStatistics {
optional string minimum = 1;
optional string maximum = 2;
// sum will store the total length of all strings in a stripe
optional sint64 sum = 3;
// If the minimum or maximum value was longer than 1024 bytes, store a lower or upper
// bound instead of the minimum or maximum values above.
optional string lower_bound = 4;
optional string upper_bound = 5;
}
message BucketStatistics {
repeated uint64 count = 1 [packed=true];
}
message DecimalStatistics {
optional string minimum = 1;
optional string maximum = 2;
optional string sum = 3;
}
message DateStatistics {
// min,max values saved as days since epoch
optional sint32 minimum = 1;
optional sint32 maximum = 2;
}
message TimestampStatistics {
// min,max values saved as milliseconds since epoch
optional sint64 minimum = 1;
optional sint64 maximum = 2;
optional sint64 minimum_utc = 3;
optional sint64 maximum_utc = 4;
// store the lower 6 TS digits for min/max to achieve nanosecond precision
optional int32 minimum_nanos = 5;
optional int32 maximum_nanos = 6;
}
message BinaryStatistics {
// sum will store the total binary blob length in a stripe
optional sint64 sum = 1;
}
// Statistics for list and map
message CollectionStatistics {
optional uint64 min_children = 1;
optional uint64 max_children = 2;
optional uint64 total_children = 3;
}
// Bounding box for Geometry or Geography type in the representation of min/max
// value pair of coordinates from each axis.
message BoundingBox {
optional double xmin = 1;
optional double xmax = 2;
optional double ymin = 3;
optional double ymax = 4;
optional double zmin = 5;
optional double zmax = 6;
optional double mmin = 7;
optional double mmax = 8;
}
// Statistics specific to Geometry or Geography type
message GeospatialStatistics {
// A bounding box of geospatial instances
optional BoundingBox bbox = 1;
// Geospatial type codes of all instances, or an empty list if not known
repeated int32 geospatial_types = 2;
}
message ColumnStatistics {
optional uint64 number_of_values = 1;
optional IntegerStatistics int_statistics = 2;
optional DoubleStatistics double_statistics = 3;
optional StringStatistics string_statistics = 4;
optional BucketStatistics bucket_statistics = 5;
optional DecimalStatistics decimal_statistics = 6;
optional DateStatistics date_statistics = 7;
optional BinaryStatistics binary_statistics = 8;
optional TimestampStatistics timestamp_statistics = 9;
optional bool has_null = 10;
optional uint64 bytes_on_disk = 11;
optional CollectionStatistics collection_statistics = 12;
optional GeospatialStatistics geospatial_statistics = 13;
}
message RowIndexEntry {
repeated uint64 positions = 1 [packed=true];
optional ColumnStatistics statistics = 2;
}
message RowIndex {
repeated RowIndexEntry entry = 1;
}
message BloomFilter {
optional uint32 num_hash_functions = 1;
repeated fixed64 bitset = 2;
optional bytes utf8bitset = 3;
}
message BloomFilterIndex {
repeated BloomFilter bloom_filter = 1;
}
message Stream {
// if you add new index stream kinds, you need to make sure to update
// StreamName to ensure it is added to the stripe in the right area
enum Kind {
PRESENT = 0;
DATA = 1;
LENGTH = 2;
DICTIONARY_DATA = 3;
DICTIONARY_COUNT = 4;
SECONDARY = 5;
ROW_INDEX = 6;
BLOOM_FILTER = 7;
BLOOM_FILTER_UTF8 = 8;
// Virtual stream kinds to allocate space for encrypted index and data.
ENCRYPTED_INDEX = 9;
ENCRYPTED_DATA = 10;
// stripe statistics streams
STRIPE_STATISTICS = 100;
// A virtual stream kind that is used for setting the encryption IV.
FILE_STATISTICS = 101;
}
optional Kind kind = 1;
optional uint32 column = 2;
optional uint64 length = 3;
}
message ColumnEncoding {
enum Kind {
DIRECT = 0;
DICTIONARY = 1;
DIRECT_V2 = 2;
DICTIONARY_V2 = 3;
}
optional Kind kind = 1;
optional uint32 dictionary_size = 2;
// The encoding of the bloom filters for this column:
// 0 or missing = none or original
// 1 = ORC-135 (utc for timestamps)
optional uint32 bloom_encoding = 3;
}
message StripeEncryptionVariant {
repeated Stream streams = 1;
repeated ColumnEncoding encoding = 2;
}
// each stripe looks like:
// index streams
// unencrypted
// variant 1..N
// data streams
// unencrypted
// variant 1..N
// footer
message StripeFooter {
repeated Stream streams = 1;
repeated ColumnEncoding columns = 2;
optional string writer_timezone = 3;
// one for each column encryption variant
repeated StripeEncryptionVariant encryption = 4;
}
// the file tail looks like:
// encrypted stripe statistics: ColumnarStripeStatistics (order by variant)
// stripe statistics: Metadata
// footer: Footer
// postscript: PostScript
// psLen: byte
message StringPair {
optional string key = 1;
optional string value = 2;
}
message Type {
enum Kind {
BOOLEAN = 0;
BYTE = 1;
SHORT = 2;
INT = 3;
LONG = 4;
FLOAT = 5;
DOUBLE = 6;
STRING = 7;
BINARY = 8;
TIMESTAMP = 9;
LIST = 10;
MAP = 11;
STRUCT = 12;
UNION = 13;
DECIMAL = 14;
DATE = 15;
VARCHAR = 16;
CHAR = 17;
TIMESTAMP_INSTANT = 18;
GEOMETRY = 19;
GEOGRAPHY = 20;
}
optional Kind kind = 1;
repeated uint32 subtypes = 2 [packed=true];
repeated string field_names = 3;
optional uint32 maximum_length = 4;
optional uint32 precision = 5;
optional uint32 scale = 6;
repeated StringPair attributes = 7;
// Coordinate Reference System (CRS) for Geometry and Geography types
optional string crs = 8;
// Edge interpolation algorithm for Geography type
enum EdgeInterpolationAlgorithm {
SPHERICAL = 0;
VINCENTY = 1;
THOMAS = 2;
ANDOYER = 3;
KARNEY = 4;
}
optional EdgeInterpolationAlgorithm algorithm = 9;
}
message StripeInformation {
// the global file offset of the start of the stripe
optional uint64 offset = 1;
// the number of bytes of index
optional uint64 index_length = 2;
// the number of bytes of data
optional uint64 data_length = 3;
// the number of bytes in the stripe footer
optional uint64 footer_length = 4;
// the number of rows in this stripe
optional uint64 number_of_rows = 5;
// If this is present, the reader should use this value for the encryption
// stripe id for setting the encryption IV. Otherwise, the reader should
// use one larger than the previous stripe's encryptStripeId.
// For unmerged ORC files, the first stripe will use 1 and the rest of the
// stripes won't have it set. For merged files, the stripe information
// will be copied from their original files and thus the first stripe of
// each of the input files will reset it to 1.
// Note that 1 was choosen, because protobuf v3 doesn't serialize
// primitive types that are the default (eg. 0).
optional uint64 encrypt_stripe_id = 6;
// For each encryption variant, the new encrypted local key to use
// until we find a replacement.
repeated bytes encrypted_local_keys = 7;
}
message UserMetadataItem {
optional string name = 1;
optional bytes value = 2;
}
// StripeStatistics (1 per a stripe), which each contain the
// ColumnStatistics for each column.
// This message type is only used in ORC v0 and v1.
message StripeStatistics {
repeated ColumnStatistics col_stats = 1;
}
// This message type is only used in ORC v0 and v1.
message Metadata {
repeated StripeStatistics stripe_stats = 1;
}
// In ORC v2 (and for encrypted columns in v1), each column has
// their column statistics written separately.
message ColumnarStripeStatistics {
// one value for each stripe in the file
repeated ColumnStatistics col_stats = 1;
}
enum EncryptionAlgorithm {
UNKNOWN_ENCRYPTION = 0; // used for detecting future algorithms
AES_CTR_128 = 1;
AES_CTR_256 = 2;
}
message FileStatistics {
repeated ColumnStatistics column = 1;
}
// How was the data masked? This isn't necessary for reading the file, but
// is documentation about how the file was written.
message DataMask {
// the kind of masking, which may include third party masks
optional string name = 1;
// parameters for the mask
repeated string mask_parameters = 2;
// the unencrypted column roots this mask was applied to
repeated uint32 columns = 3 [packed = true];
}
// Information about the encryption keys.
message EncryptionKey {
optional string key_name = 1;
optional uint32 key_version = 2;
optional EncryptionAlgorithm algorithm = 3;
}
// The description of an encryption variant.
// Each variant is a single subtype that is encrypted with a single key.
message EncryptionVariant {
// the column id of the root
optional uint32 root = 1;
// The master key that was used to encrypt the local key, referenced as
// an index into the Encryption.key list.
optional uint32 key = 2;
// the encrypted key for the file footer
optional bytes encrypted_key = 3;
// the stripe statistics for this variant
repeated Stream stripe_statistics = 4;
// encrypted file statistics as a FileStatistics
optional bytes file_statistics = 5;
}
// Which KeyProvider encrypted the local keys.
enum KeyProviderKind {
UNKNOWN = 0;
HADOOP = 1;
AWS = 2;
GCP = 3;
AZURE = 4;
}
message Encryption {
// all of the masks used in this file
repeated DataMask mask = 1;
// all of the keys used in this file
repeated EncryptionKey key = 2;
// The encrypted variants.
// Readers should prefer the first variant that the user has access to
// the corresponding key. If they don't have access to any of the keys,
// they should get the unencrypted masked data.
repeated EncryptionVariant variants = 3;
// How are the local keys encrypted?
optional KeyProviderKind key_provider = 4;
}
enum CalendarKind {
UNKNOWN_CALENDAR = 0;
// A hybrid Julian/Gregorian calendar with a cutover point in October 1582.
JULIAN_GREGORIAN = 1;
// A calendar that extends the Gregorian calendar back forever.
PROLEPTIC_GREGORIAN = 2;
}
message Footer {
optional uint64 header_length = 1;
optional uint64 content_length = 2;
repeated StripeInformation stripes = 3;
repeated Type types = 4;
repeated UserMetadataItem metadata = 5;
optional uint64 number_of_rows = 6;
repeated ColumnStatistics statistics = 7;
optional uint32 row_index_stride = 8;
// Each implementation that writes ORC files should register for a code
// 0 = ORC Java
// 1 = ORC C++
// 2 = Presto
// 3 = Scritchley Go from https://github.com/scritchley/orc
// 4 = Trino
// 5 = CUDF
optional uint32 writer = 9;
// information about the encryption in this file
optional Encryption encryption = 10;
optional CalendarKind calendar = 11;
// informative description about the version of the software that wrote
// the file. It is assumed to be within a given writer, so for example
// ORC 1.7.2 = "1.7.2". It may include suffixes, such as "-SNAPSHOT".
optional string software_version = 12;
}
enum CompressionKind {
NONE = 0;
ZLIB = 1;
SNAPPY = 2;
LZO = 3;
LZ4 = 4;
ZSTD = 5;
BROTLI = 6;
}
// Serialized length must be less that 255 bytes
message PostScript {
optional uint64 footer_length = 1;
optional CompressionKind compression = 2;
optional uint64 compression_block_size = 3;
// the version of the file format
// [0, 11] = Hive 0.11
// [0, 12] = Hive 0.12
repeated uint32 version = 4 [packed = true];
optional uint64 metadata_length = 5;
// The version of the writer that wrote the file. This number is
// updated when we make fixes or large changes to the writer so that
// readers can detect whether a given bug is present in the data.
//
// Only the Java ORC writer may use values under 6 (or missing) so that
// readers that predate ORC-202 treat the new writers correctly. Each
// writer should assign their own sequence of versions starting from 6.
//
// Version of the ORC Java writer:
// 0 = original
// 1 = HIVE-8732 fixed (fixed stripe/file maximum statistics &
// string statistics use utf8 for min/max)
// 2 = HIVE-4243 fixed (use real column names from Hive tables)
// 3 = HIVE-12055 added (vectorized writer implementation)
// 4 = HIVE-13083 fixed (decimals write present stream correctly)
// 5 = ORC-101 fixed (bloom filters use utf8 consistently)
// 6 = ORC-135 fixed (timestamp statistics use utc)
// 7 = ORC-517 fixed (decimal64 min/max incorrect)
// 8 = ORC-203 added (trim very long string statistics)
// 9 = ORC-14 added (column encryption)
//
// Version of the ORC C++ writer:
// 6 = original
//
// Version of the Presto writer:
// 6 = original
//
// Version of the Scritchley Go writer:
// 6 = original
//
// Version of the Trino writer:
// 6 = original
//
// Version of the CUDF writer:
// 6 = original
//
optional uint32 writer_version = 6;
// the number of bytes in the encrypted stripe statistics
optional uint64 stripe_statistics_length = 7;
// Leave this last in the record
optional string magic = 8000;
}
// The contents of the file tail that must be serialized.
// This gets serialized as part of OrcSplit, also used by footer cache.
message FileTail {
optional PostScript postscript = 1;
optional Footer footer = 2;
optional uint64 file_length = 3;
optional uint64 postscript_length = 4;
}
orc-format-1.1.0/.github/labeler.yml 000644 000765 000024 00000001624 14777360722 020255 0 ustar 00dongjoon staff 000000 000000 #
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
INFRA:
- ".github/**/*"
- ".asf.yaml"
- ".gitignore"
BUILD:
- "**/*pom.xml"
DOCS:
- "**/*.md"
PROTO:
- "src/**/*"
orc-format-1.1.0/.github/workflows/ 000755 000765 000024 00000000000 14777360722 020156 5 ustar 00dongjoon staff 000000 000000 orc-format-1.1.0/.github/PULL_REQUEST_TEMPLATE 000644 000765 000024 00000003346 14777360722 021331 0 ustar 00dongjoon staff 000000 000000
### What changes were proposed in this pull request?
### Why are the changes needed?
### How was this patch tested?
orc-format-1.1.0/.github/workflows/labeler.yml 000644 000765 000024 00000002126 14777360722 022310 0 ustar 00dongjoon staff 000000 000000 #
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
name: "On pull requests"
on: pull_request_target
jobs:
label:
name: Label pull requests
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/labeler@2.2.0
with:
repo-token: "${{ secrets.GITHUB_TOKEN }}"
sync-labels: true
orc-format-1.1.0/.github/workflows/stale.yml 000644 000765 000024 00000002656 14777360722 022022 0 ustar 00dongjoon staff 000000 000000 #
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
name: Close stale PRs
on:
schedule:
- cron: "0 0 * * *"
jobs:
stale:
runs-on: ubuntu-latest
steps:
- uses: actions/stale@c201d45ef4b0ccbd3bb0616f93bae13e73d0a080 # pin@v1.1.0
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
stale-pr-message: >
We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a
committer to remove the Stale tag!
days-before-stale: 365
days-before-close: 0
orc-format-1.1.0/.github/workflows/build_and_test.yml 000644 000765 000024 00000001110 14777360722 023652 0 ustar 00dongjoon staff 000000 000000 name: Build and test
on:
push:
branches:
- main
pull_request:
branches:
- main
jobs:
build:
if: github.repository == 'apache/orc-format'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@master
- uses: actions/setup-java@v4
with:
distribution: zulu
java-version: 17
- uses: bufbuild/buf-setup-action@v1
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
- uses: bufbuild/buf-lint-action@v1
with:
input: "src/main/proto"
- name: Install
run: |
./mvnw install
orc-format-1.1.0/.github/workflows/publish_snapshot.yml 000644 000765 000024 00000001311 14777360722 024262 0 ustar 00dongjoon staff 000000 000000 name: Publish Snapshot
on:
push:
branches:
- main
jobs:
publish-snapshot:
if: github.repository == 'apache/orc-format'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@master
- uses: actions/setup-java@v4
with:
distribution: zulu
java-version: 17
- name: Publish snapshot
env:
ASF_USERNAME: ${{ secrets.NEXUS_USER }}
ASF_PASSWORD: ${{ secrets.NEXUS_PW }}
run: |
echo "apache.snapshots.https$ASF_USERNAME$ASF_PASSWORD" > settings.xml
./mvnw --settings settings.xml -nsu -ntp -DskipTests deploy
orc-format-1.1.0/specification/ORCv0.md 000644 000765 000024 00000070763 14777360722 020631 0 ustar 00dongjoon staff 000000 000000 ---
layout: page
title: ORC Specification v0
---
This version of the file format was originally released as part of
Hive 0.11.
# Motivation
Hive's RCFile was the standard format for storing tabular data in
Hadoop for several years. However, RCFile has limitations because it
treats each column as a binary blob without semantics. In Hive 0.11 we
added a new file format named Optimized Row Columnar (ORC) file that
uses and retains the type information from the table definition. ORC
uses type specific readers and writers that provide light weight
compression techniques such as dictionary encoding, bit packing, delta
encoding, and run length encoding -- resulting in dramatically smaller
files. Additionally, ORC can apply generic compression using zlib, or
Snappy on top of the lightweight compression for even smaller
files. However, storage savings are only part of the gain. ORC
supports projection, which selects subsets of the columns for reading,
so that queries reading only one column read only the required
bytes. Furthermore, ORC files include light weight indexes that
include the minimum and maximum values for each column in each set of
10,000 rows and the entire file. Using pushdown filters from Hive, the
file reader can skip entire sets of rows that aren't important for
this query.

# File Tail
Since HDFS does not support changing the data in a file after it is
written, ORC stores the top level index at the end of the file. The
overall structure of the file is given in the figure above. The
file's tail consists of 3 parts; the file metadata, file footer and
postscript.
The metadata for ORC is stored using
[Protocol Buffers](https://s.apache.org/protobuf_encoding), which provides
the ability to add new fields without breaking readers. This document
incorporates the Protobuf definition from the
[ORC source code](../src/main/proto/orc/proto/orc_proto.proto) and the
reader is encouraged to review the Protobuf encoding if they need to
understand the byte-level encoding
## Postscript
The Postscript section provides the necessary information to interpret
the rest of the file including the length of the file's Footer and
Metadata sections, the version of the file, and the kind of general
compression used (eg. none, zlib, or snappy). The Postscript is never
compressed and ends one byte before the end of the file. The version
stored in the Postscript is the lowest version of Hive that is
guaranteed to be able to read the file and it stored as a sequence of
the major and minor version. This version is stored as [0, 11].
The process of reading an ORC file works backwards through the
file. Rather than making multiple short reads, the ORC reader reads
the last 16k bytes of the file with the hope that it will contain both
the Footer and Postscript sections. The final byte of the file
contains the serialized length of the Postscript, which must be less
than 256 bytes. Once the Postscript is parsed, the compressed
serialized length of the Footer is known and it can be decompressed
and parsed.
```
message PostScript {
// the length of the footer section in bytes
optional uint64 footerLength = 1;
// the kind of generic compression used
optional CompressionKind compression = 2;
// the maximum size of each compression chunk
optional uint64 compressionBlockSize = 3;
// the version of the writer
repeated uint32 version = 4 [packed = true];
// the length of the metadata section in bytes
optional uint64 metadataLength = 5;
// the fixed string "ORC"
optional string magic = 8000;
}
```
```
enum CompressionKind {
NONE = 0;
ZLIB = 1;
SNAPPY = 2;
LZO = 3;
LZ4 = 4;
ZSTD = 5;
}
```
## Footer
The Footer section contains the layout of the body of the file, the
type schema information, the number of rows, and the statistics about
each of the columns.
The file is broken in to three parts- Header, Body, and Tail. The
Header consists of the bytes "ORC'' to support tools that want to
scan the front of the file to determine the type of the file. The Body
contains the rows and indexes, and the Tail gives the file level
information as described in this section.
```
message Footer {
// the length of the file header in bytes (always 3)
optional uint64 headerLength = 1;
// the length of the file header and body in bytes
optional uint64 contentLength = 2;
// the information about the stripes
repeated StripeInformation stripes = 3;
// the schema information
repeated Type types = 4;
// the user metadata that was added
repeated UserMetadataItem metadata = 5;
// the total number of rows in the file
optional uint64 numberOfRows = 6;
// the statistics of each column across the file
repeated ColumnStatistics statistics = 7;
// the maximum number of rows in each index entry
optional uint32 rowIndexStride = 8;
}
```
### Stripe Information
The body of the file is divided into stripes. Each stripe is self
contained and may be read using only its own bytes combined with the
file's Footer and Postscript. Each stripe contains only entire rows so
that rows never straddle stripe boundaries. Stripes have three
sections: a set of indexes for the rows within the stripe, the data
itself, and a stripe footer. Both the indexes and the data sections
are divided by columns so that only the data for the required columns
needs to be read.
```
message StripeInformation {
// the start of the stripe within the file
optional uint64 offset = 1;
// the length of the indexes in bytes
optional uint64 indexLength = 2;
// the length of the data in bytes
optional uint64 dataLength = 3;
// the length of the footer in bytes
optional uint64 footerLength = 4;
// the number of rows in the stripe
optional uint64 numberOfRows = 5;
}
```
### Type Information
All of the rows in an ORC file must have the same schema. Logically
the schema is expressed as a tree as in the figure below, where
the compound types have subcolumns under them.

The equivalent Hive DDL would be:
```
create table Foobar (
myInt int,
myMap map>,
myTime timestamp
);
```
The type tree is flattened in to a list via a pre-order traversal
where each type is assigned the next id. Clearly the root of the type
tree is always type id 0. Compound types have a field named subtypes
that contains the list of their children's type ids.
```
message Type {
enum Kind {
BOOLEAN = 0;
BYTE = 1;
SHORT = 2;
INT = 3;
LONG = 4;
FLOAT = 5;
DOUBLE = 6;
STRING = 7;
BINARY = 8;
TIMESTAMP = 9;
LIST = 10;
MAP = 11;
STRUCT = 12;
UNION = 13;
DECIMAL = 14;
DATE = 15;
VARCHAR = 16;
CHAR = 17;
}
// the kind of this type
required Kind kind = 1;
// the type ids of any subcolumns for list, map, struct, or union
repeated uint32 subtypes = 2 [packed=true];
// the list of field names for struct
repeated string fieldNames = 3;
// the maximum length of the type for varchar or char in UTF-8 characters
optional uint32 maximumLength = 4;
// the precision and scale for decimal
optional uint32 precision = 5;
optional uint32 scale = 6;
}
```
### Column Statistics
The goal of the column statistics is that for each column, the writer
records the count and depending on the type other useful fields. For
most of the primitive types, it records the minimum and maximum
values; and for numeric types it additionally stores the sum.
From Hive 1.1.0 onwards, the column statistics will also record if
there are any null values within the row group by setting the hasNull flag.
The hasNull flag is used by ORC's predicate pushdown to better answer
'IS NULL' queries.
```
message ColumnStatistics {
// the number of values
optional uint64 numberOfValues = 1;
// At most one of these has a value for any column
optional IntegerStatistics intStatistics = 2;
optional DoubleStatistics doubleStatistics = 3;
optional StringStatistics stringStatistics = 4;
optional BucketStatistics bucketStatistics = 5;
optional DecimalStatistics decimalStatistics = 6;
optional DateStatistics dateStatistics = 7;
optional BinaryStatistics binaryStatistics = 8;
optional TimestampStatistics timestampStatistics = 9;
optional bool hasNull = 10;
}
```
For integer types (tinyint, smallint, int, bigint), the column
statistics includes the minimum, maximum, and sum. If the sum
overflows long at any point during the calculation, no sum is
recorded.
```
message IntegerStatistics {
optional sint64 minimum = 1;
optional sint64 maximum = 2;
optional sint64 sum = 3;
}
```
For floating point types (float, double), the column statistics
include the minimum, maximum, and sum. If the sum overflows a double,
no sum is recorded.
```
message DoubleStatistics {
optional double minimum = 1;
optional double maximum = 2;
optional double sum = 3;
}
```
For strings, the minimum value, maximum value, and the sum of the
lengths of the values are recorded.
```
message StringStatistics {
optional string minimum = 1;
optional string maximum = 2;
// sum will store the total length of all strings
optional sint64 sum = 3;
}
```
For booleans, the statistics include the count of false and true values.
```
message BucketStatistics {
repeated uint64 count = 1 [packed=true];
}
```
For decimals, the minimum, maximum, and sum are stored.
```
message DecimalStatistics {
optional string minimum = 1;
optional string maximum = 2;
optional string sum = 3;
}
```
Date columns record the minimum and maximum values as the number of
days since the UNIX epoch (1/1/1970 in UTC).
```
message DateStatistics {
// min,max values saved as days since epoch
optional sint32 minimum = 1;
optional sint32 maximum = 2;
}
```
Timestamp columns record the minimum and maximum values as the number of
milliseconds since the epoch (1/1/2015).
```
message TimestampStatistics {
// min,max values saved as milliseconds since epoch
optional sint64 minimum = 1;
optional sint64 maximum = 2;
}
```
Binary columns store the aggregate number of bytes across all of the values.
```
message BinaryStatistics {
// sum will store the total binary blob length
optional sint64 sum = 1;
}
```
### User Metadata
The user can add arbitrary key/value pairs to an ORC file as it is
written. The contents of the keys and values are completely
application defined, but the key is a string and the value is
binary. Care should be taken by applications to make sure that their
keys are unique and in general should be prefixed with an organization
code.
```
message UserMetadataItem {
// the user defined key
required string name = 1;
// the user defined binary value
required bytes value = 2;
}
```
### File Metadata
The file Metadata section contains column statistics at the stripe
level granularity. These statistics enable input split elimination
based on the predicate push-down evaluated per a stripe.
```
message StripeStatistics {
repeated ColumnStatistics colStats = 1;
}
```
```
message Metadata {
repeated StripeStatistics stripeStats = 1;
}
```
# Compression
If the ORC file writer selects a generic compression codec (zlib or
snappy), every part of the ORC file except for the Postscript is
compressed with that codec. However, one of the requirements for ORC
is that the reader be able to skip over compressed bytes without
decompressing the entire stream. To manage this, ORC writes compressed
streams in chunks with headers as in the figure below.
To handle uncompressable data, if the compressed data is larger than
the original, the original is stored and the isOriginal flag is
set. Each header is 3 bytes long with (compressedLength * 2 +
isOriginal) stored as a little endian value. For example, the header
for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d,
0x03]. The header for 5 bytes that did not compress would be [0x0b,
0x00, 0x00]. Each compression chunk is compressed independently so
that as long as a decompressor starts at the top of a header, it can
start decompressing without the previous bytes.

The default compression chunk size is 256K, but writers can choose
their own value. Larger chunks lead to better compression, but require
more memory. The chunk size is recorded in the Postscript so that
readers can allocate appropriately sized buffers. Readers are
guaranteed that no chunk will expand to more than the compression chunk
size.
ORC files without generic compression write each stream directly
with no headers.
# Run Length Encoding
## Base 128 Varint
Variable width integer encodings take advantage of the fact that most
numbers are small and that having smaller encodings for small numbers
shrinks the overall size of the data. ORC uses the varint format from
Protocol Buffers, which writes data in little endian format using the
low 7 bits of each byte. The high bit in each byte is set if the
number continues into the next byte.
Unsigned Original | Serialized
:---------------- | :---------
0 | 0x00
1 | 0x01
127 | 0x7f
128 | 0x80, 0x01
129 | 0x81, 0x01
16,383 | 0xff, 0x7f
16,384 | 0x80, 0x80, 0x01
16,385 | 0x81, 0x80, 0x01
For signed integer types, the number is converted into an unsigned
number using a zigzag encoding. Zigzag encoding moves the sign bit to
the least significant bit using the expression (val << 1) ^ (val >>
63) and derives its name from the fact that positive and negative
numbers alternate once encoded. The unsigned number is then serialized
as above.
Signed Original | Unsigned
:-------------- | :-------
0 | 0
-1 | 1
1 | 2
-2 | 3
2 | 4
## Byte Run Length Encoding
For byte streams, ORC uses a very light weight encoding of identical
values.
* Run - a sequence of at least 3 identical values
* Literals - a sequence of non-identical values
The first byte of each group of values is a header that determines
whether it is a run (value between 0 to 127) or literal list (value
between -128 to -1). For runs, the control byte is the length of the
run minus the length of the minimal run (3) and the control byte for
literal lists is the negative length of the list. For example, a
hundred 0's is encoded as [0x61, 0x00] and the sequence 0x44, 0x45
would be encoded as [0xfe, 0x44, 0x45]. The next group can choose
either of the encodings.
## Boolean Run Length Encoding
For encoding boolean types, the bits are put in the bytes from most
significant to least significant. The bytes are encoded using byte run
length encoding as described in the previous section. For example,
the byte sequence [0xff, 0x80] would be one true followed by
seven false values.
## Integer Run Length Encoding, version 1
ORC v0 files use Run Length Encoding version 1 (RLEv1),
which provides a lightweight compression of signed or unsigned integer
sequences. RLEv1 has two sub-encodings:
* Run - a sequence of values that differ by a small fixed delta
* Literals - a sequence of varint encoded values
Runs start with an initial byte of 0x00 to 0x7f, which encodes the
length of the run - 3. A second byte provides the fixed delta in the
range of -128 to 127. Finally, the first value of the run is encoded
as a base 128 varint.
For example, if the sequence is 100 instances of 7 the encoding would
start with 100 - 3, followed by a delta of 0, and a varint of 7 for
an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers
running from 100 to 1, the first byte is 100 - 3, the delta is -1,
and the varint is 100 for an encoding of [0x61, 0xff, 0x64].
Literals start with an initial byte of 0x80 to 0xff, which corresponds
to the negative of number of literals in the sequence. Following the
header byte, the list of N varints is encoded. Thus, if there are
no runs, the overhead is 1 byte for each 128 integers. Numbers
[2, 3, 6, 7, 11] would be encoded as [0xfb, 0x02, 0x03, 0x06, 0x07, 0xb].
# Stripes
The body of ORC files consists of a series of stripes. Stripes are
large (typically ~200MB) and independent of each other and are often
processed by different tasks. The defining characteristic for columnar
storage formats is that the data for each column is stored separately
and that reading data out of the file should be proportional to the
number of columns read.
In ORC files, each column is stored in several streams that are stored
next to each other in the file. For example, an integer column is
represented as two streams PRESENT, which uses one with a bit per
value recording if the value is non-null, and DATA, which records the
non-null values. If all of a column's values in a stripe are non-null,
the PRESENT stream is omitted from the stripe. For binary data, ORC
uses three streams PRESENT, DATA, and LENGTH, which stores the length
of each value. The details of each type will be presented in the
following subsections.
There is a general order for index and data streams:
* Index streams are always placed together in the beginning of the stripe.
* Data streams are placed together after index streams (if any).
* Inside index streams or data streams, the unencrypted streams should be
placed first and then followed by streams grouped by each encryption variant.
There is no fixed order within each unencrypted or encryption variant in the
index and data streams:
* Different stream kinds of the same column can be placed in any order.
* Streams from different columns can even be placed in any order.
To get the precise information (a.k.a stream kind, column id and location) of
a stream within a stripe, the streams field in the StripeFooter described below
is the single source of truth.
In the example of the integer column mentioned above, the order of the
PRESENT stream and the DATA stream cannot be determined in advance.
We need to get the precise information by **StripeFooter**.
## Stripe Footer
The stripe footer contains the encoding of each column and the
directory of the streams including their location.
```
message StripeFooter {
// the location of each stream
repeated Stream streams = 1;
// the encoding of each column
repeated ColumnEncoding columns = 2;
}
```
To describe each stream, ORC stores the kind of stream, the column id,
and the stream's size in bytes. The details of what is stored in each stream
depends on the type and encoding of the column.
```
message Stream {
enum Kind {
// boolean stream of whether the next value is non-null
PRESENT = 0;
// the primary data stream
DATA = 1;
// the length of each value for variable length data
LENGTH = 2;
// the dictionary blob
DICTIONARY_DATA = 3;
// deprecated prior to Hive 0.11
// It was used to store the number of instances of each value in the
// dictionary
DICTIONARY_COUNT = 4;
// a secondary data stream
SECONDARY = 5;
// the index for seeking to particular row groups
ROW_INDEX = 6;
}
required Kind kind = 1;
// the column id
optional uint32 column = 2;
// the number of bytes in the file
optional uint64 length = 3;
}
```
Depending on their type several options for encoding are possible. The
encodings are divided into direct or dictionary-based categories and
further refined as to whether they use RLE v1 or v2.
```
message ColumnEncoding {
enum Kind {
// the encoding is mapped directly to the stream using RLE v1
DIRECT = 0;
// the encoding uses a dictionary of unique values using RLE v1
DICTIONARY = 1;
// the encoding is direct using RLE v2
}
required Kind kind = 1;
// for dictionary encodings, record the size of the dictionary
optional uint32 dictionarySize = 2;
}
```
# Column Encodings
## SmallInt, Int, and BigInt Columns
All of the 16, 32, and 64 bit integer column types use the same set of
potential encodings, which is basically whether they use RLE v1 or
v2. If the PRESENT stream is not included, all of the values are
present. For values that have false bits in the present stream, no
values are included in the data stream.
Encoding | Stream Kind | Optional | Contents
:-------- | :---------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v1
> Note that the order of the Stream is not fixed. It also applies to other Column types.
## Float and Double Columns
Floating point types are stored using IEEE 754 floating point bit
layout. Float columns use 4 bytes per value and double columns use 8
bytes.
Encoding | Stream Kind | Optional | Contents
:-------- | :---------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | IEEE 754 floating point representation
## String, Char, and VarChar Columns
String, char, and varchar columns may be encoded either using a
dictionary encoding or a direct encoding. A direct encoding should be
preferred when there are many distinct values. In all of the
encodings, the PRESENT stream encodes whether the value is null. The
Java ORC writer automatically picks the encoding after the first row
group (10,000 rows).
For direct encoding the UTF-8 bytes are saved in the DATA stream and
the length of each value is written into the LENGTH stream. In direct
encoding, if the values were ["Nevada", "California"]; the DATA
would be "NevadaCalifornia" and the LENGTH would be [6, 10].
For dictionary encodings the dictionary is sorted (in lexicographical
order of bytes in the UTF-8 encodings) and UTF-8 bytes of
each unique value are placed into DICTIONARY_DATA. The length of each
item in the dictionary is put into the LENGTH stream. The DATA stream
consists of the sequence of references to the dictionary elements.
In dictionary encoding, if the values were ["Nevada",
"California", "Nevada", "California", and "Florida"]; the
DICTIONARY_DATA would be "CaliforniaFloridaNevada" and LENGTH would
be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v1
DICTIONARY | PRESENT | Yes | Boolean RLE
| DATA | No | Unsigned Integer RLE v1
| DICTIONARY_DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v1
## Boolean Columns
Boolean columns are rare, but have a simple encoding.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Boolean RLE
## TinyInt Columns
TinyInt (byte) columns use byte run length encoding.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Byte RLE
## Binary Columns
Binary data is encoded with a PRESENT stream, a DATA stream that records
the contents, and a LENGTH stream that records the number of bytes per a
value.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v1
## Decimal Columns
Decimal was introduced in Hive 0.11 with infinite precision (the total
number of digits). In Hive 0.13, the definition was change to limit
the precision to a maximum of 38 digits, which conveniently uses 127
bits plus a sign bit. The current encoding of decimal columns stores
the integer representation of the value as an unbounded length zigzag
encoded base 128 varint. The scale is stored in the SECONDARY stream
as a signed integer.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Unbounded base 128 varints
| SECONDARY | No | Signed Integer RLE v1
## Date Columns
Date data is encoded with a PRESENT stream, a DATA stream that records
the number of days after January 1, 1970 in UTC.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v1
## Timestamp Columns
Timestamp records times down to nanoseconds as a PRESENT stream that
records non-null values, a DATA stream that records the number of
seconds after 1 January 2015, and a SECONDARY stream that records the
number of nanoseconds.
Because the number of nanoseconds often has a large number of trailing
zeros, the number has trailing decimal zero digits removed and the
last three bits are used to record how many zeros were removed. if the
trailing zeros are more than 2. Thus 1000 nanoseconds would be
serialized as 0x0a and 100000 would be serialized as 0x0c.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v1
| SECONDARY | No | Unsigned Integer RLE v1
## Struct Columns
Structs have no data themselves and delegate everything to their child
columns except for their PRESENT stream. They have a child column
for each of the fields.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
## List Columns
Lists are encoded as the PRESENT stream and a length stream with
number of items in each list. They have a single child column for the
element values.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| LENGTH | No | Unsigned Integer RLE v1
## Map Columns
Maps are encoded as the PRESENT stream and a length stream with number
of items in each map. They have a child column for the key and
another child column for the value.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| LENGTH | No | Unsigned Integer RLE v1
## Union Columns
Unions are encoded as the PRESENT stream and a tag stream that controls which
potential variant is used. They have a child column for each variant of the
union. Currently ORC union types are limited to 256 variants, which matches
the Hive type model.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DIRECT | No | Byte RLE
# Indexes
## Row Group Index
The row group indexes consist of a ROW_INDEX stream for each primitive
column that has an entry for each row group. Row groups are controlled
by the writer and default to 10,000 rows. Each RowIndexEntry gives the
position of each stream for the column and the statistics for that row
group.
The index streams are placed at the front of the stripe, because in
the default case of streaming they do not need to be read. They are
only loaded when either predicate push down is being used or the
reader seeks to a particular row.
```
message RowIndexEntry {
repeated uint64 positions = 1 [packed=true];
optional ColumnStatistics statistics = 2;
}
```
```
message RowIndex {
repeated RowIndexEntry entry = 1;
}
```
To record positions, each stream needs a sequence of numbers. For
uncompressed streams, the position is the byte offset of the RLE run's
start location followed by the number of values that need to be
consumed from the run. In compressed streams, the first number is the
start of the compression chunk in the stream, followed by the number
of decompressed bytes that need to be consumed, and finally the number
of values consumed in the RLE.
For columns with multiple streams, the sequences of positions in each
stream are concatenated. That was an unfortunate decision on my part
that we should fix at some point, because it makes code that uses the
indexes error-prone.
Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.
Note that for columns with multiple streams, the order of stream
positions in the RowIndex is **fixed**, which may be different to
the actual data stream placement, and it is the same as
[Column Encodings](#column-encoding-section) section we described above.
orc-format-1.1.0/specification/ORCv1.md 000644 000765 000024 00000142421 14777360722 020621 0 ustar 00dongjoon staff 000000 000000 ---
layout: page
title: ORC Specification v1
---
This version of the file format was originally released as part of
Hive 0.12.
# Motivation
Hive's RCFile was the standard format for storing tabular data in
Hadoop for several years. However, RCFile has limitations because it
treats each column as a binary blob without semantics. In Hive 0.11 we
added a new file format named Optimized Row Columnar (ORC) file that
uses and retains the type information from the table definition. ORC
uses type specific readers and writers that provide light weight
compression techniques such as dictionary encoding, bit packing, delta
encoding, and run length encoding -- resulting in dramatically smaller
files. Additionally, ORC can apply generic compression using zlib, or
Snappy on top of the lightweight compression for even smaller
files. However, storage savings are only part of the gain. ORC
supports projection, which selects subsets of the columns for reading,
so that queries reading only one column read only the required
bytes. Furthermore, ORC files include light weight indexes that
include the minimum and maximum values for each column in each set of
10,000 rows and the entire file. Using pushdown filters from Hive, the
file reader can skip entire sets of rows that aren't important for
this query.

# File Tail
Since HDFS does not support changing the data in a file after it is
written, ORC stores the top level index at the end of the file. The
overall structure of the file is given in the figure above. The
file's tail consists of 3 parts; the file metadata, file footer and
postscript.
The metadata for ORC is stored using
[Protocol Buffers](https://s.apache.org/protobuf_encoding), which provides
the ability to add new fields without breaking readers. This document
incorporates the Protobuf definition from the
[ORC source code](../src/main/proto/orc/proto/orc_proto.proto) and the
reader is encouraged to review the Protobuf encoding if they need to
understand the byte-level encoding
The sections of the file tail are (and their protobuf message type):
* encrypted stripe statistics: list of ColumnarStripeStatistics
* stripe statistics: Metadata
* footer: Footer
* postscript: PostScript
* psLen: byte
## Postscript
The Postscript section provides the necessary information to interpret
the rest of the file including the length of the file's Footer and
Metadata sections, the version of the file, and the kind of general
compression used (eg. none, zlib, or snappy). The Postscript is never
compressed and ends one byte before the end of the file. The version
stored in the Postscript is the lowest version of Hive that is
guaranteed to be able to read the file and it stored as a sequence of
the major and minor version. This file version is encoded as [0,12].
The process of reading an ORC file works backwards through the
file. Rather than making multiple short reads, the ORC reader reads
the last 16k bytes of the file with the hope that it will contain both
the Footer and Postscript sections. The final byte of the file
contains the serialized length of the Postscript, which must be less
than 256 bytes. Once the Postscript is parsed, the compressed
serialized length of the Footer is known and it can be decompressed
and parsed.
```
message PostScript {
// the length of the footer section in bytes
optional uint64 footerLength = 1;
// the kind of generic compression used
optional CompressionKind compression = 2;
// the maximum size of each compression chunk
optional uint64 compressionBlockSize = 3;
// the version of the writer
repeated uint32 version = 4 [packed = true];
// the length of the metadata section in bytes
optional uint64 metadataLength = 5;
// the fixed string "ORC"
optional string magic = 8000;
}
```
```
enum CompressionKind {
NONE = 0;
ZLIB = 1;
SNAPPY = 2;
LZO = 3;
LZ4 = 4;
ZSTD = 5;
}
```
## Footer
The Footer section contains the layout of the body of the file, the
type schema information, the number of rows, and the statistics about
each of the columns.
The file is broken in to three parts- Header, Body, and Tail. The
Header consists of the bytes "ORC'' to support tools that want to
scan the front of the file to determine the type of the file. The Body
contains the rows and indexes, and the Tail gives the file level
information as described in this section.
```
message Footer {
// the length of the file header in bytes (always 3)
optional uint64 headerLength = 1;
// the length of the file header and body in bytes
optional uint64 contentLength = 2;
// the information about the stripes
repeated StripeInformation stripes = 3;
// the schema information
repeated Type types = 4;
// the user metadata that was added
repeated UserMetadataItem metadata = 5;
// the total number of rows in the file
optional uint64 numberOfRows = 6;
// the statistics of each column across the file
repeated ColumnStatistics statistics = 7;
// the maximum number of rows in each index entry
optional uint32 rowIndexStride = 8;
// Each implementation that writes ORC files should register for a code
// 0 = ORC Java
// 1 = ORC C++
// 2 = Presto
// 3 = Scritchley Go from https://github.com/scritchley/orc
// 4 = Trino
optional uint32 writer = 9;
// information about the encryption in this file
optional Encryption encryption = 10;
// the number of bytes in the encrypted stripe statistics
optional uint64 stripeStatisticsLength = 11;
}
```
### Stripe Information
The body of the file is divided into stripes. Each stripe is self
contained and may be read using only its own bytes combined with the
file's Footer and Postscript. Each stripe contains only entire rows so
that rows never straddle stripe boundaries. Stripes have three
sections: a set of indexes for the rows within the stripe, the data
itself, and a stripe footer. Both the indexes and the data sections
are divided by columns so that only the data for the required columns
needs to be read.
The encryptStripeId and encryptedLocalKeys support column
encryption. They are set on the first stripe of each ORC file with
column encryption and not set after that. For a stripe with the values
set, the reader should use those values for that stripe. Subsequent
stripes use the previous encryptStripeId + 1 and the same keys.
The current ORC merging code merges entire files, and thus the reader
will get the correct values on what was the first stripe and continue
on. If we develop a merge tool that reorders stripes or does partial
merges, these values will need to be set correctly by that tool.
```
message StripeInformation {
// the start of the stripe within the file
optional uint64 offset = 1;
// the length of the indexes in bytes
optional uint64 indexLength = 2;
// the length of the data in bytes
optional uint64 dataLength = 3;
// the length of the footer in bytes
optional uint64 footerLength = 4;
// the number of rows in the stripe
optional uint64 numberOfRows = 5;
// If this is present, the reader should use this value for the encryption
// stripe id for setting the encryption IV. Otherwise, the reader should
// use one larger than the previous stripe's encryptStripeId.
// For unmerged ORC files, the first stripe will use 1 and the rest of the
// stripes won't have it set. For merged files, the stripe information
// will be copied from their original files and thus the first stripe of
// each of the input files will reset it to 1.
// Note that 1 was choosen, because protobuf v3 doesn't serialize
// primitive types that are the default (eg. 0).
optional uint64 encryptStripeId = 6;
// For each encryption variant, the new encrypted local key to use until we
// find a replacement.
repeated bytes encryptedLocalKeys = 7;
}
```
### Type Information
All of the rows in an ORC file must have the same schema. Logically
the schema is expressed as a tree as in the figure below, where
the compound types have subcolumns under them.

The equivalent Hive DDL would be:
```
create table Foobar (
myInt int,
myMap map>,
myTime timestamp
);
```
The type tree is flattened in to a list via a pre-order traversal
where each type is assigned the next id. Clearly the root of the type
tree is always type id 0. Compound types have a field named subtypes
that contains the list of their children's type ids.
```
message Type {
enum Kind {
BOOLEAN = 0;
BYTE = 1;
SHORT = 2;
INT = 3;
LONG = 4;
FLOAT = 5;
DOUBLE = 6;
STRING = 7;
BINARY = 8;
TIMESTAMP = 9;
LIST = 10;
MAP = 11;
STRUCT = 12;
UNION = 13;
DECIMAL = 14;
DATE = 15;
VARCHAR = 16;
CHAR = 17;
TIMESTAMP_INSTANT = 18;
}
// the kind of this type
required Kind kind = 1;
// the type ids of any subcolumns for list, map, struct, or union
repeated uint32 subtypes = 2 [packed=true];
// the list of field names for struct
repeated string fieldNames = 3;
// the maximum length of the type for varchar or char in UTF-8 characters
optional uint32 maximumLength = 4;
// the precision and scale for decimal
optional uint32 precision = 5;
optional uint32 scale = 6;
}
```
### Column Statistics
The goal of the column statistics is that for each column, the writer
records the count and depending on the type other useful fields. For
most of the primitive types, it records the minimum and maximum
values; and for numeric types it additionally stores the sum.
From Hive 1.1.0 onwards, the column statistics will also record if
there are any null values within the row group by setting the hasNull flag.
The hasNull flag is used by ORC's predicate pushdown to better answer
'IS NULL' queries.
```
message ColumnStatistics {
// the number of values
optional uint64 numberOfValues = 1;
// At most one of these has a value for any column
optional IntegerStatistics intStatistics = 2;
optional DoubleStatistics doubleStatistics = 3;
optional StringStatistics stringStatistics = 4;
optional BucketStatistics bucketStatistics = 5;
optional DecimalStatistics decimalStatistics = 6;
optional DateStatistics dateStatistics = 7;
optional BinaryStatistics binaryStatistics = 8;
optional TimestampStatistics timestampStatistics = 9;
optional bool hasNull = 10;
optional uint64 bytes_on_disk = 11;
optional CollectionStatistics collection_statistics = 12;
}
```
For integer types (tinyint, smallint, int, bigint), the column
statistics includes the minimum, maximum, and sum. If the sum
overflows long at any point during the calculation, no sum is
recorded.
```
message IntegerStatistics {
optional sint64 minimum = 1;
optional sint64 maximum = 2;
optional sint64 sum = 3;
}
```
For floating point types (float, double), the column statistics
include the minimum, maximum, and sum. If the sum overflows a double,
no sum is recorded.
```
message DoubleStatistics {
optional double minimum = 1;
optional double maximum = 2;
optional double sum = 3;
}
```
For strings, the minimum value, maximum value, and the sum of the
lengths of the values are recorded.
```
message StringStatistics {
optional string minimum = 1;
optional string maximum = 2;
// sum will store the total length of all strings
optional sint64 sum = 3;
}
```
For booleans, the statistics include the count of false and true values.
```
message BucketStatistics {
repeated uint64 count = 1 [packed=true];
}
```
For decimals, the minimum, maximum, and sum are stored.
```
message DecimalStatistics {
optional string minimum = 1;
optional string maximum = 2;
optional string sum = 3;
}
```
Date columns record the minimum and maximum values as the number of
days since the UNIX epoch (1/1/1970 in UTC).
```
message DateStatistics {
// min,max values saved as days since epoch
optional sint32 minimum = 1;
optional sint32 maximum = 2;
}
```
Timestamp columns record the minimum and maximum values as the number of
milliseconds since the UNIX epoch (1/1/1970 00:00:00). Before ORC-135, the
local timezone offset was included and they were stored as `minimum` and
`maximum`. After ORC-135, the timestamp is adjusted to UTC before being
converted to milliseconds and stored in `minimumUtc` and `maximumUtc`.
```
message TimestampStatistics {
// min,max values saved as milliseconds since epoch
optional sint64 minimum = 1;
optional sint64 maximum = 2;
// min,max values saved as milliseconds since UNIX epoch
optional sint64 minimumUtc = 3;
optional sint64 maximumUtc = 4;
}
```
Binary columns store the aggregate number of bytes across all of the values.
```
message BinaryStatistics {
// sum will store the total binary blob length
optional sint64 sum = 1;
}
```
### User Metadata
The user can add arbitrary key/value pairs to an ORC file as it is
written. The contents of the keys and values are completely
application defined, but the key is a string and the value is
binary. Care should be taken by applications to make sure that their
keys are unique and in general should be prefixed with an organization
code.
```
message UserMetadataItem {
// the user defined key
required string name = 1;
// the user defined binary value
required bytes value = 2;
}
```
### File Metadata
The file Metadata section contains column statistics at the stripe
level granularity. These statistics enable input split elimination
based on the predicate push-down evaluated per a stripe.
```
message StripeStatistics {
repeated ColumnStatistics colStats = 1;
}
```
```
message Metadata {
repeated StripeStatistics stripeStats = 1;
}
```
# Column Encryption
ORC as of Apache ORC 1.6 supports column encryption where the data and
statistics of specific columns are encrypted on disk. Column
encryption provides fine-grain column level security even when many
users have access to the file itself. The encryption is transparent to
the user and the writer only needs to define which columns and
encryption keys to use. When reading an ORC file, if the user has
access to the keys, they will get the real data. If they do not have
the keys, they will get the masked data.
```
message Encryption {
// all of the masks used in this file
repeated DataMask mask = 1;
// all of the keys used in this file
repeated EncryptionKey key = 2;
// The encrypted variants.
// Readers should prefer the first variant that the user has access to
// the corresponding key. If they don't have access to any of the keys,
// they should get the unencrypted masked data.
repeated EncryptionVariant variants = 3;
// How are the local keys encrypted?
optional KeyProviderKind keyProvider = 4;
}
```
Each encrypted column in each file will have a random local key
generated for it. Thus, even though all of the decryption happens
locally in the reader, a malicious user that stores the key only
enables access that column in that file. The local keys are encrypted
by the Hadoop or Ranger Key Management Server (KMS). The encrypted
local keys are stored in the file footer's StripeInformation.
```
enum KeyProviderKind {
UNKNOWN = 0;
HADOOP = 1;
AWS = 2;
GCP = 3;
AZURE = 4;
}
```
When ORC is using the Hadoop or Ranger KMS, it generates a random encrypted
local key (16 or 32 bytes for 128 or 256 bit AES respectively). Using the
first 16 bytes as the IV, it uses AES/CTR to decrypt the local key.
With the AWS KMS, the GenerateDataKey method is used to create a new local
key and the Decrypt method is used to decrypt it.
## Data Masks
The user's data is statically masked before writing the unencrypted
variant. Because the masking was done statically when the file was
written, the information about the masking is just informational.
The three standard masks are:
* nullify - all values become null
* redact - replace characters with constants such as X or 9
* sha256 - replace string with the SHA 256 of the value
The default is nullify, but masks may be defined by the user. Masks
are not allowed to change the type of the column, just the values.
```
message DataMask {
// the kind of masking, which may include third party masks
optional string name = 1;
// parameters for the mask
repeated string maskParameters = 2;
// the unencrypted column roots this mask was applied to
repeated uint32 columns = 3 [packed = true];
}
```
## Encryption Keys
In addition to the encrypted local keys, which are stored in the
footer's StripeInformation, the file also needs to describe the master
key that was used to encrypt the local keys. The master keys are
described by name, their version, and the encryption algorithm.
```
message EncryptionKey {
optional string keyName = 1;
optional uint32 keyVersion = 2;
optional EncryptionAlgorithm algorithm = 3;
}
```
The encryption algorithm is stored using an enumeration and since
ProtoBuf uses the 0 value as a default, we added an unused value. That
ensures that if we add a new algorithm that old readers will get
UNKNOWN_ENCRYPTION instead of a real value.
```
enum EncryptionAlgorithm {
// used for detecting future algorithms
UNKNOWN_ENCRYPTION = 0;
// 128 bit AES/CTR
AES_CTR_128 = 1;
// 256 bit AES/CTR
AES_CTR_256 = 2;
}
```
## Encryption Variants
Each encrypted column is written as two variants:
* encrypted unmasked - for users with access to the key
* unencrypted masked - for all other users
The changes to the format were done so that old ORC readers will read
the masked unencrypted data. Encryption variants encrypt a subtree of
columns and use a single local key. The initial version of encryption
support only allows the two variants, but this may be extended later
and thus readers should use the first variant of a column that the
reader has access to.
```
message EncryptionVariant {
// the column id of the root column that is encrypted in this variant
optional uint32 root = 1;
// the key that encrypted this variant
optional uint32 key = 2;
// The master key that was used to encrypt the local key, referenced as
// an index into the Encryption.key list.
optional bytes encryptedKey = 3;
// the stripe statistics for this variant
repeated Stream stripeStatistics = 4;
// encrypted file statistics as a FileStatistics
optional bytes fileStatistics = 5;
}
```
Each variant stores stripe and file statistics separately. The file
statistics are serialized as a FileStatistics, compressed, encrypted
and stored in the EncryptionVariant.fileStatistics.
```
message FileStatistics {
repeated ColumnStatistics column = 1;
}
```
The stripe statistics for each column are serialized as
ColumnarStripeStatistics, compressed, encrypted and stored in a stream
of kind STRIPE_STATISTICS. By making the column stripe statistics
independent of each other, the reader only reads and parses the
columns contained in the SARG.
```
message ColumnarStripeStatistics {
// one value for each stripe in the file
repeated ColumnStatistics colStats = 1;
}
```
## Stream Encryption
Our encryption is done using AES/CTR. CTR is a mode that has some very
nice properties for us:
* It is seeded so that identical data is encrypted differently.
* It does not require padding the stream to the cipher length.
* It allows readers to seek in to a stream.
* The IV does not need to be randomly generated.
To ensure that we don't reuse IV, we set the IV as:
* bytes 0 to 2 - column id
* bytes 3 to 4 - stream kind
* bytes 5 to 7 - stripe id
* bytes 8 to 15 - cipher block counter
However, it is critical for CTR that we never reuse an initialization
vector (IV) with the same local key.
For data in the footer, use the number of stripes in the file as the
stripe id. This guarantees when we write an intermediate footer in to
a file that we don't use the same IV.
Additionally, we never reuse a local key for new data. For example, when
merging files, we don't reuse local key from the input files for the new
file tail, but always generate a new local key.
# Compression
If the ORC file writer selects a generic compression codec (zlib or
snappy), every part of the ORC file except for the Postscript is
compressed with that codec. However, one of the requirements for ORC
is that the reader be able to skip over compressed bytes without
decompressing the entire stream. To manage this, ORC writes compressed
streams in chunks with headers as in the figure below.
To handle uncompressable data, if the compressed data is larger than
the original, the original is stored and the isOriginal flag is
set. Each header is 3 bytes long with (compressedLength * 2 +
isOriginal) stored as a little endian value. For example, the header
for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d,
0x03]. The header for 5 bytes that did not compress would be [0x0b,
0x00, 0x00]. Each compression chunk is compressed independently so
that as long as a decompressor starts at the top of a header, it can
start decompressing without the previous bytes.

The default compression chunk size is 256K, but writers can choose
their own value. Larger chunks lead to better compression, but require
more memory. The chunk size is recorded in the Postscript so that
readers can allocate appropriately sized buffers. Readers are
guaranteed that no chunk will expand to more than the compression chunk
size.
ORC files without generic compression write each stream directly
with no headers.
# Run Length Encoding
## Base 128 Varint
Variable width integer encodings take advantage of the fact that most
numbers are small and that having smaller encodings for small numbers
shrinks the overall size of the data. ORC uses the varint format from
Protocol Buffers, which writes data in little endian format using the
low 7 bits of each byte. The high bit in each byte is set if the
number continues into the next byte.
Unsigned Original | Serialized
:---------------- | :---------
0 | 0x00
1 | 0x01
127 | 0x7f
128 | 0x80, 0x01
129 | 0x81, 0x01
16,383 | 0xff, 0x7f
16,384 | 0x80, 0x80, 0x01
16,385 | 0x81, 0x80, 0x01
For signed integer types, the number is converted into an unsigned
number using a zigzag encoding. Zigzag encoding moves the sign bit to
the least significant bit using the expression (val << 1) ^ (val >>
63) and derives its name from the fact that positive and negative
numbers alternate once encoded. The unsigned number is then serialized
as above.
Signed Original | Unsigned
:-------------- | :-------
0 | 0
-1 | 1
1 | 2
-2 | 3
2 | 4
## Byte Run Length Encoding
For byte streams, ORC uses a very light weight encoding of identical
values.
* Run - a sequence of at least 3 identical values
* Literals - a sequence of non-identical values
The first byte of each group of values is a header that determines
whether it is a run (value between 0 to 127) or literal list (value
between -128 to -1). For runs, the control byte is the length of the
run minus the length of the minimal run (3) and the control byte for
literal lists is the negative length of the list. For example, a
hundred 0's is encoded as [0x61, 0x00] and the sequence 0x44, 0x45
would be encoded as [0xfe, 0x44, 0x45]. The next group can choose
either of the encodings.
## Boolean Run Length Encoding
For encoding boolean types, the bits are put in the bytes from most
significant to least significant. The bytes are encoded using byte run
length encoding as described in the previous section. For example,
the byte sequence [0xff, 0x80] would be one true followed by
seven false values.
## Integer Run Length Encoding, version 1
In Hive 0.11 ORC files used Run Length Encoding version 1 (RLEv1),
which provides a lightweight compression of signed or unsigned integer
sequences. RLEv1 has two sub-encodings:
* Run - a sequence of values that differ by a small fixed delta
* Literals - a sequence of varint encoded values
Runs start with an initial byte of 0x00 to 0x7f, which encodes the
length of the run - 3. A second byte provides the fixed delta in the
range of -128 to 127. Finally, the first value of the run is encoded
as a base 128 varint.
For example, if the sequence is 100 instances of 7 the encoding would
start with 100 - 3, followed by a delta of 0, and a varint of 7 for
an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers
running from 100 to 1, the first byte is 100 - 3, the delta is -1,
and the varint is 100 for an encoding of [0x61, 0xff, 0x64].
Literals start with an initial byte of 0x80 to 0xff, which corresponds
to the negative of number of literals in the sequence. Following the
header byte, the list of N varints is encoded. Thus, if there are
no runs, the overhead is 1 byte for each 128 integers. Numbers
[2, 3, 6, 7, 11] would be encoded as [0xfb, 0x02, 0x03, 0x06, 0x07, 0xb].
## Integer Run Length Encoding, version 2
In Hive 0.12, ORC introduced Run Length Encoding version 2 (RLEv2),
which has improved compression and fixed bit width encodings for
faster expansion. RLEv2 uses four sub-encodings based on the data:
* Short Repeat - used for short sequences with repeated values
* Direct - used for random sequences with a fixed bit width
* Patched Base - used for random sequences with a variable bit width
* Delta - used for monotonically increasing or decreasing sequences
### Short Repeat
The short repeat encoding is used for short repeating integer
sequences with the goal of minimizing the overhead of the header. All
of the bits listed in the header are from the first byte to the last
and from most significant bit to least significant bit. If the type is
signed, the value is zigzag encoded.
* 1 byte header
* 2 bits for encoding type (0)
* 3 bits for width (W) of repeating value (1 to 8 bytes)
* 3 bits for repeat count (3 to 10 values)
* W bytes in big endian format, which is zigzag encoded if they type
is signed
The unsigned sequence of [10000, 10000, 10000, 10000, 10000] would be
serialized with short repeat encoding (0), a width of 2 bytes (1), and
repeat count of 5 (2) as [0x0a, 0x27, 0x10].
### Direct
The direct encoding is used for integer sequences whose values have a
relatively constant bit width. It encodes the values directly using a
fixed width big endian encoding. The width of the values is encoded
using the table below.
The 5 bit width encoding table for RLEv2:
Width in Bits | Encoded Value | Notes
:------------ | :------------ | :----
0 | 0 | for delta encoding
1 | 0 | for non-delta encoding
2 | 1
4 | 3
8 | 7
16 | 15
24 | 23
32 | 27
40 | 28
48 | 29
56 | 30
64 | 31
3 | 2 | deprecated
5 <= x <= 7 | x - 1 | deprecated
9 <= x <= 15 | x - 1 | deprecated
17 <= x <= 21 | x - 1 | deprecated
26 | 24 | deprecated
28 | 25 | deprecated
30 | 26 | deprecated
* 2 bytes header
* 2 bits for encoding type (1)
* 5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
width encoding table
* 9 bits for length (L) (1 to 512 values)
* W * L bits (padded to the next byte) encoded in big endian format, which is
zigzag encoding if the type is signed
The unsigned sequence of [23713, 43806, 57005, 48879] would be
serialized with direct encoding (1), a width of 16 bits (15), and
length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
0xbe, 0xef].
> Note: the run length(4) is one-off. We can get 4 by adding 1 to 3
(See [Hive-4123](https://github.com/apache/hive/commit/69deabeaac020ba60b0f2156579f53e9fe46157a#diff-c00fea1863eaf0d6f047535e874274199020ffed3eb00deb897f513aa86f6b59R232-R236))

### Patched Base
The patched base encoding is used for integer sequences whose bit
widths varies a lot. The minimum signed value of the sequence is found
and subtracted from the other values. The bit width of those adjusted
values is analyzed and the 90 percentile of the bit width is chosen
as W. The 10\% of values larger than W use patches from a patch list
to set the additional bits. Patches are encoded as a list of gaps in
the index values and the additional value bits.
* 4 bytes header
* 2 bits for encoding type (2)
* 5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
width encoding table
* 9 bits for length (L) (1 to 512 values)
* 3 bits for base value width (BW) (1 to 8 bytes)
* 5 bits for patch width (PW) (1 to 64 bits) using the 5 bit width
encoding table
* 3 bits for patch gap width (PGW) (1 to 8 bits)
* 5 bits for patch list length (PLL) (0 to 31 patches)
* Base value (BW bytes) - The base value is stored as a big endian value
with negative values marked by the most significant bit set. If it that
bit is set, the entire value is negated.
* Data values (W * L bits padded to the byte) - A sequence of W bit positive
values that are added to the base value.
* Patch list (PLL * (PGW + PW) bytes) - A list of patches for values
that didn't fit within W bits. Each entry in the list consists of a
gap, which is the number of elements skipped from the previous
patch, and a patch value. Patches are applied by logically or'ing
the data values with the relevant patch shifted W bits left. If a
patch is 0, it was introduced to skip over more than 255 items. The
combined length of each patch (PGW + PW) must be less or equal to
64.
The unsigned sequence of [2030, 2000, 2020, 1000000, 2040, 2050, 2060, 2070,
2080, 2090, 2100, 2110, 2120, 2130, 2140, 2150, 2160, 2170, 2180, 2190]
has a minimum of 2000, which makes the adjusted
sequence [30, 0, 20, 998000, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
150, 160, 170, 180, 190]. It has an
encoding of patched base (2), a bit width of 8 (7), a length of 20
(19), a base value width of 2 bytes (1), a patch width of 12 bits (11),
patch gap width of 2 bits (1), and a patch list length of 1 (1). The
base value is 2000 and the combined result is [0x8e, 0x13, 0x2b, 0x21, 0x07,
0xd0, 0x1e, 0x00, 0x14, 0x70, 0x28, 0x32, 0x3c, 0x46, 0x50, 0x5a, 0x64, 0x6e,
0x78, 0x82, 0x8c, 0x96, 0xa0, 0xaa, 0xb4, 0xbe, 0xfc, 0xe8]
### Delta
The Delta encoding is used for monotonically increasing or decreasing
sequences. The first two numbers in the sequence can not be identical,
because the encoding is using the sign of the first delta to determine
if the series is increasing or decreasing.
* 2 bytes header
* 2 bits for encoding type (3)
* 5 bits for encoded width (W) of deltas (0 to 64 bits) using the 5 bit
width encoding table
* 9 bits for run length (L) (1 to 512 values)
* Base value - encoded as (signed or unsigned) varint
* Delta base - encoded as signed varint
* Delta values (W * (L - 2)) bytes - encode each delta after the first
one. If the delta base is positive, the sequence is increasing and if it is
negative the sequence is decreasing.
The unsigned sequence of [2, 3, 5, 7, 11, 13, 17, 19, 23, 29] would be
serialized with delta encoding (3), a width of 4 bits (3), length of
10 (9), a base of 2 (2), and first delta of 1 (2). The resulting
sequence is [0xc6, 0x09, 0x02, 0x02, 0x22, 0x42, 0x42, 0x46].
# Stripes
The body of ORC files consists of a series of stripes. Stripes are
large (typically ~200MB) and independent of each other and are often
processed by different tasks. The defining characteristic for columnar
storage formats is that the data for each column is stored separately
and that reading data out of the file should be proportional to the
number of columns read.
In ORC files, each column is stored in several streams that are stored
next to each other in the file. For example, an integer column is
represented as two streams PRESENT, which uses one with a bit per
value recording if the value is non-null, and DATA, which records the
non-null values. If all of a column's values in a stripe are non-null,
the PRESENT stream is omitted from the stripe. For binary data, ORC
uses three streams PRESENT, DATA, and LENGTH, which stores the length
of each value. The details of each type will be presented in the
following subsections.
The layout of each stripe looks like:
* index streams
* unencrypted
* encryption variant 1..N
* data streams
* unencrypted
* encryption variant 1..N
* stripe footer
There is a general order for index and data streams:
* Index streams are always placed together in the beginning of the stripe.
* Data streams are placed together after index streams (if any).
* Inside index streams or data streams, the unencrypted streams should be
placed first and then followed by streams grouped by each encryption variant.
There is no fixed order within each unencrypted or encryption variant in the
index and data streams:
* Different stream kinds of the same column can be placed in any order.
* Streams from different columns can even be placed in any order.
To get the precise information (a.k.a stream kind, column id and location) of
a stream within a stripe, the streams field in the StripeFooter described below
is the single source of truth.
In the example of the integer column mentioned above, the order of the
PRESENT stream and the DATA stream cannot be determined in advance.
We need to get the precise information by **StripeFooter**.
## Stripe Footer
The stripe footer contains the encoding of each column and the
directory of the streams including their location.
```
message StripeFooter {
// the location of each stream
repeated Stream streams = 1;
// the encoding of each column
repeated ColumnEncoding columns = 2;
optional string writerTimezone = 3;
// one for each column encryption variant
repeated StripeEncryptionVariant encryption = 4;
}
```
If the file includes encrypted columns, those streams and column
encodings are stored separately in a StripeEncryptionVariant per an
encryption variant. Additionally, the StripeFooter will contain two
additional virtual streams ENCRYPTED_INDEX and ENCRYPTED_DATA that
allocate the space that is used by the encryption variants to store
the encrypted index and data streams.
```
message StripeEncryptionVariant {
repeated Stream streams = 1;
repeated ColumnEncoding encoding = 2;
}
```
To describe each stream, ORC stores the kind of stream, the column id,
and the stream's size in bytes. The details of what is stored in each stream
depends on the type and encoding of the column.
```
message Stream {
enum Kind {
// boolean stream of whether the next value is non-null
PRESENT = 0;
// the primary data stream
DATA = 1;
// the length of each value for variable length data
LENGTH = 2;
// the dictionary blob
DICTIONARY_DATA = 3;
// deprecated prior to Hive 0.11
// It was used to store the number of instances of each value in the
// dictionary
DICTIONARY_COUNT = 4;
// a secondary data stream
SECONDARY = 5;
// the index for seeking to particular row groups
ROW_INDEX = 6;
// original bloom filters used before ORC-101
BLOOM_FILTER = 7;
// bloom filters that consistently use utf8
BLOOM_FILTER_UTF8 = 8;
// Virtual stream kinds to allocate space for encrypted index and data.
ENCRYPTED_INDEX = 9;
ENCRYPTED_DATA = 10;
// stripe statistics streams
STRIPE_STATISTICS = 100;
// A virtual stream kind that is used for setting the encryption IV.
FILE_STATISTICS = 101;
}
required Kind kind = 1;
// the column id
optional uint32 column = 2;
// the number of bytes in the file
optional uint64 length = 3;
}
```
Depending on their type several options for encoding are possible. The
encodings are divided into direct or dictionary-based categories and
further refined as to whether they use RLE v1 or v2.
```
message ColumnEncoding {
enum Kind {
// the encoding is mapped directly to the stream using RLE v1
DIRECT = 0;
// the encoding uses a dictionary of unique values using RLE v1
DICTIONARY = 1;
// the encoding is direct using RLE v2
DIRECT_V2 = 2;
// the encoding is dictionary-based using RLE v2
DICTIONARY_V2 = 3;
}
required Kind kind = 1;
// for dictionary encodings, record the size of the dictionary
optional uint32 dictionarySize = 2;
}
```
# Column Encodings
## SmallInt, Int, and BigInt Columns
All of the 16, 32, and 64 bit integer column types use the same set of
potential encodings, which is basically whether they use RLE v1 or
v2. If the PRESENT stream is not included, all of the values are
present. For values that have false bits in the present stream, no
values are included in the data stream.
Encoding | Stream Kind | Optional | Contents
:-------- | :---------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v2
> Note that the order of the Stream is not fixed. It also applies to other Column types.
## Float and Double Columns
Floating point types are stored using IEEE 754 floating point bit
layout. Float columns use 4 bytes per value and double columns use 8
bytes.
Encoding | Stream Kind | Optional | Contents
:-------- | :---------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | IEEE 754 floating point representation
## String, Char, and VarChar Columns
String, char, and varchar columns may be encoded either using a
dictionary encoding or a direct encoding. A direct encoding should be
preferred when there are many distinct values. In all of the
encodings, the PRESENT stream encodes whether the value is null. The
Java ORC writer automatically picks the encoding after the first row
group (10,000 rows).
For direct encoding the UTF-8 bytes are saved in the DATA stream and
the length of each value is written into the LENGTH stream. In direct
encoding, if the values were ["Nevada", "California"]; the DATA
would be "NevadaCalifornia" and the LENGTH would be [6, 10].
For dictionary encodings the dictionary is sorted (in lexicographical
order of bytes in the UTF-8 encodings) and UTF-8 bytes of
each unique value are placed into DICTIONARY_DATA. The length of each
item in the dictionary is put into the LENGTH stream. The DATA stream
consists of the sequence of references to the dictionary elements.
In dictionary encoding, if the values were ["Nevada",
"California", "Nevada", "California", and "Florida"]; the
DICTIONARY_DATA would be "CaliforniaFloridaNevada" and LENGTH would
be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v1
DICTIONARY | PRESENT | Yes | Boolean RLE
| DATA | No | Unsigned Integer RLE v1
| DICTIONARY_DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v2
DICTIONARY_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Unsigned Integer RLE v2
| DICTIONARY_DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v2
## Boolean Columns
Boolean columns are rare, but have a simple encoding.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Boolean RLE
## TinyInt Columns
TinyInt (byte) columns use byte run length encoding.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Byte RLE
## Binary Columns
Binary data is encoded with a PRESENT stream, a DATA stream that records
the contents, and a LENGTH stream that records the number of bytes per a
value.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v2
## Decimal Columns
Decimal was introduced in Hive 0.11 with infinite precision (the total
number of digits). In Hive 0.13, the definition was change to limit
the precision to a maximum of 38 digits, which conveniently uses 127
bits plus a sign bit. The current encoding of decimal columns stores
the integer representation of the value as an unbounded length zigzag
encoded base 128 varint. The scale is stored in the SECONDARY stream
as a signed integer.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Unbounded base 128 varints
| SECONDARY | No | Signed Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Unbounded base 128 varints
| SECONDARY | No | Signed Integer RLE v2
## Date Columns
Date data is encoded with a PRESENT stream, a DATA stream that records
the number of days after January 1, 1970 in UTC.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v2
## Timestamp Columns
Timestamp records times down to nanoseconds as a PRESENT stream that
records non-null values, a DATA stream that records the number of
seconds after 1 January 2015, and a SECONDARY stream that records the
number of nanoseconds.
Because the number of nanoseconds often has a large number of trailing
zeros, the number has trailing decimal zero digits removed and the
last three bits are used to record how many zeros were removed. if the
trailing zeros are more than 2. Thus 1000 nanoseconds would be
serialized as 0x0a and 100000 would be serialized as 0x0c.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v1
| SECONDARY | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v2
| SECONDARY | No | Unsigned Integer RLE v2
## Struct Columns
Structs have no data themselves and delegate everything to their child
columns except for their PRESENT stream. They have a child column
for each of the fields.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
## List Columns
Lists are encoded as the PRESENT stream and a length stream with
number of items in each list. They have a single child column for the
element values.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| LENGTH | No | Unsigned Integer RLE v2
## Map Columns
Maps are encoded as the PRESENT stream and a length stream with number
of items in each map. They have a child column for the key and
another child column for the value.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| LENGTH | No | Unsigned Integer RLE v2
## Union Columns
Unions are encoded as the PRESENT stream and a tag stream that controls which
potential variant is used. They have a child column for each variant of the
union. Currently ORC union types are limited to 256 variants, which matches
the Hive type model.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DIRECT | No | Byte RLE
# Indexes
## Row Group Index
The row group indexes consist of a ROW_INDEX stream for each primitive
column that has an entry for each row group. Row groups are controlled
by the writer and default to 10,000 rows. Each RowIndexEntry gives the
position of each stream for the column and the statistics for that row
group.
The index streams are placed at the front of the stripe, because in
the default case of streaming they do not need to be read. They are
only loaded when either predicate push down is being used or the
reader seeks to a particular row.
```
message RowIndexEntry {
repeated uint64 positions = 1 [packed=true];
optional ColumnStatistics statistics = 2;
}
```
```
message RowIndex {
repeated RowIndexEntry entry = 1;
}
```
To record positions, each stream needs a sequence of numbers. For
uncompressed streams, the position is the byte offset of the RLE run's
start location followed by the number of values that need to be
consumed from the run. In compressed streams, the first number is the
start of the compression chunk in the stream, followed by the number
of decompressed bytes that need to be consumed, and finally the number
of values consumed in the RLE.
For columns with multiple streams, the sequences of positions in each
stream are concatenated. That was an unfortunate decision on my part
that we should fix at some point, because it makes code that uses the
indexes error-prone.
Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.
Note that for columns with multiple streams, the order of stream
positions in the RowIndex is **fixed**, which may be different to
the actual data stream placement, and it is the same as
[Column Encodings](#column-encoding-section) section we described above.
## Bloom Filter Index
Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
Predicate pushdown can make use of bloom filters to better prune
the row groups that do not satisfy the filter condition.
The bloom filter indexes consist of a BLOOM_FILTER stream for each
column specified through 'orc.bloom.filter.columns' table properties.
A BLOOM_FILTER stream records a bloom filter entry for each row
group (default to 10,000 rows) in a column. Only the row groups that
satisfy min/max row index evaluation will be evaluated against the
bloom filter index.
Each bloom filter entry stores the number of hash functions ('k') used
and the bitset backing the bloom filter. The original encoding (pre
ORC-101) of bloom filters used the bitset field encoded as a repeating
sequence of longs in the bitset field with a little endian encoding
(0x1 is bit 0 and 0x2 is bit 1.) After ORC-101, the encoding is a
sequence of bytes with a little endian encoding in the utf8bitset field.
```
message BloomFilter {
optional uint32 numHashFunctions = 1;
repeated fixed64 bitset = 2;
optional bytes utf8bitset = 3;
}
```
```
message BloomFilterIndex {
repeated BloomFilter bloomFilter = 1;
}
```
Bloom filter internally uses two different hash functions to map a key
to a position in the bit set. For tinyint, smallint, int, bigint, float
and double types, Thomas Wang's 64-bit integer hash function is used.
Doubles are converted to IEEE-754 64 bit representation (using Java's
Double.doubleToLongBits(double)). Floats are as converted to double
(using Java's float to double cast). All these primitive types
are cast to long base type before being passed on to the hash function.
For strings and binary types, Murmur3 64 bit hash algorithm is used.
The 64 bit variant of Murmur3 considers only the most significant
8 bytes of Murmur3 128-bit algorithm. The 64 bit hashcode generated
from the above algorithms is used as a base to derive 'k' different
hash functions. We use the idea mentioned in the paper "Less Hashing,
Same Performance: Building a Better Bloom Filter" by Kirsch et. al. to
quickly compute the k hashcodes.
The algorithm for computing k hashcodes and setting the bit position
in a bloom filter is as follows:
1. Get 64 bit base hash code from Murmur3 or Thomas Wang's hash algorithm.
2. Split the above hashcode into two 32-bit hashcodes (say hash1 and hash2).
3. k'th hashcode is obtained by (where k > 0):
* combinedHash = hash1 + (k * hash2)
4. If combinedHash is negative flip all the bits:
* combinedHash = ~combinedHash
5. Bit set position is obtained by performing modulo with m:
* position = combinedHash % m
6. Set the position in bit set. The LSB 6 bits identifies the long index
within bitset and bit position within the long uses little endian order.
* bitset[position >>> 6] \|= (1L << position);
Bloom filter streams are interlaced with row group indexes. This placement
makes it convenient to read the bloom filter stream and row index stream
together in single read operation.

orc-format-1.1.0/specification/img/ 000755 000765 000024 00000000000 14777360722 020155 5 ustar 00dongjoon staff 000000 000000 orc-format-1.1.0/specification/index.md 000644 000765 000024 00000001067 14777360722 021036 0 ustar 00dongjoon staff 000000 000000 ---
layout: page
title: ORC Specification
---
There have been two released ORC file versions:
* [ORC v0](ORCv0.md) was released in Hive 0.11.
* [ORC v1](ORCv1.md) was released in Hive 0.12 and ORC 1.x.
Each version of the library will detect the format version and use
the appropriate reader. The library can also write the older versions
of the file format to ensure that users can write files that all of their
clusters can read correctly.
We are working on a new version of the file format:
* [ORC v2](ORCv2.md) is a work in progress and is rapidly evolving.
orc-format-1.1.0/specification/ORCv2.md 000644 000765 000024 00000162504 14777360722 020626 0 ustar 00dongjoon staff 000000 000000 ---
layout: page
title: Evolving Draft for ORC Specification v2
---
This specification is rapidly evolving and should only be used for
developers on the project.
# TO DO items
The list of things that we plan to change:
* Move decimal encoding to RLEv3 and remove variable length encoding.
* Create a better float/double encoding that splits mantissa and
exponent.
* Create a dictionary encoding for float, double, and decimal.
* Create RLEv3:
* 64 and 128 bit variants
* Zero suppression
* Evaluate the rle subformats
* Group stripe data into stripelets to enable Async IO for reads.
* Reorder stripe data into (stripe metadata, index, dictionary, data)
* Stop sorting dictionaries and record the sort order separately in the index.
* Remove use of RLEv1 and RLEv2.
* Remove non-utf8 bloom filter.
* Use numeric value for decimal statistics and bloom filter.
* Add Zstd with dictionary.
# Motivation
Hive's RCFile was the standard format for storing tabular data in
Hadoop for several years. However, RCFile has limitations because it
treats each column as a binary blob without semantics. In Hive 0.11 we
added a new file format named Optimized Row Columnar (ORC) file that
uses and retains the type information from the table definition. ORC
uses type specific readers and writers that provide light weight
compression techniques such as dictionary encoding, bit packing, delta
encoding, and run length encoding -- resulting in dramatically smaller
files. Additionally, ORC can apply generic compression using zlib, or
Snappy on top of the lightweight compression for even smaller
files. However, storage savings are only part of the gain. ORC
supports projection, which selects subsets of the columns for reading,
so that queries reading only one column read only the required
bytes. Furthermore, ORC files include light weight indexes that
include the minimum and maximum values for each column in each set of
10,000 rows and the entire file. Using pushdown filters from Hive, the
file reader can skip entire sets of rows that aren't important for
this query.

# File Tail
Since HDFS does not support changing the data in a file after it is
written, ORC stores the top level index at the end of the file. The
overall structure of the file is given in the figure above. The
file's tail consists of 3 parts; the file metadata, file footer and
postscript.
The metadata for ORC is stored using
[Protocol Buffers](https://s.apache.org/protobuf_encoding), which provides
the ability to add new fields without breaking readers. This document
incorporates the Protobuf definition from the
[ORC source code](../src/main/proto/orc/proto/orc_proto.proto) and the
reader is encouraged to review the Protobuf encoding if they need to
understand the byte-level encoding
The sections of the file tail are (and their protobuf message type):
* encrypted stripe statistics: list of ColumnarStripeStatistics
* stripe statistics: Metadata
* footer: Footer
* postscript: PostScript
* psLen: byte
## Postscript
The Postscript section provides the necessary information to interpret
the rest of the file including the length of the file's Footer and
Metadata sections, the version of the file, and the kind of general
compression used (eg. none, zlib, or snappy). The Postscript is never
compressed and ends one byte before the end of the file. The version
stored in the Postscript is the lowest version of Hive that is
guaranteed to be able to read the file and it stored as a sequence of
the major and minor version. This file version is encoded as [0,12].
The process of reading an ORC file works backwards through the
file. Rather than making multiple short reads, the ORC reader reads
the last 16k bytes of the file with the hope that it will contain both
the Footer and Postscript sections. The final byte of the file
contains the serialized length of the Postscript, which must be less
than 256 bytes. Once the Postscript is parsed, the compressed
serialized length of the Footer is known and it can be decompressed
and parsed.
```
message PostScript {
// the length of the footer section in bytes
optional uint64 footerLength = 1;
// the kind of generic compression used
optional CompressionKind compression = 2;
// the maximum size of each compression chunk
optional uint64 compressionBlockSize = 3;
// the version of the writer
repeated uint32 version = 4 [packed = true];
// the length of the metadata section in bytes
optional uint64 metadataLength = 5;
// the fixed string "ORC"
optional string magic = 8000;
}
```
```
enum CompressionKind {
NONE = 0;
ZLIB = 1;
SNAPPY = 2;
LZO = 3;
LZ4 = 4;
ZSTD = 5;
}
```
## Footer
The Footer section contains the layout of the body of the file, the
type schema information, the number of rows, and the statistics about
each of the columns.
The file is broken in to three parts- Header, Body, and Tail. The
Header consists of the bytes "ORC'' to support tools that want to
scan the front of the file to determine the type of the file. The Body
contains the rows and indexes, and the Tail gives the file level
information as described in this section.
```
message Footer {
// the length of the file header in bytes (always 3)
optional uint64 headerLength = 1;
// the length of the file header and body in bytes
optional uint64 contentLength = 2;
// the information about the stripes
repeated StripeInformation stripes = 3;
// the schema information
repeated Type types = 4;
// the user metadata that was added
repeated UserMetadataItem metadata = 5;
// the total number of rows in the file
optional uint64 numberOfRows = 6;
// the statistics of each column across the file
repeated ColumnStatistics statistics = 7;
// the maximum number of rows in each index entry
optional uint32 rowIndexStride = 8;
// Each implementation that writes ORC files should register for a code
// 0 = ORC Java
// 1 = ORC C++
// 2 = Presto
// 3 = Scritchley Go from https://github.com/scritchley/orc
// 4 = Trino
optional uint32 writer = 9;
// information about the encryption in this file
optional Encryption encryption = 10;
// the number of bytes in the encrypted stripe statistics
optional uint64 stripeStatisticsLength = 11;
}
```
### Stripe Information
The body of the file is divided into stripes. Each stripe is self
contained and may be read using only its own bytes combined with the
file's Footer and Postscript. Each stripe contains only entire rows so
that rows never straddle stripe boundaries. Stripes have three
sections: a set of indexes for the rows within the stripe, the data
itself, and a stripe footer. Both the indexes and the data sections
are divided by columns so that only the data for the required columns
needs to be read.
The encryptStripeId and encryptedLocalKeys support column
encryption. They are set on the first stripe of each ORC file with
column encryption and not set after that. For a stripe with the values
set, the reader should use those values for that stripe. Subsequent
stripes use the previous encryptStripeId + 1 and the same keys.
The current ORC merging code merges entire files, and thus the reader
will get the correct values on what was the first stripe and continue
on. If we develop a merge tool that reorders stripes or does partial
merges, these values will need to be set correctly by that tool.
```
message StripeInformation {
// the start of the stripe within the file
optional uint64 offset = 1;
// the length of the indexes in bytes
optional uint64 indexLength = 2;
// the length of the data in bytes
optional uint64 dataLength = 3;
// the length of the footer in bytes
optional uint64 footerLength = 4;
// the number of rows in the stripe
optional uint64 numberOfRows = 5;
// If this is present, the reader should use this value for the encryption
// stripe id for setting the encryption IV. Otherwise, the reader should
// use one larger than the previous stripe's encryptStripeId.
// For unmerged ORC files, the first stripe will use 1 and the rest of the
// stripes won't have it set. For merged files, the stripe information
// will be copied from their original files and thus the first stripe of
// each of the input files will reset it to 1.
// Note that 1 was choosen, because protobuf v3 doesn't serialize
// primitive types that are the default (eg. 0).
optional uint64 encryptStripeId = 6;
// For each encryption variant, the new encrypted local key to use until we
// find a replacement.
repeated bytes encryptedLocalKeys = 7;
}
```
### Type Information
All of the rows in an ORC file must have the same schema. Logically
the schema is expressed as a tree as in the figure below, where
the compound types have subcolumns under them.

The equivalent Hive DDL would be:
```
create table Foobar (
myInt int,
myMap map>,
myTime timestamp
);
```
The type tree is flattened in to a list via a pre-order traversal
where each type is assigned the next id. Clearly the root of the type
tree is always type id 0. Compound types have a field named subtypes
that contains the list of their children's type ids.
```
message Type {
enum Kind {
BOOLEAN = 0;
BYTE = 1;
SHORT = 2;
INT = 3;
LONG = 4;
FLOAT = 5;
DOUBLE = 6;
STRING = 7;
BINARY = 8;
TIMESTAMP = 9;
LIST = 10;
MAP = 11;
STRUCT = 12;
UNION = 13;
DECIMAL = 14;
DATE = 15;
VARCHAR = 16;
CHAR = 17;
TIMESTAMP_INSTANT = 18;
GEOMETRY = 19;
GEOGRAPHY = 20;
}
// the kind of this type
required Kind kind = 1;
// the type ids of any subcolumns for list, map, struct, or union
repeated uint32 subtypes = 2 [packed=true];
// the list of field names for struct
repeated string fieldNames = 3;
// the maximum length of the type for varchar or char in UTF-8 characters
optional uint32 maximumLength = 4;
// the precision and scale for decimal
optional uint32 precision = 5;
optional uint32 scale = 6;
repeated StringPair attributes = 7;
// the attributes associated with the geometry type
optional GeometryType geometry = 8;
// Coordinate Reference System (CRS) for Geometry and Geography types
optional string crs = 8;
// Edge interpolation algorithm for Geography type
enum EdgeInterpolationAlgorithm {
SPHERICAL = 0;
VINCENTY = 1;
THOMAS = 2;
ANDOYER = 3;
KARNEY = 4;
}
optional EdgeInterpolationAlgorithm algorithm = 9;
}
```
#### Geometry & Geography Types
##### Background
The Geometry and Geography class hierarchy and its Well-Known Text (WKT) and
Well-Known Binary (WKB) serializations (ISO variant supporting XY, XYZ, XYM,
XYZM) are defined by [OpenGIS Implementation Specification for Geographic
information - Simple feature access - Part 1: Common architecture][sfa-part1],
from [OGC(Open Geospatial Consortium)][ogc].
The version of the OGC standard first used here is 1.2.1, but future versions
may also be used if the WKB representation remains wire-compatible.
[sfa-part1]: https://portal.ogc.org/files/?artifact_id=25355
[ogc]: https://www.ogc.org/standard/sfa/
###### Coordinate Reference System
Coordinate Reference System (CRS) is a mapping of how coordinates refer to
locations on Earth.
The default CRS `OGC:CRS84` means that the geospatial features must be stored
in the order of longitude/latitude based on the WGS84 datum.
Custom CRS can be specified by a string value. It is recommended to use an
identifier-based approach like [Spatial reference identifier][srid].
For geographic CRS, longitudes are bound by [-180, 180] and latitudes are bound
by [-90, 90].
[srid]: https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier
###### Edge Interpolation Algorithm
An algorithm for interpolating edges, and is one of the following values:
* `spherical`: edges are interpolated as geodesics on a sphere.
* `vincenty`: [https://en.wikipedia.org/wiki/Vincenty%27s_formulae](https://en.wikipedia.org/wiki/Vincenty%27s_formulae)
* `thomas`: Thomas, Paul D. Spheroidal geodesics, reference systems, & local geometry. US Naval Oceanographic Office, 1970.
* `andoyer`: Thomas, Paul D. Mathematical models for navigation systems. US Naval Oceanographic Office, 1965.
* `karney`: [Karney, Charles FF. "Algorithms for geodesics." Journal of Geodesy 87 (2013): 43-55](https://link.springer.com/content/pdf/10.1007/s00190-012-0578-z.pdf), and [GeographicLib](https://geographiclib.sourceforge.io/)
###### CRS Customization
CRS is represented as a string value. Writer and reader implementations are
responsible for serializing and deserializing the CRS, respectively.
As a convention to maximize the interoperability, custom CRS values can be
specified by a string of the format `type:identifier`, where `type` is one of
the following values:
* `srid`: [Spatial reference identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), `identifier` is the SRID itself.
* `projjson`: [PROJJSON](https://proj.org/en/stable/specifications/projjson.html), `identifier` is the name of a table property or a file property where the projjson string is stored.
###### Coordinate Axis Order
The axis order of the coordinates in WKB and bounding box stored here
follows the de facto standard for axis order in WKB and is therefore always
(x, y) where x is easting or longitude and y is northing or latitude. This
ordering explicitly overrides the axis order as specified in the CRS.
### Column Statistics
The goal of the column statistics is that for each column, the writer
records the count and depending on the type other useful fields. For
most of the primitive types, it records the minimum and maximum
values; and for numeric types it additionally stores the sum.
From Hive 1.1.0 onwards, the column statistics will also record if
there are any null values within the row group by setting the hasNull flag.
The hasNull flag is used by ORC's predicate pushdown to better answer
'IS NULL' queries.
```
message ColumnStatistics {
// the number of values
optional uint64 numberOfValues = 1;
// At most one of these has a value for any column
optional IntegerStatistics intStatistics = 2;
optional DoubleStatistics doubleStatistics = 3;
optional StringStatistics stringStatistics = 4;
optional BucketStatistics bucketStatistics = 5;
optional DecimalStatistics decimalStatistics = 6;
optional DateStatistics dateStatistics = 7;
optional BinaryStatistics binaryStatistics = 8;
optional TimestampStatistics timestampStatistics = 9;
optional bool hasNull = 10;
optional uint64 bytes_on_disk = 11;
optional CollectionStatistics collection_statistics = 12;
optional GeospatialStatistics geospatial_statistics = 13;
}
```
For integer types (tinyint, smallint, int, bigint), the column
statistics includes the minimum, maximum, and sum. If the sum
overflows long at any point during the calculation, no sum is
recorded.
```
message IntegerStatistics {
optional sint64 minimum = 1;
optional sint64 maximum = 2;
optional sint64 sum = 3;
}
```
For floating point types (float, double), the column statistics
include the minimum, maximum, and sum. If the sum overflows a double,
no sum is recorded.
```
message DoubleStatistics {
optional double minimum = 1;
optional double maximum = 2;
optional double sum = 3;
}
```
For strings, the minimum value, maximum value, and the sum of the
lengths of the values are recorded.
```
message StringStatistics {
optional string minimum = 1;
optional string maximum = 2;
// sum will store the total length of all strings
optional sint64 sum = 3;
}
```
For booleans, the statistics include the count of false and true values.
```
message BucketStatistics {
repeated uint64 count = 1 [packed=true];
}
```
For decimals, the minimum, maximum, and sum are stored.
```
message DecimalStatistics {
optional string minimum = 1;
optional string maximum = 2;
optional string sum = 3;
}
```
Date columns record the minimum and maximum values as the number of
days since the UNIX epoch (1/1/1970 in UTC).
```
message DateStatistics {
// min,max values saved as days since epoch
optional sint32 minimum = 1;
optional sint32 maximum = 2;
}
```
Timestamp columns record the minimum and maximum values as the number of
milliseconds since the UNIX epoch (1/1/1970 00:00:00). The timestamp is
adjusted to UTC before being converted to milliseconds and stored in
`minimumUtc` and `maximumUtc`.
```
message TimestampStatistics {
// min,max values saved as milliseconds since epoch
optional sint64 minimum = 1;
optional sint64 maximum = 2;
// min,max values saved as milliseconds since UNIX epoch
optional sint64 minimumUtc = 3;
optional sint64 maximumUtc = 4;
}
```
Binary columns store the aggregate number of bytes across all of the values.
```
message BinaryStatistics {
// sum will store the total binary blob length
optional sint64 sum = 1;
}
```
Geometry and Geography columns store optional bounding boxes and list of
geospatial type codes from all values.
**Bounding Box**
A geospatial instance has at least two coordinate dimensions: X and Y for 2D
coordinates of each point. Please note that X is longitude/easting and Y is
latitude/northing. A geospatial instance can optionally have Z and/or M values
associated with each point.
The Z values introduce the third dimension coordinate. Usually they are used to
indicate the height, or elevation.
M values are an opportunity for a geospatial instance to express a fourth
dimension as a coordinate value. These values can be used as a linear reference
value (e.g., highway milepost value), a timestamp, or some other value as defined
by the CRS.
Bounding box is defined as the thrift struct below in the representation of
min/max value pair of coordinates from each axis. Note that X and Y Values are
always present. Z and M are omitted for 2D geospatial instances.
For the X values only, xmin may be greater than xmax. In this case, an object
in this bounding box may match if it contains an X such that `x >= xmin` OR
`x <= xmax`. This wraparound occurs only when the corresponding bounding box
crosses the antimeridian line. In geographic terminology, the concepts of `xmin`,
`xmax`, `ymin`, and `ymax` are also known as `westernmost`, `easternmost`,
`southernmost` and `northernmost`, respectively.
For Geography type, X and Y values are restricted to the canonical ranges of
[-180, 180] for X and [-90, 90] for Y.
**Geospatial Types**
A list of geospatial types from all instances in the Geometry or Geography
column, or an empty list if they are not known.
This is borrowed from [geometry_types of GeoParquet][geometry-types] except that
values in the list are [WKB (ISO-variant) integer codes][wkb-integer-code].
Table below shows the most common geospatial types and their codes:
| Type | XY | XYZ | XYM | XYZM |
| :----------------- | :--- | :--- | :--- | :--: |
| Point | 0001 | 1001 | 2001 | 3001 |
| LineString | 0002 | 1002 | 2002 | 3002 |
| Polygon | 0003 | 1003 | 2003 | 3003 |
| MultiPoint | 0004 | 1004 | 2004 | 3004 |
| MultiLineString | 0005 | 1005 | 2005 | 3005 |
| MultiPolygon | 0006 | 1006 | 2006 | 3006 |
| GeometryCollection | 0007 | 1007 | 2007 | 3007 |
In addition, the following rules are applied:
- A list of multiple values indicates that multiple geospatial types are present (e.g. `[0003, 0006]`).
- An empty array explicitly signals that the geospatial types are not known.
- The geospatial types in the list must be unique (e.g. `[0001, 0001]` is not valid).
[geometry-types]: https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md?plain=1#L159
[wkb-integer-code]: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
```protobuf
// Bounding box for Geometry or Geography type in the representation of min/max
// value pair of coordinates from each axis.
message BoundingBox {
optional double xmin = 1;
optional double xmax = 2;
optional double ymin = 3;
optional double ymax = 4;
optional double zmin = 5;
optional double zmax = 6;
optional double mmin = 7;
optional double mmax = 8;
}
// Statistics specific to Geometry or Geography type
message GeospatialStatistics {
// A bounding box of geospatial instances
optional BoundingBox bbox = 1;
// Geospatial type codes of all instances, or an empty list if not known
repeated int32 geospatial_types = 2;
}
```
### User Metadata
The user can add arbitrary key/value pairs to an ORC file as it is
written. The contents of the keys and values are completely
application defined, but the key is a string and the value is
binary. Care should be taken by applications to make sure that their
keys are unique and in general should be prefixed with an organization
code.
```
message UserMetadataItem {
// the user defined key
required string name = 1;
// the user defined binary value
required bytes value = 2;
}
```
### File Metadata
The file Metadata section contains column statistics at the stripe
level granularity. These statistics enable input split elimination
based on the predicate push-down evaluated per a stripe.
```
message StripeStatistics {
repeated ColumnStatistics colStats = 1;
}
```
```
message Metadata {
repeated StripeStatistics stripeStats = 1;
}
```
# Column Encryption
ORC as of Apache ORC 1.6 supports column encryption where the data and
statistics of specific columns are encrypted on disk. Column
encryption provides fine-grain column level security even when many
users have access to the file itself. The encryption is transparent to
the user and the writer only needs to define which columns and
encryption keys to use. When reading an ORC file, if the user has
access to the keys, they will get the real data. If they do not have
the keys, they will get the masked data.
```
message Encryption {
// all of the masks used in this file
repeated DataMask mask = 1;
// all of the keys used in this file
repeated EncryptionKey key = 2;
// The encrypted variants.
// Readers should prefer the first variant that the user has access to
// the corresponding key. If they don't have access to any of the keys,
// they should get the unencrypted masked data.
repeated EncryptionVariant variants = 3;
// How are the local keys encrypted?
optional KeyProviderKind keyProvider = 4;
}
```
Each encrypted column in each file will have a random local key
generated for it. Thus, even though all of the decryption happens
locally in the reader, a malicious user that stores the key only
enables access that column in that file. The local keys are encrypted
by the Hadoop or Ranger Key Management Server (KMS). The encrypted
local keys are stored in the file footer's StripeInformation.
```
enum KeyProviderKind {
UNKNOWN = 0;
HADOOP = 1;
AWS = 2;
GCP = 3;
AZURE = 4;
}
```
When ORC is using the Hadoop or Ranger KMS, it generates a random encrypted
local key (16 or 32 bytes for 128 or 256 bit AES respectively). Using the
first 16 bytes as the IV, it uses AES/CTR to decrypt the local key.
With the AWS KMS, the GenerateDataKey method is used to create a new local
key and the Decrypt method is used to decrypt it.
## Data Masks
The user's data is statically masked before writing the unencrypted
variant. Because the masking was done statically when the file was
written, the information about the masking is just informational.
The three standard masks are:
* nullify - all values become null
* redact - replace characters with constants such as X or 9
* sha256 - replace string with the SHA 256 of the value
The default is nullify, but masks may be defined by the user. Masks
are not allowed to change the type of the column, just the values.
```
message DataMask {
// the kind of masking, which may include third party masks
optional string name = 1;
// parameters for the mask
repeated string maskParameters = 2;
// the unencrypted column roots this mask was applied to
repeated uint32 columns = 3 [packed = true];
}
```
## Encryption Keys
In addition to the encrypted local keys, which are stored in the
footer's StripeInformation, the file also needs to describe the master
key that was used to encrypt the local keys. The master keys are
described by name, their version, and the encryption algorithm.
```
message EncryptionKey {
optional string keyName = 1;
optional uint32 keyVersion = 2;
optional EncryptionAlgorithm algorithm = 3;
}
```
The encryption algorithm is stored using an enumeration and since
ProtoBuf uses the 0 value as a default, we added an unused value. That
ensures that if we add a new algorithm that old readers will get
UNKNOWN_ENCRYPTION instead of a real value.
```
enum EncryptionAlgorithm {
// used for detecting future algorithms
UNKNOWN_ENCRYPTION = 0;
// 128 bit AES/CTR
AES_CTR_128 = 1;
// 256 bit AES/CTR
AES_CTR_256 = 2;
}
```
## Encryption Variants
Each encrypted column is written as two variants:
* encrypted unmasked - for users with access to the key
* unencrypted masked - for all other users
The changes to the format were done so that old ORC readers will read
the masked unencrypted data. Encryption variants encrypt a subtree of
columns and use a single local key. The initial version of encryption
support only allows the two variants, but this may be extended later
and thus readers should use the first variant of a column that the
reader has access to.
```
message EncryptionVariant {
// the column id of the root column that is encrypted in this variant
optional uint32 root = 1;
// the key that encrypted this variant
optional uint32 key = 2;
// The master key that was used to encrypt the local key, referenced as
// an index into the Encryption.key list.
optional bytes encryptedKey = 3;
// the stripe statistics for this variant
repeated Stream stripeStatistics = 4;
// encrypted file statistics as a FileStatistics
optional bytes fileStatistics = 5;
}
```
Each variant stores stripe and file statistics separately. The file
statistics are serialized as a FileStatistics, compressed, encrypted
and stored in the EncryptionVariant.fileStatistics.
```
message FileStatistics {
repeated ColumnStatistics column = 1;
}
```
The stripe statistics for each column are serialized as
ColumnarStripeStatistics, compressed, encrypted and stored in a stream
of kind STRIPE_STATISTICS. By making the column stripe statistics
independent of each other, the reader only reads and parses the
columns contained in the SARG.
```
message ColumnarStripeStatistics {
// one value for each stripe in the file
repeated ColumnStatistics colStats = 1;
}
```
## Stream Encryption
Our encryption is done using AES/CTR. CTR is a mode that has some very
nice properties for us:
* It is seeded so that identical data is encrypted differently.
* It does not require padding the stream to the cipher length.
* It allows readers to seek in to a stream.
* The IV does not need to be randomly generated.
To ensure that we don't reuse IV, we set the IV as:
* bytes 0 to 2 - column id
* bytes 3 to 4 - stream kind
* bytes 5 to 7 - stripe id
* bytes 8 to 15 - cipher block counter
However, it is critical for CTR that we never reuse an initialization
vector (IV) with the same local key.
For data in the footer, use the number of stripes in the file as the
stripe id. This guarantees when we write an intermediate footer in to
a file that we don't use the same IV.
Additionally, we never reuse a local key for new data. For example, when
merging files, we don't reuse local key from the input files for the new
file tail, but always generate a new local key.
# Compression
If the ORC file writer selects a generic compression codec (zlib or
snappy), every part of the ORC file except for the Postscript is
compressed with that codec. However, one of the requirements for ORC
is that the reader be able to skip over compressed bytes without
decompressing the entire stream. To manage this, ORC writes compressed
streams in chunks with headers as in the figure below.
To handle uncompressable data, if the compressed data is larger than
the original, the original is stored and the isOriginal flag is
set. Each header is 3 bytes long with (compressedLength * 2 +
isOriginal) stored as a little endian value. For example, the header
for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d,
0x03]. The header for 5 bytes that did not compress would be [0x0b,
0x00, 0x00]. Each compression chunk is compressed independently so
that as long as a decompressor starts at the top of a header, it can
start decompressing without the previous bytes.

The default compression chunk size is 256K, but writers can choose
their own value. Larger chunks lead to better compression, but require
more memory. The chunk size is recorded in the Postscript so that
readers can allocate appropriately sized buffers. Readers are
guaranteed that no chunk will expand to more than the compression chunk
size.
ORC files without generic compression write each stream directly
with no headers.
# Run Length Encoding
## Base 128 Varint
Variable width integer encodings take advantage of the fact that most
numbers are small and that having smaller encodings for small numbers
shrinks the overall size of the data. ORC uses the varint format from
Protocol Buffers, which writes data in little endian format using the
low 7 bits of each byte. The high bit in each byte is set if the
number continues into the next byte.
Unsigned Original | Serialized
:---------------- | :---------
0 | 0x00
1 | 0x01
127 | 0x7f
128 | 0x80, 0x01
129 | 0x81, 0x01
16,383 | 0xff, 0x7f
16,384 | 0x80, 0x80, 0x01
16,385 | 0x81, 0x80, 0x01
For signed integer types, the number is converted into an unsigned
number using a zigzag encoding. Zigzag encoding moves the sign bit to
the least significant bit using the expression (val << 1) ^ (val >>
63) and derives its name from the fact that positive and negative
numbers alternate once encoded. The unsigned number is then serialized
as above.
Signed Original | Unsigned
:-------------- | :-------
0 | 0
-1 | 1
1 | 2
-2 | 3
2 | 4
## Byte Run Length Encoding
For byte streams, ORC uses a very light weight encoding of identical
values.
* Run - a sequence of at least 3 identical values
* Literals - a sequence of non-identical values
The first byte of each group of values is a header that determines
whether it is a run (value between 0 to 127) or literal list (value
between -128 to -1). For runs, the control byte is the length of the
run minus the length of the minimal run (3) and the control byte for
literal lists is the negative length of the list. For example, a
hundred 0's is encoded as [0x61, 0x00] and the sequence 0x44, 0x45
would be encoded as [0xfe, 0x44, 0x45]. The next group can choose
either of the encodings.
## Boolean Run Length Encoding
For encoding boolean types, the bits are put in the bytes from most
significant to least significant. The bytes are encoded using byte run
length encoding as described in the previous section. For example,
the byte sequence [0xff, 0x80] would be one true followed by
seven false values.
## Integer Run Length Encoding, version 1
In Hive 0.11 ORC files used Run Length Encoding version 1 (RLEv1),
which provides a lightweight compression of signed or unsigned integer
sequences. RLEv1 has two sub-encodings:
* Run - a sequence of values that differ by a small fixed delta
* Literals - a sequence of varint encoded values
Runs start with an initial byte of 0x00 to 0x7f, which encodes the
length of the run - 3. A second byte provides the fixed delta in the
range of -128 to 127. Finally, the first value of the run is encoded
as a base 128 varint.
For example, if the sequence is 100 instances of 7 the encoding would
start with 100 - 3, followed by a delta of 0, and a varint of 7 for
an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers
running from 100 to 1, the first byte is 100 - 3, the delta is -1,
and the varint is 100 for an encoding of [0x61, 0xff, 0x64].
Literals start with an initial byte of 0x80 to 0xff, which corresponds
to the negative of number of literals in the sequence. Following the
header byte, the list of N varints is encoded. Thus, if there are
no runs, the overhead is 1 byte for each 128 integers. Numbers
[2, 3, 6, 7, 11] would be encoded as [0xfb, 0x02, 0x03, 0x06, 0x07, 0xb].
## Integer Run Length Encoding, version 2
In Hive 0.12, ORC introduced Run Length Encoding version 2 (RLEv2),
which has improved compression and fixed bit width encodings for
faster expansion. RLEv2 uses four sub-encodings based on the data:
* Short Repeat - used for short sequences with repeated values
* Direct - used for random sequences with a fixed bit width
* Patched Base - used for random sequences with a variable bit width
* Delta - used for monotonically increasing or decreasing sequences
### Short Repeat
The short repeat encoding is used for short repeating integer
sequences with the goal of minimizing the overhead of the header. All
of the bits listed in the header are from the first byte to the last
and from most significant bit to least significant bit. If the type is
signed, the value is zigzag encoded.
* 1 byte header
* 2 bits for encoding type (0)
* 3 bits for width (W) of repeating value (1 to 8 bytes)
* 3 bits for repeat count (3 to 10 values)
* W bytes in big endian format, which is zigzag encoded if they type
is signed
The unsigned sequence of [10000, 10000, 10000, 10000, 10000] would be
serialized with short repeat encoding (0), a width of 2 bytes (1), and
repeat count of 5 (2) as [0x0a, 0x27, 0x10].
### Direct
The direct encoding is used for integer sequences whose values have a
relatively constant bit width. It encodes the values directly using a
fixed width big endian encoding. The width of the values is encoded
using the table below.
The 5 bit width encoding table for RLEv2:
Width in Bits | Encoded Value | Notes
:------------ | :------------ | :----
0 | 0 | for delta encoding
1 | 0 | for non-delta encoding
2 | 1
4 | 3
8 | 7
16 | 15
24 | 23
32 | 27
40 | 28
48 | 29
56 | 30
64 | 31
3 | 2 | deprecated
5 <= x <= 7 | x - 1 | deprecated
9 <= x <= 15 | x - 1 | deprecated
17 <= x <= 21 | x - 1 | deprecated
26 | 24 | deprecated
28 | 25 | deprecated
30 | 26 | deprecated
* 2 bytes header
* 2 bits for encoding type (1)
* 5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
width encoding table
* 9 bits for length (L) (1 to 512 values)
* W * L bits (padded to the next byte) encoded in big endian format, which is
zigzag encoding if the type is signed
The unsigned sequence of [23713, 43806, 57005, 48879] would be
serialized with direct encoding (1), a width of 16 bits (15), and
length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
0xbe, 0xef].
> Note: the run length(4) is one-off. We can get 4 by adding 1 to 3
(See [Hive-4123](https://github.com/apache/hive/commit/69deabeaac020ba60b0f2156579f53e9fe46157a#diff-c00fea1863eaf0d6f047535e874274199020ffed3eb00deb897f513aa86f6b59R232-R236))

### Patched Base
The patched base encoding is used for integer sequences whose bit
widths varies a lot. The minimum signed value of the sequence is found
and subtracted from the other values. The bit width of those adjusted
values is analyzed and the 90 percentile of the bit width is chosen
as W. The 10\% of values larger than W use patches from a patch list
to set the additional bits. Patches are encoded as a list of gaps in
the index values and the additional value bits.
* 4 bytes header
* 2 bits for encoding type (2)
* 5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit
width encoding table
* 9 bits for length (L) (1 to 512 values)
* 3 bits for base value width (BW) (1 to 8 bytes)
* 5 bits for patch width (PW) (1 to 64 bits) using the 5 bit width
encoding table
* 3 bits for patch gap width (PGW) (1 to 8 bits)
* 5 bits for patch list length (PLL) (0 to 31 patches)
* Base value (BW bytes) - The base value is stored as a big endian value
with negative values marked by the most significant bit set. If it that
bit is set, the entire value is negated.
* Data values (W * L bits padded to the byte) - A sequence of W bit positive
values that are added to the base value.
* Patch list (PLL * (PGW + PW) bytes) - A list of patches for values
that didn't fit within W bits. Each entry in the list consists of a
gap, which is the number of elements skipped from the previous
patch, and a patch value. Patches are applied by logically or'ing
the data values with the relevant patch shifted W bits left. If a
patch is 0, it was introduced to skip over more than 255 items. The
combined length of each patch (PGW + PW) must be less or equal to
64.
The unsigned sequence of [2030, 2000, 2020, 1000000, 2040, 2050, 2060, 2070,
2080, 2090, 2100, 2110, 2120, 2130, 2140, 2150, 2160, 2170, 2180, 2190]
has a minimum of 2000, which makes the adjusted
sequence [30, 0, 20, 998000, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
150, 160, 170, 180, 190]. It has an
encoding of patched base (2), a bit width of 8 (7), a length of 20
(19), a base value width of 2 bytes (1), a patch width of 12 bits (11),
patch gap width of 2 bits (1), and a patch list length of 1 (1). The
base value is 2000 and the combined result is [0x8e, 0x13, 0x2b, 0x21, 0x07,
0xd0, 0x1e, 0x00, 0x14, 0x70, 0x28, 0x32, 0x3c, 0x46, 0x50, 0x5a, 0x64, 0x6e,
0x78, 0x82, 0x8c, 0x96, 0xa0, 0xaa, 0xb4, 0xbe, 0xfc, 0xe8]
### Delta
The Delta encoding is used for monotonically increasing or decreasing
sequences. The first two numbers in the sequence can not be identical,
because the encoding is using the sign of the first delta to determine
if the series is increasing or decreasing.
* 2 bytes header
* 2 bits for encoding type (3)
* 5 bits for encoded width (W) of deltas (0 to 64 bits) using the 5 bit
width encoding table
* 9 bits for run length (L) (1 to 512 values)
* Base value - encoded as (signed or unsigned) varint
* Delta base - encoded as signed varint
* Delta values (W * (L - 2)) bytes - encode each delta after the first
one. If the delta base is positive, the sequence is increasing and if it is
negative the sequence is decreasing.
The unsigned sequence of [2, 3, 5, 7, 11, 13, 17, 19, 23, 29] would be
serialized with delta encoding (3), a width of 4 bits (3), length of
10 (9), a base of 2 (2), and first delta of 1 (2). The resulting
sequence is [0xc6, 0x09, 0x02, 0x02, 0x22, 0x42, 0x42, 0x46].
# Stripes
The body of ORC files consists of a series of stripes. Stripes are
large (typically ~200MB) and independent of each other and are often
processed by different tasks. The defining characteristic for columnar
storage formats is that the data for each column is stored separately
and that reading data out of the file should be proportional to the
number of columns read.
In ORC files, each column is stored in several streams that are stored
next to each other in the file. For example, an integer column is
represented as two streams PRESENT, which uses one with a bit per
value recording if the value is non-null, and DATA, which records the
non-null values. If all of a column's values in a stripe are non-null,
the PRESENT stream is omitted from the stripe. For binary data, ORC
uses three streams PRESENT, DATA, and LENGTH, which stores the length
of each value. The details of each type will be presented in the
following subsections.
The layout of each stripe looks like:
* index streams
* unencrypted
* encryption variant 1..N
* data streams
* unencrypted
* encryption variant 1..N
* stripe footer
There is a general order for index and data streams:
* Index streams are always placed together in the beginning of the stripe.
* Data streams are placed together after index streams (if any).
* Inside index streams or data streams, the unencrypted streams should be
placed first and then followed by streams grouped by each encryption variant.
There is no fixed order within each unencrypted or encryption variant in the
index and data streams:
* Different stream kinds of the same column can be placed in any order.
* Streams from different columns can even be placed in any order.
To get the precise information (a.k.a stream kind, column id and location) of
a stream within a stripe, the streams field in the StripeFooter described below
is the single source of truth.
In the example of the integer column mentioned above, the order of the
PRESENT stream and the DATA stream cannot be determined in advance.
We need to get the precise information by **StripeFooter**.
## Stripe Footer
The stripe footer contains the encoding of each column and the
directory of the streams including their location.
```
message StripeFooter {
// the location of each stream
repeated Stream streams = 1;
// the encoding of each column
repeated ColumnEncoding columns = 2;
optional string writerTimezone = 3;
// one for each column encryption variant
repeated StripeEncryptionVariant encryption = 4;
}
```
If the file includes encrypted columns, those streams and column
encodings are stored separately in a StripeEncryptionVariant per an
encryption variant. Additionally, the StripeFooter will contain two
additional virtual streams ENCRYPTED_INDEX and ENCRYPTED_DATA that
allocate the space that is used by the encryption variants to store
the encrypted index and data streams.
```
message StripeEncryptionVariant {
repeated Stream streams = 1;
repeated ColumnEncoding encoding = 2;
}
```
To describe each stream, ORC stores the kind of stream, the column id,
and the stream's size in bytes. The details of what is stored in each stream
depends on the type and encoding of the column.
```
message Stream {
enum Kind {
// boolean stream of whether the next value is non-null
PRESENT = 0;
// the primary data stream
DATA = 1;
// the length of each value for variable length data
LENGTH = 2;
// the dictionary blob
DICTIONARY_DATA = 3;
// deprecated prior to Hive 0.11
// It was used to store the number of instances of each value in the
// dictionary
DICTIONARY_COUNT = 4;
// a secondary data stream
SECONDARY = 5;
// the index for seeking to particular row groups
ROW_INDEX = 6;
// original bloom filters used before ORC-101
BLOOM_FILTER = 7;
// bloom filters that consistently use utf8
BLOOM_FILTER_UTF8 = 8;
// Virtual stream kinds to allocate space for encrypted index and data.
ENCRYPTED_INDEX = 9;
ENCRYPTED_DATA = 10;
// stripe statistics streams
STRIPE_STATISTICS = 100;
// A virtual stream kind that is used for setting the encryption IV.
FILE_STATISTICS = 101;
}
required Kind kind = 1;
// the column id
optional uint32 column = 2;
// the number of bytes in the file
optional uint64 length = 3;
}
```
Depending on their type several options for encoding are possible. The
encodings are divided into direct or dictionary-based categories and
further refined as to whether they use RLE v1 or v2.
```
message ColumnEncoding {
enum Kind {
// the encoding is mapped directly to the stream using RLE v1
DIRECT = 0;
// the encoding uses a dictionary of unique values using RLE v1
DICTIONARY = 1;
// the encoding is direct using RLE v2
DIRECT_V2 = 2;
// the encoding is dictionary-based using RLE v2
DICTIONARY_V2 = 3;
}
required Kind kind = 1;
// for dictionary encodings, record the size of the dictionary
optional uint32 dictionarySize = 2;
}
```
# Column Encodings
## SmallInt, Int, and BigInt Columns
All of the 16, 32, and 64 bit integer column types use the same set of
potential encodings, which is basically whether they use RLE v1 or
v2. If the PRESENT stream is not included, all of the values are
present. For values that have false bits in the present stream, no
values are included in the data stream.
Encoding | Stream Kind | Optional | Contents
:-------- | :---------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v2
> Note that the order of the Stream is not fixed. It also applies to other Column types.
## Float and Double Columns
Floating point types are stored using IEEE 754 floating point bit
layout. Float columns use 4 bytes per value and double columns use 8
bytes.
Encoding | Stream Kind | Optional | Contents
:-------- | :---------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | IEEE 754 floating point representation
## String, Char, and VarChar Columns
String, char, and varchar columns may be encoded either using a
dictionary encoding or a direct encoding. A direct encoding should be
preferred when there are many distinct values. In all of the
encodings, the PRESENT stream encodes whether the value is null. The
Java ORC writer automatically picks the encoding after the first row
group (10,000 rows).
For direct encoding the UTF-8 bytes are saved in the DATA stream and
the length of each value is written into the LENGTH stream. In direct
encoding, if the values were ["Nevada", "California"]; the DATA
would be "NevadaCalifornia" and the LENGTH would be [6, 10].
For dictionary encodings the dictionary is sorted (in lexicographical
order of bytes in the UTF-8 encodings) and UTF-8 bytes of
each unique value are placed into DICTIONARY_DATA. The length of each
item in the dictionary is put into the LENGTH stream. The DATA stream
consists of the sequence of references to the dictionary elements.
In dictionary encoding, if the values were ["Nevada",
"California", "Nevada", "California", and "Florida"]; the
DICTIONARY_DATA would be "CaliforniaFloridaNevada" and LENGTH would
be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1].
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v1
DICTIONARY | PRESENT | Yes | Boolean RLE
| DATA | No | Unsigned Integer RLE v1
| DICTIONARY_DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v2
DICTIONARY_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Unsigned Integer RLE v2
| DICTIONARY_DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v2
## Boolean Columns
Boolean columns are rare, but have a simple encoding.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Boolean RLE
## TinyInt Columns
TinyInt (byte) columns use byte run length encoding.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Byte RLE
## Binary Columns
Binary data is encoded with a PRESENT stream, a DATA stream that records
the contents, and a LENGTH stream that records the number of bytes per a
value.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | String contents
| LENGTH | No | Unsigned Integer RLE v2
## Decimal Columns
Since Hive 0.13, all decimals have had fixed precision and scale.
The goal is to use RLEv3 for the value and use the fixed scale from
the type. As an interim solution, we are using RLE v2 for short decimals
(precision <= 18) and the old encoding for long decimals.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v2
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Unbounded base 128 varints
| SECONDARY | No | Signed Integer RLE v2
## Date Columns
Date data is encoded with a PRESENT stream, a DATA stream that records
the number of days after January 1, 1970 in UTC.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v2
## Timestamp Columns
Timestamp records times down to nanoseconds as a PRESENT stream that
records non-null values, a DATA stream that records the number of
seconds after 1 January 2015, and a SECONDARY stream that records the
number of nanoseconds.
Because the number of nanoseconds often has a large number of trailing
zeros, the number has trailing decimal zero digits removed and the
last three bits are used to record how many zeros were removed. if the
trailing zeros are more than 2. Thus 1000 nanoseconds would be
serialized as 0x0a and 100000 would be serialized as 0x0c.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v1
| SECONDARY | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v2
| SECONDARY | No | Unsigned Integer RLE v2
## Struct Columns
Structs have no data themselves and delegate everything to their child
columns except for their PRESENT stream. They have a child column
for each of the fields.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
## List Columns
Lists are encoded as the PRESENT stream and a length stream with
number of items in each list. They have a single child column for the
element values.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| LENGTH | No | Unsigned Integer RLE v2
## Map Columns
Maps are encoded as the PRESENT stream and a length stream with number
of items in each map. They have a child column for the key and
another child column for the value.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| LENGTH | No | Unsigned Integer RLE v2
## Union Columns
Unions are encoded as the PRESENT stream and a tag stream that controls which
potential variant is used. They have a child column for each variant of the
union. Currently ORC union types are limited to 256 variants, which matches
the Hive type model.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DIRECT | No | Byte RLE
## Geometry & Geography Columns
Geometry and Geography data are encoded with a PRESENT stream, a DATA stream that records
the WKB-encoded geometry/geography data as binary, and a LENGTH stream that records
the number of bytes per a value.
Encoding | Stream Kind | Optional | Contents
:------------ | :-------------- | :------- | :-------
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Binary contents
| LENGTH | No | Unsigned Integer RLE v1
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Binary contents
| LENGTH | No | Unsigned Integer RLE v2
# Indexes
## Row Group Index
The row group indexes consist of a ROW_INDEX stream for each primitive
column that has an entry for each row group. Row groups are controlled
by the writer and default to 10,000 rows. Each RowIndexEntry gives the
position of each stream for the column and the statistics for that row
group.
The index streams are placed at the front of the stripe, because in
the default case of streaming they do not need to be read. They are
only loaded when either predicate push down is being used or the
reader seeks to a particular row.
```
message RowIndexEntry {
repeated uint64 positions = 1 [packed=true];
optional ColumnStatistics statistics = 2;
}
```
```
message RowIndex {
repeated RowIndexEntry entry = 1;
}
```
To record positions, each stream needs a sequence of numbers. For
uncompressed streams, the position is the byte offset of the RLE run's
start location followed by the number of values that need to be
consumed from the run. In compressed streams, the first number is the
start of the compression chunk in the stream, followed by the number
of decompressed bytes that need to be consumed, and finally the number
of values consumed in the RLE.
For columns with multiple streams, the sequences of positions in each
stream are concatenated. That was an unfortunate decision on my part
that we should fix at some point, because it makes code that uses the
indexes error-prone.
Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.
Note that for columns with multiple streams, the order of stream
positions in the RowIndex is **fixed**, which may be different to
the actual data stream placement, and it is the same as
[Column Encodings](#column-encoding-section) section we described above.
## Bloom Filter Index
Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
Predicate pushdown can make use of bloom filters to better prune
the row groups that do not satisfy the filter condition.
The bloom filter indexes consist of a BLOOM_FILTER stream for each
column specified through 'orc.bloom.filter.columns' table properties.
A BLOOM_FILTER stream records a bloom filter entry for each row
group (default to 10,000 rows) in a column. Only the row groups that
satisfy min/max row index evaluation will be evaluated against the
bloom filter index.
Each bloom filter entry stores the number of hash functions ('k') used
and the bitset backing the bloom filter. The original encoding (pre
ORC-101) of bloom filters used the bitset field encoded as a repeating
sequence of longs in the bitset field with a little endian encoding
(0x1 is bit 0 and 0x2 is bit 1.) After ORC-101, the encoding is a
sequence of bytes with a little endian encoding in the utf8bitset field.
```
message BloomFilter {
optional uint32 numHashFunctions = 1;
repeated fixed64 bitset = 2;
optional bytes utf8bitset = 3;
}
```
```
message BloomFilterIndex {
repeated BloomFilter bloomFilter = 1;
}
```
Bloom filter internally uses two different hash functions to map a key
to a position in the bit set. For tinyint, smallint, int, bigint, float
and double types, Thomas Wang's 64-bit integer hash function is used.
Doubles are converted to IEEE-754 64 bit representation (using Java's
Double.doubleToLongBits(double)). Floats are as converted to double
(using Java's float to double cast). All these primitive types
are cast to long base type before being passed on to the hash function.
For strings and binary types, Murmur3 64 bit hash algorithm is used.
The 64 bit variant of Murmur3 considers only the most significant
8 bytes of Murmur3 128-bit algorithm. The 64 bit hashcode generated
from the above algorithms is used as a base to derive 'k' different
hash functions. We use the idea mentioned in the paper "Less Hashing,
Same Performance: Building a Better Bloom Filter" by Kirsch et. al. to
quickly compute the k hashcodes.
The algorithm for computing k hashcodes and setting the bit position
in a bloom filter is as follows:
1. Get 64 bit base hash code from Murmur3 or Thomas Wang's hash algorithm.
2. Split the above hashcode into two 32-bit hashcodes (say hash1 and hash2).
3. k'th hashcode is obtained by (where k > 0):
* combinedHash = hash1 + (k * hash2)
4. If combinedHash is negative flip all the bits:
* combinedHash = ~combinedHash
5. Bit set position is obtained by performing modulo with m:
* position = combinedHash % m
6. Set the position in bit set. The LSB 6 bits identifies the long index
within bitset and bit position within the long uses little endian order.
* bitset[position >>> 6] \|= (1L << position);
Bloom filter streams are interlaced with row group indexes. This placement
makes it convenient to read the bloom filter stream and row index stream
together in single read operation.

orc-format-1.1.0/specification/img/TreeWriters.png 000644 000765 000024 00000406501 14777360722 023150 0 ustar 00dongjoon staff 000000 000000 PNG
IHDR O Z 6ܣF iCCPICC Profile (c``RH,(a``+)
rwRR` Š\\À|T
uAfVp'?@\PTd+=@HR6 .:bC'j ;`5!A@ / fŗaP{A@1%?)U{
CKKM@VhʢG`H*x%(20s 8< !6G)I/15C}}sKʠ0230 I#2 pHYs gR @ IDATx%EնkdI.A$DDAT*?`P`D AI .Y%,fSu\fwgvw̝=۷N:Uv[SH@ @ @破B)@ @!@ ! i`j @ Ab@ @ 0<
P
@ @ S@ @ @!@ @ y1@ 4B5@ O1@ @ A V@ @ )@ @ C@
@ @ <@ @`yX@ @@ @ OC +T@ @ c @ ! i`j @ Ab@ @ 0<
P
@ @ S@ @ @!@ @ y1@ 4B5@ 3!/@ FymW$@ <=EsI>}
F` 9ҩG: i(hG c?# D~M5VE O_D'WN|][~T$6(~*/2Qn=:@ 0X<
S81⤜4˜LyМpfb3%94'Q340.!C4"N
rsy=&1ը:'EĄz\O7(ʐqv]D.8!inD٘DI gy&~GwN}_M+bE"@`AFQ@-7&bӹ;*#:54wߴ>ص8q͛=Pu^@ @9!1{D9s]vܸtՕ-.źөSqbcbKQd]qe\ur)buW*USp! k'IJX2DptS0Uy=E7596Q9)+˶EЮl=MySO7D]'E@I5B @0v)gϞm9' 5o$"d$Id
By"NevPW6TɻNHH \Qi'i*i#.Dބ:hɂDQ D.>!E
9T%A 8:$i1БL.zTZl^oEt @~vUtg2;vmD̞"M9SysoFԚI;ujD@ Os&
L1y'Oo&2"4JGDKiAr@vDdeT/cʋeć