June 27, 2022

Avro File Format in Hadoop

Apache Avro is a data serialization system native to Hadoop which is also language independent. Apache Avro project was created by Doug Cutting, creator of Hadoop to increase data interoperability in Hadoop. Avro implementations for C, C++, C#, Java, PHP, Python, and Ruby are available making it easier to interchange data among various platforms.

What is data serialization

Just to make it clear here Data serialization is a mechanism to convert data (class objects, data structures) into a stream of bytes (binary form) in order to send it across network or store it persistently in a file or DB.

Avro in Hadoop

Main features of Avro in Hadoop are-

  • Avro is language independent
  • It is schema based

To define structure for Avro data, language-independent schema is used. Avro schemas are defined using JSON that helps in data interoperability.

Some of the benefits of using schema in Avro are-

  1. For language interoperability, since schema is defined using JSON.
  2. You can save Avro schema in a separate file with .avsc extension.
  3. It allows for evolution of schema. You can add or remove a column.
  4. Using Avro you can perform serialization and deserialization without code generation. Since data in Avro is always stored with its corresponding schema, you can always read a serialized item regardless of whether you know the schema ahead of time.

Avro file format

Avro includes a simple object container file format. A file has a schema, and all objects stored in the file must be written according to that schema, using binary encoding. Objects are stored in blocks that may be compressed. Synchronization markers are used between blocks to permit efficient splitting of files for MapReduce processing.

Avro file consists of:

  • A file header
  • One or more file data blocks.
Header Data block Data block .......

A file header consists of:

  • Four bytes, ASCII 'O', 'b', 'j', followed by 1.
  • file metadata which includes the schema. Also contains information about the compression codec used to compress blocks.
  • The 16-byte, randomly-generated sync marker for this file.

A file data block consists of:

  • A long indicating the count of objects in this block.
  • A long indicating the size in bytes of the serialized objects in the current block, after any codec is applied.
  • The serialized objects. If a codec is specified, this is compressed by that codec.
  • The file's 16-byte sync marker.

Schema Declaration in Avro

A Schema is represented in JSON by one of:

  • A JSON string, naming a defined type.
  • A JSON object, of the form:{"type": "typeName" ...attributes...} where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.
  • A JSON array, representing a union of embedded types.

Primitive Types in Avro

The set of primitive type names is:

  • null: no value
  • boolean: a binary value
  • int: 32-bit signed integer
  • long: 64-bit signed integer
  • float: single precision (32-bit) IEEE 754 floating-point number
  • double: double precision (64-bit) IEEE 754 floating-point number
  • bytes: sequence of 8-bit unsigned bytes
  • string: unicode character sequence

Primitive types have no specified attributes.

Primitive type names are also defined type names. Thus, for example, the schema "string" is equivalent to:

{"type": "string"}

Complex Types in Avro

Avro supports six kinds of complex types: records, enums, arrays, maps, unions and fixed.

Records- Records use the type name "record" and support following attributes:

  • name: a JSON string providing the name of the record (required).
  • namespace, a JSON string that qualifies the name;
  • doc: a JSON string providing documentation to the user of this schema (optional).
  • aliases: a JSON array of strings, providing alternate names for this record (optional).
  • fields: a JSON array, listing fields (required). Each field is a JSON object with the following attributes:

For example, schema for employee record:

{
  "type": 	 "record",
  "name": 	 "EmployeeRecord",
  "doc": 	 "Employee Record",
  "fields": 	 [
    {"name": 	"empId", 	 "type": 	"int"},
    {"name": 	"empName", 	 "type": 	"string"},
    {"name": 	"age", 		 "type": 	"int"}
  ]
}

Enums- Enums use the type name "enum" and support the following attributes:

  • name: a JSON string providing the name of the enum (required).
  • namespace, a JSON string that qualifies the name;
  • aliases: a JSON array of strings, providing alternate names for this enum (optional).
  • doc: a JSON string providing documentation to the user of this schema (optional).
  • symbols: a JSON array, listing symbols, as JSON strings (required). All symbols in an enum must be unique; duplicates are prohibited. Every symbol must match the regular expression [A-Za-z_][A-Za-z0-9_]* (the same requirement as for names).

For example declaring days of week using an Enum:

{ "type": "enum",
  "name": "WeekDays",
  "symbols" : ["MONDAY", "TUESDAY", "WEDNESDAY", "THURSDAY", "FRIDAY", "SATURDAY", "SUNDAY"]
}

Arrays- Arrays use the type name "array" and support a single attribute:

  • items: the schema of the array's items.

For example declaring an array of strings:

{"type": "array", "items": "string"}

Maps- Maps use the type name "map" and support one attribute:

  • values: the schema of the map's values.

Map keys are assumed to be strings.

For example, a map from string to long is declared with:

{"type": "map", "values": "long"}

union- A union is represented using JSON array and each element in the array is a schema. For example, ["null", "string"] declares a schema which may be either a null or string. Data confirming to the union schema must match one of the schema in the union.

Fixed- Fixed uses the type name "fixed" and supports two attributes:

  • name: a string naming this fixed (required).
  • namespace, a string that qualifies the name;
  • aliases: a JSON array of strings, providing alternate names for this enum (optional).
  • size: an integer, specifying the number of bytes per value (required).

For example, declaring a 16-byte quantity:

{"type": "fixed", "size": 16, "name": "md5"}

That's all for the topic Avro File Format in Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

No comments:

Post a Comment