Formats for Encoding Data

programs usually work with data in two different representations:

In memory, data is kept in objects or all the sorts of data structures. These data structures are optimized for efficient access and manipulation by the CPU
When you want to write data to a file or send it over the network, you have to encode the data as some kind of self-contained sequence of bytes(like JSON).

Thus, the translation process of encoding(aka serialization or marshalling) and decoding(parsing, deserialization, unmarshalling) is crucial.

Language-Specific Formats

Many programming languages have built-in support for encoding in-memory objects into byte sequence, yet they have lots of problems such as it’s very language-specific(as the name suggest, of course) and efficiency issues. Thus, it’s generally a bad idea to use those encodings.

JSON, XML and Binary Variants

Despite there are flaws in JSON and XML, they are very commonly used as they are good enough for many purposes.

There are also different binary variants of JSON or XML to save the space or increase efficiency. Examples could be Thrift and Protocol Buffers. One main thing to notice about them are they use filed tag instead of the actual string name.

Avro is another variant of binary encoding, and instead of storing the type in the actual files, there’s a defined schema to refer to when decoding to save space(the binary data can only be decoded correctly if the code reading the data is using the exact same schema as the code that wrote the data). This schema concept can be referred to as “writer schema” and “reader schema”.

Modes of Dataflow

Dataflow through Databases

In a database, the process that writes to the database encodes the data, and the process that reads from the database decodes it.

If there are changes made on a model object, to preserve the unknown field(s), the new field must be taken care of. Say there’s an attempt to add one field in the schema and there’s another attempt to change the content of one existing schema. The second attempt, although not knowing what the new field is, would have to change its content while still adding the new field in to the schema.

In databases, there are very little cases that you need to completely rewrite things. Most times, simple schema changes would be made such as adding new column with null default value.

Dataflow Through Services: REST and RPC

When you have processes that need to communicate over a network, there are a few ways of arranging the communication. The most common arrangement is to have two roles, clients and servers. The server expose an API over the network and the clients can connect to the servers to make requests to that API. The API exposed by the server is known as a service.