Data lifecycle management with data tiers

Elasticsearch 7.10 made configuring the lifecycle of your data less complicated. In this blog post I’ll walk through some of the changes, how to use them, and some best practices along the way. Data lifecycle can encompass a lot of stages, so we’ll touch on:

Dividing a cluster into tiers (hot, warm, cold), ensuring new data makes its way to the right place.
Making use of those tiers within index lifecycle management (ILM) for migrating data between tiers.
Increasing the data density within the cold tier using searchable snapshots.
Putting it all together with a real example of how data flows through tiers.

Out with attributes, in with node rolesOne common use case when dealing with time series data is separating the cluster topology into separate tiers. These tiers are given names like hot, where new data is ingested and queried, warm, where medium-age data is held and queried, and cold, where data is usually held for long periods of time, being queried less frequently. It's common for users to configure these tiers with different hardware so that the most powerful, expensive hardware is used for the hot tier, while less expensive and more storage-dense hardware is used for the warm or cold tiers. Before 7.10, one of the most common ways to configure different tiers of nodes was to use node attributes, so a user would configure something like:

On hot nodes

node.attr.node_type: hot

On warm nodes

node.attr.node_type: warm

On cold nodes:

node.attr.node_type: cold These could then be used with the cluster and index level allocation parameters. So for example, an index would be created with: PUT /myindex { "settings": { "index.routing.allocation.include.node_type": "hot" } } This created an index that was allocated to the hot nodes. Let's take a look at the new way to do things. Formalizing the processIn 7.10 configurations of this type have been formalized, and we now have specific roles that correspond to the hot, warm, and cold tiers (as well as another tier, which we'll get to). This means instead of adding the node.attr.node_type attribute, we can add one of the data_hot, data_warm, or data_cold node roles to the node.roles setting:

On hot nodes

node.roles: ["data_hot"]

On warm nodes

node.roles: ["data_warm"]

On cold nodes

node.roles: ["data_cold"] Remember to add any other roles you might need to the list! For example, for a smaller cluster, a data node may look like: node.roles: ["master", "ingest", "ml", "data_hot", "data_content"] (The data_content role will be explained further down in this post.) Note: You may be wondering what happened to the existing data role. Well, the data role acts as though all of the tiers have been specified, so it’s part of the hot, warm, and cold tiers all at once. This means that nodes that have been upgraded to 7.10 or later but don't specify custom node roles with the node.roles setting are part of every data tier. Moving data around tiersOnce these roles are set, data can be shifted around the cluster using the same cluster and index-level allocation filters as before:

cluster.routing.allocation.require._tier
cluster.routing.allocation.include._tier
cluster.routing.allocation.exclude._tier
index.routing.allocation.require._tier
index.routing.allocation.include._tier
index.routing.allocation.exclude._tier

There’s also a new parameter that behaves a little differently to the regular include, exclude, and require filtering. This is the _tier_preference index-level setting: index.routing.allocation.include._tier_preference. To see how it works, let's look at an example: PUT /myindex { "settings": { "index.routing.allocation.include._tier_preference": "data_cold,data_warm,data_hot" } } This snippet creates an index that prefers to be on the cold tier, then the warm tier, and finally, the hot tier. If the cluster contains no cold nodes, then it must be allocated to a warm node. If there are no warm nodes in the cluster, then it must be allocated to a hot node. This configuration allows policies or templates to set a preference and not have to worry about each tier being part of the cluster — it can specify its preferred tiers instead. To clarify exactly how this new setting behaves, check out this flow diagram:

There's one last difference between node attributes and formalized data tier roles — where an index is allocated when it is initially created. Here’s a simple rule that determines where an index is initially placed: Any index that is part of a data stream automatically has a "index.routing.allocation.include._tier_preference: data_hot" setting added upon creation. This means that all data stream backing indices will be allocated by default on hot (data_hot) nodes. And for indices that aren’t part of data streams: Any index that is not part of a data stream automatically has a "index.routing.allocation.include._tier_preference: data_content" setting added upon creation. Either of these settings can be overridden. Simply set the _tier_preference index setting to null, or set any other index-level allocation filtering setting during creation. But wait: What is this new data_content role? Let's consider another kind of data that doesn’t fit into a lifecycle model: data that does not fit the time series model. Data that doesn't fit a time seriesSome data is indexed and then only queried but doesn't have an age or timestamp in the conceptual sense. This includes enterprise search data, or ecommerce data, or a database of user information. For that kind of data, there’s a specific role: the data_content role. This role is configured just like the other roles using the node.roles setting. It’s worth mentioning that none of these roles are mutually exclusive, so if you wanted non-time series data and hot data to be on the same nodes, you would configure a node like so:

This node can hold either "hot" tier data, or "content" tier data

node.roles: ["data_hot", "data_content"] And, as you may have guessed already, the existing default data role is also part of the content tier. This means that the data role is the same as adding the data_hot, data_warm, data_cold, and data_content roles to the node.roles setting. As was mentioned in the previous section, the way that Elasticsearch determines what is time series data versus non-time series data is whether that index belongs to a data stream. Best practice: Make sure that your cluster always has at least one data_hot node and one data_content node, even if it’s the same node. Without both of these node roles, the different types of indices can’t be allocated. Now that we know about all the different tiers and have configured a few, let's look at how ILM can make use of these tiers. ILM in a data tier worldConfiguring tiers by themselves doesn't change much compared to the older attribute-based allocation. However, now that we have a built-in and consistent way to identify tiers within Elasticsearch, ILM can make use of those settings to migrate data automatically between tiers. Before 7.10, a tiered configuration may have been accomplished with a policy like the following: { "phases" : { "hot" : { "actions" : { "rollover" : { "max_age" : "30d", "max_size" : "50gb" } } }, "warm" : { "min_age" : "45d", "actions" : { "allocate" : { "include" : { "node_type" : "warm" } }, "forcemerge" : { "max_num_segments" : 1 } } }, "cold" : { "min_age" : "60d", "actions" : { "allocate" : { "include" : { "node_type" : "cold" } } } }, "delete" : { "min_age" : "90d", "actions" : { "delete" : { } } } } } This policy relied on an index template allocating data to the hot tier, then updating the node_type with "allocate" actions within each ILM phase to move the data around. But with 7.10, there’s a better way! Since Elasticsearch now has the ability to prefer a particular tier, ILM can make use of this preference by setting index.routing.allocation.include._tier_preference automatically when entering a new phase. Here’s an example using the same policy as above, but removing the allocate steps: { "phases" : { "hot" : { "actions" : { "rollover" : { "max_age" : "30d", "max_size" : "50gb" } } }, "warm" : { "min_age" : "45d", "actions" : { "forcemerge" : { "max_num_segments" : 1 } } }, "cold" : { "min_age" : "60d", "actions" : { } }, "delete" : { "min_age" : "90d", "actions" : { "delete" : { } } } } } When a new data stream is created that uses this policy, initially the preference will be set to index.routing.allocation.include._tier_preference: data_hot. When the index enters the warm phase, the setting will be updated to index.routing.allocation.include._tier_preference: data_warm,data_hot. Upon entering the cold phase, it’ll be updated to index.routing.allocation.include._tier_preference: data_cold,data_warm,data_hot, and then the index will finally be deleted in the delete phase. This automatic migration does not apply to the hot phase, as that is automatically managed on index creation. This automatic migration allows the data location to automatically match the current ILM phase when the cluster topology also contains nodes of that tier. This behavior is automatically applied, with two caveats:

If a "migrate": {"enabled": false} action is added to the phase's list of actions, the automatic migration doesn’t take effect.
If the ILM phase actions contain an allocate step that sets an include, require, or exclude filter, the automatic migration doesn’t take effect.

Both of these phases will not automatically migrate data: "warm": { "actions": { "migrate": { "enabled": false } } } or "warm": { "actions": { "allocate": { "include": { "node_type": "warm" } } } } Best pra

Creato 4y | 2 feb 2021, 21:20:49


Accedi per aggiungere un commento