Getting started with runtime fields, Elastic’s implementation of schema on read

Historically, Elasticsearch has relied on a schema on write approach to make searching data fast. We are now adding schema on read capabilities to Elasticsearch so that users have the flexibility to alter a document's schema after ingest and also generate fields that exist only as part of the search query. Together, schema on read and schema on write provides users with the choice to balance performance and flexibility based on their needs. Our solution for schema on read is runtime fields, which are evaluated only at query time. They are defined in the index mapping or in the query, and once defined they are immediately available for search requests, aggregations, filtering, and sorting. Because runtime fields aren’t indexed, adding a runtime field doesn’t increase the index size. They can, in fact, reduce storage costs and increase the speed of ingestion. However, there are tradeoffs. Queries against runtime fields can be expensive, so data that you commonly search or filter on should still be mapped to indexed fields. Runtime fields can also decrease search speed, even though your index size is smaller. We recommend using runtime fields in tandem with indexed fields to find the right balance between ingest speed, index size, flexibility, and search performance for your use cases. It’s easy to add runtime fieldsThe easiest way to define a runtime field is in the query. For example, if we have the following index: PUT my_index { "mappings": { "properties": { "address": { "type": "ip"}, "port": { "type": "long" } } } } And load a few documents into it: POST my_index/_bulk {"index":{"_id":"1"}} {"address":"1.2.3.4","port":"80"} {"index":{"_id":"2"}} {"address":"1.2.3.4","port":"8080" {"index":{"_id":"3"}} {"address":"2.4.8.16","port":"80"} We can create the concatenation of two fields with a static string as follows: GET my_index/_search { "runtime_mappings": { "socket": { "type": "keyword", "script": { "source": "emit(doc['address'].value + ':' + doc['port'].value)" } } }, "fields": [ "socket" ], "query": { "match": { "socket": "1.2.3.4:8080" } } } Yielding the following response: … "hits" : [ { "_index" : "my_index", "_type" : "_doc", "_id" : "2", "_score" : 1.0, "_source" : { "address" : "1.2.3.4", "port" : "8080" }, "fields" : { "socket" : [ "1.2.3.4:8080" ] } } ] We defined the field socket in the runtime_mappings section. We used a short Painless script that defines how the value of socket will be calculated per document (using + to indicate concatenation of the value of the address field with the static string ‘:’ and the value of the port field). We then used the field socket in the query. The field socket is an ephemeral runtime field that exists only for this query and is calculated when the query is run. When defining a Painless script to use with runtime fields, you must include emit to return calculated values. If we find that socket is a field that we want to use in multiple queries without having to define it per query, we can simply add it to the mapping by making the call: PUT my_index/_mapping { "runtime": { "socket": { "type": "keyword", "script": { "source": "emit(doc['address'].value + ':' + doc['port'].value)" } } } } And then the query does not have to include the definition of the field, for example: GET my_index/_search { "fields": [ "socket" ], "query": { "match": { "socket": "1.2.3.4:8080" } } } The statement "fields": ["socket"] is only required if you want to display the value of the socket field. While the field socket is now available to any query, it does not exist in the index and does not increase the index’s size. Socket is calculated only when a query requires it and for the documents for which it is required. Consumed like any fieldBecause runtime fields are exposed through the same API as indexed fields, a query can refer to some indices where the field is a runtime field, and other indices where the field is an indexed field. You have the flexibility to choose which fields to index and which ones to keep as runtime fields. This separation between field generation and field consumption facilitates more organized code that is easier to create and maintain. You define runtime fields in the index mapping or in the search request. This inherent capability provides flexibility in how you use runtime fields in conjunction with indexed fields.  Override field values at query timeOftentimes, you realize mistakes in your production data when it's too late.  While it is easy to fix the ingest instructions for documents that you will ingest in the future, it’s much more challenging to fix the data that has already been ingested and indexed. Using runtime fields, you can fix errors in your indexed data by overriding values at query time. Runtime fields can shadow indexed fields with the same name so that you can correct errors in your indexed data.   Here’s a simple example to make this more concrete. Let’s say we have an index with a message field and an address field: PUT my_raw_index { "mappings": { "properties": { "raw_message": { "type": "keyword" }, "address": { "type": "ip" } } } } And let’s load a document into it: POST my_raw_index/_doc/1 { "raw_message": "199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] GET /history/apollo/ HTTP/1.0 200 6245", "address": "1.2.3.4" } Alas, the document contains a wrong IP address in the address field. The correct IP address exists in the message but somehow the wrong address was parsed out in the document that was sent to be ingested into Elasticsearch and indexed. For a single document, that’s not a problem, but what if we discover after a month that 10% of our documents contain a wrong address? Fixing it for new documents is not a big deal, but reindexing the documents that were already ingested is frequently operationally complex. With runtime fields, it can be fixed immediately, by shadowing the indexed field with a runtime field. Here is how you would do it in a query: GET my_raw_index/_search { "runtime_mappings": { "address": { "type": "ip", "script": "Matcher m = /\d+\.\d+\.\d+\.\d+/.matcher(doc[\"raw_message\"].value);if (m.find()) emit(m.group());" } }, "fields": [ "address" ] } You can also do the change in the mapping so that it is made available for all queries. Note that the use of regex is now enabled by default through Painless script. Balance performance and flexibilityWith indexed fields, you make all the preparations during ingest and maintain the sophisticated data structures to provide optimal performance. But querying runtime fields is slower than querying indexed fields. So what if your queries are slow after you start using runtime fields? We recommend using asynchronous search when retrieving a runtime field. The full result set is returned just like in a synchronous search, provided that the query completes within a given time threshold. However, even if the query doesn't finish in that time, you still get a partial result set and Elasticsearch will continue polling until the complete result set is returned. This mechanism is particularly useful when managing an index lifecycle because newer results typically return first and are also typically more important to users. To provide optimal performance, we rely on the indexed fields to do the heavy lifting of the query so that the values of runtime fields are only calculated for a subset of the documents. Changing a field from runtime to indexedRuntime fields allow users to flexibly change their mapping and parsing while working on data in a live environment. Because a runtime field does not consume resources, and because the script that defines it can be changed, users can experiment until they reach the optimal mapping. When a runtime field is found to be useful for the long term, it is possible to precalculate its value at index time by simply defining that field in the template as an indexed field and making sure that the ingested document includes it. The field will be indexed from the next index rollover and provide better performance. The queries that use the field do not need to change at all.  This scenario is particularly useful with dynamic mapping. On the one hand, it is very helpful to allow new documents to generate new fields, because that way the data in them can be immediately used (the structure of entries frequently changes, e.g., due to a change in the software that generates the log). On the other hand, dynamic mapping comes with the risk of burdening the index and even creating a mapping explosion, because you never know if some document might surprise you with 2000 new fields. Runtime fields can provide a solution to this scenario. The new fields can be automatically created as runtime fields so as not to burden the index (since they do not exist in the index), and they are not counted in the index.mapping.total_fields.limit. These automatically created runtime fields are queryable, albeit with lower performance, so users can use them and, if needed, decide to change them to indexed fields in the next rollover.    We recommend using runtime fields initially to experiment with your data structure. After working with your data, you might decide to index a runtime field for better search performance. You can create a new index and then add the field definition to the index mapping, add the field to _source and make sure the new field is included in the ingested documents. If you're using data streams, you can update your index template so that when indices are created from that template, Elasticsearch knows to index that field. In a future release, we plan to make the process of changing a runtime field to an indexed field as si

Creato 4y | 10 feb 2021, 19:20:35


Accedi per aggiungere un commento