Hiya! With Elastic’s expansion of our Elasticsearch Service Cloud offering and automated onboarding, we’ve expanded the Elastic Stack audience from full ops teams to data engineers, security teams, and consultants. As an Elastic support rep, I’ve enjoyed interacting with more user backgrounds and with even wider use cases. With a wider audience, I’m seeing more questions about managing resource allocation, in particular the mystical shard-heap ratio and avoiding circuit breakers. I get it! When I started with the Elastic Stack, I had the same questions. It was my first intro to managing Java heap and time series database shards, and scaling my own infrastructure. When I joined the Elastic team, I loved that on top of documentation, we had blogs and tutorials so I could onboard quickly. But then I struggled my first month to correlate my theoretical knowledge to the errors users would send through my ticket queue. Eventually, I figured out, like other support reps, that a lot of the reported errors were just symptoms of allocation issues and the same seven-ish links would bring users up to speed to successfully manage their resource allocation. Speaking as a support rep, in the following sections, I’m going to go over the top allocation management theory links we send users, the top symptoms we see, and where we direct users to update their configurations to resolve their resource allocation issues. TheoryAs a Java application, Elasticsearch requires some logical memory (heap) allocation from the system’s physical memory. This should be up to half of the physical RAM, capping at 32GB. Setting higher heap usage is usually in response to expensive queries and larger data storage. Parent circuit breaker defaults to 95%, but we recommend scaling resources once consistently hitting 85%. I highly recommend these overview articles by our team for more info:
A heap of trouble
Heap: Sizing and swapping
How many shards should I have in my Elasticsearch cluster?
ConfigOut of the box, Elasticsearch’s default settings automatically size your JVM heap based on node role and total memory. However, as needed, you can configure it directly in the following three ways:
- Directly in your config > jvm.options file of your local Elasticsearch files
JVM configuration
################################################################
IMPORTANT: JVM heap size
################################################################ …
Xms represents the initial size of total heap space
Xmx represents the maximum size of total heap space
-Xms4g -Xmx4g
- As an Elasticsearch environment variable in your docker-compose
version: '2.2'
services:
es01:
image: docker.elastic.co/elasticsearch/elasticsearch:7.12.0
environment:
- node.name=es01
- cluster.name=es
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- discovery.type=single-node ulimits: memlock: soft: -1 hard: -1 ports:
- 9200:9200
- Via our Elasticsearch Service > Deployment > Edit view. Note: The slider assigns physical memory and roughly half will be allotted to the heap.
TroubleshootingIf you’re currently experiencing performance issues with your cluster, it’ll most likely come down to the usual suspects
Configuration issues: Oversharding, no ILM policy
Volume induced: High request pace/load, overlapping expensive queries / writes
All following cURL / API requests can be made in the Elasticsearch Service > API Console, as a cURL to the Elasticsearch API, or under Kibana > Dev Tools. OvershardingData indices store into sub-shards which use heap for maintenance and during search/write requests. Shard size should cap at 50GB and number should cap as determined via this equation: shards 0 outside active_shards or active_primary_shards, you’ve pinpointed a major config cause for performance issues. Most commonly if this reports an issue, it’ll be unassigned_shards>0. If these shards are primary, your cluster will report as status:red, and if only replicas it’ll report as status:yellow. (This is why setting replicas on indices is important so that if the cluster encounters an issue it can recover rather than experience data loss.) Let’s pretend we have a status:yellow with a single unassigned shard. To investigate, we’d take a look at which index shard is having trouble via _cat/shards GET _cat/shards?v=true&s=state index shard prirep state docs store ip node logs 0 p STARTED 2 10.1kb 10.42.255.40 instance-0000000001 logs 0 r UNASSIGNED kibana_sample_data_logs 0 p STARTED 14074 10.6mb 10.42.255.40 instance-0000000001 .kibana_1 0 p STARTED 2261 3.8mb 10.42.255.40 instance-0000000001 So this will be for our non-system index logs, which has an unassigned replica shard. Let’s see what’s giving it grief by running _cluster/allocation/explain (Pro tip: When you escalate to support, this is exactly what we do) GET _cluster/allocation/explain?pretty&filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.* { "index": "logs", "node_allocation_decisions": [{ "node_name": "instance-0000000005", "deciders": [{ "decider": "data_tier", "decision": "NO", "explanation": "node does not match any index setting [index.routing.allocation.include._tier] tier filters [data_hot]" }]}]} This error message points to data_hot, which is part of an index lifecycle management (ILM) policy and indicates that our ILM policy is incongruent with our current index settings. In this case, the cause of this error is from setting up a hot-warm ILM policy without having designated hot-warm nodes. (I needed to guarantee something would fail, this is me forcing error examples for y’all. See what you’ve done to me 😂.) FYI, if you run this command when you don’t have any unassigned shards, you’ll get a 400 error saying unable to find any unassigned shards to explain because nothing’s wrong to report on. If you get a non-logic cause (e.g., a temporary network error like node left cluster during allocation), then you can use Elastic’s handy-dandy _cluster/reroute.
POST /_cluster/reroute
This request without customizations starts an asynchronous background process that attempts to allocate all current state:UNASSIGNED shards. (Don’t be like me and not wait for it to finish before you contact dev because I thought it would be instantaneous and coincidentally escalate just in time for them to say nothing’s wrong because nothing was anymore.) Circuit breakersMaxing out your heap allocation can cause requests to your cluster to timeout or error and frequently will cause your cluster to experience circuit breaker exceptions. Circuit breaking causes elasticsearch.log events like Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [] would be [num/numGB], which is larger than the limit of [num/numGB], usages [request=0/0b, fielddata=num/numKB, in_flight_requests=num/numGB, accounting=num/numGB] To investigate, take a look at your heap.percent, either by looking at _cat/nodes GET /_cat/nodes?v=true&h=name,node,heap
heap = JVM (logical memory reserved for heap)
ram = physical memory
name node.role heap.current heap.percent heap.max tiebreaker-0000000002 mv 119.8mb 23 508mb instance-0000000001 himrst 1.8gb 48 3.9gb instance-0000000000 himrst 2.8gb 73 3.9gb Or if you’ve previously enabled it, by navigating to Kibana > Stack Monitoring.
If you’ve confirmed you’re hitting your memory circuit breakers, you’ll want to consider increasing heap temporarily to give yourself breathing room to investigate. When investigating root cause, look through your cluster proxy logs or elasticsearch.log for the preceding consecutive events. You’ll be looking for
expensive queries, especially:
high bucket aggregations
I felt so silly when I found out that searches temporarily allocate a certain port of your heap before they run the query based on the search size or bucket dimensions, so setting 10,000,000 really was giving my ops team heartburn.
non-optimized mappings
The second reason to feel silly was when I thought doing hierarchical reporting would search better than flattened out data (it does not).
Request volume/pace: Usually batch or async queries
Time to scaleIf this isn’t your first time hitting circuit breakers or you suspect it’ll be an ongoing issue (e.g., consistently hitting 85%, so it’s time to look at scaling resources), you’ll want to take a closer look at the JVM Memory Pressure as your long-term heap indicator. You can check this in Elasticsearch Service > Deployment
Or you can calculate it from _nodes/stats GET /_nodes/stats?filter_path=nodes.*.jvm.mem.pools.old {"nodes": { "node_id": { "jvm": { "mem": { "pools": { "old": { "max_in_bytes": 532676608, "peak_max_in_bytes": 532676608, "peak_used_in_bytes": 104465408, "used_in_bytes": 104465408 }}}}}}} where JVM Memory Pressure = used_in_bytes / max_in_bytes A potential symptom of this is high frequency and long duration from garbage collector (gc) events in your elasticsearch.log [timestamp_short_interval_from_last][INFO ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][number] overhead, spent [21s] collecting in the last [40s] If you confirm this scenario, you’ll need to take a look either at scaling your cluster or at reducing the demands hitting it. You’ll want to investigate/consider:
increasing heap resources (heap/node, number of nodes)
decreasing shards (delete unnecessary/old data, use ILM to put data into warm/cold storage so you can shrink it, turn off replicas for data you don’t care if you lose)
ConclusionWooh! From what I see in Elastic Support, that’s the rundown of most common user tickets: unassigned shards, unbalanced shard-heap, circuit breakers, high garbage collection, and allocation errors. All are symptoms of the core resource allocation management conversation. Hopefully, you now know the theory and resolution steps, too. At this point, though, if you’re stuck resolving an issue, feel free to reach out. We’re here and happy to help! You can contact us via Elastic Discuss, Elastic Community Slack, consulting, training, and support. Cheers to our ability to self-manage Elastic Stack’s resource allocation as non-Ops (love Ops, too)!
https://www.elastic.co/blog/managing-and-troubleshooting-elasticsearch-memory
Autentifică-te pentru a adăuga comentarii
Alte posturi din acest grup
Version 7.17.27 of the Elastic Stack was released today. We recommend you upgrade to this latest version. We recommend 7.17.27 over the previous versi