Skip to content

Instantly share code, notes, and snippets.

@sinhan
Last active June 4, 2017 23:57
Show Gist options
  • Save sinhan/42791422e723602cae98adc461ef9ab8 to your computer and use it in GitHub Desktop.
Save sinhan/42791422e723602cae98adc461ef9ab8 to your computer and use it in GitHub Desktop.
Monitoring
OpenTSDB:
Scales well, keep all data, creating metric is easy : OpnenTSDB ( vs Ganglia).
Anamoly detection : Skyline and oculus by etsy. Icinga checks for metrics
Opensource and distributed based on HBASE/HDFS
tcollector framework to colect and put data
decouple mesaurment from storage
Very precise and collects trillion of data points. never lose precision
Very good for IOT, distributed systems
Monitor : Application performance, network performance, resource utilization,
FrontEnds: By box, tcketmaster,
How to guard against chatty service . It difficult. Promethius is better ( digital ocen)
Promethius + Garphana
Add Vulacan
Pull based just like nagios but only collects time series data
Uses TCP for pull rather than UDP for push
Not a event based monitoring system nor it stores raw data
With a pull-based approach, your monitoring system needs to know which service instances exist and how to connect to them
To get high availability, pull allows you to just run two identically configured Prometheus servers in parallel
Whether you pull or push, any time-series database will fall over if you send it more samples than it can handle.
EFK (Elasticsearch, Fluentd, and Kibana)
both Fluentd and Logstash provide both log forwarders and log shippers
Log shippers are an essential component in modern devops, because logs are streams, not files
Log forwarders send logging events to log shippers. A forwarder's goal is to send those events upstream as quickly as possible. Log shippers make delivery/routing decisions based upon the log event stream. A shipper may aggregate events, and/or send them to remote storage or analysis tools.
Both Kibana and Grafana are powerful visualization tools. However, the Grafana and InfluxDB combination is used for metric data whereas Kibana is part of the popular ELK Stack, which provides more flexibility when exploring log data.
Both platforms are good options and can even sometimes complement each other. First, use Kibana to analyze your logs. Then, export the data into Grafana as the visualization layer. Both rely on the same Elasticsearch repository.
Generate - > Collect -> Transpor - > Store - > Analyze - > Alerts
Generate : Application logs, syslog, proxy and web servers,
Logging considerations ● Logging means more code ● Logging is not free ● Consider feedback to the UI instead of logging ● The more you log, the less you can find ● Consider to log only the most evil scenarios (log exceptions) ● Agree on levels like FATAL, ERROR, WARN, DEBUG, INFO, TRACE
Collect : Stdout, files,
Transport : transporters and collectors, Logstash, Flume, Fluent . Pull vs push for traffic
Store : Short vs long, speed of data ingestion ad retrieval , data access,. Can be S3, Elastic Search, Cassandra, HBASE,
Analyze :
Batch processing - HDFS, Hive, Pig -> Map Reduce
UI based : Kibana, Garylog
Alerts : Based on patterns or calculated metric send out events.
Logging is not monitoring
Logging : recording to diagnose system
Monitoring : observation, checking and recording
In containerized world : label data at source. Push and parse as soon as possible
=========================================
.Monitor individual servers/VMS - Sutained CPU utlization ,Load per CPU, Memory consumptions, HEAP, NTP offsets, Disc usage, Active connections,
network traffic , swap usage: Using Nagios
2.Monitor applications : Using Shell scripting, nagios/Sensu, Log Aggregator, Splunk , Code instrumentation (byteman for tomcat), verifying
server process are up
- Apache : Logs , mod_status, (number of incoming requests, CPU Usage, Server load, server uptime,total tarffic, worker pool, idle vs
active connections, number of threads)
- RabbitMq : Logs,rabbitmqctl,/management/nagios plugin, number of message in a queue, acknowledged vs unack, queue timeouts, cpu/memory
usage, messages in queue, publish rate, get rate, health status, open sockets, open files
- NginX : Logs, access_log directive,ngxtop HttpStubStatusModule module,serverdensity ,: requests per second , number of connections,
baseline traffic, uptime, CPU overload
- NodeJs server : Logs, response time of imporatnt services/apis/webpages, transaction errors, Free, Used, and Max Heap, Non-Heap Memory,
Garbage collection, Total time spent actively executing in each event loop tick, Event loop ticks per minute
- Redis : Log, redis-cli, memory, max concurrent connections, cache hit ratio, evictions, expired objects
- Tomcat : Logs , enable JMX, JVM Heap an,d memory utilization, thread usage, request throughput, sessions,threadpool, GC Collection,
Thread pool, Active threads,Active, expired and rejected HTTP session
- Code : Instrumentation for trace
- Jboss : Same as above tomcat
- HAProxy : properly distributing traffic, Error Rate (per-min), Proxy Status, Request Rate (per-min), Active Servers, Sessions Active,
Sessions Queued, frontend metrics, backend metrics, health metrics, session utilization, http client and server errors, average backend response
time, number of time connection retries, , connection failures, response denied or failed,q time, ( check directive will detect health of backend
and frontend servers)
- Load Balancer : Session utilization, latency, Denials, queue length and queue time
- Oracle : Number of connections to database
- MySQL : Read/Write requests,Uptime,Threads_connected,Max_used_connections,Aborted_connects,InnoDB deadlocks,slow query log, salve lag, full table scans
3. Monitor website :
- Whitebox monitoring : Monitoring based on metrics exposed by the internals of the system like
http responses and error codes,
Response time for services, apis, webpages
Drops/Spikes for different pages with differnt pages : Browse, Search, products, Add to card, checkout
Drops and spike for aggregation from DB : OPM
- Blackbox monitoring : Testing externally visible behavior as a user would see it. user journey .Most important
Page load performance
Page view throughput
Browser traces
- Ping Tests : Geographic performance,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment