Skip to content

Instantly share code, notes, and snippets.

@sinhan
Last active June 4, 2017 23:57
Show Gist options
  • Select an option

  • Save sinhan/42791422e723602cae98adc461ef9ab8 to your computer and use it in GitHub Desktop.

Select an option

Save sinhan/42791422e723602cae98adc461ef9ab8 to your computer and use it in GitHub Desktop.

Revisions

  1. sinhan revised this gist Jun 4, 2017. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion gistfile1.txt
    Original file line number Diff line number Diff line change
    @@ -88,7 +88,7 @@ and frontend servers)

    - Load Balancer : Session utilization, latency, Denials, queue length and queue time
    - Oracle : Number of connections to database
    - MySQL :
    - MySQL : Read/Write requests,Uptime,Threads_connected,Max_used_connections,Aborted_connects,InnoDB deadlocks,slow query log, salve lag, full table scans

    3. Monitor website :
    - Whitebox monitoring : Monitoring based on metrics exposed by the internals of the system like
  2. sinhan revised this gist Jun 4, 2017. 1 changed file with 52 additions and 0 deletions.
    52 changes: 52 additions & 0 deletions gistfile1.txt
    Original file line number Diff line number Diff line change
    @@ -50,6 +50,58 @@ Monitoring : observation, checking and recording

    In containerized world : label data at source. Push and parse as soon as possible

    =========================================

    .Monitor individual servers/VMS - Sutained CPU utlization ,Load per CPU, Memory consumptions, HEAP, NTP offsets, Disc usage, Active connections,

    network traffic , swap usage: Using Nagios

    2.Monitor applications : Using Shell scripting, nagios/Sensu, Log Aggregator, Splunk , Code instrumentation (byteman for tomcat), verifying

    server process are up

    - Apache : Logs , mod_status, (number of incoming requests, CPU Usage, Server load, server uptime,total tarffic, worker pool, idle vs

    active connections, number of threads)
    - RabbitMq : Logs,rabbitmqctl,/management/nagios plugin, number of message in a queue, acknowledged vs unack, queue timeouts, cpu/memory

    usage, messages in queue, publish rate, get rate, health status, open sockets, open files
    - NginX : Logs, access_log directive,ngxtop HttpStubStatusModule module,serverdensity ,: requests per second , number of connections,

    baseline traffic, uptime, CPU overload
    - NodeJs server : Logs, response time of imporatnt services/apis/webpages, transaction errors, Free, Used, and Max Heap, Non-Heap Memory,

    Garbage collection, Total time spent actively executing in each event loop tick, Event loop ticks per minute
    - Redis : Log, redis-cli, memory, max concurrent connections, cache hit ratio, evictions, expired objects
    - Tomcat : Logs , enable JMX, JVM Heap an,d memory utilization, thread usage, request throughput, sessions,threadpool, GC Collection,

    Thread pool, Active threads,Active, expired and rejected HTTP session
    - Code : Instrumentation for trace
    - Jboss : Same as above tomcat
    - HAProxy : properly distributing traffic, Error Rate (per-min), Proxy Status, Request Rate (per-min), Active Servers, Sessions Active,

    Sessions Queued, frontend metrics, backend metrics, health metrics, session utilization, http client and server errors, average backend response

    time, number of time connection retries, , connection failures, response denied or failed,q time, ( check directive will detect health of backend

    and frontend servers)

    - Load Balancer : Session utilization, latency, Denials, queue length and queue time
    - Oracle : Number of connections to database
    - MySQL :

    3. Monitor website :
    - Whitebox monitoring : Monitoring based on metrics exposed by the internals of the system like
    http responses and error codes,
    Response time for services, apis, webpages
    Drops/Spikes for different pages with differnt pages : Browse, Search, products, Add to card, checkout
    Drops and spike for aggregation from DB : OPM
    - Blackbox monitoring : Testing externally visible behavior as a user would see it. user journey .Most important
    Page load performance
    Page view throughput
    Browser traces
    - Ping Tests : Geographic performance,




  3. sinhan revised this gist Jun 4, 2017. 1 changed file with 29 additions and 1 deletion.
    30 changes: 29 additions & 1 deletion gistfile1.txt
    Original file line number Diff line number Diff line change
    @@ -19,9 +19,37 @@ With a pull-based approach, your monitoring system needs to know which service i
    To get high availability, pull allows you to just run two identically configured Prometheus servers in parallel
    Whether you pull or push, any time-series database will fall over if you send it more samples than it can handle.

    EFK
    EFK (Elasticsearch, Fluentd, and Kibana)
    both Fluentd and Logstash provide both log forwarders and log shippers
    Log shippers are an essential component in modern devops, because logs are streams, not files
    Log forwarders send logging events to log shippers. A forwarder's goal is to send those events upstream as quickly as possible. Log shippers make delivery/routing decisions based upon the log event stream. A shipper may aggregate events, and/or send them to remote storage or analysis tools.


    Both Kibana and Grafana are powerful visualization tools. However, the Grafana and InfluxDB combination is used for metric data whereas Kibana is part of the popular ELK Stack, which provides more flexibility when exploring log data.
    Both platforms are good options and can even sometimes complement each other. First, use Kibana to analyze your logs. Then, export the data into Grafana as the visualization layer. Both rely on the same Elasticsearch repository.



    Generate - > Collect -> Transpor - > Store - > Analyze - > Alerts
    Generate : Application logs, syslog, proxy and web servers,
    Logging considerations ● Logging means more code ● Logging is not free ● Consider feedback to the UI instead of logging ● The more you log, the less you can find ● Consider to log only the most evil scenarios (log exceptions) ● Agree on levels like FATAL, ERROR, WARN, DEBUG, INFO, TRACE

    Collect : Stdout, files,
    Transport : transporters and collectors, Logstash, Flume, Fluent . Pull vs push for traffic

    Store : Short vs long, speed of data ingestion ad retrieval , data access,. Can be S3, Elastic Search, Cassandra, HBASE,

    Analyze :
    Batch processing - HDFS, Hive, Pig -> Map Reduce
    UI based : Kibana, Garylog
    Alerts : Based on patterns or calculated metric send out events.

    Logging is not monitoring
    Logging : recording to diagnose system
    Monitoring : observation, checking and recording

    In containerized world : label data at source. Push and parse as soon as possible




  4. sinhan created this gist Jun 4, 2017.
    28 changes: 28 additions & 0 deletions gistfile1.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,28 @@
    OpenTSDB:
    Scales well, keep all data, creating metric is easy : OpnenTSDB ( vs Ganglia).
    Anamoly detection : Skyline and oculus by etsy. Icinga checks for metrics
    Opensource and distributed based on HBASE/HDFS
    tcollector framework to colect and put data
    decouple mesaurment from storage
    Very precise and collects trillion of data points. never lose precision
    Very good for IOT, distributed systems
    Monitor : Application performance, network performance, resource utilization,
    FrontEnds: By box, tcketmaster,
    How to guard against chatty service . It difficult. Promethius is better ( digital ocen)

    Promethius + Garphana
    Add Vulacan
    Pull based just like nagios but only collects time series data
    Uses TCP for pull rather than UDP for push
    Not a event based monitoring system nor it stores raw data
    With a pull-based approach, your monitoring system needs to know which service instances exist and how to connect to them
    To get high availability, pull allows you to just run two identically configured Prometheus servers in parallel
    Whether you pull or push, any time-series database will fall over if you send it more samples than it can handle.

    EFK