|
|
|
|
@ -1,6 +1,6 @@
|
|
|
|
|
Challenge: Exporter for Prometheus |
|
|
|
|
================================== |
|
|
|
|
This is an prometheus metrics exporter for the soundcloud challenge. |
|
|
|
|
This is an prometheus metrics exporter for the soundcloud challenge. |
|
|
|
|
|
|
|
|
|
Requirements |
|
|
|
|
------------ |
|
|
|
|
@ -22,7 +22,7 @@ make docker
|
|
|
|
|
|
|
|
|
|
Run |
|
|
|
|
--- |
|
|
|
|
Start locally in docker: |
|
|
|
|
Start locally in docker: |
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
docker run -d -p 8080:8080 --name challenge beorn7/syseng-challenge |
|
|
|
|
@ -46,10 +46,10 @@ Bonus
|
|
|
|
|
1. What are good ways of deploying hundreds of instances of our simulated |
|
|
|
|
service? How would you deploy your exporter? And how would you configure |
|
|
|
|
Prometheus to monitor them all? |
|
|
|
|
|
|
|
|
|
Pretty easy with *kubernetes*. |
|
|
|
|
|
|
|
|
|
Pretty easy with **kubernetes**. |
|
|
|
|
Just run the exporter along the app in a pod with an ReplicationController: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
_Note: Config is just an proof of concept, not fully tested:_ |
|
|
|
|
```yaml |
|
|
|
|
apiVersion: v1 |
|
|
|
|
@ -79,7 +79,7 @@ Bonus
|
|
|
|
|
ports: |
|
|
|
|
- containerPort: 8081 |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Just use the service discovery in prometheus: |
|
|
|
|
```yaml |
|
|
|
|
- job_name: kube-app |
|
|
|
|
@ -98,44 +98,47 @@ Bonus
|
|
|
|
|
action: replace |
|
|
|
|
target_label: pod |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
DNS discovery may be an alternative, for example with coredns. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2. What graphs about the service would you plot in a dashboard builder like Grafana? |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Usually graph everything where attention is required. |
|
|
|
|
It does not make sense to monitor metrics/graphs where nobody needs to get in action. Less is more. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Assuming we have a fleet of this service and monitor all of them, it makes sense to graph in groups. |
|
|
|
|
|
|
|
|
|
*Graph* Request rates per code (QPS): |
|
|
|
|
|
|
|
|
|
**Graph** Request rates per code (QPS): |
|
|
|
|
``` |
|
|
|
|
sum(app_request_rates) by (code) |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
*Graph* Highest latencies: |
|
|
|
|
**Graph** Highest latencies: |
|
|
|
|
``` |
|
|
|
|
max(app_duration_avg) |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
*Singlestat* Running instances: |
|
|
|
|
|
|
|
|
|
**Singlestat** Running instances: |
|
|
|
|
``` |
|
|
|
|
count_scalar(app_up == 1) |
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3. What would you alert on? What would be the urgency of the various alerts? |
|
|
|
|
|
|
|
|
|
High: Too few apps are up (to handle all requests) |
|
|
|
|
|
|
|
|
|
Middle/Hight: Request times are too high (priority depends on latency) |
|
|
|
|
|
|
|
|
|
Middle: Too many bad/failed requests (5xx) codes in comparision to suceeded (2xx) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4. If you were in control of the microservice, which exported metrics would you |
|
|
|
|
add or modify next? |
|
|
|
|
|
|
|
|
|
Depends a little bit on the service, but probably these will be useful: |
|
|
|
|
- CPU/RAM utilization. Probably network throughput. |
|
|
|
|
|
|
|
|
|
Depends a little bit on the service, but probably these will be useful: |
|
|
|
|
|
|
|
|
|
- CPU/RAM utilization. Probably network throughput. |
|
|
|
|
- Avg duration time per code, method. |
|
|
|
|
- Request rates per code and method. |
|
|
|
|
|
|
|
|
|
In general, monitor more metrics than you need in the moment. |
|
|
|
|
As more than you have, debugging an issue can probably solved by an metric which is not active monitored. |
|
|
|
|
|
|
|
|
|
In general, monitor more metrics than you need in the moment. |
|
|
|
|
As more than you have, debugging an issue can probably solved by an metric which is not active monitored. |
|
|
|
|
|