Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1511500 - [Regression] Log files are getting too big and flooding the entire file system
Summary: [Regression] Log files are getting too big and flooding the entire file system
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.9.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: Beta 1
: 5.9.0
Assignee: Adam Grare
QA Contact: luke couzens
URL:
Whiteboard: containers:black
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-09 13:24 UTC by Shalom Naim
Modified: 2018-03-06 16:26 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-06 16:26:38 UTC
Category: ---
Cloudforms Team: CFME Core


Attachments (Terms of Use)

Description Shalom Naim 2017-11-09 13:24:38 UTC
Description of problem:

Log files are getting too big and fill the entire system storage

[root@dhcp-8-197-228 log]# ll -h | grep G
total 10G
-rw-r--r--. 1 root root 6.6G Nov  9 05:06 automation.log
-rw-r--r--. 1 root root 3.2G Nov  9 08:02 evm.log

[root@dhcp-8-197-228 log]# 
[root@dhcp-8-197-228 log]# df -h
Filesystem                             Size  Used Avail Use% Mounted on
/dev/mapper/VG--CFME-lv_os             4.5G  1.8G  2.8G  40% /
devtmpfs                               5.8G     0  5.8G   0% /dev
tmpfs                                  5.8G     0  5.8G   0% /dev/shm
tmpfs                                  5.8G  492K  5.8G   1% /run
tmpfs                                  5.8G     0  5.8G   0% /sys/fs/cgroup
/dev/vda1                             1014M  164M  851M  17% /boot
/dev/mapper/VG--CFME-lv_home          1014M   33M  982M   4% /home
/dev/mapper/VG--CFME-lv_var             12G  856M   12G   7% /var
/dev/mapper/VG--CFME-lv_log             10G   10G   20K 100% /var/www/miq/vmdb/log
/dev/mapper/VG--CFME-lv_tmp           1014M  103M  912M  11% /tmp
/dev/mapper/VG--CFME-lv_var_log       1014M   72M  943M   8% /var/log
/dev/mapper/VG--CFME-lv_var_log_audit  509M   33M  476M   7% /var/log/audit
/dev/mapper/vg_pg-lv_pg                4.7G  1.0G  3.7G  22% /var/opt/rh/rh-postgresql95/lib/pgsql
tmpfs                                  1.2G     0  1.2G   0% /run/user/0
[root@dhcp-8-197-228 log]# 


At the end of the log I found the following messages:
[----] I, [2017-11-09T05:10:59.358487 #2143:4fd134]  INFO -- : MIQ(MiqServer#monitor) Reconnecting to database after error...
[----] I, [2017-11-09T05:10:59.365541 #2143:4fd134]  INFO -- : MIQ(MiqServer#monitor) Reconnecting to database after error...Successful
[----] I, [2017-11-09T05:10:59.365698 #2143:4fd134]  INFO -- : MIQ(MiqServer#monitor_loop) Server Monitoring Complete - Timings: {:server_dequeue=>0.005356311798095703, :worker_monitor=>1.1409904956817627, :total_time=>1.154151439666748}
/opt/rh/cfme-gemset/gems/activerecord-5.0.6/lib/active_record/connection_adapters/postgresql_adapter.rb:598:in `async_exec': MIQ(MiqQueue.put_or_update) "PG::DiskFull: ERROR:  could not extend file "base/16386/17994": No space left on device (RuntimeError)
HINT:  Check free disk space.


I also found that there are strange messages in the log all of them starting with "UPDATE "miq_queue" SET "msg_data" and they are super long 

UPDATE "miq_queue" SET "msg_data" = '\x04085b0218705b074922354d616e61676549513a3a50726f7669646572733a3a4f70656e73686966743a3a436f6e7461696e65724d616e61676572063a06454669075b0749221b4d616e61676572526566726573683a3a546172676574063b00467b0a3a0f6d616e616765725f696469073a106173736f63696174696f6e3a15636f6e7461696e65725f67726f7570733a106d616e616765725f726566492229333532613264...

[root@dhcp-8-197-228 log]# grep 'UPDATE "miq_queue" SET "msg_data"' evm.log|wc
     17     423 3353216180

I also looked for log rotation, this log is about few hours

-rw-r--r--. 1 root root   86 Nov  3 02:11 rhevm.log-20171104.gz
-rw-r--r--. 1 root root 1.2M Nov  4 03:24 evm.log-20171104.gz
-rw-r--r--. 1 root root    0 Nov  4 03:25 rhevm.log
-rw-r--r--. 1 root root 1.7M Nov  5 03:37 evm.log-20171105.gz
-rw-r--r--. 1 root root 3.2M Nov  6 03:18 evm.log-20171106.gz
-rw-r--r--. 1 root root 162M Nov  7 03:24 evm.log-20171107.gz
-rw-r--r--. 1 root root  21M Nov  8 03:19 evm.log-20171108.gz
-rw-r--r--. 1 root root  30M Nov  9 03:47 evm.log-20171109.gz
-rw-r--r--. 1 root root 3.2G Nov  9 08:12 evm.log

due to the size of the problemtic log, I will keep it localy,
In case someone need it, just let me know

Comment 2 Shalom Naim 2017-11-09 13:25:54 UTC
CFME Version:            5.9.0.5

Comment 3 Dave Johnson 2017-11-09 14:40:27 UTC
Luke, please confirm if this is a log rotate issue.  If so, move to appliance component and set yourself as qe contact.

Comment 4 Adam Grare 2017-11-09 15:05:58 UTC
Shalom can you re-test on 5.9.0.7?  This should have been resolved by disabling the InventoryCollectorWorker and is a known issue on 5.9.0.5

Comment 7 Shalom Naim 2017-11-09 17:12:29 UTC
I am not sure how to reproduce this issue.
Anyway, I will do my best

Comment 8 Adam Grare 2017-11-09 17:17:07 UTC
Shalom to reproduce you just need to add an Openshift/Kubernetes provider.

Comment 9 Joe Rafaniello 2017-11-09 17:26:14 UTC
We rotate based on date, not size:
https://github.com/ManageIQ/manageiq-appliance/blob/d7af23a86b4079c79f557c52d9d90de206e9b931/COPY/etc/logrotate.d/miq_logs.conf

Before logrotate 3.8.1, it looks like you can't specify both size and daily.  With 3.8.1+, we can specify maxsize and daily. yay!

https://serverfault.com/questions/474941/how-to-rotate-log-based-on-an-interval-unless-log-exceeds-a-certain-size

Unfortunately, I tried adding maxsize and keeping our daily option but it seems to not rotate the log if the maxsize is reached.  It appears to look at our dateext option and sees the prior dates rotated log already existing and skips it even though the log is too big.

Does it work for anyone else?

Comment 10 Joe Rafaniello 2017-11-09 19:15:57 UTC
Adam,

So, I'd love to get logrotate to do the right thing when we make mistakes but it really doesn't seem to want to honor both daily with a dateext AND maxsize.  If the prior day is already rotated, it doesn't seem to honor the maxsize, probably because it can't figure out what to name it since the "dateext" name is already taken.

The man page says it works but I don't know what magic we need to do to get it to work.  I'm guessing we'd have to use a custom dateext (or remove datext) that guarantees a unique name even if you've rotated already today.  I tried removing dateext and it still doesn't seem to work correctly.

From the man page:
       maxsize size
              Log files are rotated when they grow bigger than size bytes even before the additionally specified time interval (daily, weekly, monthly, or yearly).  The related size option is similar except that it is mutually exclusive with the time interval options, and it causes log files to be rotated without regard for the  last  rotation  time.
              When maxsize is used, both the size and timestamp of a log file are considered.

Without maxsize working, I'm not sure what we can do other than not fill up the log within the day.

I'll return this BZ so we can get the fix so we don't fill up the logs with sql inserts of large data.  If we ever find a way to get logrotate to do the right thing, we can definitely do this in a followup BZ.

Comment 11 Joe Rafaniello 2017-11-09 19:19:00 UTC
Note, this prior BZ reverted our changing of the logrotate name convention so if we figure out how to get maxsize to work, we'll have to consider how we name logs to make sure they're acceptable to support and others: 

https://bugzilla.redhat.com/show_bug.cgi?id=1356267https://github.com/ManageIQ/manageiq-appliance/pull/80

Comment 12 Adam Grare 2017-11-13 13:16:48 UTC
Per dajohnso:
> As for the current bug, we can close it or move it to verified.  The issue is resolved, we should be good to go.

Moving this to verified


Note You need to log in before you can comment on or make changes to this bug.