Ticket #368 (closed defect: fixed)

Opened 5 years ago

Last modified 4 years ago

CouchDB DAR deployment issues

Reported by: claris Owned by: claris
Priority: major Milestone:
Component: ORCA: Shirako Core Version: baseline
Keywords: Cc: ibaldin@…, anirban@…, vjo, jonmills@…

Description (last modified by claris) (diff)

The first step to interact with DAR cluster is to identify one available/accessible registry. To test the availability of a registry the client code checks if its corresponding couchdb instance is available. It may be the case that the instance is available but the database is not--as we encountered in wsu-hn. In this case the client breaks. I need to fix this code to test availability of the database rather than Couchdb instance. So that the client code can fail over to another registry.

10/06/14 i modified the summary of the ticket to capture what we're doing.
10/07/14 added jonmills

Change History

  Changed 5 years ago by ibaldin

  • cc vjo added

Exception handling definitely needs to improve. WSU went down again but even after it was shut down the actor (BEN) refused to come up. Exception in contacting the first available server should not cause the actor to fail in startup. Only reordering the server URLs in config file finally allowed actor to start up (putting TAMU first).

Also, the issue with CouchDB appears to be the very large amount of space it is trying to occupy (5GB on WSU), which apparently has been the source of all kinds of problems. Since CouchDB database is stored in /var, it can overflow and that causes AMQP to hose, which then hoses OpenStack?. Very NOT GOOD. We need to determine why Couch wants so much space.

  Changed 5 years ago by claris

Jonathan mentioned that restart is not working for couchdb. That he was going to change puppet to do a hard stop and start. This time this is not working.

Can you please compare the storage usage of Couchdb in Tamu vs WSU? Couchdb is disk hogger mainly because it relies on MVCC. It keeps versions of documents around for every Save action. Considering that each "Save" action results in a new document (new version) and each actor is heartbeating every 1 minute (and this action is replicated to other DAR servers) this is a lot of storage demand. However, I was under the assumption that db compaction (i.e., deleting old versions) was running periodically. Let me look into this.

  Changed 5 years ago by claris

Ok. I found the issue.
Vjo, is the valid config file for couchdb in /etc/couchdb/local.d/exogeni.ini ?
I need to add a few lines for fixing periodic compaction of the actor database.

Claris

  Changed 5 years ago by vjo

That looks right, yes.

Put the changes you want to make in this ticket, though - seems that Jonathan keeps that file in puppet, so any changes you make will be automatically reverted.

  Changed 5 years ago by claris

To enable compaction of actor database. Modify /etc/couchdb/local.d/exogeni.ini (I have not received confirmation that this is the right config file. This is a guess).

[daemons]
compaction_daemon={couch_compaction_daemon, start_link, []}
[campaction_daemon]
; The delay, in seconds, between each check for which database and view indexes
; need to be compacted.

check_interval = 300

; If a database or view index file is smaller then this value (in bytes),
; compaction will not happen. Very small files always have a very high
; fragmentation therefore it's not worth to compact them.
min_file_size = 131072
[compactions]
; List of compaction rules for the compaction daemon.
; The daemon compacts databases and their respective view groups when all the
; condition parameters are satisfied. Configuration can be per database or
; global, and it has the following format:
actor = [{db_fragmentation, 60}, {view_fragmentation, 60}]

  Changed 5 years ago by vjo

I see.

I have put the changes into puppet, and have pushed them to tamu-hn and wsu-hn.

I am waiting to see if the changes cause a compaction in 300 seconds.

  Changed 5 years ago by vjo

I have fixed a minor error in the configuration.

Compaction is now working "as desired".

The one line I altered was:
actor = [{db_fragmentation, "60%"}, {view_fragmentation, "60%"}, {parallel_view_compaction, true}]

  Changed 5 years ago by claris

I don't understand. Was it not working with "parallel_view_compaction" set to false?
This is only useful when views (which we don't have many) and db index directories are in different disks.

  Changed 5 years ago by vjo

It's not the parallel_view_compaction; I merely thought that would be useful.

The issue was that the fragmentation values had to be in quotes, and had to have the "%" sign after them.

  Changed 5 years ago by vjo

Also - I corrected a typo:

[campaction_daemon]

to

[compaction_daemon]

  Changed 5 years ago by claris

Ah! Thanks for the clarification.

  Changed 5 years ago by claris

Victor,
I am getting the following error messages in couchdb@wsu:
httpd 500 error response: {"error":"timeout"}

This has coincided with Ilya reporting some documents (records) reaching consistency in a long time window (at least 15 minutes) in both TAMU and WSI. I looked at /var/log/httpd/* but could not find any log message for today. Should I look somewhere else? I know Jonathan is using proxying ssl. Where can I find these logs?

Thanks for the help!
Claris

  Changed 5 years ago by vjo

/var/log/couchdb/couch.log is another place to look.

I have done some tuning on couchdb, on TAMU and WSU.
See if the long durations for reaching consistency recur; I expect that the defaults for the replication engine were inadequate for our use case.

  Changed 5 years ago by claris

Thanks vjo.

Can you tell us more about the tuning? At which level did you do the tunning (couchdb, http, network, ...?)

The amount of data that is being replicated is minimum and I have disabled batch replication in couchdb to see if this is causing the timeout.

  Changed 5 years ago by ibaldin

Claris, don't know what your time availability is, but the bugs related to actor start up when some of the couch isntances are misbehaving is causing us delays in testing and would be good to fix sooner, rather than later. Basically the only thing that works lately is reordering the registry URLs to put TAMU first to make an container start up properly.

  Changed 5 years ago by vjo

Sure.

I added to: /etc/sysconfig/couchdb
export ERL_MAX_PORTS=4096

To increase the number of sockets/file descriptors for the Erlang runtime

I created: /etc/security/limits.d/100-couchdb.conf
with the following content:
couchdb hard nofile 4096
couchdb soft nofile 4096

Finally, I altered: /etc/couchdb/local.d/exogeni.ini
and added the following:

[httpd]
max_connections = 2048

[replicator]
worker_processes = 10
worker_batch_size = 250
http_connections = 40
connection_timeout = 60000
retries_per_request = 20

  Changed 5 years ago by vjo

BTW - this message appears in the couch.log for WSU, quite a bit:
[Thu, 25 Sep 2014 18:40:20 GMT] [error] [<0.107.0>] Error in replication fe62483393fd89d8bfb6452806cef28d (triggered by document `ffa437c38ddf
87a0663007f27c000cdd`): {{case_clause,{error,illegal_database_name}},

Doesn't happen at TAMU, though.

  Changed 5 years ago by claris

Yes, I have seen it this past weeks. It follows a pattern. This error along with the http 500 one precedes the couchdb instance going down. In general, i have not had any issue with TAMU but TAMU is a slave. WSU is taking the heavy load. By making TAMU the master we may learn a few things.

Ilya,
is it possible at some point soon to do a clean restart of the DARs. I'd like to clean them up, make TAMU the master and restart.

  Changed 5 years ago by vjo

I have decreased the log verbosity on couchdb at: WSU, TAMU, and UFL.
We were filling /var on those three head nodes with couchdb logs.

  Changed 5 years ago by ibaldin

We can do a DAR restart today if desired. So far as I know only UCD is not under maintenance and depends on it, and I would presume it can continue functioning across the DAR restart.

  Changed 5 years ago by claris

I want to do a restart with 2 of the 3 DAR instances that are going in production namely: SOF and TAMU. SOF is up and running.

Victor,
Is SOF couchdb instance managed by Puppet? If so, I assume that all the configuration/tuning changes are reflected in Puppet and will apply to SOF.

Thanks,
Claris

  Changed 5 years ago by vjo

I presume you mean OSF.

It is in puppet, but:
1) I need to ensure that the changes are there - it seems that only OSF and UMass are being managed by the puppet module right now.
2) The puppet module needs updating/editing to include the changes I made for increasing the file descriptors.
I'll see if I can make that happen shortly.

follow-up: ↓ 24   Changed 5 years ago by vjo

The module has been edited in puppet, and the latest configuration changes have been synced to it.

I have updated OSF with the latest couchdb configuration from puppet, and it looks good.

in reply to: ↑ 23   Changed 5 years ago by claris

Replying to vjo:

The module has been edited in puppet, and the latest configuration changes have been synced to it.

I have updated OSF with the latest couchdb configuration from puppet, and it looks good.

Yes, I meant OSF. Great. Thanks, Vjo.

  Changed 5 years ago by claris

Vjo,

I'd like to dial back the replication configuration that we enabled today. In particular

worker_processes = 10
worker_batch_size = 250
http_connections = 40
connection_timeout = 60000
retries_per_request = 20

Lets go back to the defaults.

Since we are using replication in push-mode the effective number of http_connections per replication is max(worker_processes, http_connections). This means that for every replication we have gone from 4 (the default) to 11 http_connections. Which is a small number but it is still an order of magnitude more. Furthermore this is the number of http_connections per replication. Every heartbeat is replicated across all the DAR nodes and there is one heartbeat per minute per actor. Worst case: 24 heartbeats*11=264 http_connections. It seems like a small number but from what I see online Erlang may be unhappy.

I am not sure that the problem is the reverse-proxy as you found in https://issues.apache.org/jira/browse/COUCHDB-1672 because we don't have any error message of the type: {conn_failed,{error,eoptions}}

Lets use the defaults and see what happens.

One particular error message that puzzles me and for which I can't find much information is
{case_clause,{error,illegal_database_name}}. This one does not make absolutely any sense.

References:
http://wiki.apache.org/couchdb/Replication
http://stackoverflow.com/questions/13403295/why-after-increasing-max-dbs-open-do-replications-still-fail-with-increase-ma

  Changed 5 years ago by vjo

I have sorted out the illegal_database_name issue.

It turns out, we had multiple conflicting document revisions in the replicator database on wsu-hn:

[root@wsu-hn couchdb]# curl -X GET http://vjo-admin:********@127.0.0.1:5984/_replicator/ffa437c38ddf87a0663007f27c001765\?revs_info=true | python -mjson.tool

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

116 581 116 581 0 0 334k 0 --:--:-- --:--:-- --:--:-- 567k
{

"_id": "ffa437c38ddf87a0663007f27c001765",
"_replication_id": "fe62483393fd89d8bfb6452806cef28d",
"_replication_state": "error",
"_replication_state_time": "2014-09-27T14:18:32+00:00",
"_rev": "3-4bf1dd906ac22f87075d99750244e6b9",
"_revs_info": [

{

"rev": "3-4bf1dd906ac22f87075d99750244e6b9",
"status": "available"

},
{

"rev": "2-21c4ee094694e340ecbb6f3948216e02",
"status": "available"

},
{

"rev": "1-9dc4e65f3c78e1f514bf3f64d79c2963",
"status": "available"

}

],
"owner": "admin",
"source": "https:\\admin:********@wsu-hn.exogeni.net:6984/_users",
"target": "https:\\admin:********@tamu-hn.exogeni.net:6984/_users"

}

There were two such documents in the _replicator database, and I deleted them.

The other issue was max_dbs_open was *way* too low.
Setting it to 500 resolved the issue.

Please inspect the state of the _replicator database, and ensure that it meets expectations - but I think that the issues have been resolved now.
I have left the replicator settings as I tuned them, since I am no longer seeing errors.

I have also altered the log_level to be "info" - it seems to be a reasonable balance between "debug" and "warning" in terms of verbosity.

  Changed 5 years ago by claris

Hi Vjo,

Thanks for looking into this. I am too conservative when it comes to touching large number of parameters without fully understanding what is going on. So please bare with me with my questions. Point me to references if you don't have time to elaborate on this.

I saw both of those two documents that you deleted in the_replicator database last night and they were dated 8/20/2014. That is the reason I did not touch them. I doubt they are related to the issue we encountered. In fact, the illegal-database error was not present in the last logs files I saw last night.

The other thing that I observed last night is that everything in DAR was running smoothly except for the crash error message that we kept getting. I monitored the active tasks via the mgmt API and they were going through without any issue (mostly replication tasks). That last error to me is related to the number of replications and http connections and the re-assignment of the master role in a few actors. That is why I would like to dialed back the configuration you set yesterday.

When you say that setting "max_dbs_open" to 500 resolved the issue. Which issue are you referring to? We only have 2 active databases in DAR: actor and _users (which is only touched when an actor registers for the first time). I am not clear why increasing "max_dbs_open" to 500 will have an impact.

Finally, I am curious about how did you correlate the illegal-database name error message with the presence of conflicts. I did not find such relationship. The replication should not hang because of the presence of conflicts. Conflicts will always happen and are resolved at the client side (shirako code). If you have a reference that you can share please do so.

  Changed 5 years ago by claris

Victor, please change compaction interval to 300. Thanks.

Reference documentation below:
Current actor database is very small: Effective (useful size) is 2.6MB. The disk size though (including metadata grows fast —(~2MB/s as per raw observations) —this is a function of actors’ heartbeat rate. Since we have fragmentation threshold set to 60% as soon as the disk_size reaches ~12MB compaction starts.

This is too aggressive considering that the process of compacting involved a “hot copy” of the database. Note that this means that we need to have enough storage capacity to store 2 copies of the database at any point in time.

Too see current disk_size and data_size (effective size) of database do a curl -X GET on the database. For example:
dhcp152-54-9-242:~ claris$ curl -X GET https://admin:****@wsu-hn.exogeni.net:6984/actor
{"db_name":"actor","doc_count":24,"doc_del_count":344,"update_seq":394901,"purge_seq":0,"compact_running":false,"disk_size":8401015,"data_size":2635490,"instance_start_time":"1411846698511968","disk_format_version":6,"committed_update_seq":394900}

The actual database file is stored in “/var/lib/couchdb/database_name.couch” (/var/lib/couchdb/actor.couch).

Change compaction_interval to 300. This change will result in a disk_size of ~600MB before compaction is triggered. We should have enough storage for keeping <1.5GB of database (legacy and non legacy data) around at any time.

References:
http://docs.couchdb.org/en/latest/config/compaction.html

  Changed 5 years ago by vjo

Um....

The compaction check interval is, and has been, 300 seconds:

[compaction_daemon]
check_interval = 300
min_file_size = 131072

[compactions]
actor = [{db_fragmentation, "60%"}, {view_fragmentation, "60%"}]

I haven't altered that.

I also just checked: http://docs.couchdb.org/en/latest/config/compaction.html
AFAICT, there's not a "compaction_interval" config directive - just "check_interval".

Am I missing something?

  Changed 5 years ago by vjo

How about I change the following directive instead:

min_file_size

?

  Changed 5 years ago by claris

Yes, I think that changing "min_file_size" is a better idea.

If the check_interval m is set to 300 then something is not right with how couchdb is taking the parameters then--unless I am the one missing something here.
[[Mon, 29 Sep 2014 14:26:14 GMT] [info] [<0.149.0>] Starting compaction for db "actor"
[Mon, 29 Sep 2014 14:31:14 GMT] [info] [<0.149.0>] Starting compaction for db "actor"
[Mon, 29 Sep 2014 14:36:14 GMT] [info] [<0.149.0>] Starting compaction for db "actor"
[Mon, 29 Sep 2014 14:41:14 GMT] [info] [<0.149.0>] Starting compaction for db "actor"
[Mon, 29 Sep 2014 14:46:15 GMT] [info] [<0.149.0>] Starting compaction for db "actor"
[Mon, 29 Sep 2014 14:51:15 GMT] [info] [<0.149.0>] Starting compaction for db "actor"
[Mon, 29 Sep 2014 14:56:16 GMT] [info] [<0.149.0>] Starting compaction for db "actor"
[Mon, 29 Sep 2014 15:01:16 GMT] [info] [<0.149.0>] Starting compaction for db "actor"
[Mon, 29 Sep 2014 15:06:16 GMT] [info] [<0.149.0>] Starting compaction for db "actor"
[Mon, 29 Sep 2014 15:11:17 GMT] [info] [<0.149.0>] Starting compaction for db "actor"
[Mon, 29 Sep 2014 15:16:17 GMT] [info] [<0.149.0>] Starting compaction for db "actor"
[Mon, 29 Sep 2014 15:21:17 GMT] [info] [<0.149.0>] Starting compaction for db "actor"
[Mon, 29 Sep 2014 15:26:18 GMT] [info] [<0.149.0>] Starting compaction for db "actor"

This suggests that compaction is starting every 5 seconds.

follow-up: ↓ 33   Changed 5 years ago by claris

And there is these error messages below which suggests to me that something is getting saturated either with IO or replication/compaction call, etc. Hence my intuition that we are compacting too often.

[Mon, 29 Sep 2014 11:56:04 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=389994. compact update seq=389992). Retrying.
[Mon, 29 Sep 2014 12:56:08 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=391696. compact update seq=391695). Retrying.
[Mon, 29 Sep 2014 13:16:09 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=392265. compact update seq=392264). Retrying.
[Mon, 29 Sep 2014 13:31:10 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=392697. compact update seq=392696). Retrying.
[Mon, 29 Sep 2014 14:06:12 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=393694. compact update seq=393693). Retrying.
[Mon, 29 Sep 2014 14:41:15 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=394677. compact update seq=394676). Retrying.
[Mon, 29 Sep 2014 14:46:15 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=394815. compact update seq=394814). Retrying.
[Mon, 29 Sep 2014 15:06:16 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=395390. compact update seq=395389). Retrying.
[Mon, 29 Sep 2014 15:26:18 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=395967. compact update seq=395965). Retrying.

Meaning

in reply to: ↑ 32   Changed 5 years ago by claris

Meaning perhaps that compaction is failing (although I monitored the disk_size and it seems to succeed since everytime the file size reaches 12MB it goes down to ~2.5MB) and what we are observing are multiple retrials.
Replying to claris:

And there is these error messages below which suggests to me that something is getting saturated either with IO or replication/compaction call, etc. Hence my intuition that we are compacting too often.

[Mon, 29 Sep 2014 11:56:04 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=389994. compact update seq=389992). Retrying.
[Mon, 29 Sep 2014 12:56:08 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=391696. compact update seq=391695). Retrying.
[Mon, 29 Sep 2014 13:16:09 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=392265. compact update seq=392264). Retrying.
[Mon, 29 Sep 2014 13:31:10 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=392697. compact update seq=392696). Retrying.
[Mon, 29 Sep 2014 14:06:12 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=393694. compact update seq=393693). Retrying.
[Mon, 29 Sep 2014 14:41:15 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=394677. compact update seq=394676). Retrying.
[Mon, 29 Sep 2014 14:46:15 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=394815. compact update seq=394814). Retrying.
[Mon, 29 Sep 2014 15:06:16 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=395390. compact update seq=395389). Retrying.
[Mon, 29 Sep 2014 15:26:18 GMT] [info] [<0.149.0>] Compaction file still behind main file (update seq=395967. compact update seq=395965). Retrying.

Meaning

  Changed 5 years ago by vjo

Check now; I have increased min_file_size to 600000 (a little less than 4x bigger) and restarted couchdb on TAMU and WSU.

This *should* result in fewer compactions.

  Changed 5 years ago by vjo

D'oh.

"If a database or view index file is smaller than this value (in bytes)"

I'm going to crank that value up higher still, and restart WSU and TAMU.
My fault.

  Changed 5 years ago by vjo

min_file_size is now ~50M.

As an aside - NDL documents get pumped into couchdb as part of the action of DAR, right?

Given that the NDL documents (particularly the manifests) can be sizeable (particularly with one of Paul's "monster" slices) - what's our desired high water mark for min_file_size?

If my understanding is correct (which I'm not sure about) - we could easily cause havoc with a large NDL manifest or three.

  Changed 5 years ago by claris

Compaction was triggered as expected (when disk_size reached ~50MB) and completed successfully.

So compaction seems to be the root cause behind the two --apparently harmless--error messages present in the log files. These error messages are still present but only when compaction is triggered. I don't see more error messages suggesting that compaction is failing and retrying. More info below for our references:

1. Error message: "... in 0.25 seconds due to error connection_closing" observed at least twice an hour. Others report that this error message is harmless and goes away if apache_proxy is disabled. This may be related to what Victor pointed out last week in https://issues.apache.org/jira/browse/COUCHDB-1672_
Another consisting reference:
http://stackoverflow.com/questions/21776896/couchdb-retrying-due-to-error-connection-closing

2. Error message below was observed very frequently in the logs.
** Reason for termination ==
** killed

[Mon, 29 Sep 2014 14:58:17 GMT] [error] [<0.25801.40>] {error_report,<0.30.0>,

{<0.25801.40>,crash_report,

[[{initial_call,

{couch_httpc_pool,init,Argument__1?}},

{pid,<0.25801.40>},
{registered_name,[]},
{error_info,

{exit,killed,

[{gen_server,terminate,6},

{proc_lib,init_p_do_apply,3}]}},

{ancestors,

[<0.25800.40>,couch_rep_sup,

couch_primary_services,couch_server_sup,
<0.31.0>]},

{messages,[]},
{links,[<0.25802.40>]},

This may well be related to proxying as well.

  Changed 5 years ago by vjo

So - let me ask this...

On an ORCA 5 actor configured to use DAR, the actor initiates replication, correct?
If so, why do certain documents take so long to converge to successful replication?

Where is this coming from?
When I log into tamu-hn, and do the following:

cd /etc/orca/am+broker-12080/logs/
grep -i replicat orca.log

I see long blocks of time (~15 minutes) where some documents don't converge to "replicated":

2014-09-29 16:38:16,080 [ActorLiveness: 1ca9d21d-b617-46dd-a76c-d47a5289b33e] ERROR orca - Actor Registry at https://wsu-hn.exogeni.net was unable to replicate to Actor Registry at tamu-hn.exogeni.net docid: 1ca9d21d-b617-46dd-a76c-d47a5289b33e

The document cited above is one of those that took ~15 minutes to converge.

What's going on there? Can that be resolved?
Is it even something to worry about?

  Changed 5 years ago by claris

I traced down this issue to a low level read time out socket exception in the replication call for the heartbeats in ActorLIveness. Ektorp library is evidently buggy as it returns status "OK' after calling the ReplicationCommand?. I found a few sources suggesting different hacks of solving this including: http://www.programcreek.com/java-api-examples/index.php?api=org.ektorp.CouchDbInstance

Instead, I have overwritten the socket timeout in our own orcasslhttpclient.builder() to 30000 and that seems to have solved the problem. The fix has been applied to UNC-HN and after a few hours of running uninterruptedly no replication has failed. I will report back after >24 hours running. Fingers crossed.

  Changed 5 years ago by claris

No more read socket timeout error messages when replicating. However, there are 3 socket timeout error messages (not many) related to other actions (e.g., update). I am increasing the socket timeout for all other tasks. It does not hurt.

  Changed 5 years ago by vjo

After noting excessive startup times for the actors at UCD, I took a look at the logs for couchdb and httpd at both wsu-hn and tamu-hn.

Both httpd logs had several very curious entries in them:
[Thu Oct 02 18:28:26 2014] [error] proxy: HTTP: disabled connection for (10.100.0.1)

Upon consulting:
http://httpd.apache.org/docs/2.2/mod/mod_proxy.html

and:

http://wiki.apache.org/couchdb/Apache_As_a_Reverse_Proxy

I added the following items to the ProxyPass? directive in the couchdb httpd config:

nocanon retry=0 timeout=25

This *immediately* decreased the startup time for the actors, as the retry directive ensured that the proxy could immediately retry communications with the couchdb running behind it.

We are now getting new errors from httpd, with some frequency:

[Thu Oct 02 19:57:32 2014] [error] [client 128.194.6.131] proxy: Error reading from remote server returned by /_replicate
[Thu Oct 02 19:58:32 2014] [error] [client 128.194.6.131] (70007)The timeout specified has expired: proxy: error reading status line from remote server 10.100.0.1

It appears that couchdb is taking longer than 25 seconds to return from client POST requests to _replicate.

  Changed 5 years ago by claris

Victor,

Disclaimer: I have not looked at any http or ssl-proxy related log so I may be missing part of the picture here.

I initialized actors in UNC-HN with the latest code at "2014-10-02 01:34:37". Recall that this code basically increases the client timeout to 60 seconds and these actors are registering to DARs in UFL and SL. The logs don't show any error message neither in couchdb nor in orca *till* "2014-10-02 16:01:49" (that is +12 hours later) when there are all sort of time out errors consistent with what you described earlier. Also, this seems to be around the time that changes were made to couchdb ssl-proxy configuration (?).

A few observations:
The changes to couchdb configuration that you did had no effect on the actors running the previous code (with small client timeouts). (1) Negative: I still see replication timeouts error messages in AM logs in TAMU actors. (2) Positive: It has not resulted in the bunch of new error messages that we are seeing in UCD and UNC actors (both with the latest shirako code).

This suggests that the combination of increasing the client timeout that I did (latest code in UCD and UNC) + the changes to the couchdb server that you did (and puppet pushed to every couchdb instance in the universe) are the culprit here. At the same time, increasing the client time out *alone* resolved the issue of replication timeouts in UNC-HN (not tested in production yet).

If my analysis is correct (Feel free to correct me if I am wrong) We should rollback the changes you made to the ssl proxy configuration and test first the client timeout. From what I understood when you tested the latest code in UCD with the previous ssl-proxy configuration the actors took forever to initialize. I have not had this problem with the actors in UNC. Some possible reasons are because UNC actors are not using TAMU and WSU--and indeed increasing the timeout is causing problems with busy DARs-- or it is some UCD specific issue (?). We can quickly test these hypotheses by having UNC actors register to WSU and TAMU DARs.

Please let me know what you think so that we can decide what to do next. At this point UCD is in trouble and I am chasing a moving target in the dark.

Thanks,
Claris

  Changed 5 years ago by ibaldin

I suggest addressing the problem of exceptions first and then moving on to slow start ups.

  Changed 5 years ago by vjo

I will be leaving all configuration static at this point, in order to reduce the number of moving parts and thereby ease debugging.
As of right now, the current Apache proxy configuration and couchdb configuration seems to be allowing things to function adequately in production.

If desired, we can decide on which racks will be used in debugging, and mark the couchdb configuration as unmanaged by puppet for those racks. That will allow any local alterations necessary for testing.

  Changed 5 years ago by ibaldin

Just a point of observation - I restarted ION actor today (actually NLR and ION), which I think is running the old pre-delay-modification code-base (r6890)

  1. Did not see the usual exceptions on startup.
  1. The startup was somewhat faster - almost bearable now.


The actor is configured to talk to WSU, then TAMU Coach instances.

  Changed 5 years ago by claris

Ilya,

Two things:
1-What are the "usual" exceptions on startup?
I have been particularly looking at the following (as I have been reporting in this ticket throughout the week)

[ActorLiveness: 4e01341d-6ea9-4338-85bc-c1c217ad5ed9] ERROR orca - Actor Registry at https://wsu-hn.exogeni.net was unable to replicate to Actor Registry at tamu-hn.exogeni.net docid: 4e01341d-6ea9-4338-85bc-c1c217

This error message is back in my UNC-HN deployment (it gone with the extended client delay) since the configuration of the couchdb was updated. It is still present in actors running pre-delay -modification code-base (as I can see in TAMU) and is also in newest rack I know about (UCD). I have not looked into ion.

2- The issue of start-up time may be completely orthogonal to the replication failures. I will refrain from commenting on this because I have not been looking into this and these issues need proper investigation.

Victor,
I am not testing nor planning to test anything else with DAR. The replication issues are still present and I can't resolve them without control over the testbed.

  Changed 5 years ago by claris

Victor, Jonathan:

To continue looking at the replication issue (and how its solution relates to the long start-ups observed in UCD and new changes in ssl-proxy configuration) I need the following:

1- Un-puppetize Couchdb instances in SL and UFL.
2- Undo the last configuration changes to SSL-proxy on those two instances SL and UFL. For my references, please also describe in this ticket how and where are these changes made --I may want to try it out for testing.
3- Be given access to UMASS (as suggested by Ilya) for testing in a production environment with representative load.

I plan to use actors in UNC-HN and DARs on UFL and SL for testing in the mean time. Once I am convinced that things are working correctly I will try UMASS actors and DARs on TAMU and WSU (with whatever configuration they have).

  Changed 5 years ago by ibaldin

  • description modified (diff)
  • summary changed from Testing availability vs accessibility of CouchdDB database to CouchDB DAR deployment issues

Victor and Jonathan,

We need to give Claris more freedom in testing couch. Right now the changes you make to stabilize couch for running actors (instances running on tamu and wsu) are propagating and interfering with her testing on UNC (and upcoming UMass) actors using couch instances on SL and UFL. Let’s decouple the issue of stability of production actors from testing new code to get rid of the known problems.

I believe the most destructive issue related to couch growing uncontrollably has been addressed, however there are still issues with replication and slower start up - they may or may not be related, but the production infrastructure is not the place to find that out. My suggested strategy is we keep limping on the production exogeni deployment and the two instances (tamu, wsu) run by Puppet, while Claris figures out and fixes the issues on the test instances (sl and ufl), not using Puppet and using non-production actors (UNC and UMass).

We will need to coordinate access and restarts at UMass as it is occasionally used to test monitoring code (both by as and the GPO).

Jonathan, please do item 1 below when you get a chance. Victor, please leave a comment on the ticket regarding item 2.

  Changed 5 years ago by vjo

1- Un-puppetize Couchdb instances in SL and UFL.

DONE.

2- Undo the last configuration changes to SSL-proxy on those two instances SL and UFL. For my references, please also describe in this ticket how and where are these changes made --I may want to try it out for testing.

DONE.
The changes are in: /etc/httpd/conf.d/couchdb_proxy.conf
Look for the "ProxyPass?" directive.

I have a commented-out version with my changes above the version prior to my changes.

Please refer to: http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxypass
if you need references as to the functions of the changes I performed.

Without these changes, you will probably see entries in: /var/log/httpd/error_log
of this variety:
[Thu Oct 02 18:28:26 2014] [error] proxy: HTTP: disabled connection for (10.100.0.1)

The "retry=0" parameter to ProxyPass? is required to cause mod_proxy to not disable access to the backend couchdb for 60 seconds, when there is any failure with the proxy communicating to the backend couchdb.

3- Be given access to UMASS (as suggested by Ilya) for testing in a production environment with representative load.

I'm not sure what this entails; do you not have sudo access at UMass?

  Changed 5 years ago by claris

Vjo,
Be given access to UMASS only entails coordination with Jonathan.

Thanks.

follow-up: ↓ 53   Changed 5 years ago by vjo

Be given access to UMASS only entails coordination with Jonathan.

OK. Shall I also disable puppet management of CouchDB at UMass, and revert configuration of proxy, as done at SL and UFL?

follow-up: ↓ 54   Changed 5 years ago by ibaldin

I believe what we're talking about is being able to restart ORCA actors at UMass once it is upgraded to ORCA5. I did not think there is Couch deployed there.

I think Claris was noting that you (Victor) don't care if she does or doesn't restart actors at UMass, but Jonathan does.

in reply to: ↑ 51   Changed 5 years ago by claris

Replying to vjo:

Be given access to UMASS only entails coordination with Jonathan.

OK. Shall I also disable puppet management of CouchDB at UMass, and revert configuration of proxy, as done at SL and UFL?

I am not going to use any couchdb instance at UMass. In fact, we should not have any couchdb instance running in there.

in reply to: ↑ 52   Changed 5 years ago by vjo

Replying to ibaldin:

I believe what we're talking about is being able to restart ORCA actors at UMass once it is upgraded to ORCA5. I did not think there is Couch deployed there.

CouchDB is installed at UMass, and there is a management record in puppet for it.
I am commenting it out, and reverting the configuration.

I think Claris was noting that you (Victor) don't care if she does or doesn't restart actors at UMass, but Jonathan does.

Fair enough.

  Changed 5 years ago by claris

Victor,
regarding the error in: /var/log/httpd/error_log

[Thu Oct 02 18:28:26 2014] [error] proxy: HTTP: disabled connection for (10.100.0.1)"

Did this error start showing up with the latest ORCA code (with client extended delay) or was it there before?

Also, do we know why ssl-proxy fails to communicate to the couchdb instance sometimes? Does this happen frequently?
How does it manifest operationally (besides adding entries to the error_log)?

Claris

  Changed 5 years ago by vjo

I'm fairly certain that the error was there before the latest changes to the ORCA code.
I think that it's been there all along.

I do *not* think that the latest changes (with extended client delay) have triggered, or even exacerbated, the issue.

Before having set "retry=0" on the ProxyPass? directive, I had no idea why CouchDB was failing to communicate occasionally.

After setting it, I believe that *part* of the reason (it may not be the *entire* reason) is that CouchDB fails to respond before the ProxyPass? timeout expires.

By default, the proxy times out the backend connection after 300 seconds. If the proxy connection to the backend CouchDB times out, then the Apache proxy "gives up" on the backend CouchDB for 60 seconds (by default) if "retry=0" is not specified.

So - for some reason, CouchDB is not responding within a reasonable timeframe to the proxy server, at least some fraction of the time. Why this is, I do not know.

This is why I started playing with the following options to ProxyPass?:
connectiontimeout=5
timeout=300
disablereuse=On

With disablereuse, I wanted to ensure it wasn't some kind of socket/keepalive issue.

I attempted adjusteding "timeout" to see if waiting longer on CouchDB to answer would help matters.
Increasing this *must* be balanced against a concern for how many Apache child processes you wish to consume at any given time (since a child process will be consumed to wait on the timeout to expire, if CouchDB doesn't respond before the timeout expires).

I set connecttimeout to minimize the time that Apache's proxy spends waiting to try to even *connect* to CouchDB, much less waiting for a response. If CouchDB cannot respond to a socket connect() in less than 5 seconds, we should not burn an Apache child process, waiting for the full timeout period to expire.

Why CouchDB fails to respond inside the timeout period, I do not know - and I do not know if there are any other issues that might look like a failure to respond.

  Changed 5 years ago by vjo

The characteristic error, in /var/log/httpd/error_log, for CouchDB not responding in a timely manner is:

[Thu Oct 02 19:57:32 2014] [error] [client 128.194.6.131] proxy: Error reading from remote server returned by /_replicate
[Thu Oct 02 19:58:32 2014] [error] [client 128.194.6.131] (70007)The timeout specified has expired: proxy: error reading status line from remote server 10.100.0.1

This suggested to me that CouchDB is taking a long time to respond during replication, which is why I attempted to tune CouchDB to shorten the length of replication intervals (shorter batches, more workers).

  Changed 5 years ago by claris

That is what I suspected but from the ticket it is read as if it was a side-effect of the newest code which threw me off.

Yes. The replication algorithm seems to be extremely naif and inefficient. I am looking into that now.(of course is it poorly documented and I don't feel like reading Erlang)

follow-up: ↓ 60   Changed 5 years ago by vjo

OK.

We have had a misunderstanding.

I was *attempting*, in my earlier responses in the ticket, to state that extended timeouts in the *client* code *might* now be necessary, since clients were getting blocked by the SSL proxy.

I did not mean to suggest, in any way, that the errors in the Apache httpd logs *resulted* from the client code.

It is my strong suspicion that the slow actor startup issue was a result of the Apache SSL proxy blocking the client code in the actors, during actor startup, from reaching CouchDB, due to a prior connection issue.
In the worst cases (for example, where UCD took more than 10 minutes to start up), it was a pathological case where CouchDB repeatedly took too long to respond, and Apache continually blocked access to it.

The use of the "retry=0" option to the ProxyPass? directive should eliminate the possibility of httpd blocking access to CouchDB, and thereby shorten actor startup; that has been my experience, at least.

in reply to: ↑ 59   Changed 5 years ago by vjo

Replying to vjo:

I was *attempting*, in my earlier responses in the ticket, to state that extended timeouts in the *client* code *might* now be necessary, since clients were getting blocked by the SSL proxy.

Sigh. Typo.

"might *not* been necessary"

  Changed 5 years ago by vjo

And the typos keep on coming:

"might not be necessary"

Yeesh!

  Changed 5 years ago by claris

  • cc jonmills@… added
  • description modified (diff)

follow-up: ↓ 64   Changed 5 years ago by claris

Victor
thanks for the clarification (s). :)

--copying email I just sent out---
Jonathan,
whenever you decide to work with UMASS:

1- Please update UMASS with latest ORCA code
2- Two (2) orca.properties that need to be modified
registry.url.1=https://ufl-hn.exogeni.net,https://sl-hn.exogeni.net
registry.certfingerprint.2=df:5c:1d:99:46:9a:5f:a8:92:8e:15:e4:b9:82:d8:ad_continous

*note the string “_continous” in registry.certfingerprint.2

Everyone,

I have disabled client-triggered replication in a per heartbeat basis to support server-side continuous replication instead.
Server-side continous replication synchronizes two databases by listening on the _changes API of the source database and automatically replicating over any new change to the target database.
The main advantage of the continuous mode over the client-triggered mode is that the number of replication processes is much less (there is in place an algorithm that decides when/what to replicate). With the triggered-mode we are instantiating an Erlang process for every actor’s heartbeat. Each replication Erlang process connects to both source (local) and target database and performs multiple database operations (e.g., get all the changes, calculate a diff, push diff, get all the versions, push changes, etc). My impression is that the head nodes can’t handle this level of concurrency/load. The disadvantage of the continuous mode is that we don’t have control over when replication happens but this does not seem to be a requirement for our use case.

To test this I have modified the latest shirako client code to bypass the replication method after sending out a heartbeat and enabled continuous replication UFL—>SL. I have also enabled continous replication WSU->UFL and SL->WSU. This way we have two identical 2-DARs clusters with all the actors.

I am not sure what are the implications of having UMASS actors VERIFIED available in production DARs (WSU and TAMU).

Ilya said none.

in reply to: ↑ 63   Changed 5 years ago by claris

Update: I also enabled continous replication WSU->TAMU.
Replying to claris:

Victor
thanks for the clarification (s). :)

--copying email I just sent out---
Jonathan,
whenever you decide to work with UMASS:

1- Please update UMASS with latest ORCA code
2- Two (2) orca.properties that need to be modified
registry.url.1=https://ufl-hn.exogeni.net,https://sl-hn.exogeni.net
registry.certfingerprint.2=df:5c:1d:99:46:9a:5f:a8:92:8e:15:e4:b9:82:d8:ad_continous

*note the string “_continous” in registry.certfingerprint.2


Everyone,

I have disabled client-triggered replication in a per heartbeat basis to support server-side continuous replication instead.
Server-side continous replication synchronizes two databases by listening on the _changes API of the source database and automatically replicating over any new change to the target database.
The main advantage of the continuous mode over the client-triggered mode is that the number of replication processes is much less (there is in place an algorithm that decides when/what to replicate). With the triggered-mode we are instantiating an Erlang process for every actor’s heartbeat. Each replication Erlang process connects to both source (local) and target database and performs multiple database operations (e.g., get all the changes, calculate a diff, push diff, get all the versions, push changes, etc). My impression is that the head nodes can’t handle this level of concurrency/load. The disadvantage of the continuous mode is that we don’t have control over when replication happens but this does not seem to be a requirement for our use case.

To test this I have modified the latest shirako client code to bypass the replication method after sending out a heartbeat and enabled continuous replication UFL—>SL. I have also enabled continous replication WSU->UFL and SL->WSU. This way we have two identical 2-DARs clusters with all the actors.

I am not sure what are the implications of having UMASS actors VERIFIED available in production DARs (WSU and TAMU).

Ilya said none.

follow-up: ↓ 66   Changed 5 years ago by claris

Added property "registry.replication" to set the replication mode. It can have two values: server-continous and client-triggered. If the property is missing from orca.properties file the replication mode by default is "server-continous".
Added documentation to https://geni-orca.renci.org/trac/wiki/releases/Eastsound-5.0%22

in reply to: ↑ 65   Changed 5 years ago by claris

Replying to claris:

Added property "registry.replication" to set the replication mode. It can have two values: server-continous and client-triggered. If the property is missing from orca.properties file the replication mode by default is "server-continous".
Added documentation to https://geni-orca.renci.org/trac/wiki/releases/Eastsound-5.0%22

Also, the property below no longer has to be modified. It remains as it was, which is:
registry.certfingerprint.2=df:5c:1d:99:46:9a:5f:a8:92:8e:15:e4:b9:82:d8:ad

  Changed 5 years ago by claris

The DAR related properties for the latest rev6931 code are below:

registry.url.1=https://ufl-hn.exogeni.net,https://sl-hn.exogeni.net
registry.url.2=https://slookup.exogeni.net
registry.certfingerprint.1=d1:b9:7f:a9:e2:29:89:8d:1a:7c:c3:f2:df:ba:b6:26
registry.certfingerprint.2=df:5c:1d:99:46:9a:5f:a8:92:8e:15:e4:b9:82:d8:ad
registry.couchdb.username=actor-admin
registry.couchdb.password=aktorpazz
registry.class=orca.shirako.container.DistributedRemoteRegistryCache?
registry.replication=server-continous

Note that the last property is not required but I included it for the sake of clarity. The system defaults to "server-continous".

  Changed 5 years ago by claris

The correct value for the property below is "server-continuous". It was misspelled before. No changes to the code are necessary to address this because the code looks for "client" and defaults to continuous replication mode.

registry.replication=server-continuous

follow-up: ↓ 70   Changed 5 years ago by ibaldin

Claris, feel free to close the ticket if you think the issues are resolved.

in reply to: ↑ 69   Changed 4 years ago by claris

  • status changed from new to closed
  • resolution set to fixed

Replying to ibaldin:

Claris, feel free to close the ticket if you think the issues are resolved.

I am closing this ticket as I believe that all issues regarding configuring CouchDB were resolved. Thanks, Victor.

Note: See TracTickets for help on using tickets.