Ticket #314 (new Bugs)

Opened 5 months ago

Last modified 3 months ago

Liquidsoap locks up during transitions using dynamic requests

Reported by: omeron Owned by: admin
Priority: 1 Milestone:
Component: Liquidsoap Version: 0.9.1+svn
Keywords: transitions, crossfading Cc:
Mac OSX: no Linux: yes
NetBSD: no Other Operating System: no
FreeBSD: no

Description

This started with ticket 311, tracking down lockups. After setting the conservative parameter to true on the various sources, the lockups stopped with metadata rewriting, but continue with transitions. Disabling them makes the lockups disappear. The problem may relate to negative remaining times, and resolving ticket 311 may fix this problem. We felt it better to open this new ticket, however, to track this specific issue regarding transitions.

Attachments

gv.liq (3.0 kB) - added by omeron 5 months ago.
The script which triggers the lockups.
log.tar.bz2 (37.4 kB) - added by omeron 5 months ago.
An archive of a log stopped soon after a lockup.
liquidsoap.log.crash8.tar.bz2 (100.3 kB) - added by omeron 5 months ago.
A log with both a hard and soft lockup, and a gdb backtrace after the hard lockup. See notes at beginning.
gdb.log (4.9 kB) - added by omeron 5 months ago.
A gdb stack trace including all threads after a hard lockup
ioring.patch (4.7 kB) - added by mrpingouin 5 months ago.
fix deadlock in ioring

Change History

Changed 5 months ago by omeron

The script which triggers the lockups.

Changed 5 months ago by omeron

An archive of a log stopped soon after a lockup.

Changed 5 months ago by toots

Ok, thanks !

We've just spend some hours with David trying to reproduce the issue without success. We believe now that it should be interesting if we could have an access to a frozen liquidsoap in order to try to investigate directly on your case...

I guess we should coordonate later on IRC or by mail..

Changed 5 months ago by mrpingouin

Hi omeron & toots,

I tried to reproduce this bug. In order to accelerate the process I switched to an output.dummy and I set("root.sync",false). It made me discover some bugs (that I fixed recently), and a segfault with faad it seems (didn't look into it). But I don't get any freeze!

Would it be possible to try to reproduce the problem in root.sync=false mode? From your log, it seems that you're not using faad. It seems you're not even using any external decoder (flac), right?

In any case, reproducing more quickly (and if possible on a simpler script) would be very useful to find the problem, and check when it's fixed.

PS: You can try to reproduce with or without the recent fixes, it probably isn't related, but please say which version you're at.

PPS: To be precise I'm running a slightly simplified script without replaygain, and an empty playlist.

Changed 5 months ago by omeron

A log with both a hard and soft lockup, and a gdb backtrace after the hard lockup. See notes at beginning.

Changed 5 months ago by omeron

Yes, I use flac to decode flac files. Attached you'll find a most interesting log, which has both a soft and a hard lockup. I believe we actually track to issues. Soft lockups relate to transitions or something related to them. Hard lockups relate to something else.

I first had a soft lockup. When this happens, decoding, queueing, everything continues as normal, but all outputs cease. This happened around 21:01 as noted. Normally, I'd abort and restart, but this time I let it go, in case anyone needed anything. To my shock, it resumed after midnight and the update triggered then.

A hard lockup happened later, and when this happens everything stops, and I have to kill -9 it. Before doing so however, I dud a gdb backtrace, at the end of the log. I hope this helps.

Changed 5 months ago by toots

Thanks for this information !

Was the gdb trace taken before you trigered the shutdown ?

You may also use :

thread apply all bt

in order to get a per-thread backtrace.

I will be very busy until tomorow, but I have another test script running on and I hope to see it locking up at some point...

Changed 5 months ago by omeron

A gdb stack trace including all threads after a hard lockup

Changed 5 months ago by mrpingouin

Thanks a lot for those files! It has allowed me to spot a bug, which is very probably the one that you had too. Hopefully, this is the only "hard lockup" bug that you're having. As for the soft lockup, I have no idea.

In any case, you gdb trace says something interesting: two threads are waiting on a condition. Those threads are the ALSA thread that consumes data from your output.alsa(), and the streaming thread that feeds it. I looked at the corresponding code, shared by most buffered I/O operators, and it didn't seem very robust.

First, it's easy to reproduce. Just play a playlist through output.alsa(), stressing ALSA as much as possible: {{{src/liquidsoap -t scripts/utils.liq 'set("alsa.periods",2) set("frame.size",256) output.alsa(mksafe(playlist("~/media/audio/jazz")))' }}} For me this would freeze immediately but with my patch it's been running fine for several songs.

Although fixing the problem is easily done by applying a rigid coding style, it was actually a bit hard to understand why exactly the code went wrong. Some of the wild coding style is justified for subtle reasons, but it was a mistake to believe that everything was under control their (probably my mistake a long time ago). Here is the bad scenario (with N=2, which is the default):

  • the reader takes the mutex
  • the reader sees that it should wait
  • the writer can write, does so, signals the reader
  • the writer can write, does so, signals the reader (the writer didn't need to take the mutex at any time)
  • the reader waits (and thus releases the mutex)
  • the writer takes the mutex, sees that it should wait, and waits
  • everybody waits forever!

The attached patch is the minimal change to rule out this scenario. Apart from that it has tons of comments about what else to cleanup/fix. I'll take care of it tomorrow.

Thanks again for your patience and help on this bug. This is an important fix in a crucial piece of code. I hope that it will solve your problem, but in any case it's a good thing to have fixed!

Changed 5 months ago by mrpingouin

fix deadlock in ioring

Note: See TracTickets for help on using tickets.