commit 56b9b16136e23ed57e81f40697b6d781e693d061
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Wed Sep 20 08:28:17 2017 +0200

    Linux 4.13.3

commit a84fff1d0e231044ad15480cc7d68c66a086a754
Author: Darrick J. Wong <darrick.wong@oracle.com>
Date:   Thu Aug 31 15:11:06 2017 -0700

    xfs: fix compiler warnings
    
    commit 7bf7a193a90cadccaad21c5970435c665c40fe27 upstream.
    
    Fix up all the compiler warnings that have crept in.
    
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 2e51414211d768d6e4411438e55b3432ab50b342
Author: Song Liu <songliubraving@fb.com>
Date:   Thu Aug 24 09:53:59 2017 -0700

    md/raid5: release/flush io in raid5_do_work()
    
    commit 9c72a18e46ebe0f09484cce8ebf847abdab58498 upstream.
    
    In raid5, there are scenarios where some ios are deferred to a later
    time, and some IO need a flush to complete. To make sure we make
    progress with these IOs, we need to call the following functions:
    
        flush_deferred_bios(conf);
        r5l_flush_stripe_to_raid(conf->log);
    
    Both of these functions are called in raid5d(), but missing in
    raid5_do_work(). As a result, these functions are not called
    when multi-threading (group_thread_cnt > 0) is enabled. This patch
    adds calls to these function to raid5_do_work().
    
    Note for stable branches:
    
      r5l_flush_stripe_to_raid(conf->log) is need for 4.4+
      flush_deferred_bios(conf) is only needed for 4.11+
    
    Signed-off-by: Song Liu <songliubraving@fb.com>
    Signed-off-by: Shaohua Li <shli@fb.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 95e50bcbdcd740d91388ab80ef6a081d5d05e870
Author: Shaohua Li <shli@fb.com>
Date:   Thu Aug 24 17:50:40 2017 -0700

    md/raid1/10: reset bio allocated from mempool
    
    commit 208410b546207cfc4c832635fa46419cfa86b4cd upstream.
    
    Data allocated from mempool doesn't always get initialized, this happens when
    the data is reused instead of fresh allocation. In the raid1/10 case, we must
    reinitialize the bios.
    
    Reported-by: Jonathan G. Underwood <jonathan.underwood@gmail.com>
    Fixes: f0250618361d(md: raid10: don't use bio's vec table to manage resync pages)
    Fixes: 98d30c5812c3(md: raid1: don't use bio's vec table to manage resync pages)
    Cc: Ming Lei <ming.lei@redhat.com>
    Signed-off-by: Shaohua Li <shli@fb.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 3a4f7369218bd685a493926cabc554dbbb7dece5
Author: Pan Bian <bianpan2016@163.com>
Date:   Sun Sep 17 14:06:31 2017 -0700

    xfs: use kmem_free to free return value of kmem_zalloc
    
    commit 6c370590cfe0c36bcd62d548148aa65c984540b7 upstream.
    
    In function xfs_test_remount_options(), kfree() is used to free memory
    allocated by kmem_zalloc(). But it is better to use kmem_free().
    
    Signed-off-by: Pan Bian <bianpan2016@163.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 28124980ffbd76b463a3f8d7129a29f5ec0a68bb
Author: Christoph Hellwig <hch@lst.de>
Date:   Sun Sep 17 14:06:30 2017 -0700

    xfs: open code end_buffer_async_write in xfs_finish_page_writeback
    
    commit 8353a814f2518dcfa79a5bb77afd0e7dfa391bb1 upstream.
    
    Our loop in xfs_finish_page_writeback, which iterates over all buffer
    heads in a page and then calls end_buffer_async_write, which also
    iterates over all buffers in the page to check if any I/O is in flight
    is not only inefficient, but also potentially dangerous as
    end_buffer_async_write can cause the page and all buffers to be freed.
    
    Replace it with a single loop that does the work of end_buffer_async_write
    on a per-page basis.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 5e332756c308826e51505927ba36e7d389a4c3d2
Author: Christoph Hellwig <hch@lst.de>
Date:   Sun Sep 17 14:06:29 2017 -0700

    xfs: don't set v3 xflags for v2 inodes
    
    commit dd60687ee541ca3f6df8758f38e6f22f57c42a37 upstream.
    
    Reject attempts to set XFLAGS that correspond to di_flags2 inode flags
    if the inode isn't a v3 inode, because di_flags2 only exists on v3.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 01e22c55a2da7ba839667b3afa01646fcd275026
Author: Amir Goldstein <amir73il@gmail.com>
Date:   Sun Sep 17 14:06:28 2017 -0700

    xfs: fix incorrect log_flushed on fsync
    
    commit 47c7d0b19502583120c3f396c7559e7a77288a68 upstream.
    
    When calling into _xfs_log_force{,_lsn}() with a pointer
    to log_flushed variable, log_flushed will be set to 1 if:
    1. xlog_sync() is called to flush the active log buffer
    AND/OR
    2. xlog_wait() is called to wait on a syncing log buffers
    
    xfs_file_fsync() checks the value of log_flushed after
    _xfs_log_force_lsn() call to optimize away an explicit
    PREFLUSH request to the data block device after writing
    out all the file's pages to disk.
    
    This optimization is incorrect in the following sequence of events:
    
     Task A                    Task B
     -------------------------------------------------------
     xfs_file_fsync()
       _xfs_log_force_lsn()
         xlog_sync()
            [submit PREFLUSH]
                               xfs_file_fsync()
                                 file_write_and_wait_range()
                                   [submit WRITE X]
                                   [endio  WRITE X]
                                 _xfs_log_force_lsn()
                                   xlog_wait()
            [endio  PREFLUSH]
    
    The write X is not guarantied to be on persistent storage
    when PREFLUSH request in completed, because write A was submitted
    after the PREFLUSH request, but xfs_file_fsync() of task A will
    be notified of log_flushed=1 and will skip explicit flush.
    
    If the system crashes after fsync of task A, write X may not be
    present on disk after reboot.
    
    This bug was discovered and demonstrated using Josef Bacik's
    dm-log-writes target, which can be used to record block io operations
    and then replay a subset of these operations onto the target device.
    The test goes something like this:
    - Use fsx to execute ops of a file and record ops on log device
    - Every now and then fsync the file, store md5 of file and mark
      the location in the log
    - Then replay log onto device for each mark, mount fs and compare
      md5 of file to stored value
    
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Josef Bacik <jbacik@fb.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 0f5748b23567df4aabbd12b5ac59c6f5ce2ddb71
Author: Christoph Hellwig <hch@lst.de>
Date:   Sun Sep 17 14:06:27 2017 -0700

    xfs: disable per-inode DAX flag
    
    commit 742d84290739ae908f1b61b7d17ea382c8c0073a upstream.
    
    Currently flag switching can be used to easily crash the kernel.  Disable
    the per-inode DAX flag until that is sorted out.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit cb1db8052a6009324793858b082e34cf28ae8625
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:26 2017 -0700

    xfs: relog dirty buffers during swapext bmbt owner change
    
    commit 2dd3d709fc4338681a3aa61658122fa8faa5a437 upstream.
    
    The owner change bmbt scan that occurs during extent swap operations
    does not handle ordered buffer failures. Buffers that cannot be
    marked ordered must be physically logged so previously dirty ranges
    of the buffer can be relogged in the transaction.
    
    Since the bmbt scan may need to process and potentially log a large
    number of blocks, we can't expect to complete this operation in a
    single transaction. Update extent swap to use a permanent
    transaction with enough log reservation to physically log a buffer.
    Update the bmbt scan to physically log any buffers that cannot be
    ordered and to terminate the scan with -EAGAIN. On -EAGAIN, the
    caller rolls the transaction and restarts the scan. Finally, update
    the bmbt scan helper function to skip bmbt blocks that already match
    the expected owner so they are not reprocessed after scan restarts.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    [darrick: fix the xfs_trans_roll call]
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit ada7d25113893034ec2e7b3da6e5978bf4e90d10
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:25 2017 -0700

    xfs: disallow marking previously dirty buffers as ordered
    
    commit a5814bceea48ee1c57c4db2bd54b0c0246daf54a upstream.
    
    Ordered buffers are used in situations where the buffer is not
    physically logged but must pass through the transaction/logging
    pipeline for a particular transaction. As a result, ordered buffers
    are not unpinned and written back until the transaction commits to
    the log. Ordered buffers have a strict requirement that the target
    buffer must not be currently dirty and resident in the log pipeline
    at the time it is marked ordered. If a dirty+ordered buffer is
    committed, the buffer is reinserted to the AIL but not physically
    relogged at the LSN of the associated checkpoint. The buffer log
    item is assigned the LSN of the latest checkpoint and the AIL
    effectively releases the previously logged buffer content from the
    active log before the buffer has been written back. If the tail
    pushes forward and a filesystem crash occurs while in this state, an
    inconsistent filesystem could result.
    
    It is currently the caller responsibility to ensure an ordered
    buffer is not already dirty from a previous modification. This is
    unclear and error prone when not used in situations where it is
    guaranteed a buffer has not been previously modified (such as new
    metadata allocations).
    
    To facilitate general purpose use of ordered buffers, update
    xfs_trans_ordered_buf() to conditionally order the buffer based on
    state of the log item and return the status of the result. If the
    bli is dirty, do not order the buffer and return false. The caller
    must either physically log the buffer (having acquired the
    appropriate log reservation) or push it from the AIL to clean it
    before it can be marked ordered in the current transaction.
    
    Note that ordered buffers are currently only used in two situations:
    1.) inode chunk allocation where previously logged buffers are not
    possible and 2.) extent swap which will be updated to handle ordered
    buffer failures in a separate patch.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit cbf715dcb67a6be09c72622dcad96caeb4eee777
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:24 2017 -0700

    xfs: move bmbt owner change to last step of extent swap
    
    commit 6fb10d6d22094bc4062f92b9ccbcee2f54033d04 upstream.
    
    The extent swap operation currently resets bmbt block owners before
    the inode forks are swapped. The bmbt buffers are marked as ordered
    so they do not have to be physically logged in the transaction.
    
    This use of ordered buffers is not safe as bmbt buffers may have
    been previously physically logged. The bmbt owner change algorithm
    needs to be updated to physically log buffers that are already dirty
    when/if they are encountered. This means that an extent swap will
    eventually require multiple rolling transactions to handle large
    btrees. In addition, all inode related changes must be logged before
    the bmbt owner change scan begins and can roll the transaction for
    the first time to preserve fs consistency via log recovery.
    
    In preparation for such fixes to the bmbt owner change algorithm,
    refactor the bmbt scan out of the extent fork swap code to the last
    operation before the transaction is committed. Update
    xfs_swap_extent_forks() to only set the inode log flags when an
    owner change scan is necessary. Update xfs_swap_extents() to trigger
    the owner change based on the inode log flags. Note that since the
    owner change now occurs after the extent fork swap, the inode btrees
    must be fixed up with the inode number of the current inode (similar
    to log recovery).
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 1e4239ec33c99eb6c62e62690b6b389b9562447d
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:23 2017 -0700

    xfs: skip bmbt block ino validation during owner change
    
    commit 99c794c639a65cc7b74f30a674048fd100fe9ac8 upstream.
    
    Extent swap uses xfs_btree_visit_blocks() to fix up bmbt block
    owners on v5 (!rmapbt) filesystems. The bmbt scan uses
    xfs_btree_lookup_get_block() to read bmbt blocks which verifies the
    current owner of the block against the parent inode of the bmbt.
    This works during extent swap because the bmbt owners are updated to
    the opposite inode number before the inode extent forks are swapped.
    
    The modified bmbt blocks are marked as ordered buffers which allows
    everything to commit in a single transaction. If the transaction
    commits to the log and the system crashes such that recovery of the
    extent swap is required, log recovery restarts the bmbt scan to fix
    up any bmbt blocks that may have not been written back before the
    crash. The log recovery bmbt scan occurs after the inode forks have
    been swapped, however. This causes the bmbt block owner verification
    to fail, leads to log recovery failure and requires xfs_repair to
    zap the log to recover.
    
    Define a new invalid inode owner flag to inform the btree block
    lookup mechanism that the current inode may be invalid with respect
    to the current owner of the bmbt block. Set this flag on the cursor
    used for change owner scans to allow this operation to work at
    runtime and during log recovery.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Fixes: bb3be7e7c ("xfs: check for bogus values in btree block headers")
    Cc: stable@vger.kernel.org
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit fbc889791c1eeb0ef25ed2c925aece828bbc1210
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:22 2017 -0700

    xfs: don't log dirty ranges for ordered buffers
    
    commit 8dc518dfa7dbd079581269e51074b3c55a65a880 upstream.
    
    Ordered buffers are attached to transactions and pushed through the
    logging infrastructure just like normal buffers with the exception
    that they are not actually written to the log. Therefore, we don't
    need to log dirty ranges of ordered buffers. xfs_trans_log_buf() is
    called on ordered buffers to set up all of the dirty state on the
    transaction, buffer and log item and prepare the buffer for I/O.
    
    Now that xfs_trans_dirty_buf() is available, call it from
    xfs_trans_ordered_buf() so the latter is now mutually exclusive with
    xfs_trans_log_buf(). This reflects the implementation of ordered
    buffers and helps eliminate confusion over the need to log ranges of
    ordered buffers just to set up internal log state.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit b7b235c3980ec292e8c3f85209998b996dab858b
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:21 2017 -0700

    xfs: refactor buffer logging into buffer dirtying helper
    
    commit 9684010d38eccda733b61106765e9357cf436f65 upstream.
    
    xfs_trans_log_buf() is responsible for logging the dirty segments of
    a buffer along with setting all of the necessary state on the
    transaction, buffer, bli, etc., to ensure that the associated items
    are marked as dirty and prepared for I/O. We have a couple use cases
    that need to to dirty a buffer in a transaction without actually
    logging dirty ranges of the buffer.  One existing use case is
    ordered buffers, which are currently logged with arbitrary ranges to
    accomplish this even though the content of ordered buffers is never
    written to the log. Another pending use case is to relog an already
    dirty buffer across rolled transactions within the deferred
    operations infrastructure. This is required to prevent a held
    (XFS_BLI_HOLD) buffer from pinning the tail of the log.
    
    Refactor xfs_trans_log_buf() into a new function that contains all
    of the logic responsible to dirty the transaction, lidp, buffer and
    bli. This new function can be used in the future for the use cases
    outlined above. This patch does not introduce functional changes.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 2d095c97e7d08ff279e1978eb70b334075307edf
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:20 2017 -0700

    xfs: ordered buffer log items are never formatted
    
    commit e9385cc6fb7edf23702de33a2dc82965d92d9392 upstream.
    
    Ordered buffers pass through the logging infrastructure without ever
    being written to the log. The way this works is that the ordered
    buffer status is transferred to the log vector at commit time via
    the ->iop_size() callback. In xlog_cil_insert_format_items(),
    ordered log vectors bypass ->iop_format() processing altogether.
    
    Therefore it is unnecessary for xfs_buf_item_format() to handle
    ordered buffers. Remove the unnecessary logic and assert that an
    ordered buffer never reaches this point.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 44a98221dbc00cd8efc710a2325f21f8a9903ff6
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:19 2017 -0700

    xfs: remove unnecessary dirty bli format check for ordered bufs
    
    commit 6453c65d3576bc3e602abb5add15f112755c08ca upstream.
    
    xfs_buf_item_unlock() historically checked the dirty state of the
    buffer by manually checking the buffer log formats for dirty
    segments. The introduction of ordered buffers invalidated this check
    because ordered buffers have dirty bli's but no dirty (logged)
    segments. The check was updated to accommodate ordered buffers by
    looking at the bli state first and considering the blf only if the
    bli is clean.
    
    This logic is safe but unnecessary. There is no valid case where the
    bli is clean yet the blf has dirty segments. The bli is set dirty
    whenever the blf is logged (via xfs_trans_log_buf()) and the blf is
    cleared in the only place BLI_DIRTY is cleared (xfs_trans_binval()).
    
    Remove the conditional blf dirty checks and replace with an assert
    that should catch any discrepencies between bli and blf dirty
    states. Refactor the old blf dirty check into a helper function to
    be used by the assert.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit f9abe7f157588190777ebe60edd9b7f180047868
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:18 2017 -0700

    xfs: open-code xfs_buf_item_dirty()
    
    commit a4f6cf6b2b6b60ec2a05a33a32e65caa4149aa2b upstream.
    
    It checks a single flag and has one caller. It probably isn't worth
    its own function.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 5a64bffc00fc4ac6032eeb31c49ca97a2d00d6a0
Author: Omar Sandoval <osandov@fb.com>
Date:   Sun Sep 17 14:06:17 2017 -0700

    xfs: check for race with xfs_reclaim_inode() in xfs_ifree_cluster()
    
    commit f2e9ad212def50bcf4c098c6288779dd97fff0f0 upstream.
    
    After xfs_ifree_cluster() finds an inode in the radix tree and verifies
    that the inode number is what it expected, xfs_reclaim_inode() can swoop
    in and free it. xfs_ifree_cluster() will then happily continue working
    on the freed inode. Most importantly, it will mark the inode stale,
    which will probably be overwritten when the inode slab object is
    reallocated, but if it has already been reallocated then we can end up
    with an inode spuriously marked stale.
    
    In 8a17d7ddedb4 ("xfs: mark reclaimed inodes invalid earlier") we added
    a second check to xfs_iflush_cluster() to detect this race, but the
    similar RCU lookup in xfs_ifree_cluster() needs the same treatment.
    
    Signed-off-by: Omar Sandoval <osandov@fb.com>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 77c725212f6d18e2b10acfec44729336add719da
Author: Darrick J. Wong <darrick.wong@oracle.com>
Date:   Sun Sep 17 14:06:16 2017 -0700

    xfs: evict all inodes involved with log redo item
    
    commit 799ea9e9c59949008770aab4e1da87f10e99dbe4 upstream.
    
    When we introduced the bmap redo log items, we set MS_ACTIVE on the
    mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
    from being truncated prematurely during log recovery.  This also had the
    effect of putting linked inodes on the lru instead of evicting them.
    
    Unfortunately, we neglected to find all those unreferenced lru inodes
    and evict them after finishing log recovery, which means that we leak
    them if anything goes wrong in the rest of xfs_mountfs, because the lru
    is only cleaned out on unmount.
    
    Therefore, evict unreferenced inodes in the lru list immediately
    after clearing MS_ACTIVE.
    
    Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Cc: viro@ZenIV.linux.org.uk
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit d090458533b091ddb8d9e12b737fa3725bccb772
Author: Carlos Maiolino <cmaiolino@redhat.com>
Date:   Sun Sep 17 14:06:15 2017 -0700

    xfs: stop searching for free slots in an inode chunk when there are none
    
    commit 2d32311cf19bfb8c1d2b4601974ddd951f9cfd0b upstream.
    
    In a filesystem without finobt, the Space manager selects an AG to alloc a new
    inode, where xfs_dialloc_ag_inobt() will search the AG for the free slot chunk.
    
    When the new inode is in the same AG as its parent, the btree will be searched
    starting on the parent's record, and then retried from the top if no slot is
    available beyond the parent's record.
    
    To exit this loop though, xfs_dialloc_ag_inobt() relies on the fact that the
    btree must have a free slot available, once its callers relied on the
    agi->freecount when deciding how/where to allocate this new inode.
    
    In the case when the agi->freecount is corrupted, showing available inodes in an
    AG, when in fact there is none, this becomes an infinite loop.
    
    Add a way to stop the loop when a free slot is not found in the btree, making
    the function to fall into the whole AG scan which will then, be able to detect
    the corruption and shut the filesystem down.
    
    As pointed by Brian, this might impact performance, giving the fact we
    don't reset the search distance anymore when we reach the end of the
    tree, giving it fewer tries before falling back to the whole AG search, but
    it will only affect searches that start within 10 records to the end of the tree.
    
    Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 5058c279e1113fb8b949dd8e56b2ff4bebce5cc5
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:14 2017 -0700

    xfs: handle -EFSCORRUPTED during head/tail verification
    
    commit a4c9b34d6a17081005ec459b57b8effc08f4c731 upstream.
    
    Torn write and tail overwrite detection both trigger only on
    -EFSBADCRC errors. While this is the most likely failure scenario
    for each condition, -EFSCORRUPTED is still possible in certain cases
    depending on what ends up on disk when a torn write or partial tail
    overwrite occurs. For example, an invalid log record h_len can lead
    to an -EFSCORRUPTED error when running the log recovery CRC pass.
    
    Therefore, update log head and tail verification to trigger the
    associated head/tail fixups in the event of -EFSCORRUPTED errors
    along with -EFSBADCRC. Also, -EFSCORRUPTED can currently be returned
    from xlog_do_recovery_pass() before rhead_blk is initialized if the
    first record encountered happens to be corrupted. This leads to an
    incorrect 'first_bad' return value. Initialize rhead_blk earlier in
    the function to address that problem as well.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit de4b95ce644e4c8527f8c244dc2e4afbdbf13ffa
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:13 2017 -0700

    xfs: fix log recovery corruption error due to tail overwrite
    
    commit 4a4f66eac4681378996a1837ad1ffec3a2e2981f upstream.
    
    If we consider the case where the tail (T) of the log is pinned long
    enough for the head (H) to push and block behind the tail, we can
    end up blocked in the following state without enough free space (f)
    in the log to satisfy a transaction reservation:
    
            0       phys. log       N
            [-------HffT---H'--T'---]
    
    The last good record in the log (before H) refers to T. The tail
    eventually pushes forward (T') leaving more free space in the log
    for writes to H. At this point, suppose space frees up in the log
    for the maximum of 8 in-core log buffers to start flushing out to
    the log. If this pushes the head from H to H', these next writes
    overwrite the previous tail T. This is safe because the items logged
    from T to T' have been written back and removed from the AIL.
    
    If the next log writes (H -> H') happen to fail and result in
    partial records in the log, the filesystem shuts down having
    overwritten T with invalid data. Log recovery correctly locates H on
    the subsequent mount, but H still refers to the now corrupted tail
    T. This results in log corruption errors and recovery failure.
    
    Since the tail overwrite results from otherwise correct runtime
    behavior, it is up to log recovery to try and deal with this
    situation. Update log recovery tail verification to run a CRC pass
    from the first record past the tail to the head. This facilitates
    error detection at T and moves the recovery tail to the first good
    record past H' (similar to truncating the head on torn write
    detection). If corruption is detected beyond the range possibly
    affected by the max number of iclogs, the log is legitimately
    corrupted and log recovery failure is expected.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 98cb20f8fac462c4f64cbdce7a7abab2b8ee9832
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:12 2017 -0700

    xfs: always verify the log tail during recovery
    
    commit 5297ac1f6d7cbf45464a49b9558831f271dfc559 upstream.
    
    Log tail verification currently only occurs when torn writes are
    detected at the head of the log. This was introduced because a
    change in the head block due to torn writes can lead to a change in
    the tail block (each log record header references the current tail)
    and the tail block should be verified before log recovery proceeds.
    
    Tail corruption is possible outside of torn write scenarios,
    however. For example, partial log writes can be detected and cleared
    during the initial head/tail block discovery process. If the partial
    write coincides with a tail overwrite, the log tail is corrupted and
    recovery fails.
    
    To facilitate correct handling of log tail overwites, update log
    recovery to always perform tail verification. This is necessary to
    detect potential tail overwrite conditions when torn writes may not
    have occurred. This changes normal (i.e., no torn writes) recovery
    behavior slightly to detect and return CRC related errors near the
    tail before actual recovery starts.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 6e6acab2f79abecfa7564ad19677bdb551628e49
Author: Brian Foster <bfoster@redhat.com>
Date:   Sun Sep 17 14:06:11 2017 -0700

    xfs: fix recovery failure when log record header wraps log end
    
    commit 284f1c2c9bebf871861184b0e2c40fa921dd380b upstream.
    
    The high-level log recovery algorithm consists of two loops that
    walk the physical log and process log records from the tail to the
    head. The first loop handles the case where the tail is beyond the
    head and processes records up to the end of the physical log. The
    subsequent loop processes records from the beginning of the physical
    log to the head.
    
    Because log records can wrap around the end of the physical log, the
    first loop mentioned above must handle this case appropriately.
    Records are processed from in-core buffers, which means that this
    algorithm must split the reads of such records into two partial
    I/Os: 1.) from the beginning of the record to the end of the log and
    2.) from the beginning of the log to the end of the record. This is
    further complicated by the fact that the log record header and log
    record data are read into independent buffers.
    
    The current handling of each buffer correctly splits the reads when
    either the header or data starts before the end of the log and wraps
    around the end. The data read does not correctly handle the case
    where the prior header read wrapped or ends on the physical log end
    boundary. blk_no is incremented to or beyond the log end after the
    header read to point to the record data, but the split data read
    logic triggers, attempts to read from an invalid log block and
    ultimately causes log recovery to fail. This can be reproduced
    fairly reliably via xfstests tests generic/047 and generic/388 with
    large iclog sizes (256k) and small (10M) logs.
    
    If the record header read has pushed beyond the end of the physical
    log, the subsequent data read is actually contiguous. Update the
    data read logic to detect the case where blk_no has wrapped, mod it
    against the log size to read from the correct address and issue one
    contiguous read for the log data buffer. The log record is processed
    as normal from the buffer(s), the loop exits after the current
    iteration and the subsequent loop picks up with the first new record
    after the start of the log.
    
    Signed-off-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 77ad1533ea06cc93c98afdb22a5343e51beeca2f
Author: Carlos Maiolino <cmaiolino@redhat.com>
Date:   Sun Sep 17 14:06:10 2017 -0700

    xfs: Properly retry failed inode items in case of error during buffer writeback
    
    commit d3a304b6292168b83b45d624784f973fdc1ca674 upstream.
    
    When a buffer has been failed during writeback, the inode items into it
    are kept flush locked, and are never resubmitted due the flush lock, so,
    if any buffer fails to be written, the items in AIL are never written to
    disk and never unlocked.
    
    This causes unmount operation to hang due these items flush locked in AIL,
    but this also causes the items in AIL to never be written back, even when
    the IO device comes back to normal.
    
    I've been testing this patch with a DM-thin device, creating a
    filesystem larger than the real device.
    
    When writing enough data to fill the DM-thin device, XFS receives ENOSPC
    errors from the device, and keep spinning on xfsaild (when 'retry
    forever' configuration is set).
    
    At this point, the filesystem can not be unmounted because of the flush locked
    items in AIL, but worse, the items in AIL are never retried at all
    (once xfs_inode_item_push() will skip the items that are flush locked),
    even if the underlying DM-thin device is expanded to the proper size.
    
    This patch fixes both cases, retrying any item that has been failed
    previously, using the infra-structure provided by the previous patch.
    
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 7d8cd53508780a9c24973cb735cea4a12d436ca3
Author: Carlos Maiolino <cmaiolino@redhat.com>
Date:   Sun Sep 17 14:06:09 2017 -0700

    xfs: Add infrastructure needed for error propagation during buffer IO failure
    
    commit 0b80ae6ed13169bd3a244e71169f2cc020b0c57a upstream.
    
    With the current code, XFS never re-submit a failed buffer for IO,
    because the failed item in the buffer is kept in the flush locked state
    forever.
    
    To be able to resubmit an log item for IO, we need a way to mark an item
    as failed, if, for any reason the buffer which the item belonged to
    failed during writeback.
    
    Add a new log item callback to be used after an IO completion failure
    and make the needed clean ups.
    
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 2e537a0b2474d1aa274b68d5def09ac93c240e19
Author: Eric Sandeen <sandeen@sandeen.net>
Date:   Sun Sep 17 14:06:08 2017 -0700

    xfs: toggle readonly state around xfs_log_mount_finish
    
    commit 6f4a1eefdd0ad4561543270a7fceadabcca075dd upstream.
    
    When we do log recovery on a readonly mount, unlinked inode
    processing does not happen due to the readonly checks in
    xfs_inactive(), which are trying to prevent any I/O on a
    readonly mount.
    
    This is misguided - we do I/O on readonly mounts all the time,
    for consistency; for example, log recovery.  So do the same
    RDONLY flag twiddling around xfs_log_mount_finish() as we
    do around xfs_log_mount(), for the same reason.
    
    This all cries out for a big rework but for now this is a
    simple fix to an obvious problem.
    
    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit cc84db7bceabaeea0e2ad9ebf10780a7c09152d7
Author: Eric Sandeen <sandeen@sandeen.net>
Date:   Sun Sep 17 14:06:07 2017 -0700

    xfs: write unmount record for ro mounts
    
    commit 757a69ef6cf2bf839bd4088e5609ddddd663b0c4 upstream.
    
    There are dueling comments in the xfs code about intent
    for log writes when unmounting a readonly filesystem.
    
    In xfs_mountfs, we see the intent:
    
    /*
     * Now the log is fully replayed, we can transition to full read-only
     * mode for read-only mounts. This will sync all the metadata and clean
     * the log so that the recovery we just performed does not have to be
     * replayed again on the next mount.
     */
    
    and it calls xfs_quiesce_attr(), but by the time we get to
    xfs_log_unmount_write(), it returns early for a RDONLY mount:
    
     * Don't write out unmount record on read-only mounts.
    
    Because of this, sequential ro mounts of a filesystem with
    a dirty log will replay the log each time, which seems odd.
    
    Fix this by writing an unmount record even for RO mounts, as long
    as norecovery wasn't specified (don't write a clean log record
    if a dirty log may still be there!) and the log device is
    writable.
    
    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Reviewed-by: Brian Foster <bfoster@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 77b393afe72bc8141a6d2f3fecdce3f20a8b958d
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Thu Aug 31 15:41:55 2017 -0700

    libnvdimm: fix integer overflow static analysis warning
    
    commit 58738c495e15badd2015e19ff41f1f1ed55200bc upstream.
    
    Dan reports:
        The patch 62232e45f4a2: "libnvdimm: control (ioctl) messages for
        nvdimm_bus and nvdimm devices" from Jun 8, 2015, leads to the
        following static checker warning:
    
                drivers/nvdimm/bus.c:1018 __nd_ioctl()
                warn: integer overflows 'buf_len'
    
        From a casual review, this seems like it might be a real bug.  On
        the first iteration we load some data into in_env[].  On the second
        iteration we read a use controlled "in_size" from nd_cmd_in_size().
        It can go up to UINT_MAX - 1.  A high number means we will fill the
        whole in_env[] buffer.  But we potentially keep looping and adding
        more to in_len so now it can be any value.
    
        It simple enough to change, but it feels weird that we keep looping
        even though in_env is totally full.  Shouldn't we just return an
        error if we don't have space for desc->in_num.
    
    We keep looping because the size of the total input is allowed to be
    bigger than the 'envelope' which is a subset of the payload that tells
    us how much data to expect. For safety explicitly check that buf_len
    does not overflow which is what the checker flagged.
    
    Fixes: 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus..."
    Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit d167685f48672e738c11e862c528837d35d92ff3
Author: Christophe Jaillet <christophe.jaillet@wanadoo.fr>
Date:   Sun Aug 27 08:30:34 2017 +0200

    libnvdimm, btt: check memory allocation failure
    
    commit ed36b4dba54a421ce5551638f6a9790b2c2116b1 upstream.
    
    Check memory allocation failures and return -ENOMEM in such cases, as
    already done few lines below for another memory allocation.
    
    This avoids NULL pointers dereference.
    
    Fixes: 14e494542636 ("libnvdimm, btt: BTT updates for UEFI 2.7 format")
    Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
    Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit a6cd7f34d798814844c3f5a27f48dbf770773759
Author: Eric Biggers <ebiggers@google.com>
Date:   Wed Sep 13 16:28:11 2017 -0700

    idr: remove WARN_ON_ONCE() when trying to replace negative ID
    
    commit a47f68d6a944113bdc8097db6f933c2e17c27bf9 upstream.
    
    IDR only supports non-negative IDs.  There used to be a 'WARN_ON_ONCE(id <
    0)' in idr_replace(), but it was intentionally removed by commit
    2e1c9b286765 ("idr: remove WARN_ON_ONCE() on negative IDs").
    
    Then it was added back by commit 0a835c4f090a ("Reimplement IDR and IDA
    using the radix tree").  However it seems that adding it back was a
    mistake, given that some users such as drm_gem_handle_delete()
    (DRM_IOCTL_GEM_CLOSE) pass in a value from userspace to idr_replace(),
    allowing the WARN_ON_ONCE to be triggered.  drm_gem_handle_delete()
    actually just wants idr_replace() to return an error code if the ID is
    not allocated, including in the case where the ID is invalid (negative).
    
    So once again remove the bogus WARN_ON_ONCE().
    
    This bug was found by syzkaller, which encountered the following
    warning:
    
        WARNING: CPU: 3 PID: 3008 at lib/idr.c:157 idr_replace+0x1d8/0x240 lib/idr.c:157
        Kernel panic - not syncing: panic_on_warn set ...
    
        CPU: 3 PID: 3008 Comm: syzkaller218828 Not tainted 4.13.0-rc4-next-20170811 #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
         do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
         do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
         do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
         do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
         invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:930
        RIP: 0010:idr_replace+0x1d8/0x240 lib/idr.c:157
        RSP: 0018:ffff8800394bf9f8 EFLAGS: 00010297
        RAX: ffff88003c6c60c0 RBX: 1ffff10007297f43 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800394bfa78
        RBP: ffff8800394bfae0 R08: ffffffff82856487 R09: 0000000000000000
        R10: ffff8800394bf9a8 R11: ffff88006c8bae28 R12: ffffffffffffffff
        R13: ffff8800394bfab8 R14: dffffc0000000000 R15: ffff8800394bfbc8
         drm_gem_handle_delete+0x33/0xa0 drivers/gpu/drm/drm_gem.c:297
         drm_gem_close_ioctl+0xa1/0xe0 drivers/gpu/drm/drm_gem.c:671
         drm_ioctl_kernel+0x1e7/0x2e0 drivers/gpu/drm/drm_ioctl.c:729
         drm_ioctl+0x72e/0xa50 drivers/gpu/drm/drm_ioctl.c:825
         vfs_ioctl fs/ioctl.c:45 [inline]
         do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
         SYSC_ioctl fs/ioctl.c:700 [inline]
         SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
         entry_SYSCALL_64_fastpath+0x1f/0xbe
    
    Here is a C reproducer:
    
        #include <fcntl.h>
        #include <stddef.h>
        #include <stdint.h>
        #include <sys/ioctl.h>
        #include <drm/drm.h>
    
        int main(void)
        {
                int cardfd = open("/dev/dri/card0", O_RDONLY);
    
                ioctl(cardfd, DRM_IOCTL_GEM_CLOSE,
                      &(struct drm_gem_close) { .handle = -1 } );
        }
    
    Link: http://lkml.kernel.org/r/20170906235306.20534-1-ebiggers3@gmail.com
    Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
    Signed-off-by: Eric Biggers <ebiggers@google.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Matthew Wilcox <mawilcox@microsoft.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 343b0d82c25cf481ed4d97f1c044d56d40c74e24
Author: Miklos Szeredi <mszeredi@redhat.com>
Date:   Tue Sep 12 16:57:53 2017 +0200

    fuse: allow server to run in different pid_ns
    
    commit 5d6d3a301c4e749e04be6fcdcf4cb1ffa8bae524 upstream.
    
    Commit 0b6e9ea041e6 ("fuse: Add support for pid namespaces") broke
    Sandstorm.io development tools, which have been sending FUSE file
    descriptors across PID namespace boundaries since early 2014.
    
    The above patch added a check that prevented I/O on the fuse device file
    descriptor if the pid namespace of the reader/writer was different from the
    pid namespace of the mounter.  With this change passing the device file
    descriptor to a different pid namespace simply doesn't work.  The check was
    added because pids are transferred to/from the fuse userspace server in the
    namespace registered at mount time.
    
    To fix this regression, remove the checks and do the following:
    
    1) the pid in the request header (the pid of the task that initiated the
    filesystem operation) is translated to the reader's pid namespace.  If a
    mapping doesn't exist for this pid, then a zero pid is used.  Note: even if
    a mapping would exist between the initiator task's pid namespace and the
    reader's pid namespace the pid will be zero if either mapping from
    initator's to mounter's namespace or mapping from mounter's to reader's
    namespace doesn't exist.
    
    2) The lk.pid value in setlk/setlkw requests and getlk reply is left alone.
    Userspace should not interpret this value anyway.  Also allow the
    setlk/setlkw operations if the pid of the task cannot be represented in the
    mounter's namespace (pid being zero in that case).
    
    Reported-by: Kenton Varda <kenton@sandstorm.io>
    Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
    Fixes: 0b6e9ea041e6 ("fuse: Add support for pid namespaces")
    Cc: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Seth Forshee <seth.forshee@canonical.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 9a023afa8c7bd3200ec9a8c51803e478dd9b926b
Author: Amir Goldstein <amir73il@gmail.com>
Date:   Mon Sep 11 16:30:15 2017 +0300

    ovl: fix false positive ESTALE on lookup
    
    commit 939ae4efd51c627da270af74ef069db5124cb5b0 upstream.
    
    Commit b9ac5c274b8c ("ovl: hash overlay non-dir inodes by copy up origin")
    verifies that the origin lower inode stored in the overlayfs inode matched
    the inode of a copy up origin dentry found by lookup.
    
    There is a false positive result in that check when lower fs does not
    support file handles and copy up origin cannot be followed by file handle
    at lookup time.
    
    The false negative happens when finding an overlay inode in cache on a
    copied up overlay dentry lookup. The overlay inode still 'remembers' the
    copy up origin inode, but the copy up origin dentry is not available for
    verification.
    
    Relax the check in case copy up origin dentry is not available.
    
    Fixes: b9ac5c274b8c ("ovl: hash overlay non-dir inodes by copy up...")
    Reported-by: Jordi Pujol <jordipujolp@gmail.com>
    Signed-off-by: Amir Goldstein <amir73il@gmail.com>
    Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 6506d1d7d0c209bdba330b6c2e9cfab7174d7cba
Author: Tony Luck <tony.luck@intel.com>
Date:   Wed Aug 16 10:18:03 2017 -0700

    x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
    
    commit ce0fa3e56ad20f04d8252353dcd24e924abdafca upstream.
    
    Speculative processor accesses may reference any memory that has a
    valid page table entry.  While a speculative access won't generate
    a machine check, it will log the error in a machine check bank. That
    could cause escalation of a subsequent error since the overflow bit
    will be then set in the machine check bank status register.
    
    Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
    address of the page we want to map out otherwise we may trigger the
    very problem we are trying to avoid.  We use a non-canonical address
    that passes through the usual Linux table walking code to get to the
    same "pte".
    
    Thanks to Dave Hansen for reviewing several iterations of this.
    
    Also see:
    
      http://marc.info/?l=linux-mm&m=149860136413338&w=2
    
    Signed-off-by: Tony Luck <tony.luck@intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Andy Lutomirski <luto@kernel.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Borislav Petkov <bp@suse.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Denys Vlasenko <dvlasenk@redhat.com>
    Cc: Elliott, Robert (Persistent Memory) <elliott@hpe.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Josh Poimboeuf <jpoimboe@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 5b7b3fe58c813a2970bb221fa3f56165d5abf9f1
Author: Andy Lutomirski <luto@kernel.org>
Date:   Tue Aug 1 07:11:37 2017 -0700

    x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs
    
    commit e137a4d8f4dd2e277e355495b6b2cb241a8693c3 upstream.
    
    Switching FS and GS is a mess, and the current code is still subtly
    wrong: it assumes that "Loading a nonzero value into FS sets the
    index and base", which is false on AMD CPUs if the value being
    loaded is 1, 2, or 3.
    
    (The current code came from commit 3e2b68d752c9 ("x86/asm,
    sched/x86: Rewrite the FS and GS context switch code"), which made
    it better but didn't fully fix it.)
    
    Rewrite it to be much simpler and more obviously correct.  This
    should fix it fully on AMD CPUs and shouldn't adversely affect
    performance.
    
    Signed-off-by: Andy Lutomirski <luto@kernel.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Borislav Petkov <bpetkov@suse.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Chang Seok <chang.seok.bae@intel.com>
    Cc: Denys Vlasenko <dvlasenk@redhat.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Josh Poimboeuf <jpoimboe@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 9f79ec9876637347812b905574526065a26236e9
Author: Andy Lutomirski <luto@kernel.org>
Date:   Tue Aug 1 07:11:35 2017 -0700

    x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps
    
    commit 9584d98bed7a7a904d0702ad06bbcc94703cb5b4 upstream.
    
    In ELF_COPY_CORE_REGS, we're copying from the current task, so
    accessing thread.fsbase and thread.gsbase makes no sense.  Just read
    the values from the CPU registers.
    
    In practice, the old code would have been correct most of the time
    simply because thread.fsbase and thread.gsbase usually matched the
    CPU registers.
    
    Signed-off-by: Andy Lutomirski <luto@kernel.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Borislav Petkov <bpetkov@suse.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Chang Seok <chang.seok.bae@intel.com>
    Cc: Denys Vlasenko <dvlasenk@redhat.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Josh Poimboeuf <jpoimboe@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 3f13b64cf58fb99e28834bfcbc4d6bcb7de9fe49
Author: Andy Lutomirski <luto@kernel.org>
Date:   Tue Aug 1 07:11:34 2017 -0700

    x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common
    
    commit 767d035d838f4fd6b5a5bbd7a3f6d293b7f65a49 upstream.
    
    execve used to leak FSBASE and GSBASE on AMD CPUs.  Fix it.
    
    The security impact of this bug is small but not quite zero -- it
    could weaken ASLR when a privileged task execs a less privileged
    program, but only if program changed bitness across the exec, or the
    child binary was highly unusual or actively malicious.  A child
    program that was compromised after the exec would not have access to
    the leaked base.
    
    Signed-off-by: Andy Lutomirski <luto@kernel.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Borislav Petkov <bpetkov@suse.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Chang Seok <chang.seok.bae@intel.com>
    Cc: Denys Vlasenko <dvlasenk@redhat.com>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Josh Poimboeuf <jpoimboe@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit c4e91edabc9e0ab351c52772cd0a6ddb5018878d
Author: Bernat, Yehezkel <yehezkel.bernat@intel.com>
Date:   Tue Aug 15 08:19:20 2017 +0300

    thunderbolt: Allow clearing the key
    
    commit e545f0d8a54a9594fe604d67d80ca6fddf72ca59 upstream.
    
    If secure authentication of a devices fails, either because the device
    already has another key uploaded, or there is some other error sending
    challenge to the device, and the user only wants to approve the device
    just once (without a new key being uploaded to the device) the current
    implementation does not allow this because the key cannot be cleared
    once set even if we allow it to be changed.
    
    Make this scenario possible and allow clearing the key by writing
    empty string to the key sysfs file.
    
    Signed-off-by: Yehezkel Bernat <yehezkel.bernat@intel.com>
    Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 24ed5fd65f7fdc45caeff7edf5f5a95b8d66dedb
Author: Bernat, Yehezkel <yehezkel.bernat@intel.com>
Date:   Tue Aug 15 08:19:12 2017 +0300

    thunderbolt: Make key root-only accessible
    
    commit 0956e41169222822d3557871fcd1d32e4fa7e934 upstream.
    
    Non-root user may read the key back after root wrote it there.
    This removes read access to everyone but root.
    
    Signed-off-by: Yehezkel Bernat <yehezkel.bernat@intel.com>
    Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit b92e97e6e5d3ae7bf4472a59f9b40ee589e76ed2
Author: Bernat, Yehezkel <yehezkel.bernat@intel.com>
Date:   Tue Aug 15 08:19:01 2017 +0300

    thunderbolt: Remove superfluous check
    
    commit 8fdd6ab36197ad891233572c57781b1f537da0ac upstream.
    
    The key size is tested by hex2bin() already (as '\0' isn't an hex digit)
    
    Suggested-by: Andy Shevchenko <andriy.shevchenko@intel.com>
    Signed-off-by: Yehezkel Bernat <yehezkel.bernat@intel.com>
    Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 214d7f6b7e5ff16f2a694ed9d1627650c50bdacb
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sat Aug 12 21:33:23 2017 -0700

    f2fs: check hot_data for roll-forward recovery
    
    commit 125c9fb1ccb53eb2ea9380df40f3c743f3fb2fed upstream.
    
    We need to check HOT_DATA to truncate any previous data block when doing
    roll-forward recovery.
    
    Reviewed-by: Chao Yu <yuchao0@huawei.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit be009dd06669d6df7c038e0566f8ce0901460e24
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Aug 10 17:35:04 2017 -0700

    f2fs: let fill_super handle roll-forward errors
    
    commit afd2b4da40b3b567ef8d8e6881479345a2312a03 upstream.
    
    If we set CP_ERROR_FLAG in roll-forward error, f2fs is no longer to proceed
    any IOs due to f2fs_cp_error(). But, for example, if some stale data is involved
    on roll-forward process, we're able to get -ENOENT, getting fs stuck.
    If we get any error, let fill_super set SBI_NEED_FSCK and try to recover back
    to stable point.
    
    Reviewed-by: Chao Yu <yuchao0@huawei.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit ffe6e0e9c2bf12c316d71d661621ff602b3a162e
Author: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date:   Thu Sep 7 14:08:34 2017 +0800

    ip_tunnel: fix setting ttl and tos value in collect_md mode
    
    
    [ Upstream commit 0f693f1995cf002432b70f43ce73f79bf8d0b6c9 ]
    
    ttl and tos variables are declared and assigned, but are not used in
    iptunnel_xmit() function.
    
    Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel")
    Cc: Alexei Starovoitov <ast@fb.com>
    Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
    Acked-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 35c2f174d191555d80f93915febda00f3bc7163c
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Sep 8 12:44:47 2017 -0700

    tcp: fix a request socket leak
    
    
    [ Upstream commit 1f3b359f1004bd34b7b0bad70b93e3c7af92a37b ]
    
    While the cited commit fixed a possible deadlock, it added a leak
    of the request socket, since reqsk_put() must be called if the BPF
    filter decided the ACK packet must be dropped.
    
    Fixes: d624d276d1dd ("tcp: fix possible deadlock in TCP stack vs BPF filter")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 3b97e138dd9e242ad208e0fd48d4e1a3a61a8031
Author: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date:   Fri Sep 8 11:35:21 2017 -0300

    sctp: fix missing wake ups in some situations
    
    
    [ Upstream commit 7906b00f5cd1cd484fced7fcda892176e3202c8a ]
    
    Commit fb586f25300f ("sctp: delay calls to sk_data_ready() as much as
    possible") minimized the number of wake ups that are triggered in case
    the association receives a packet with multiple data chunks on it and/or
    when io_events are enabled and then commit 0970f5b36659 ("sctp: signal
    sk_data_ready earlier on data chunks reception") moved the wake up to as
    soon as possible. It thus relies on the state machine running later to
    clean the flag that the event was already generated.
    
    The issue is that there are 2 call paths that calls
    sctp_ulpq_tail_event() outside of the state machine, causing the flag to
    linger and possibly omitting a needed wake up in the sequence.
    
    One of the call paths is when enabling SCTP_SENDER_DRY_EVENTS via
    setsockopt(SCTP_EVENTS), as noticed by Harald Welte. The other is when
    partial reliability triggers removal of chunks from the send queue when
    the application calls sendmsg().
    
    This commit fixes it by not setting the flag in case the socket is not
    owned by the user, as it won't be cleaned later. This works for
    user-initiated calls and also for rx path processing.
    
    Fixes: fb586f25300f ("sctp: delay calls to sk_data_ready() as much as possible")
    Reported-by: Harald Welte <laforge@gnumonks.org>
    Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit c5b047b1a55ea111d9c3200eab6d7f6310c99d92
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Sep 8 15:48:47 2017 -0700

    ipv6: fix typo in fib6_net_exit()
    
    
    [ Upstream commit 32a805baf0fb70b6dbedefcd7249ac7f580f9e3b ]
    
    IPv6 FIB should use FIB6_TABLE_HASHSZ, not FIB_TABLE_HASHSZ.
    
    Fixes: ba1cc08d9488 ("ipv6: fix memory leak with multiple tables during netns destruction")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit c352fb6adc1095411aac7c87f57d43ee00034e64
Author: Sabrina Dubroca <sd@queasysnail.net>
Date:   Fri Sep 8 10:26:19 2017 +0200

    ipv6: fix memory leak with multiple tables during netns destruction
    
    
    [ Upstream commit ba1cc08d9488c94cb8d94f545305688b72a2a300 ]
    
    fib6_net_exit only frees the main and local tables. If another table was
    created with fib6_alloc_table, we leak it when the netns is destroyed.
    
    Fix this in the same way ip_fib_net_exit cleans up tables, by walking
    through the whole hashtable of fib6_table's. We can get rid of the
    special cases for local and main, since they're also part of the
    hashtable.
    
    Reproducer:
        ip netns add x
        ip -net x -6 rule add from 6003:1::/64 table 100
        ip netns del x
    
    Reported-by: Jianlin Shi <jishi@redhat.com>
    Fixes: 58f09b78b730 ("[NETNS][IPV6] ip6_fib - make it per network namespace")
    Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit f8581386693fd6d19de2e18c81fcb58575fd9947
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Sep 6 14:44:36 2017 +0200

    udp: drop head states only when all skb references are gone
    
    
    [ Upstream commit ca2c1418efe9f7fe37aa1f355efdf4eb293673ce ]
    
    After commit 0ddf3fb2c43d ("udp: preserve skb->dst if required
    for IP options processing") we clear the skb head state as soon
    as the skb carrying them is first processed.
    
    Since the same skb can be processed several times when MSG_PEEK
    is used, we can end up lacking the required head states, and
    eventually oopsing.
    
    Fix this clearing the skb head state only when processing the
    last skb reference.
    
    Reported-by: Eric Dumazet <edumazet@google.com>
    Fixes: 0ddf3fb2c43d ("udp: preserve skb->dst if required for IP options processing")
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit c9a166bef42a189f82bfe47575c344064832a166
Author: Xin Long <lucien.xin@gmail.com>
Date:   Tue Sep 5 17:26:33 2017 +0800

    ip6_gre: update mtu properly in ip6gre_err
    
    
    [ Upstream commit 5c25f30c93fdc5bf25e62101aeaae7a4f9b421b3 ]
    
    Now when probessing ICMPV6_PKT_TOOBIG, ip6gre_err only subtracts the
    offset of gre header from mtu info. The expected mtu of gre device
    should also subtract gre header. Otherwise, the next packets still
    can't be sent out.
    
    Jianlin found this issue when using the topo:
      client(ip6gre)<---->(nic1)route(nic2)<----->(ip6gre)server
    
    and reducing nic2's mtu, then both tcp and sctp's performance with
    big size data became 0.
    
    This patch is to fix it by also subtracting grehdr (tun->tun_hlen)
    from mtu info when updating gre device's mtu in ip6gre_err(). It
    also needs to subtract ETH_HLEN if gre dev'type is ARPHRD_ETHER.
    
    Reported-by: Jianlin Shi <jishi@redhat.com>
    Signed-off-by: Xin Long <lucien.xin@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit b076d2518599fb5ea8ade23be2e04f326fa85a57
Author: Jason Wang <jasowang@redhat.com>
Date:   Tue Sep 5 09:22:05 2017 +0800

    vhost_net: correctly check tx avail during rx busy polling
    
    
    [ Upstream commit 8b949bef9172ca69d918e93509a4ecb03d0355e0 ]
    
    We check tx avail through vhost_enable_notify() in the past which is
    wrong since it only checks whether or not guest has filled more
    available buffer since last avail idx synchronization which was just
    done by vhost_vq_avail_empty() before. What we really want is checking
    pending buffers in the avail ring. Fix this by calling
    vhost_vq_avail_empty() instead.
    
    This issue could be noticed by doing netperf TCP_RR benchmark as
    client from guest (but not host). With this fix, TCP_RR from guest to
    localhost restores from 1375.91 trans per sec to 55235.28 trans per
    sec on my laptop (Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz).
    
    Fixes: 030881372460 ("vhost_net: basic polling support")
    Signed-off-by: Jason Wang <jasowang@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 80b25b4bb2e27e08380d1ba943863395bbb75086
Author: Claudiu Manoil <claudiu.manoil@nxp.com>
Date:   Mon Sep 4 10:45:28 2017 +0300

    gianfar: Fix Tx flow control deactivation
    
    
    [ Upstream commit 5d621672bc1a1e5090c1ac5432a18c79e0e13e03 ]
    
    The wrong register is checked for the Tx flow control bit,
    it should have been maccfg1 not maccfg2.
    This went unnoticed for so long probably because the impact is
    hardly visible, not to mention the tangled code from adjust_link().
    First, link flow control (i.e. handling of Rx/Tx link level pause frames)
    is disabled by default (needs to be enabled via 'ethtool -A').
    Secondly, maccfg2 always returns 0 for tx_flow_oldval (except for a few
    old boards), which results in Tx flow control remaining always on
    once activated.
    
    Fixes: 45b679c9a3ccd9e34f28e6ec677b812a860eb8eb ("gianfar: Implement PAUSE frame generation support")
    Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit ecb26e815aaa5f8c6a6f060f945856b7b472132b
Author: Jesper Dangaard Brouer <brouer@redhat.com>
Date:   Fri Sep 1 11:26:13 2017 +0200

    Revert "net: fix percpu memory leaks"
    
    
    [ Upstream commit 5a63643e583b6a9789d7a225ae076fb4e603991c ]
    
    This reverts commit 1d6119baf0610f813eb9d9580eb4fd16de5b4ceb.
    
    After reverting commit 6d7b857d541e ("net: use lib/percpu_counter API
    for fragmentation mem accounting") then here is no need for this
    fix-up patch.  As percpu_counter is no longer used, it cannot
    memory leak it any-longer.
    
    Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
    Fixes: 1d6119baf061 ("net: fix percpu memory leaks")
    Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 021c60ff0c79e5d549ca84e93e9661b03865e1b9
Author: Jesper Dangaard Brouer <brouer@redhat.com>
Date:   Fri Sep 1 11:26:08 2017 +0200

    Revert "net: use lib/percpu_counter API for fragmentation mem accounting"
    
    
    [ Upstream commit fb452a1aa3fd4034d7999e309c5466ff2d7005aa ]
    
    This reverts commit 6d7b857d541ecd1d9bd997c97242d4ef94b19de2.
    
    There is a bug in fragmentation codes use of the percpu_counter API,
    that can cause issues on systems with many CPUs.
    
    The frag_mem_limit() just reads the global counter (fbc->count),
    without considering other CPUs can have upto batch size (130K) that
    haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
    this become dangerous at >=24 CPUs (3*1024*1024/130000=24).
    
    The correct API usage would be to use __percpu_counter_compare() which
    does the right thing, and takes into account the number of (online)
    CPUs and batch size, to account for this and call __percpu_counter_sum()
    when needed.
    
    We choose to revert the use of the lib/percpu_counter API for frag
    memory accounting for several reasons:
    
    1) On systems with CPUs > 24, the heavier fully locked
       __percpu_counter_sum() is always invoked, which will be more
       expensive than the atomic_t that is reverted to.
    
    Given systems with more than 24 CPUs are becoming common this doesn't
    seem like a good option.  To mitigate this, the batch size could be
    decreased and thresh be increased.
    
    2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
       CPU, before SKBs are pushed into sockets on remote CPUs.  Given
       NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
       likely be limited.  Thus, a fair chance that atomic add+dec happen
       on the same CPU.
    
    Revert note that commit 1d6119baf061 ("net: fix percpu memory leaks")
    removed init_frag_mem_limit() and instead use inet_frags_init_net().
    After this revert, inet_frags_uninit_net() becomes empty.
    
    Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting")
    Fixes: 1d6119baf061 ("net: fix percpu memory leaks")
    Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Acked-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>