Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1059024 - [Performance] Hadoop fs -put is prohibitively slow for > 10K small files.
Summary: [Performance] Hadoop fs -put is prohibitively slow for > 10K small files.
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rhs-hadoop
Version: unspecified
Hardware: Unspecified
OS: Unspecified
low
unspecified
Target Milestone: ---
: ---
Assignee: Diane Feddema
QA Contact: BigData QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-01-29 02:07 UTC by Jay Vyas
Modified: 2016-02-01 16:17 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1090704 1090705 (view as bug list)
Environment:
Last Closed: 2016-02-01 16:17:19 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Jay Vyas 2014-01-29 02:07:19 UTC
When copying files into gluster, hadoop fs -put seems to cause alot of overhead 
and slow the system down.  This script: 

https://github.com/jayunit100/bigtop/blob/master/stresstest.sh

Will run in any gluster mounted hadoop enabled cluster.  It demonstrates that, 
for example, with 1000 files... 

local copy  of 100 files into /mnt/glusterfs takes 2.5 seconds
hadoop fs -put of the same 100 files into /mnt/glusterfs takes 15.8 seconds.  
 
...

So, both are pretty slow, considering that each file is very small (64 bytes - 
about the size of one english sentence). 

... 

I notice that , this is uncovered by mahout smoke tests which insert 21K small 
files into /mnt/glusterfs using hadoop fs -put. 
In that case, I also saw gluster CPU usage very high during this time : this 
indicates there might be double overhead : hadoop overhead, and gluster 
overhead, or maybe hadoop induced gluster overhead of some sort. 

.... 

Not sure where to point the finger yet, but this means we cant really do heavy 
small file workloads.

To reproduce, just run the script above... It should do everything and print 
out performance stats for raw writes and hadoop writes.

Comment 2 Jay Vyas 2014-01-29 14:06:37 UTC
Note: Here are the results.  

- gluster:  10,000 small files (~64 Bytes each) takes 5 minutes to copy into 
gluster.  That is only half a MB... So it seems that gluster itself is slow for 
this task. 

- hadoop: on top of glusters slowness, we can see that there is a 4-5X time 
increase using "hadoop fs -put" versus recursive copy.  So our plugin may be 
adding overhead. Brad has suggested this could be due to logging.. which might 
make sense..  

local ......  DONE (1 files) .......real	0m0.015s
hadoop ..... DONE (1 files) .......real	0m1.734s

local ......  DONE (10 files) .......real	0m0.227s
hadoop ..... DONE (10 files) .......real	0m3.021s

local ......  DONE (100 files) .......real	0m2.567s
hadoop ..... DONE (100 files) .......real	0m15.494s

local ......  DONE (1000 files) .......real	0m28.390s
hadoop ..... DONE (1000 files) .......real	2m20.654s

local ......  DONE (10000 files) .......real	4m58.034s
hadoop ..... DONE (10000 files) .......real	23m25.956s

Comment 3 Ben England 2014-01-29 18:12:30 UTC
adding perf group to cc list. 

What would be adequate or expected for your application?  Obviously result is bad but what would be good enough?

Could you post your exact workload and any Gluster volume parameters and XFS/LVM configuration parameters?     What hardware configuration?
How many files and what % space is used in your filesystem?

I typically try to convert results to files/sec for small-file workloads, since files/sec is relatively insensitive to average file size at small sizes.  Your result was 4m58 sec for 10000 files = 10000/300 fps = 33 files/sec for 64-byte files.

However, BAGL perf. seems much better than yours, though still not good. This Gluster test using smallfile on a BAGL node running glusterfs-3.4.0.55rhs-1.el6rhs.x86_64 (RHS 2.1 U2) shows 8x the throughput you reported in files/sec for 10,000 1-KB files.  You can get smallfile benchmark from  https://github.com/bengland2/smallfile .  I also reproduced the Gluster result using a simple shell command at bottom.  Using 1 thread creating 10,000 1-KB files:

fs, fsync-on-close? : fps
----------------------
XFS, yes: 1750
XFS, no: 6600
Gluster, yes: 234
Gluster, no: 259

No tuning was done on this Gluster volume other than rhs-high-throughput tuned profile, which should not help small files.  We are using 12-disk RAID6 with 256-KB strip size.  We are using jumbo frames.  Just 2 servers with 1 brick each in this configuration.  No XFS/LVM tuning is present other than mount options "noatime,inode64".  Gluster mountpoint was on one of the two servers.

in RHS 2.1 U2, write barriers are not turned off by the rhs-high-performance tuned profile.  This is in support of data integrity when using JBOD configurations and low-end storage controllers, where NVRAM may not be present or engaged. I don't believe this is the root cause.

Eventually I think Hadoop would benefit from Denali RHS release, they plan to backport upstream code that avoids fdatasync() call per replica.  Right now the brick process glusterfsd does a fdatasync() call per file regardless of whether the application called it or not.  This should buy you 40% throughput improvement on small-file creates at least.  Neependra Khare originally observed this smallfile perf. decrease between Anshi and Big Bend RHS releases, and my testing of the patch showed a corresponding increase in performance back to Anshi level.  But this doesn't explain the discrepancy between your results and mine.

example output:

[root@gprfs047 smallfile-master]# ./smallfile_cli.py --threads 1 --top /mnt/sasvol/smf --file-size 1 --files 10000 --fsync Y --response-times Y --operation create
host = gprfs047.sbu.lab.eng.bos.redhat.com, thread = 00, elapsed sec. = 42.618219, total files = 10000, total_records = 10000, status = ok
total threads = 1
total files = 10000
total data =     0.010 GB
100.00% of requested files processed, minimum is  90.00
234.641433 files/sec
234.641433 IOPS
42.618219 sec elapsed time, 0.229142 MB/sec

And using 64-byte files:

[root@gprfs047 smallfile-master]# time for n in `seq 1 10000` ; do \
  echo '1111111111111111111111111111111111111111111111111111111111111111' \
  > /mnt/sasvol/echo/$n ; done

real    0m35.264s
user    0m1.125s
sys     0m0.931s

Comment 4 Jay Vyas 2014-01-29 18:27:07 UTC
- Re Exact workload is here:  https://github.com/jayunit100/bigtop/blob/master/stresstest.sh is the exact workload I ran. 

we'll respond to the other ideas and comments in time.   I think we will need to dedicate some time specifically to fine tuning small file workloads in the plugin specifically, to see where the hadoop discrepency comes from.  

The gluster speed is not so bad i guess.  

 Thanks for your help ben!

Comment 5 Scott McClellan 2014-01-30 19:39:56 UTC
This BZ is interesting... 

First off, I am of the opinion that "hadoop fs -put" is "not necessarily" a good proxy for small file performance (either for Hadoop/HDFS itself, or in the Hadoop on RHS usecase). So, my initial point is that I don't believe this is a well constructed experiment (more comments below). Having said that, this experiment shows an alarming result.

Some context...  Hadoop has a whole bunch of commands (documented here: http://hadoop.apache.org/docs/r0.19.0/hdfs_shell.html) that are the Hadoop "equivalent" for a whole bunch of typical CI commands users use all the time to manipulate files. Why do these commands exist? Because you can't use the "normal" CI to do things like "touch" or "cat" a file in HDFS (or make or remove a directory). That's all an artifact of the fact that HDFS is not "really" a file system - it is just a veneer that presents a namespace to applications that use the prescribed Hadoop filesystem API. All the POSIX commands and utilities use normal standard POSIX file system interfaces - not the Hadoop filesystem API - thus the standard shell commands and standard utilities do not work with HDFS. This is precisely why Hadoop on Gluster is so cool - because you shouldn't really every need to use the "hadoop fs -xxx" commands, the normal shell command and utilities all just work. 

With RHS there really isn't the notion of "files in HDFS" and "files in a POSIX
namespace"... ALL files are in a POSIX namespace and you can also access any of those files via the Hadoop filesystem API. 

Normally we would be crowing about how great our solution is because we don't even need to use the "hadoop fs -put" command (and all the other commands on that list). But Jay's BZ points to a flaw... it is really embarassing that the Hadoop FS Shell command poxies are (at least in this case) FASTER than the native POSIX equivalents - at least if the files are small.

I think we need to devise a broader experiment where we flesh out two things for EVERY command in the master list:

+ Performance Comparison 
  - Three ways: Hadoop fs shell vs Posix Shell/RHS vs Posix Shell/HDFS (via FUSE)
  - varying file size, cluster size (number of nodes in cluster) 
+ Functional comparison
  - Many of the Hadoop fs shell "equivalent" are not functionally equivalent 

I will speculate that the result will show that we are slower/much slower when operations dominate and are competitive to faster when the data path dominates. Furthermore, I will speculate that all of these cases will suffer (more) on RHS as the size of the cluster increases. 

And back to my original point - I am skeptical that hadoop fs -put is really a good proxy for small file performance in our solution. So I think the question of small file performance needs to be re-phrased in the context of particular applications or at least types of Hadoop applications. I suspect Hbase, Hive, and all MapReduce workloads are not "equal" or even "similar" in terms of the relevance of small file performance. So Ben's question back to Jay is kind of hard to answer without some additional experimentation.

Comment 7 Jay Vyas 2014-02-05 13:44:18 UTC
Ahh... okay i think i see what the regresion is !

public void setPermission(Path p, FsPermission permission) throws IOException { 
  
   super.setPermission(p,permission); updateAcl(p); 
}

These new ACLs are now what is slowing us down !

I beleive now, since the filesystem is handling permissions for yarn files FOR 
US by using hadoop group, we can eliminate these filters... then we will be in 
good shape regarding the "acute" part of this bug

Comment 8 Jay Vyas 2014-02-05 14:57:21 UTC
Another update: 

By doing a simple remount of the shim, not using RHS-HADOP-INSTALL, it appears that you get an 8X speedup :

Starting test... 100
local ......  DONE (100 files) .......real	0m0.445s
103
hadoop ..... DONE (100 files) .......real	0m1.912s
103
END of test... 100

All preiminary, but it looks like something in 

local GLUSTER_MNT_OPTS="entry-timeout=0,attribute-timeout=0,use-readdirp=no,acl,_netdev"

Has slowed us down a little bit?

Comment 9 Jay Vyas 2014-02-05 15:33:41 UTC
Hey ! Okay, this is solved :) It was our MOUNT OPTIONS that slowed us down, i believe:

### RHS_HADOOP MOUNT : FANCY OPTIONS SLOW US DOWN 8-FOLD ! ### 

[root@mrg42 bigtop]# ./stresstest.sh "entry-timeout=0,attribute-timeout=0,use-readdirp=no,acl"
Starting test... 1
local ......  DONE (1 files) .......real	0m0.013s
hadoop ..... DONE (1 files) .......real	0m0.968s
END of test... 1
c = 10
Starting test... 10
local ......  DONE (10 files) .......real	0m0.155s
hadoop ..... DONE (10 files) .......real	0m1.693s
END of test... 10
c = 100
Starting test... 100
local ......  DONE (100 files) .......real	0m1.460s
hadoop ..... DONE (100 files) .......real	0m8.801s
END of test... 100
c = 1000


### ORIGINAL MOUNT : NO OPTIONS ### 

[root@mrg42 bigtop]# ./stresstest.sh
local ......  DONE (1 files) .......real	0m0.006s
hadoop ..... DONE (1 files) .......real	0m0.877s
END of test... 1
c = 10
Starting test... 10
local ......  DONE (10 files) .......real	0m0.052s
hadoop ..... DONE (10 files) .......real	0m0.992s
END of test... 10
c = 100
Starting test... 100
local ......  DONE (100 files) .......real	0m0.544s
hadoop ..... DONE (100 files) .......real	0m2.107s

Comment 10 Jay Vyas 2014-02-05 20:51:19 UTC
Further root-caused:  Its entry-timeout=0.

Proposed fix : Just make the shim double check on FnF exceptions, and User entry-timeout=1.  This will dramatically increase small file performance, and mapred in corner cases where super fast task tracker changes happen, will still perform robustly.

2 in our group have voted this down, but i dont think its a bad fix, given the 10X performance increase that you get from it.

Comment 13 Scott Haines 2014-04-09 00:09:42 UTC
Per Apr-02 bug triage meeting, granting both devel and pm acks

Comment 14 Bradley Childs 2014-05-20 17:25:57 UTC
I'm moving this to resolved/QE state.  I don't have a  'fixed in' version as the resolution is around configuration.

Comment 15 Martin Kudlej 2014-05-21 08:39:58 UTC
I think this configuration solution should be supported by rhs-hadoop-install. So I suggest to move this BZ from rhs-hadoop component to rhs-hadoop-install. Also description of 2 mount point solution should be part of this BZ(or new documentation BZ). Please consider all these issues and change/create required BZs. Because of this moving to Assigned.

Comment 17 Steve Watt 2016-02-01 16:17:19 UTC
This solution is no longer available from Red Hat.


Note You need to log in before you can comment on or make changes to this bug.