Overview: • About Miller • File formats • Miller features in the context of the Unix toolkit • Record-heterogeneity • Internationalization Using Miller: • Reference • FAQ • Cookbook • Data examples • Installation, portability, dependencies, and testing • Documents by release Background: • Why C? • Why call it Miller? • How original is Miller? • Performance Repository: • Things to do • Contact information • GitHub repo |
• Rectangularizing data • Bulk rename of field names • Regularizing ragged CSV • Two-pass computation of percentages • Filtering paragraphs of text • Doing arithmetic on fields with currency symbols • Program timing • Using out-of-stream variables • Mean without/with oosvars • Keyed mean without/with oosvars • Variance and standard deviation without/with oosvars • Min/max without/with oosvars • Keyed min/max without/with oosvars • Delta without/with oosvars • Keyed delta without/with oosvars • Exponentially weighted moving averages without/with oosvars Parsing log-file outputThis, of course, depends highly on what’s in your log files. But, as an example, suppose you have log-file lines such as2015-10-08 08:29:09,445 INFO com.company.path.to.ClassName @ [sometext] various/sorts/of data {& punctuation} hits=1 status=0 time=2.378 grep 'various sorts' *.log | sed 's/.*} //' | mlr --fs space --repifs --oxtab stats1 -a min,p10,p50,p90,max -f time -g status Rectangularizing dataSuppose you have a method (in whatever language) which is printing things of the formouter=1 outer=2 outer=3 middle=10 middle=11 middle=12 middle=20 middle=21 middle=30 middle=31 inner1=100,inner2=101 inner1=120,inner2=121 inner1=200,inner2=201 inner1=210,inner2=211 inner1=300,inner2=301 inner1=312 inner1=313,inner2=314 outer=1 middle=10 inner1=100,inner2=101 middle=11 middle=12 inner1=120,inner2=121 outer=2 middle=20 inner1=200,inner2=201 middle=21 inner1=210,inner2=211 outer=3 middle=30 inner1=300,inner2=301 middle=31 inner1=312 inner1=313,inner2=314 $ mlr --from data/rect.txt put -q ' ispresent($outer) { unset @r } for (k, v in $*) { @r[k] = v } ispresent($inner1) { emit @r }' outer=1,middle=10,inner1=100,inner2=101 outer=1,middle=12,inner1=120,inner2=121 outer=2,middle=20,inner1=200,inner2=201 outer=2,middle=21,inner1=210,inner2=211 outer=3,middle=30,inner1=300,inner2=301 outer=3,middle=31,inner1=312,inner2=301 outer=3,middle=31,inner1=313,inner2=314 Bulk rename of field names$ cat data/spaces.csv a b c,def,g h i 123,4567,890 2468,1357,3579 9987,3312,4543 $ mlr --csv --rs lf rename -r -g ' ,_' data/spaces.csv a_b_c,def,g_h_i 123,4567,890 2468,1357,3579 9987,3312,4543 $ mlr --csv --irs lf --opprint rename -r -g ' ,_' data/spaces.csv a_b_c def g_h_i 123 4567 890 2468 1357 3579 9987 3312 4543 $ mlr --icsv --irs lf --opprint put -f data/bulk-rename-for-loop.mlr data/spaces.csv def a_b_c g_h_i 4567 123 890 1357 2468 3579 3312 9987 4543 Regularizing ragged CSVMiller handles compliant CSV: in particular, it’s an error if the number of data fields in a given data line don’t match the number of header lines. But in the event that you have a CSV file in which some lines have less than the full number of fields, you can use Miller to pad them out. The trick is to use NIDX format, for which each line stands on its own without respect to a header line.$ cat data/ragged.csv a,b,c 1,2,3 4,5 6 7,8,9 $ mlr --from data/ragged.csv --fs comma --nidx put ' @maxnf = max(@maxnf, NF); @nf = NF; while(@nf < @maxnf) { @nf += 1; $[@nf] = "" } ' a,b,c 1,2,3 4,5, 6,, 7,8,9 $ mlr --from data/ragged.csv --fs comma --nidx put ' @maxnf = max(@maxnf, NF); while(NF < @maxnf) { $[NF+1] = ""; } ' a,b,c 1,2,3 4,5, 6,, 7,8,9 Two-pass computation of percentagesMiller is a streaming record processor; commands are performed once per record. This makes Miller particularly suitable for single-pass algorithms, allowing many of its verbs to process files that are (much) larger than the amount of RAM present in your system. (Of course, Miller verbs such as sort, tac, etc. all must ingest and retain all input records before emitting any output records.) You can also use out-of-stream variables to perform multi-pass computations. For example, mapping numeric values down a column to the percentage between their min and max values is two-pass: on the first pass you find the min and max values, then on the second, map each record’s value to a percentage.$ mlr --from data/small --opprint put -q ' # These are executed once per record, which is the first pass. # The key is to use NR to index an out-of-stream variable to # retain all the x-field values. @x_min = min($x, @x_min); @x_max = max($x, @x_max); @x[NR] = $x; # The second pass is in a for-loop in an end-block. end { for (nr, x in @x) { @x_pct[nr] = 100 * (@x[nr] - @x_min) / (@x_max - @x_min); } emit (@x, @x_pct), "NR" } ' NR x x_pct 1 0.346790 25.661943 2 0.758680 100.000000 3 0.204603 0.000000 4 0.381399 31.908236 5 0.573289 66.540542 Filtering paragraphs of textThe idea is to use a record separator which is a pair of newlines. Then, if you want each paragraph to be a record with a single value, use a field-separator which isn’t present in the input data (e.g. a control-A which is octal 001). Or, if you want each paragraph to have its lines as separate values, use newline as field separator.$ cat paragraphs.txt The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. $ mlr --from paragraphs.txt --nidx --rs '\n\n' --fs '\001' filter '$1 =~ "the"' The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. The quick brown fox jumped over the lazy dogs. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. Now is the time for all good people to come to the aid of their country. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly on the plain. $ mlr --from paragraphs.txt --nidx --rs '\n\n' --fs '\n' cut -f 1,3 The quick brown fox jumped over the lazy dogs. The quick brown fox jumped brown fox jumped over the lazy dogs. The quick brown fox jumped over the Now is the time for all good people to come to the aid of their country. Now the time for all good people to come to the aid of their country. Now is the Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow. Sphynx of black quartz, judge my vow. The rain in Spain falls mainly on the plain. The rain in Spain falls mainly falls mainly on the plain. The rain in Spain falls mainly on the plain. The Doing arithmetic on fields with currency symbols$ cat sample.csv EventOccurred,EventType,Description,Status,PaymentType,NameonAccount,TransactionNumber,Amount 10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,John,1,$230.36 10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,Fred,2,$32.25 10/1/2015,Charged Back,Reason: Customer Advises Not Authorized,Disputed,Checking,Bob,3,$39.02 10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,Alice,4,$57.54 10/1/2015,Charged Back,Reason: Authorization Revoked By Customer,Disputed,Checking,Jungle,5,$230.36 10/1/2015,Charged Back,Reason: Payment Stopped,Disputed,Checking,Joe,6,$281.96 10/2/2015,Charged Back,Reason: Customer Advises Not Authorized,Disputed,Checking,Joseph,7,$188.19 10/2/2015,Charged Back,Reason: Customer Advises Not Authorized,Disputed,Checking,Joseph,8,$188.19 10/2/2015,Charged Back,Reason: Payment Stopped,Disputed,Checking,Anthony,9,$250.00 $ mlr --icsv --opprint cat sample.csv EventOccurred EventType Description Status PaymentType NameonAccount TransactionNumber Amount 10/1/2015 Charged Back Reason: Authorization Revoked By Customer Disputed Checking John 1 $230.36 10/1/2015 Charged Back Reason: Authorization Revoked By Customer Disputed Checking Fred 2 $32.25 10/1/2015 Charged Back Reason: Customer Advises Not Authorized Disputed Checking Bob 3 $39.02 10/1/2015 Charged Back Reason: Authorization Revoked By Customer Disputed Checking Alice 4 $57.54 10/1/2015 Charged Back Reason: Authorization Revoked By Customer Disputed Checking Jungle 5 $230.36 10/1/2015 Charged Back Reason: Payment Stopped Disputed Checking Joe 6 $281.96 10/2/2015 Charged Back Reason: Customer Advises Not Authorized Disputed Checking Joseph 7 $188.19 10/2/2015 Charged Back Reason: Customer Advises Not Authorized Disputed Checking Joseph 8 $188.19 10/2/2015 Charged Back Reason: Payment Stopped Disputed Checking Anthony 9 $250.00 $ mlr --csv put '$Amount = sub(string($Amount), "\$", "")' then stats1 -a sum -f Amount sample.csv Amount_sum 1497.870000 $ mlr --csv --ofmt '%.2lf' put '$Amount = sub(string($Amount), "\$", "")' then stats1 -a sum -f Amount sample.csv Amount_sum 1497.87 Program timingThis admittedly artificial example demonstrates using Miller time and stats functions to introspectly acquire some information about Miller’s own runtime. The delta function computes the difference between successive timestamps.$ ruby -e '10000.times{|i|puts "i=#{i+1}"}' > lines.txt $ head -n 5 lines.txt i=1 i=2 i=3 i=4 i=5 mlr --ofmt '%.9le' --opprint put '$t=systime()' then step -a delta -f t lines.txt | head -n 7 i t t_delta 1 1430603027.018016 1.430603027e+09 2 1430603027.018043 2.694129944e-05 3 1430603027.018048 5.006790161e-06 4 1430603027.018052 4.053115845e-06 5 1430603027.018055 2.861022949e-06 6 1430603027.018058 3.099441528e-06 mlr --ofmt '%.9le' --oxtab \ put '$t=systime()' then \ step -a delta -f t then \ filter '$i>1' then \ stats1 -a min,mean,max -f t_delta \ lines.txt t_delta_min 2.861022949e-06 t_delta_mean 4.077508505e-06 t_delta_max 5.388259888e-05 Using out-of-stream variablesOne of Miller’s strengths is its compact notation: for example, given input of the form$ head -n 5 ../data/medium a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533 a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797 a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776 a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729 $ mlr --oxtab stats1 -a sum -f x ../data/medium x_sum 4986.019682 $ mlr --opprint stats1 -a sum -f x -g b ../data/medium b x_sum pan 965.763670 wye 1023.548470 zee 979.742016 eks 1016.772857 hat 1000.192668 $ mlr --oxtab put -q ' @x_sum += $x; end { emit @x_sum } ' data/medium x_sum 4986.019682 $ mlr --opprint put -q ' @x_sum[$b] += $x; end { emit @x_sum, "b" } ' data/medium b x_sum pan 965.763670 wye 1023.548470 zee 979.742016 eks 1016.772857 hat 1000.192668 Mean without/with oosvars$ mlr --opprint stats1 -a mean -f x data/medium x_mean 0.498602 $ mlr --opprint put -q ' @x_sum += $x; @x_count += 1; end { @x_mean = @x_sum / @x_count; emit @x_mean } ' data/medium x_mean 0.498602 Keyed mean without/with oosvars$ mlr --opprint stats1 -a mean -f x -g a,b data/medium a b x_mean pan pan 0.513314 eks pan 0.485076 wye wye 0.491501 eks wye 0.483895 wye pan 0.499612 zee pan 0.519830 eks zee 0.495463 zee wye 0.514267 hat wye 0.493813 pan wye 0.502362 zee eks 0.488393 hat zee 0.509999 hat eks 0.485879 wye hat 0.497730 pan eks 0.503672 eks eks 0.522799 hat hat 0.479931 hat pan 0.464336 zee zee 0.512756 pan hat 0.492141 pan zee 0.496604 zee hat 0.467726 wye zee 0.505907 eks hat 0.500679 wye eks 0.530604 $ mlr --opprint put -q ' @x_sum[$a][$b] += $x; @x_count[$a][$b] += 1; end{ for ((a, b), v in @x_sum) { @x_mean[a][b] = @x_sum[a][b] / @x_count[a][b]; } emit @x_mean, "a", "b" } ' data/medium a b x_mean pan pan 0.513314 pan wye 0.502362 pan eks 0.503672 pan hat 0.492141 pan zee 0.496604 eks pan 0.485076 eks wye 0.483895 eks zee 0.495463 eks eks 0.522799 eks hat 0.500679 wye wye 0.491501 wye pan 0.499612 wye hat 0.497730 wye zee 0.505907 wye eks 0.530604 zee pan 0.519830 zee wye 0.514267 zee eks 0.488393 zee zee 0.512756 zee hat 0.467726 hat wye 0.493813 hat zee 0.509999 hat eks 0.485879 hat hat 0.479931 hat pan 0.464336 Variance and standard deviation without/with oosvars$ mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium x_count 10000 x_sum 4986.019682 x_mean 0.498602 x_var 0.084270 x_stddev 0.290293 $ cat variance.mlr @n += 1; @sumx += $x; @sumx2 += $x**2; end { @mean = @sumx / @n; @var = (@sumx2 - @mean * (2 * @sumx - @n * @mean)) / (@n - 1); @stddev = sqrt(@var); emitf @n, @sumx, @sumx2, @mean, @var, @stddev } $ mlr --oxtab put -q -f variance.mlr data/medium n 10000 sumx 4986.019682 sumx2 3328.652400 mean 0.498602 var 0.084270 stddev 0.290293 Min/max without/with oosvars$ mlr --oxtab stats1 -a min,max -f x data/medium x_min 0.000045 x_max 0.999953 $ mlr --oxtab put -q '@x_min = min(@x_min, $x); @x_max = max(@x_max, $x); end{emitf @x_min, @x_max}' data/medium x_min 0.000045 x_max 0.999953 Keyed min/max without/with oosvars$ mlr --opprint stats1 -a min,max -f x -g a data/medium a x_min x_max pan 0.000204 0.999403 eks 0.000692 0.998811 wye 0.000187 0.999823 zee 0.000549 0.999490 hat 0.000045 0.999953 $ mlr --opprint --from data/medium put -q ' @min[$a] = min(@min[$a], $x); @max[$a] = max(@max[$a], $x); end{ emit (@min, @max), "a"; } ' a min max pan 0.000204 0.999403 eks 0.000692 0.998811 wye 0.000187 0.999823 zee 0.000549 0.999490 hat 0.000045 0.999953 Delta without/with oosvars$ mlr --opprint step -a delta -f x data/small a b i x y x_delta pan pan 1 0.3467901443380824 0.7268028627434533 0 eks pan 2 0.7586799647899636 0.5221511083334797 0.411890 wye wye 3 0.20460330576630303 0.33831852551664776 -0.554077 eks wye 4 0.38139939387114097 0.13418874328430463 0.176796 wye pan 5 0.5732889198020006 0.8636244699032729 0.191890 $ mlr --opprint put '$x_delta = ispresent(@last) ? $x - @last : 0; @last = $x' data/small a b i x y x_delta pan pan 1 0.3467901443380824 0.7268028627434533 0 eks pan 2 0.7586799647899636 0.5221511083334797 0.411890 wye wye 3 0.20460330576630303 0.33831852551664776 -0.554077 eks wye 4 0.38139939387114097 0.13418874328430463 0.176796 wye pan 5 0.5732889198020006 0.8636244699032729 0.191890 Keyed delta without/with oosvars$ mlr --opprint step -a delta -f x -g a data/small a b i x y x_delta pan pan 1 0.3467901443380824 0.7268028627434533 0 eks pan 2 0.7586799647899636 0.5221511083334797 0 wye wye 3 0.20460330576630303 0.33831852551664776 0 eks wye 4 0.38139939387114097 0.13418874328430463 -0.377281 wye pan 5 0.5732889198020006 0.8636244699032729 0.368686 $ mlr --opprint put '$x_delta = ispresent(@last[$a]) ? $x - @last[$a] : 0; @last[$a]=$x' data/small a b i x y x_delta pan pan 1 0.3467901443380824 0.7268028627434533 0 eks pan 2 0.7586799647899636 0.5221511083334797 0 wye wye 3 0.20460330576630303 0.33831852551664776 0 eks wye 4 0.38139939387114097 0.13418874328430463 -0.377281 wye pan 5 0.5732889198020006 0.8636244699032729 0.368686 Exponentially weighted moving averages without/with oosvars$ mlr --opprint step -a ewma -d 0.1 -f x data/small a b i x y x_ewma_0.1 pan pan 1 0.3467901443380824 0.7268028627434533 0.346790 eks pan 2 0.7586799647899636 0.5221511083334797 0.387979 wye wye 3 0.20460330576630303 0.33831852551664776 0.369642 eks wye 4 0.38139939387114097 0.13418874328430463 0.370817 wye pan 5 0.5732889198020006 0.8636244699032729 0.391064 $ mlr --opprint put ' begin{ @a=0.1 }; $e = NR==1 ? $x : @a * $x + (1 - @a) * @e; @e=$e ' data/small a b i x y e pan pan 1 0.3467901443380824 0.7268028627434533 0.346790 eks pan 2 0.7586799647899636 0.5221511083334797 0.387979 wye wye 3 0.20460330576630303 0.33831852551664776 0.369642 eks wye 4 0.38139939387114097 0.13418874328430463 0.370817 wye pan 5 0.5732889198020006 0.8636244699032729 0.391064 |