Here is an attempt at parsing the data. It is fixed field so the regular expression will extract the data. Some does not seem to make sense since it has curly brackets in the data. Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.
On Sun, Aug 28, 2016 at 8:49 AM, Glenn Schultz <glennmschultz at me.com> wrote:
Hi Jim,
Attached is the layout of the file I would like to parse with dput sample
of the data. From the layout it seems to me there are two sets in the data
Header and Details. I would like to either parse such that
- I have either 1 comma delimited file of all data or
- 2 comma delimited files one of header the other of details
I have never seen a file layout described in the manner before.
Consequently, I am a little confused as to how to work with the file.
Best,
Glenn
"1176552 CL20031031367RBV319920901
217655208875{08875{08875{08875{08875{08875{22D22D22D22D22D2
2D13C13C13C13C13C13C0000604000{0000604000{0000604000{0000604
000{0000604000{0000604000{36{36{36{36{36{36{08500{08500{08500{08500{08500{
08500{1254240 CL20031031371KLV120020201
225424007484{07250{07375{07500{07625{08625{33F06H33H33I34{3
4A02A01I02{02{02A03B0001121957C0000123500{0000920000{0001280
000{0001741000{0003849000{35I30{36{36{36{36{07000{07000{07000{07000{07000{
07000{1254253 CL20031031371KMA620020301
225425306715{06250{06500{06750{06875{07000{33C23G33C33I34{3
4A02{01I02{02{02A02C0000946646A0000350000{0000850000{0001030
000{0001205000{0001300000{35H30{36{36{36{36{06000{06000{06000{06000{06000{
06000{1259455 CL20031031371RE4420020501
225945507045{06750{06875{07000{07250{07375{34{28B34A34B34B3
4C01H01G01H01H01H02C0000934444E0000360000{0000765000{0000995
000{0001384000{0002184000{35I30{36{36{36{36{06500{06500{06500{06500{06500{
06500{1261060 CI20031031371S5V219940101
226106006637{06500{06500{06625{06750{06875{05B00C04H05I06B0
6B11H11G11G11H11H11I0001169090I0000650000{0000950000{0001250
000{0001328000{0001900000{18{18{18{18{18{18{06000{06000{06000{06000{06000{
06000{1335271 CI20031031375HMU519960101
233527107500{07500{07500{07500{07500{07500{08B06B08E08F08F0
8F09D09D09D09D09E09E0000717375{0000464000{0000550000{0000770
000{0001085500{0001085500{18{18{18{18{18{18{07000{07000{07000{07000{07000{
07000{1440840 CL20031031380HV9519981101
244084006707{06500{06625{06750{06875{06875{27D03C28C29H30{3
0A06{05I06{06{06{06A0000615172I0000250000{0000621000{
0000673000{0000750000{0000791000{36{36{36{36{36{36{06000{
06000{06000{06000{06000{06000{1521993 CI20031031384E3A620000101
252199306937{06875{06875{06875{07000{07000{12H02H12H13{13D1
3E04E04E04E04E04F04F0001129428F0000700000{0000955000{0001000
000{0002087000{0002087000{18{18{18{18{18{18{06500{06500{0650
0{06500{06500{06500{1538080 CL20031031384YXH420000501
253808008875{08875{08875{08875{08875{08875{31I31I31I31I31I3
1I04A04A04A04A04A04A0001419300{0001419300{0001419300{0001419
300{0001419300{0001419300{36{36{36{36{36{36{07000{07000{07000{07000{07000{
07000{1659123 CI20031031390XG8720020801
265912306909{06750{06750{06875{07000{07125{16E15I16C16E16F1
6F01E01D01D01E01E01G0000998541G0000162000{0000792000{0001156
500{0001600000{0001990000{18{18{18{18{18{18{06000{06000{06000{06000{06000{
06000{"
-------------- next part --------------
require(stringr)
# input data
data2 <- c("1176552 CL20031031367RBV319920901 217655208875{08875{08875{08875{08875{08875{22D22D22D22D22D22D13C13C13C13C13C13C0000604000{0000604000{0000604000{0000604000{0000604000{0000604000{36{36{36{36{36{36{08500{08500{08500{08500{08500{08500{", "1254240 CL20031031371KLV120020201 225424007484{07250{07375{07500{07625{08625{33F06H33H33I34{34A02A01I02{02{02A03B0001121957C0000123500{0000920000{0001280000{0001741000{0003849000{35I30{36{36{36{36{07000{07000{07000{07000{07000{07000{", "1254253 CL20031031371KMA620020301 225425306715{06250{06500{06750{06875{07000{33C23G33C33I34{34A02{01I02{02{02A02C0000946646A0000350000{0000850000{0001030000{0001205000{0001300000{35H30{36{36{36{36{06000{06000{06000{06000{06000{06000{", "1259455 CL20031031371RE4420020501 225945507045{06750{06875{07000{07250{07375{34{28B34A34B34B34C01H01G01H01H01H02C0000934444E0000360000{0000765000{0000995000{0001384000{0002184000{35I30{36{36{36{36{06500{06500{06500{06500{06500{06500{", "1261060 CI20031031371S5V219940101 226106006637{06500{06500{06625{06750{06875{05B00C04H05I06B06B11H11G11G11H11H11I0001169090I0000650000{0000950000{0001250000{0001328000{0001900000{18{18{18{18{18{18{06000{06000{06000{06000{06000{06000{", "1335271 CI20031031375HMU519960101 233527107500{07500{07500{07500{07500{07500{08B06B08E08F08F08F09D09D09D09D09E09E0000717375{0000464000{0000550000{0000770000{0001085500{0001085500{18{18{18{18{18{18{07000{07000{07000{07000{07000{07000{", "1440840 CL20031031380HV9519981101 244084006707{06500{06625{06750{06875{06875{27D03C28C29H30{30A06{05I06{06{06{06A0000615172I0000250000{0000621000{0000673000{0000750000{0000791000{36{36{36{36{36{36{06000{06000{06000{06000{06000{06000{", "1521993 CI20031031384E3A620000101 252199306937{06875{06875{06875{07000{07000{12H02H12H13{13D13E04E04E04E04E04F04F0001129428F0000700000{0000955000{0001000000{0002087000{0002087000{18{18{18{18{18{18{06500{06500{06500{06500{06500{06500{", "1538080 CL20031031384YXH420000501 253808008875{08875{08875{08875{08875{08875{31I31I31I31I31I31I04A04A04A04A04A04A0001419300{0001419300{0001419300{0001419300{0001419300{0001419300{36{36{36{36{36{36{07000{07000{07000{07000{07000{07000{", "1659123 CI20031031390XG8720020801 265912306909{06750{06750{06875{07000{07125{16E15I16C16E16F16F01E01D01D01E01E01G0000998541G0000162000{0000792000{0001156500{0001600000{0001990000{18{18{18{18{18{18{06000{06000{06000{06000{06000{06000{"
)
# file format from PDF so we know the size and header names
# used the second part starting at FIXED QUARTILES DETAIL
header <- strsplit("5 QUARTILE RECORD TYPE alpha-numeric X(1) 1 1
5 POOL NUMBER alpha-numeric X(6) 2 6
10 WEIGHTED AVERAGE (WA) COUPON numeric edited S9(2)V9(4) 8 6
10 COUPON POOL MINIMUM numeric edited S9(2)V9(4) 14 6
10 COUPON QUARTILE 1 numeric edited S9(2)V9(4) 20 6
10 COUPON QUARTILE 2 numeric edited S9(2)V9(4) 26 6
10 COUPON QUARTILE 3 numeric edited S9(2)V9(4) 32 6
10 COUPON QUARTILE 4 numeric edited S9(2)V9(4) 38 6
10 WA MATURITY numeric edited S9(3) 44 3
10 MINIMUM numeric edited S9(3) 47 3
10 REMAINING MONTHS TO MATURITY QUARTILE 1 numeric edited S9(3) 50 3
10 REMAINING MONTHS TO MATURITY QUARTILE 2 numeric edited S9(3) 53 3
10 REMAINING MONTHS TO MATURITY QUARTILE 3 numeric edited S9(3) 56 3
10 REMAINING MONTHS TO MATURITY QUARTILE 4 numeric edited S9(3) 59 3
10 WA LOAN AGE numeric edited S9(3) 62 3
10 LOAN AGE POOL MINIMUM numeric edited S9(3) 65 3
10 LOAN AGE QUARTILE 1 numeric edited S9(3) 68 3
10 LOAN AGE QUARTILE 2 numeric edited S9(3) 71 3
10 LOAN AGE QUARTILE 3 numeric edited S9(3) 74 3
10 LOAN AGE QUARTILE 4 numeric edited S9(3) 77 3
10 AVERAGE LOAN SIZE numeric edited S9(9)V9(2) 80 11
10 LOAN SIZE POOL MINIMUM numeric edited S9(9)V9(2) 91 11
10 LOAN SIZE QUARTILE 1 numeric edited S9(9)V9(2) 102 11
10 LOAN SIZE QUARTILE 2 numeric edited S9(9)V9(2) 113 11
10 LOAN SIZE QUARTILE 3 numeric edited S9(9)V9(2) 124 11
10 LOAN SIZE QUARTILE 4 numeric edited S9(9)V9(2) 135 11
10 WA ORIGINAL LOAN TERM numeric edited S9(3) 146 3
10 ORIGINAL LOAN TERM MINIMUM numeric edited S9(3) 149 3
10 ORIGINAL LOAN TERM QUARTILE 1 numeric edited S9(3) 152 3
10 ORIGINAL LOAN TERM QUARTILE 2 numeric edited S9(3) 155 3
10 ORIGINAL LOAN TERM QUARTILE 3 numeric edited S9(3) 158 3
10 ORIGINAL LOAN TERM QUARTILE 4 numeric edited S9(3) 161 3
10 WA PASS-THROUGH RATE numeric edited S9(2)V9(4) 164 6
10 PASS-THROUGH RATE POOL MINIMUM numeric edited S9(2)V9(4) 170 6
10 PASS-THROUGH RATE QUARTILE 1 numeric edited S9(2)V9(4) 176 6
10 PASS-THROUGH RATE QUARTILE 2 numeric edited S9(2)V9(4) 182 6
10 PASS-THROUGH RATE QUARTILE 3 numeric edited S9(2)V9(4) 188 6
10 PASS-THROUGH RATE QUARTILE 4 numeric edited S9(2)V9(4) 194 6 ", '\n')[[1]]
# extract header data and positions
head <- str_match(header, "^\\d+ (.+) (alpha|numeric).+(\\d+)\\s+(\\d+)\\s*$")
# create regular expression to parse the data
regexp <- sapply(head[, 5], function(num) paste0("(.{", num, "})"))
regexp <- paste(regexp, collapse = '')
result <- str_match(data2, regexp)
# remove 1st column since it is just the input
result <- result[, -1]
# add header
colnames(result) <- head[, 2]
View(result) # some data does not seem to match up; has curly brackets in data