<*>= from CifFile import CifError from types import * <Helper functions> %% parser CifParser: <Regular expressions> <Grammar specification> %%
We have a monitor function which we can call to save the last parsed value (and print, if we are debugging). We also have functions for stripping off delimiters from strings. Finally, we match up our loops after reading them in. Note that we have function stripextras, which is only for semicolon strings, and stripstring, which is for getting rid of the inverted commas.
<Helper functions>= (<-U) # An alternative specification for the Cif Parser, based on Yapps2 # by Amit Patel (http://theory.stanford.edu/~amitp/Yapps) # # helper code: we define our match tokens lastval = '' def monitor(location,value): global lastval # print 'At %s: %s' % (location,`value`) lastval = `value` return value def stripextras(value): # we get rid of semicolons and leading/trailing terminators etc. import re jj = re.compile("[\n\r\f \t\v]*") semis = re.compile("[\n\r\f \t\v]*[\n\r\f]\n*;") cut = semis.match(value) if cut: nv = value[cut.end():len(value)-2] try: if nv[-1]=='\r': nv = nv[:-1] except IndexError: #empty data value pass else: nv = value cut = jj.match(nv) if cut: return stripstring(nv[cut.end():]) return nv # helper function to get rid of inverted commas etc. def stripstring(value): if value: if value[0]== '\'' and value[-1]=='\'': return value[1:-1] if value[0]=='"' and value[-1]=='"': return value[1:-1] return value # helper function to create a dictionary given a set of # looped datanames and data values def makeloop(namelist,itemlist,context): noitems = len(namelist) nopoints = divmod(len(itemlist),noitems) if nopoints[1]!=0: #mismatch raise CifError, "loop item mismatch" nopoints = nopoints[0] # check no overlap between separate loops! new_lower = map(lambda a:a.lower(),namelist) for loop in context: for key in loop.keys(): if key.lower() in new_lower: raise CifError, "Duplicate item name %s" % key newdict = {} for i in range(0,noitems): templist = [] for j in range(0,nopoints): templist.append(itemlist[j*noitems + i]) lower_keys = map(lambda a:a.lower(),newdict.keys()) if namelist[i].lower() in lower_keys: raise CifError, "%s occurs twice in same loop" % namelist[i] newdict.update({namelist[i]:templist}) context.append(newdict) #print 'Constructed loop with items: '+`newdict` return {} # to keep things easy # this function updates a dictionary first checking for name collisions, # which imply that the CIF is invalid. We need case insensitivity for # names. # Unfortunately we cannot check loop item contents against non-loop contents # in a non-messy way during parsing, as we may not have easy access to previous # key value pairs in the context of our call (unlike our built-in access to all # previous loops). # For this reason, we don't waste time checking looped items against non-looped # names during parsing of a data block. This would only match a subset of the # final items. We do check against ordinary items, however. # # Note the following situations: # (1) new_dict is empty -> we have just added a loop; do no checking # (2) new_dict is not empty -> we have some new key-value pairs # def cif_update(old_dict,new_dict,loops): old_keys = map(lambda a:a.lower(),old_dict.keys()) if new_dict != {}: # otherwise we have a new loop #print 'Comparing %s to %s' % (`old_keys`,`new_dict.keys()`) for new_key in new_dict.keys(): if new_key.lower() in old_keys: raise CifError, "Duplicate dataname or blockname %s in input file" % new_key old_dict[new_key] = new_dict[new_key] #
The regular expressions aren't quite as easy to deal with as in kjParsing; in kjParsing we could pass string variables as re arguments, but here we have to have a raw string. However, we can simplify the BNC specification of Nick Spadaccini. First of all, we do not have to have type I and type II strings, which are distinguished by the presence or absence of a line feed directly preceding them and thus by being allowed a semicolon at the front or not. We take care of this by treating as whitespace all terminators except for those with a following semicolon, so that a carriage-return-semicolon sequence matches the start_sc_line uniquely.
We include reserved words and save frames, although we semantically utterly ignore any save frames that we see - but no syntax error is flagged. The other reserved words have no rules defined, so will flag a syntax error. However, as yapps is a context-sensitive parser, it will by default make any word found starting with our reserved words into a data value if it occurs in the expected position, so we explicity exclude stuff starting with our words in the definition of data_value_1.
<Regular expressions>= (<-U) # first handle whitespace and comments, keeping whitespace # before a semicolon ignore: "([ \t\n\r](?!;))|[ \t]" ignore: "#.*[\n\r](?!;)" ignore: "#.*" # now the tokens token LBLOCK: "(L|l)(O|o)(O|o)(P|p)_" # loop_ token save_heading: "(S|s)(A|a)(V|v)(E|e)_[][!%&\(\)*+,./:<=>?@0-9A-Za-z\\\\^`{}\|~\"#$';_-]+" token save_end: "(S|s)(A|a)(V|v)(E|e)_" token RESERVED: "((G|g)(L|l)(O|o)(B|b)(A|a)(L|l)_)|((S|s)(T|t)(O|o)(P|p)_)" token data_name: "_[][!%&\(\)*+,./:<=>?@0-9A-Za-z\\\\^`{}\|~\"#$';_-]+" #_followed by stuff token data_heading: "(D|d)(A|a)(T|t)(A|a)_[][!%&\(\)*+,./:<=>?@0-9A-Za-z\\\\^`{}\|~\"#$';_-]+" token start_sc_line: "(\n|\r\n);([^\n\r])*(\r\n|\r|\n)+" token sc_line_of_text: "[^;\r\n]([^\r\n])*(\r\n|\r|\n)+" token end_sc_line: ";" token data_value_1: "((?!(((S|s)(A|a)(V|v)(E|e)_[^\s]*)|((G|g)(L|l)(O|o)(B|b)(A|a)(L|l)_[^\s]*)|((S|s)(T|t)(O|o)(P|p)_[^\s]*)|((D|d)(A|a)(T|t)(A|a)_[^\s]*)))[^\s\"#$'_\[\]][^\s]*)|'(('(?=\S))|([^\n\r\f']))*'+|\"((\"(?=\S))|([^\n\r\"]))*\"+" token END: '$'
The grammar specification is adapted from the kjParsing version. We now merge our actions in with the specification, rather than binding rules as we were doing. The final returned value is a dictionary, with each key a datablock name. The value attached to each key is an entire dictionary for that block, with special member 'loops' containing an array of dictionaries, one for each loop block. In order to construct this properly, we pass the current value of 'loops' to the data rule, so that when a loop is found it can be appended to this array.
<Grammar specification>= (<-U) # now the rules rule input: (( dblock {{maindict = dblock }} ( dblock {{cif_update(maindict,monitor('input',dblock),[])}} # )* END ) | ( END {{maindict = {} }} )) {{return maindict}} rule dblock: (data_heading {{dict={data_heading[5:]:{"loops":[],"saves":{} } } }} # a data heading ( dataseq<<dict[data_heading[5:]]["loops"]>> {{cif_update(dict[data_heading[5:]],dataseq,[])}} | save_frame {{dict[data_heading[5:]]["saves"].update(save_frame)}} )* ) {{return monitor('dblock',dict)}} # but may be empty rule dataseq<<loop_array>>: data<<loop_array>> {{datadict=data}} ( data<<loop_array>> {{cif_update(datadict,data,loop_array)}} )* {{return monitor('dataseq',datadict)}} rule data<<loop_array>>: data_loop<<loop_array>> {{return data_loop}} | datakvpair {{return datakvpair}} #kv pair rule datakvpair: data_name data_value {{return {data_name:data_value} }} # name-value rule data_value: (data_value_1 {{thisval = stripstring(data_value_1)}} | sc_lines_of_text {{thisval = stripextras(sc_lines_of_text)}} ) {{return monitor('data_value',thisval)}} rule sc_lines_of_text: start_sc_line {{lines = start_sc_line}} ( sc_line_of_text {{lines = lines+sc_line_of_text}} )* end_sc_line {{return monitor('sc_line_of_text',lines+end_sc_line)}} rule data_loop<<loop_array>>: LBLOCK loopfield loopvalues {{return makeloop(loopfield,loopvalues,loop_array)}} rule loopfield: data_name {{loop=[data_name]}} ( data_name {{loop.append(data_name)}} )* {{return loop}} # sequence of data names rule loopvalues: ( data_value {{loop=[data_value]}} ( data_value {{loop.append(monitor('loopval',data_value))}} )* ) {{return loop}} rule save_frame: save_heading {{savedict = {save_heading[5:]:{"loops":[] } } }} ( dataseq<<savedict[save_heading[5:]]["loops"]>> {{savedict[save_heading[5:]].update(dataseq)}} )* save_end {{return monitor('save_frame',savedict)}}