20
Fastest implementation of `ast.literal_eval`
I asked this as a question on StackOverflow and then answered it myself by some own implementation.
I have some text (str
, bytes
; actually gzipped in a file on disk) which can be parsed via ast.literal_eval
.
(It consists of a list of dicts, where the dict keys are strings, and values strings, int or float. But maybe this question could be generic for any string which can be parsed via ast.literal_eval
.)
It is large: ~22MB uncompressed.
What is the fastest way to parse it?
Surely I can use ast.literal_eval
, but this seems quite slow. Standard eval
is slightly faster (interestingly, but probably as expected, depending how well you know Python; see the implementation of ast.literal_eval
) but still slow.
In comparison, when I serialize the same data as JSON, and then load the JSON (json.loads
), this is way faster (>10x). So this shows that in principle it should be possible to parse it just as fast.
Some statistics:
Gunzip + read time: 0.15111494064331055
Size: 22035943
compile: 3.1023156170000004
parse: 3.3381092380000004
eval: 3.0252232049999996
ast.literal_eval: 3.765798232
json.loads: 0.2657175249999994
This benchmark script and also a script to generate such a dummy text file can be found: here
(Maybe the answer is: "this needs a faster C implementation; no-one has implemented that yet")
After posting this, I found some related questions. I did not found them via Google though (maybe my search term "faster literal_eval" was bad).
This partly answers the question on why ast.literal_eval
is slow.
Also, this basically tells you, when you are thinking whether Python code is a good human readable serialization format (e.g. via repr
), then this tells you, better use JSON instead.
So, to the best of my knowledge, there currently did not exist a faster implementation than ast.literal_eval
(well, eval
itself is a bit faster, but unsafe).
So I implemented my own simple implementation, which converts the literal Python code into equivalent binary Pickle data.
So, for some bytes data
, instead of ast.literal_eval(data.decode("utf8"))
, you would use pickle.loads(py_to_pickle(data))
, and get a speedup by 5.5x.
The repo is here.
This is a quite straight-forward implementation in C++, and you can easily directly use it with ctypes
(there is an example in the repo).
New statistics:
Gunzip + read time: 0.1663219928741455
Size: 22540270
py_to_pickle: 0.539439306
pickle.loads+py_to_pickle: 0.7234611099999999
compile: 3.3440755870000003
parse: 3.6302585899999995
eval: 3.306765757000001
ast.literal_eval: 4.056752016000003
json.loads: 0.3230752619999997
pickle.loads: 0.1351051709999993
marshal.loads: 0.10351717500000035
20