Slow read values

nikkov · July 19, 2017, 6:12am

Hi,
I tried to do a benchmarking read/write openhistorian based on examples of writing and reading. After writing several million points * 50000 tags, reading values by one tag is very slow - about hundreds points/second. Is it normally?
My code fragment:

public void ReadHistorianData(string historianServer, string instanceName, DateTime startTime, DateTime stopTime, string measurementIDs = null)
        {
            const int DefaultHistorianPort = 38402;
            if (string.IsNullOrEmpty(historianServer))
                throw new ArgumentNullException(" historianServer", " Missing historian server parameter");
            if (string.IsNullOrEmpty(instanceName))
                throw new ArgumentNullException(" instanceName", " Missing historian instance name parameter");
            if (startTime > stopTime)
                throw new ArgumentException(" Invalid time range specified", " startTime");
            string[] parts = historianServer.Split(':');
            string hostName = parts[0];
            int port;
            if (parts.Length < 2 || !int.TryParse(parts[1], out port))
            port = DefaultHistorianPort;
            // Open historian connection
            using (HistorianClient client = new HistorianClient(hostName, port))
            using (ClientDatabaseBase<HistorianKey, HistorianValue> reader = client.GetDatabase<HistorianKey, HistorianValue>(instanceName))
            {
                // Setup time-range and point ID selections
                SeekFilterBase<HistorianKey> timeFilter = TimestampSeekFilter.CreateFromRange<HistorianKey>(startTime, stopTime);
                MatchFilterBase<HistorianKey, HistorianValue> pointFilter = null;
                HistorianKey key = new HistorianKey();
                HistorianValue value = new HistorianValue();
                if (!string.IsNullOrEmpty(measurementIDs))
                    pointFilter = PointIdMatchFilter.CreateFromList<HistorianKey, HistorianValue>(measurementIDs.Split(',').Select(ulong.Parse));
                ulong iterationCount = 100000;
                var sw = new System.Diagnostics.Stopwatch();
                ulong count = 0;
                sw.Start();
                for (ulong j = 0; j < iterationCount; ++j)
                {
                    // Start stream reader for the provided time window and selected points
                    TreeStream<HistorianKey, HistorianValue> stream = reader.Read(SortedTreeEngineReaderOptions.Default, timeFilter, pointFilter);
                    while (stream.Read(key, value))
                    {
                        if(key.PointID != 0)
                            throw new Exception("Wrong key ID!");
                        //yield return new HistorianMeasurement(key.PointID, key.TimestampAsDate, value.AsSingle);
                        ++count;
                    }
                    if (j % 5 == 0)
                    {
                        float speed1 = 1000.0f * count / (float)sw.ElapsedMilliseconds;
                        m_labelRead.Text = string.Format("Read speed: {0} rec/sec, count: {1}", speed1, count);
                        Application.DoEvents();
                    }
                }
                m_labelRead.Text += " OK";
            }
        }

        public void WriteHistorianData(string historianServer, string instanceName)
        {
            const int DefaultHistorianPort = 38402;
            if (string.IsNullOrEmpty(historianServer))
                throw new ArgumentNullException("historianServer", "Missing historian server parameter");
            if (string.IsNullOrEmpty(instanceName))
                throw new ArgumentNullException(" instanceName", "Missing historian instance name parameter");
            string[] parts = historianServer.Split(':');
            string hostName = parts[0];
            int port;
            if (parts.Length < 2 || !int.TryParse(parts[1], out port))
                port = DefaultHistorianPort;
            ulong count = 0;
            DateTime startTime = DateTime.Now.AddHours(-240);
            // Open historian connection
            using (HistorianClient client = new HistorianClient(hostName, port))
            using (ClientDatabaseBase<HistorianKey, HistorianValue> database = client.GetDatabase<HistorianKey, HistorianValue>(instanceName))
            using (HistorianInputQueue queue = new HistorianInputQueue(() => database))
            {
                HistorianKey key = new HistorianKey();
                HistorianValue value = new HistorianValue();
                ulong tagCount = 50000;
                ulong iterationCount = 100000;
                var sw = new System.Diagnostics.Stopwatch();
                sw.Start();
                ulong local_count = 0;
                for (ulong j = 0; j < iterationCount; ++j)
                {
                    for (ulong i = 0; i < tagCount; ++i)
                    {
                        key.PointID = i;
                        key.TimestampAsDate = startTime;
                        value.AsSingle = 2.0f * i * j;
                        queue.Enqueue(key, value);
                        ++count;
                        ++local_count;
                    }
                    startTime = startTime.AddSeconds(1);
                    if (j % 5 == 0)
                    {
                        sw.Stop();
                        float speed = 1000.0f * local_count / (float)sw.ElapsedMilliseconds;
                        m_labelWrite.Text = string.Format("Speed: {0} rec/sec, count: {1}", speed, count);
                        Application.DoEvents();
                        while (queue.Size > 0)
                            Thread.Sleep(100);
                        sw.Start();
                        local_count = 0;
                    }
                }
            }
        }

Thanks!
Nikolay

StevenChisholm · July 19, 2017, 1:34pm

When we designed the openHistorian, we decided our primary index should come Time first, then PointID. This drastically improves read times for applications that need a large subset of points for a relatively short time duration. We have considered possibly allowing the user to store data twice and have a index on PointID then Timestamp. This would greatly speed up the access of a single point. But we haven’t moved past considering it.

Scanning speeds should be in the range of 100+ million measurements per second with real data. Since your data isn’t real, it will likely incur about a 30% penalty. This is because the encoding method takes advantage of the fact that the value doesn’t change by much. Since you are grabbing 1 out of every 50,000 points, take 70 million and divide by 50,000. I would expect you would see 1400 measurements per second. I’m not sure how large of a set you are scanning, but it’s possible the remainder of the difference is the time it takes build the query.

nikkov · July 20, 2017, 8:11am

Thanks Steve, now it is clear. In my case, we are interested in many points now and individual points in the past. But with this design it’s almost impossible to work with data from the past for individual points, if you have enough points. Maybe you were considering the option when the near archive is ordered by time, and the far archive is by point?

StevenChisholm · July 20, 2017, 3:08pm

For synchrophasor measurements (the target use case), you usually always want a set of all measurements from a given PMU to calculate values from the fly. So changing to ordered by point for far archive would still not be viable. You really need 2 separate archives and query one or the other depending on how many points you are trying to scan.

Here are 2 use cases where each other suffers:

Assuming a 1TB/year data repository broken into 365 distinct files at 3GB each file

Scanning 1 point for an entire year.
By Time: That would require a sequential scan of all of the data for that year. Response Time: Hours.
By PointID: That would require require scanning 365 files (assuming they are still partitioned by 1 file per day) for 1 point. Response Time: Seconds+.

Scanning all points for a given time:
By Time: Seeking in 1 file for a few blocks of sequential data. Response Time: Milliseconds.
By PointID: In 1 file, Seeking 50,0000 times. Actually, it will probably be faster to read the entire file sequentially since seeks are really slow. Response Time: Minutes.

So, we decided that for synchrophasor data, the best choice would be order by Time. For SCADA Data, it’s really a toss up, but the scan speed is so fast it should still be reasonable to sort by Time. For AMI Meter Data, it depends on the use case, web interfaces for a single customer would be terrible, but analytic engines that look at the entire sample set would be detrimental. Imagine randomly seeking millions of times to get today’s meter reads.

Also, You might be missing one key feature. Have you tried down sampled seeks? If you are storing 2 second SCADA data, and wanting to go back 1 year, do you really plan on displaying 15 million measurements to the client. Rather you should probably request 1 sample per hour and display 8000 measurements. Then when the user selects a month, display 4 measurements per hour. Then when they select a week, display 1 measurement per minute, then when they select a day, show the full resolution data. That will execute much quicker and was the design case for the openHistorian. You can also build down-sampled databases, in which case all of these type of queries would return in milliseconds. That’s what we have done at OGE for our software.